The Most Catastrophic Software Bugs in History [Part 1 ]: Therac-25’s Deadly Software Tragedy

On March 21, 1986, a 33-year-old oilfield worker was treated for a tumor in his upper back at the Tyler Clinic. He was supposed to be treated with electron beams. However, the patient reported a jolt that felt like an electric shock in his back. He then stood up, which unfortunately went unnoticed by the technician because the intercom was defective and the camera was unplugged. At that moment, another shot was triggered, hitting his hand. The system only displayed an error – malfunction 54. The technician ignored the error respectively did not understand it and proceeded with the treatment. The patient returned to the clinic that evening with skin and back pain. He lost control of his left arm and suffered from nausea and vomiting. What he didn’t know was that he got from the treatment before a severe overdose and he had sustained damage to his spinal canal, resulting in paralysis of both legs. He died in September of the same year from the effects of the overdose…

As software testers, we often deal with bugs. While most of them are annoying and influence the user experience badly, they don't cause significant harm. However, this isn't always the case. Some bugs have caused serious damage, including injuries, people suffering, and even deaths.

In this series, I'll share stories about the worst software bugs in history, how they occurred, and what we can learn from them. These stories serve as a reminder that thorough testing is not just beneficial but crucial, sometimes even life-saving.

The next time someone questions the importance of your job, share one of these stories.

By the way, if you haven't subscribed yet, don't miss out on my newsletter, "Test Your Mind." It covers all quality-related topics and is free!

A New Revolutionary Machine

In the 1980s, the Therac-25, developed by Atomic Energy of Canada Limited (AECL), emerged as a pioneering radiation therapy machine. The Therac-25 was a linear accelerator with two main operating modes. It could deliver a beam of low-energy electrons to treat shallow tissues, such as skin cancer, and a beam of high-energy X-ray photons for deeper tissues, like lung cancer. This dual capability minimized the need for multiple machines, streamlining logistics and maintenance for hospitals. It was among the first to depend on software-based safety systems instead of traditional hardware controls.

To understand the significance of these changes, it's important to look at the Therac-25's former series. The Therac-6 and Therac-20 were both electro-mechanical devices and had hardware interlocks serving as safety features. The success of the Therac-6 led to the development of the Therac-20, where the computer took on more tasks but did not directly control the safety systems. In 1981, the collaboration between Thomson CGR and AECL ended, leading AECL to develop the Therac-25 independently.

Safety System Failures and Software Issues

The Therac-25 replaced previous electrical and hardware safety systems with a software system. This software, comprising over 20,000 instructions, was developed by a single programmer over several years. Yes you read correctly, one software engineer. There is no information about the programmer's identity, qualifications, or training but the program was poorly documented and not thoroughly tested before deployment. The software allowed simultaneous access to shared variables. This implementation of multitasking resulted in race conditions, which played a significant role in the incidents.

A race condition occurs when the system's behavior depends on the sequence or timing of uncontrollable events, leading to unpredictable results. In this case, if the machine operator quickly made a series of inputs to change the machine settings, the software failed to synchronize properly. This could result in the high-power electron beam being activated without the appropriate diffuser in place, leading to a massive overdose of radiation.

Additionally, the software contained a flaw that allowed the electron beam to be turned on without ensuring the beam spreader was correctly in place. This defect occurred when the operator quickly changed settings, causing the software to lose track of the machine's state and erroneously permit high-dose radiation delivery without proper diffusion.

Another significant challenge was the cryptic error messages displayed to technicians, which simply read "Malfunction" followed by a number from 1 to 64. These messages did not clearly indicate what had gone wrong or how the technician should proceed. Reports showed that there were days when over 40 such error messages appeared in a single day and therefore people tend to ignore the error messages. The “Malfunction 54” error was particularly troubling, as it signaled both an underdose and an overdose. There were no specific error messages to differentiate between excessively high or low radiation levels, further compounding the confusion.

The Accidents

Between 1985 and 1987, six severe radiation overdoses occurred, resulting in patient deaths and serious injuries. Initially, neither the government nor the manufacturer, AECL, investigated these incidents, believing device malfunctions were impossible:

  1. Georgia, June 3, 1985: A 61-year-old woman undergoing electron beam therapy for breast cancer felt intense heat during treatment. She later developed paralysis in her shoulder and arm, requiring a mastectomy.

  2. Ontario, July 26, 1985: A 40-year-old woman receiving her 24th Therac-25 treatment for cervical cancer experienced a system error indicating no dose. The technician restarted the machine five times before it shut down. She felt a burning sensation, and later it was discovered she had received a 15,000 rad overdose instead of the intended 200 rad.

  3. Washington, December 11, 1985: After improvements following the Ontario incident, a patient in Washington suffered burns during treatment for hip skin cancer due to a faulty microswitch. She survived but was left with a stiff hip.

  4. Texas, March 21, 1986: A 33-year-old man felt an electric shock during treatment for a back tumor. The technician was unaware due to faulty communication equipment. He received multiple overdoses, leading to spinal cord damage and paralysis. He died in September 1986.

  5. Texas, April 11, 1986: A 66-year-old man treated for facial skin cancer experienced a burning sensation and saw a bright flash. Despite functioning communication equipment, he received a fatal overdose, leading to brain damage and death three weeks later.

  6. Washington, January 17, 1987: The last accident occurred when a patient received an X-ray overdose instead of electron beam therapy. He died in April 1987.

The Reaction

After the first accident, Atomic Energy of Canada Limited (AECL) sent a technician to investigate, who attributed the failure to a defective microswitch and advised users to inspect the device before treatment. Despite modifying the hardware, another accident occurred in June 1985, and AECL did not respond adequately.

It wasn't until a technician in Texas replicated the accident sequences that AECL reported to the FDA, which then declared the Therac-25 defective. AECL submitted a correction plan, but another accident in Washington led to further investigations, revealing additional software errors. The devices were shut down in February 1987, and AECL later introduced a comprehensive improvement plan including hardware interlocks and software upgrades.

AECL's delayed and insufficient response suggested that many accidents could have been prevented with immediate investigations, thorough pre-use testing, and reliance on both software and mechanical safety systems. Additionally, the mishandling of cryptic error messages undermined the seriousness of the issues.

Aftermath and Lessons Learned

The Therac-25 incidents uncovered significant flaws in the development and testing of software for medical devices. An internal FDA memo revealed that AECL, the manufacturer, lacked formal software specifications and a comprehensive test plan. The software wasn't independently evaluated, and development heavily relied on a hardware simulator, leading to critical safety issues being overlooked.

Focusing solely on fixing individual software bugs is insufficient for ensuring system security. Continuous testing from the start is crucial. The primary issues weren't just specific code errors but poor development practices and an over-reliance on software for safety. Error messages need to be meaningful and well-documented, which wasn't the case with Therac-25.

Furthermore replacing hardware safeguards with software poses significant risks. Hardware solutions are generally safer and more stable. Carelessness and unrealistic risk assessments further compounded the issues. Testing only on a simulator and minimal software testing were inadequate. Naive software reuse and prioritizing user-friendliness over safety also contributed to the failures.

The catastrophic consequences of the Therac-25 software bugs led to lawsuits and a Class I recall by the FDA, highlighting the severe risks associated with the device. This case emphasizes the importance of rigorous software development practices, thorough testing, clear error communication, and robust safety measures to prevent future disasters. The Therac-25 incidents serve as a important reminder of the critical need for attention to safety in the development and testing of medical device software.

Previous
Previous

The HeadSpin Scandal: Are AI Testing Tools Overhyped?

Next
Next

Forget the Hype: 6 (Realistic) AI-Powered Techniques to Enhance Your Software Quality