Computer-Based System Safety Essential Reading List
Here is a quick start resource guide for computer-based system safety literacy. If you work on computer-based system safety and you aren't familiar with the below case studies, you really need to read them. (Not just safety engineers -- everyone!)
Essential Case Studies: Because those who have not read history are doomed to repeat it.
Essential Case Studies: Because those who have not read history are doomed to repeat it.
- 1985-1987: Therac 25 (Summary video | Wikipedia | Paper)
- 1992: Patriot Missile Timekeeping (Wikipedia | Report)
- 1996: Ariane 5 flight 501 (Wikipedia | Report)
- 1999: Mars Climate Orbiter (Wikipedia | Report)
- 2002-2010: Toyota UA (Slides | Full Video | Wikipedia)
- 2007: F-22 Date Line Bug (News | Wikipedia)
- 2009: Air France Flight 447 (Wikipedia | Report)
- 2010: Stuxnet (security) (Wired | Wikipedia)
- 2011: Wenzhou Train Crash (News | Wikipedia)
- 2015: Seville A400M Crash (News | Wikipedia)
- 2015: Ukraine Power Outage (security) (News | CERT)
- 2017: USS John S. McCain (Propublica | Wikipedia)
- 2018: Uber Testing Fatality (News | NTSB report | NTSB Hearing)
- 2019: Boeing 737 Max/MCAS (News | US House Prelim Findings | PBS video | Wikipedia)
- 2021: Horizon IT scandal (News | Journal paper | Wikipedia | Computerphile)
Additional Case Studies:
- 1997: Mars Pathfinder ("What Really Happened on Mars")
- 2004: Spirit Mars Rover file system (Wikipedia)
- 2005: V-22 Osprey Crashes (Wired | Wikipedia)
- 2016: Schiaparelli Mars Lander crash (Wikipedia)
- See also: https://en.wikipedia.org/wiki/List_of_software_bugs
Recommended Supplemental Materials
- Carnegie Mellon University course 18-642 on embedded software code quality, safety, security (Koopman)
- Other reading/viewing
- Discussion of aircraft human/computer interaction issues with automation (New Yorker article)
- They Write the Right Stuff: software for grown-ups 1996 (FastCompany)
- Leveson (Paper Index)
- Software Safety: why, what and how 1986 (CiteSeer)
(longer version: 1995 Safeware Book) - High-Pressure Steam Engines and Computer Software
- Engineering a Safer World (link for free PDF download)
- Read safety standards for yourself:
- IEC 61508 functional safety standard (older version, but still representative for learning purposes) (Wikipedia | 2005 Full Text)
- UL 4600 autonomous product system safety standard (2019 Voting Draft)
- US DoD MIL-STD-882E System Safety (DLA)
- UK MoD Defstan 00-55 and 00-56 are available via free account on: StanMIS (web searches will often turn up cached copies)
- US FDA Medical Device Software (Principles | Submissions)
Other Case Studies: (Still important, and should be read by anyone digging deep into safety. But less specifically related to computer-based system risks.)
- 1814: London Beer Flood (Wikipedia)
- 1860: Pemberton Mill (Wikipedia)
- 1889: Johnstown Flood (Wikipedia)
- 1912: RMS Titanic (Wikipedia)
- 1919: Great Molasses Flood (Wikipedia)
- 1937: LZ 129 Hindenburg (Wikipedia)
- 1957: Windscale Fire (Wikipedia)
- 1959: Minamata Disease (Wikipedia)
- 1963: USS Thresher (SSN-593) (Inquiry transcripts (Part 1 / Part 2) | Release of Records Story | Wikipedia | Subsafe | Webinar)
- 1966: Aberfan Spoil Tip Collapse (Wikipedia)
- 1972: DC10 Cargo Door/ AA Flight 96 (Wikipedia)
- 1973: PBB contamination in Michigan (Wikipedia | Science)
- 1977: Tenerife Airport (Wikipedia)
- 1979: Three Mile Island (Wikipedia | Report | Lecture)
- 1980: Damascus Titan missile explosion (Wikipedia)
- 1981: Ford Pinto (Wikipedia | Museum Summary with video, Writeup)
- 1981: Kansas City Hotel Walkway Collapse (Wikipedia)
- 1982: Speedbird 9 (Wikipedia)
- 1983: Gimli Glider (Wikipedia)
- 1984: Bhopal (Wikipedia)
- 1986: Challenger Space Shuttle (IEEE Spectrum | Wikipedia | Report)
- 1986: Chernobyl (Wikipedia | INSAG Report | WHO Web Page | Physics Video)
- 1988: Piper Alpha (Wikipedia | Report part 1, part 2)
- 1991: Lauda Air Flight 004 (Wikipedia) -- said to have prompted the DO-178B update.
- 1998: Eschede derailment (Wikipedia)
- 2001: Petrobras P-36 (Mishap Report | NASA Case Study)
- 2003: Columbia Space Shuttle (Wikipedia | Report)
- 2006: Nimrod MR2 (Wikipedia | Report)
- 2008: British Air 38 (Wikipedia)
- 2010: Deepwater Horizon (Wikipedia | Reports)
- 2011: Fukushima Daiichi (Wikipedia | Report Summary | IEEE Spectrum)
- 2013: Takata airbags (Wikipedia)
- 2014: GM ignition switch (Wikipedia | Long-Read)
- 2015: Volkswagen emissions scandal (Wikipedia)
- 2014-2019: Flint MI Water Crisis (Wikipedia)
- 2023: Oceangate Titan implosion (Wikipedia | Long-Read)
- 2023: UK NATS Air Traffic Control failure (Report)
- Wikipedia List of Industrial Disasters | List of disasters by death toll | Compendium of computer-related aviation accidents
- Safety Culture (Wikipedia | NASA Handbook)
- Mode Confusion (Wikipedia)
- Compilation of US software-related Automotive Safety Recalls (Blog)
- Systems Engineering Body of Knowledge on Safety Engineering (SEBoK)
- NASA Safety library (index | Safety Guidebook)
- NASA Real System Failure story collection by Kevin Driscoll (Home | slides)
- FAA System Safety Handbook (FAA)
- USAF System Safety Handbook (USAF)
- List of NHTSA software-related automotive recalls (Blog)
- SafeComp annual conferences
- Safety Critical Systems Club
- RISKS Digest
- System Safety Mailing List
- Journal of System Safety (open access)
- Safety of Work podcast (Rae & Provan) (Podcast)
- Fundamentals of Dependable Computing (John Knight)
- Better Embedded System Software Blog (Koopman)
- Safe Autonomy Blog (Koopman)
- Security Engineering (Ross Anderson)
- Marsh 100 Largest Losses in the Hydrocarbon Industry 1974-2019
- List of freely available reference books by Stephen Thomas
- To Engineer is Human, Petroski, 1992.
Advanced Specialty Topics/Research:
- Computer System Diversity, Independence, and Bootstrapping Safety (Lorenzo Strigini)
- Radiation-induced upsets. (YouTube, OK for background on the phenomenon, but has an inaccurate summary of Toyota UA findings)
- Knight, "Safety Standards -- a New Approach" about the equity issues with safety standards that need to be resolved: https://scsc.uk/r126/1:1
NOTE: While Wikipedia is not always an authoritative source, for these sorts of events it tends to present useful summary descriptions.
If you think something important is missing, let me know!
Last update 9/10/2023
If you think something important is missing, let me know!
Last update 9/10/2023
Leveson, "Engineering A Safer World", PDF download from https://mitpress.mit.edu/books/engineering-safer-world.
ReplyDeleteNeumann, "Computer-Related Risks" (based on the Risks Digest archives as of 1994).
Excellent list. While I've read most of these, seeing this list there's a few I want to add to my reading list.
ReplyDeleteThe Eschede train accident should be included under Other Mishap Case Studies, not only because it highlights the importance of proper maintenance procedures, but also the legal aftermath where officials and engineers were charged with manslaughter.
ReplyDeletehttps://en.wikipedia.org/wiki/Eschede_derailment