The recent CrowdStrike IT outage has focused attention on a topic rarely discussed, but fundamental to mission-critical systems: common cause failures. While this particular outage affected business and airlines more than cars, cars have been and will be affected by similar issues.
The most important thing to think about for highly dependable critical systems is ensuring a lack of common cause failures. This includes single point failures (one component that breaks brings the whole system down), as well as failures due to something shared across multiple otherwise independent subsystems or replicated systems.
We’ll talk about cars here to keep the examples concrete, but the principles apply across all types of systems. A few examples include:
A software design defect that is the same across every car is introduced by a botched update.
A third-party software component that is installed in every car is the subject of a botched update (this is the CrowdStrike situation — a security plug-in installed in massive numbers of Windows computers failed in every computer at the same time due to some sort of defective update distribution).
A novel cyberattack or social media trend emerges that might affect all cars from a particular manufacturer. (For example the Kia TikTok challenge.)
A hardware design defect that is the same across every car is activated by a rare circumstance encountered by every car. (Yes, these happen. The poster child is the Intel FDIV bug.)
A latent, known defect is not fixed in legacy systems, perhaps because they stay in operation longer than expected but are not being updated. There have been multiple rounds of this issue with GPS week rollover. And the Unix 2038 date rollover not as far away as it might seem.
A central service goes down. For example, what if nobody can start their cars because an driver monitoring system to screen for impaired drivers depends on a failed cloud service?
A central database is corrupted. For example, what if a cloud-based mapping service is contaminated with an update that corrupts information?
A requirements gap that is revealed by a novel but broadly experienced situation. For example, a city-wide power outage disables traffic lights, and a whole fleet of self-driving cars cannot handle intersections with dark traffic lights.
A natural disaster invalidates the operational scenarios used by the designers. For example, riders who have given up their cars and depend on robotaxis for transportation need to evacuate a town surrounded by wildfires — but the robotaxis have been programmed not to drive on a road surrounded by flames. With no override capability available.
You might say these are all “rare” conditions or “edge cases,” but they can’t be dismissed due to comparative rarity when people’s lives are on the line. Too many companies seem to believe in the Andrew Carnegie saying: “Put all your eggs in one basket — and then watch that basket!” But experienced safety folks know that never really works out in the long term for technical systems at scale. Defense in depth is the smart play.
Hurricanes are a rare event for any particular town. But they can do a lot of damage when they hit. In particular, a hurricane hit does not take out a single house — but tends to destroy entire towns or groups of towns. It is, in effect, a common cause source of losses for an entire town. Defense in depth does not simply bet a hurricane won’t come to town, but also buys insurance and updates the local building codes for hurricane-resistant roofs and the like to reduce losses when the hurricane eventually shows up.
Regardless of defenses, all buildings will be subject to the same problem at the same time when the hurricane shows up. The damage is not random and independent due to the common cause loss source of the hurricane raging through the town.
We can analogize this type event in the software world as a cyber-hurricane. A large number of systems suffer a loss due to some sort of common cause effect.
The insurance industry has a financial mechanism called reinsurance to spread risk around so if any one company gets an unlucky break due to selling too many policies in a town hit by a hurricane (or earthquake, flood, tornado, etc.) the risk gets spread around on a secondary market. All the companies lose, but an unlucky break probably won’t kill any particular insurance company. Even so, insurance companies might still require mitigation efforts to issue insurance, or at least reward such efforts with lower rates.
If a car company suffers a cyber-hurricane, how might they be able to spread that risk around? The car insurance companies might be fine due to reinsurance. But the car company itself has its name in the news. Just ask Hyundai/Kia how that’s been working out for them.
For a cyber-hurricane the sources of losses might be more diverse than just strong winds and rising flood waters. But you can expect we will see such events on a regular basis as more of our society comes to depend on interconnected software-intensive systems.
If you’re in the car business, ask yourself what a cyber-hurricane might look like for your system and what you should be doing to prevent it.
No comments:
Post a Comment
All comments are moderated by a human. While it is always nice to see "I like this" comments, only comments that contribute substantively to the discussion will be approved for posting.