Saturday, July 20, 2024

Cyber-Hurricanes and Automotive Software: Common cause failures and the risk of widespread automotive losses

The recent CrowdStrike IT outage has focused attention on a topic rarely discussed, but fundamental to mission-critical systems: common cause failures. While this particular outage affected business and airlines more than cars, cars have been and will be affected by similar issues.

The most important thing to think about for highly dependable critical systems is ensuring a lack of common cause failures. This includes single point failures (one component that breaks brings the whole system down), as well as failures due to something shared across multiple otherwise independent subsystems or replicated systems.

We’ll talk about cars here to keep the examples concrete, but the principles apply across all types of systems. A few examples include:

  • A software design defect that is the same across every car is introduced by a botched update.

  • A third-party software component that is installed in every car is the subject of a botched update (this is the CrowdStrike situation — a security plug-in installed in massive numbers of Windows computers failed in every computer at the same time due to some sort of defective update distribution).

  • A novel cyberattack or social media trend emerges that might affect all cars from a particular manufacturer. (For example the Kia TikTok challenge.)

  • A hardware design defect that is the same across every car is activated by a rare circumstance encountered by every car. (Yes, these happen. The poster child is the Intel FDIV bug.)

  • A latent, known defect is not fixed in legacy systems, perhaps because they stay in operation longer than expected but are not being updated. There have been multiple rounds of this issue with GPS week rollover. And the Unix 2038 date rollover not as far away as it might seem.

  • A central service goes down. For example, what if nobody can start their cars because an driver monitoring system to screen for impaired drivers depends on a failed cloud service?

  • A central database is corrupted. For example, what if a cloud-based mapping service is contaminated with an update that corrupts information?

  • A requirements gap that is revealed by a novel but broadly experienced situation. For example, a city-wide power outage disables traffic lights, and a whole fleet of self-driving cars cannot handle intersections with dark traffic lights.

  • A natural disaster invalidates the operational scenarios used by the designers. For example, riders who have given up their cars and depend on robotaxis for transportation need to evacuate a town surrounded by wildfires — but the robotaxis have been programmed not to drive on a road surrounded by flames. With no override capability available.

You might say these are all “rare” conditions or “edge cases,” but they can’t be dismissed due to comparative rarity when people’s lives are on the line. Too many companies seem to believe in the Andrew Carnegie saying: “Put all your eggs in one basket — and then watch that basket!” But experienced safety folks know that never really works out in the long term for technical systems at scale. Defense in depth is the smart play.

Hurricanes are a rare event for any particular town. But they can do a lot of damage when they hit. In particular, a hurricane hit does not take out a single house — but tends to destroy entire towns or groups of towns. It is, in effect, a common cause source of losses for an entire town. Defense in depth does not simply bet a hurricane won’t come to town, but also buys insurance and updates the local building codes for hurricane-resistant roofs and the like to reduce losses when the hurricane eventually shows up.

Regardless of defenses, all buildings will be subject to the same problem at the same time when the hurricane shows up. The damage is not random and independent due to the common cause loss source of the hurricane raging through the town.

We can analogize this type event in the software world as a cyber-hurricane. A large number of systems suffer a loss due to some sort of common cause effect.

The insurance industry has a financial mechanism called reinsurance to spread risk around so if any one company gets an unlucky break due to selling too many policies in a town hit by a hurricane (or earthquake, flood, tornado, etc.) the risk gets spread around on a secondary market. All the companies lose, but an unlucky break probably won’t kill any particular insurance company. Even so, insurance companies might still require mitigation efforts to issue insurance, or at least reward such efforts with lower rates.

If a car company suffers a cyber-hurricane, how might they be able to spread that risk around? The car insurance companies might be fine due to reinsurance. But the car company itself has its name in the news. Just ask Hyundai/Kia how that’s been working out for them.

For a cyber-hurricane the sources of losses might be more diverse than just strong winds and rising flood waters. But you can expect we will see such events on a regular basis as more of our society comes to depend on interconnected software-intensive systems.

If you’re in the car business, ask yourself what a cyber-hurricane might look like for your system and what you should be doing to prevent it.

Sunday, July 14, 2024

Architectural Coupling Killed The Software Defined Vehicle

SDV failures might have poor architectural cohesion and coupling as a critical factor.

We’re reading about high profile software fiascos in car companies, and how they might be handling them, for example: The $5B VW bet on Rivian; Volvo refunding car owners over poor software. And don’t forget a steady stream of recalls over infotainment screen failures related to vehicle status indication and rear-view cameras.

There are business forces at play here to be sure, such as a mad rush to catch up to Tesla for EVs. But I think there might be a system architecture issue that is also playing an outsized role — both technical and business.

The technical side of this issue is directly related to the move from a bunch of boxes from a tiered supplier system to a single big computer that is a key aspect of so-called Software Defined Vehicles.

Architectural coupling and cohesion

Two key architectural principles that differentiate a good architecture from a bad one are cohesion and coupling. High cohesion is good; low coupling is good. The opposite can easily kill a system due to drowning in complexity.

Here are some definitions:

  • Cohesion: how well all the functions in a particular hardware or software module are related. Are they all cousins (high cohesion)? Or is it miscellaneous cats and dogs, with a hamster tossed in for good measure (low cohesion)? As an analogy, an apartment building (all apartments) has higher cohesion than a mixed use building (shops+dining+offices+apartments+garage+metro station). Low cohesion might have some advantages, but it is more complex to maintain and operate.

  • Coupling: how many data connections there are into/out of each module. Low coupling is good (a few cleanly defined data types); high coupling bad. High coupling amounts to data flow spaghetti. Not the tasty kind of spaghetti — the kind that causes system design failures analogous to spaghetti code, but in the middleware universe. As a more strained analogy, think of an apartment with a dozen exit doors — a different one for going to the shops, office, a neighbor, the garage, the metro, the sidewalk, the cafeteria, your patio, etc — and what it means to check to make sure all the exit doors are locked at bed time.

The old days: technology & supplier approaches incentivized low coupling and high cohesion

In traditional vehicles, the use of wires to connect different Electronic Control Units (ECUs) placed an artificial limit on coupling. In the old days you only got so many wires in a wire bundle before it would no longer fit in the available space in the car. And with the transition to networks, you only got so many messages per second on a comparatively slow CAN bus (250K or 500K bits/sec in the bad old days).

Moreover, in the old days each ECU more or less did a single function created by a single supplier. This was due in large part to a functional architecture approach in which OEMs could mix-and-match functions inside dedicated boxes from different suppliers.

Sure, there was duplication and potential wasteful use of compute resources. But a single box doing a single function hung onto a low-bandwidth network cable had no choice but to get high cohesion and low coupling.

New technology removes previous architectural incentives

Now we have much higher speed networks (perhaps hundreds of megabits/sec, with the sky being the limit). And if all the software is on the same computer, dramatically faster than that.

We also have middleware that is pushing software architectures from procedure-passing data based on flows of control to publish/subscribe broadcast models (pub/sub). Sure, that started with CAN, but has gotten a lot more aggressive with the advent of higher speed interconnects and middleware frameworks such as the one provided by AUTOSAR.

The combination of higher connection bandwidth between modules and pub/sub middleware has effectively removed the technical cost of high coupling.

Now we are promised that Software Defined Vehicles will let us aggregate all the functions into a single big box (or perhaps a set of medium-big boxes). With high bandwidth networks. And all sorts of functions all competing for resources on shared hardware.

High bandwidth communications, pub/sub models, centralized hardware, and centralized software implicitly incentivize approaches with high coupling and low cohesion.

SDV architectures destroy all the incentives that pushed toward low coupling and high cohesion in older system designs. You should expect to get what you incentivize. Or in this case, stop getting what you have stopped incentivizing.

Any student of system architecture should find it no surprise that we’re seeing systems with high coupling (and likely low cohesion) in SDVs. Accompanied by spectacular failures.

I’m not saying it is impossible to design a good architecture with current system design approaches. And I’m not saying the only solution is to go back to slow CAN networks. Or go back 50 years in software/system architectures. And I’m only arguing a tiny bit that painful constraints build character. What I’m saying is that the incentives that used to push designers to better architectures have evaporated.

Business architecture imitating technical architecture

Consider Conway’s law: organizations and systems they create tend to have similar structures. In my experience this is a two-way street. We can easily get organizations that evolve to match the architecture of the system being built. It is possible that the software/system architecture itself is the tail, and the dog is that organizations over time have aligned themselves with low cohesion/high coupling system approaches, and are therefore suffering from these same architectural problems.

So it might not be the middleware-centric architectural trends that are as much of a problem as the way those reflect back into the organizations that create them.

Despite the headline I don’t think the SDV is actually dead. But the failures we’re seeing will not be resolved simply by hiring more and smarter people. There are some fundamental issues in architecture that need to be addressed. With incentives to do so that are strategic rather than tactical, making it harder to explain the return on investment for doing so. Beyond that, there are some serious issues with how software engineering practices have evolved and their suitability for life critical systems.

I think the spectacular failures are just getting started. It will take some really fundamental changes to get things back on track. And probably more corporate fiascos.

Sunday, July 7, 2024

Mercedes Benz DRIVE PILOT and driver blame

MB has softened their stance on Level 3 liability, but they still don't really have your back when it matters.

Good news: Mercedes Benz has improved their position on driver liability.

Bad news: But they’re not there yet. A soft promise to pay insurance isn’t the biggest issue. Tort liability in a courtroom is.

MB is starting yet another round of “we take responsibility for crashes” for their Level 3 DRIVE PILOT traffic jam assist automation feature, approved for use in California & Nevada as a “Level 3” system and some places outside the US as an ALKS traffic jam assist system. (h/t to Susana Gallun for this gem promising they will cover insurance cost of crashes in Australia.)

Gone is the wording of the driver having to notice “irregularities .. in the traffic situation” while watching a movie on the dashboard. Because that was just plain stupid. And was an admission that they were selling an SAE Level 2+ system and not a Level 3 system, because Level 3 requires no driver attention to what is happening on the roadway.

Now MB has adopted terminology from SAE J3016. The driver “must take control of the vehicle if evident irregularities concerning performance relevant system failures are detected in the vehicle.” Straight out of J3016 Level 3 requirements. So that at least means they are consistent with the standard they invoke, and they are legit selling a Level 3 system according to their description. Too bad J3016 is not actually a safety standard and conformance to Level 3 does not bestow safety.

(Excerpt from Drive Pilot supplement: https://www.mbusa.com/content/dam/mb-nafta/us/owners/drive-pilot/QS_Sedan_Operators_Manual_US.pdf

There are two issues here. The first is what a “performance relevant system failure” inside the vehicle actually might be, and how the driver is supposed to notice if the system does not tell them. Let’s say there is a complete lidar failure for some reason (for example, a common cause failure inside the lidar firmware for all redundant lidar units), and the system does not tell the driver to take over. Crash ensues. That is clearly a “performance relevant system failure” — including the part where the system fails to inform the driver that the ADS is driving blind while the driver is still watching the movie. Yes, conforms to Level 3. But see the part where J3016 is not actually a safety standard.

So the driver is still potentially hung out to dry if the automation fails and there is a crash.

At this point MB typically says something about how smart their engineers are (which is credible) and they will pay insurance (if they don’t walk it back) and it won’t fail because MB is trustworthy (mixed experiences on that). But the fact is the current wording still leaves drivers exposed to significant downside if the technology fails. And technology eventually fails, no matter who builds it. (MB has recalls, just like everyone else.)

But let’s say MB actually honors its soft PR-promise to pay up insurance claims. In practice this kind of doesn’t matter, because the driver’s insurance was going to pay anyway. For anyone who can afford DRIVE PILOT, who pays insurance is not an existential economic threat. That comes from the wrongful death lawsuit, which MB is NOT saying it will cover.

What matters is who takes responsibility for the $10M+ (or whatever) wrongful death tort lawsuit that blows well past the insurance coverage of the driver, and which MB might (or might not) conveniently find an excuse to walk away from.

Mercedes Benz still wants us to think they accept liability in case of a Level 3 computer driver crash. Their current posture is not as ridiculous as it was before, because now they’re flirting with tort liability instead of disingenuously meaning only product liability as they were saying last year.

But we’re still not there yet. Maybe next year we will see them actually say they accept a duty of care (liability) for any crash that occurs when DRIVE PILOT is engaged, and liability returns to the driver only after more than 10 seconds have elapsed from a takeover request or the driver performs a takeover before then. Human factors questions about takeover situational awareness and safety remain, but at least this would be relatively clear-cut and not leave the potential for the human driver to be hung out to dry for watching a movie — as MB encourages them to do — when a crash happens.

For a longer writeup with detailed history, see:
No, Mercedes-Benz will NOT take the blame for a Drive Pilot crash