Safe Autonomy: software quality

Showing posts with label software quality. Show all posts

Friday, November 11, 2022

Shiny vs. Critical Software

Coverage of Lucid software problems (cars bricked; wrong direction of travel) might be written off to growing pains for a new company. But I think this is just yet another story about a deeper industry-wide problem. (The article notes other more established companies have problems too.) This weeks story: https://www.businessinsider.com/electric-vehicle-startup-lucid-struggling-production-reveal-insiders-owners-2022-11

Shiny and critical software and developer skills
mix as oil and water.

All software is not created equal. For cars I am seeing three types:

Shiny Infotainment and other software that provides shiny customer features might be less reliable and still sell cars. There are limits to tolerance for problems, but more forgiveness if the features are shiny enough. Apparently "coding" is enough to build valuable companies, as is slapping "beta" on the label as a pretext for further reducing quality expectations for a product sold retail. Maximizing lines of code per day has made lots of founders rich.
Critical deeply embedded control firmware has to be rock solid or you're going to get serious malfunctions. "Coding" without software engineering invariably leads to deeper problems, some of which feature in harm to people. Maximizing lines of code per day at the cost of impaired quality and skipped safety engineering practices has made customers dead.
"AI" software based on machine learning, which is being asked to do safety critical work but often without using the foundational skills and processes of the critical software experts.

(You can argue that cloud services are yet another type, but in my experience that divides up into shiny services software and critical infrastructure support software.)

Journalists should differentiate when reporting. If you must call a shiny software problem a "glitch" so be it. But critical software failures are due to *defects* reflective of an important lapse in engineering, not glitches due to being in a hurry to deploy the new hotness.

Companies get in trouble, sometimes very seriously, via several mechanisms:

Treating all software the same. Software needs to be thought of as either shiny or critical. They are as oil and water. (Maybe this could be different, but in the world we live in this is the only pragmatic approach.)
Treating "software staff" as fungible. Developers for shiny vs. critical software are deeply different. The skill sets, mindset, and work flows are quite different. So is the training required. While some can do both, few can do both well. (This is not about being "smart." It is about being different.) To a first approximation, anyone talking about "coding" is in the shiny software business, especially if they indicate that knowing how to code is equivalent to being a software engineer.
Mixed components and features. We see an endless parade of NHTSA recalls for malfunctioning backup cameras. The usual story is that a critical function (backup camera; by definition safety critical per FMVSS) is hosted on a platform optimized for shiny (infotainment display and OS). The only surprise is that car companies persist in thinking that plan will work out well.
We're still sorting out how to fit AI software into the mix. A lot of that will end up forcing a choice between the shiny bin or the critical bin, with a human/machine interface suitable to the choice. Making AI software shiny can be a reasonable choice -- but only if we don't pretend shiny AI it is fit for purpose as critical software. (Critical machine learning might be done, but there is a significant gap to overcome in skills, work flows, etc. that the industry is only beginning to wrestle with.)

Cost is pushing companies to mix shiny with critical more than they should. That pressure will continue to generate news stories. Instead of just pretending they mix well, companies should be rethinking their software architectures to maintain separation of technical aspects, staff skills, and cultural aspects in a way that is harmonious. Pretending these differences don't exist will continue to lead to bad outcomes.

Friday, October 14, 2022

The Software Defined Vehicle Is Still More Wish Than Reality

Here is a Software Defined Vehicle video that covers a lot of ground. Car companies are all talking a big game about adding software to their vehicles, including big data, software updates, connectivity, and more. The possibilities are exciting, but you only have to read the news to know that the road to get there is proving bumpier than they'd like. (See this story too.)

Getting the mix of Silicon Valley software + automotive system integration + vehicle automation technology right is still a big challenge. This video talks about the possibilities. But to get there, OEMs still have a lot of work to do achieving a viable culture that addresses inherent tensions:

Cutting edge cloud software vs. life critical embedded systems
Role of automation vs. realistic expectations of human drivers
A shift from "recall" mentality to continuous improvement processes
Fast updates vs. assured safety integrity
Role of suppliers vs. OEM, especially for autonomous vehicle functions
Monetizing data vs. consumer rights
OEMs stepping up to the system integration challenges
Getting a regulatory approach that balances risks and benefits across all stakeholders

(Sadly, the video includes an incorrect statement that "95% to 96% of the accidents happen because of distracted driving" in the context of fatalities. Drivers are not perfect, but distracted driving only contributes to about 9% of fatalities per US DOT, about one-tenth of what was stated.)

YouTube: https://youtu.be/T4wEe2bNSSk

Saturday, April 30, 2022

OTA updates won't save buggy autonomous vehicle software

There is a feeling that it's OK for software to ship with questionable quality if you have the ability to send out updates quickly. You might be able to get away for this with human-driven vehicles, but for autonomous vehicles (no human driver responsible for safety) this strategy might collapse.

Right now, companies are all pushing hard to do quick-turn Over The Air (OTA) software updates, with Tesla being the poster child of both shipping dodgy software and pushing out quick updates (not all of which actually solve the problem as intended). There is a moral hazard that comes with the ability to do quick OTAs in that you might not spend much time on quality since you know you can just send another update if the first one doesn't turn out as you hoped.

"There's definitely the mindset that you can fix fast so you can take a higher risk," Florian Rohde, a former Tesla validation manager (https://www.reuters.com/article/tesla-recalls-idTRNIKBN2KN171)

For now companies across an increasing number of industries have been getting away with shipping lower quality software, and the ability to do internet-based updates has let them get by with such a strategy. The practice is so prevalent that the typical trouble-shooting for any software after "is the power turned on" has become "have you downloaded the latest updates."

But the reason this approach works (after a fashion) is that there is a human user or vehicle operator present to recognize something is wrong, work around the problem, and participate in the trouble shooting. In a fully automated vehicle, that human isn't going to be there to save the day.

What happens when there is no human to counter-act the defective and potentially fatally dangerous defective software behavior? The biggest weakness of any automation is typically that it is not "smart" enough to know when something is going wrong that is not supposed to happen. People are pretty good at this, which is why even for very serious software defects in cars we often see huge numbers of complaints compared to few actual instances of harm -- because human drivers have compensated for the bad software behavior.

Here's a concrete example of a surprising software defect pulled from my extensive list of problematic automotive software defects: NHTSA Recall 14V-204:

Due to software calibration error vehicle may be in and display "drive" but engage "reverse" for 1.5 seconds.

If a human driver notices the vehicle going the wrong direction they'll stop accelerating pretty quickly. They might hit something at slow speed during the reaction time, but they'll realize something is wrong without having explicit instructions for that particular failure scenario. In contract, a computer-based system that has been taught the car always moves in the direction of the transmission display might not even realize something is wrong and accelerate into collisions.

Obviously a computer can be programmed to deal with such a situation if it has been thought of at design time. But the whole point here is that this is something that isn't supposed to happen -- so why would you waste time programming a computer to handle an "impossible" event? Safety engineering deals with hazard analysis to mitigate low risk things, but even that often overlooks "impossible" events until after they've occurred. Sure, you can send an OTA update after the crash -- but that doesn't bring crash victims back to life.

In practice the justification that it is OK to ship out less-than-perfect automotive software has been that human drivers can compensate for problems. (In the ISO 26262 functional safety standard one takes credit for "controllability" in reducing the risk of a potential defect.) When there is no human driver, that's a problem, and shipping defective software is more likely to result in harm to a vehicle occupant or other road user before it can be noticed there is a problem for OTA to correct.

Right now, a significant challenge to OTA updates is the moral hazard that software will be a bit more dangerous than it should be due to pushing the boundaries of human driver ability to compensate for defects. With fully automated vehicles there will be a huge cliff of ability, and even small OTA update defects could result in large numbers of crashes across a fleet before there is time to correct the problem. (If you push a bad updated to millions of cars, you can have a lot of crashes even in a single day for a defect that affects a common driving situation.)

The industry is going all-in on fast and loose OTAs to be more "agile" and iterate software changes more quickly without worrying as much about quality. But I think they're heading right into a proverbial brick wall that will be hit when the human drivers are taken out of the loop. Getting software quality right will become more important than ever for fully autonomous vehicles.