Safe Autonomy: feedback

Wednesday, May 18, 2022

SEAMS Keynote talk: Safety Performance Indicators and Continuous Improvement Feedback

Abstract: Successful autonomous ground vehicles will require a continuous improvement strategy after deployment. Feedback from road testing and deployed operation will be required to ensure enduring safety in the face of newly discovered rare events. Additionally, the operational environment will change over time, requiring the system design to adapt to new conditions. The need for ensuring life critical safety is likely to limit the amount of real time adaptation that can be relied upon. Beyond runtime responses, lifecycle safety approaches will need to incorporate significant field engineering feedback based on safety performance indicator monitoring.

A continuous monitoring and improvement approach will require a fundamental shift in the safety world-view for automotive applications. Previously, a useful fiction was maintained that vehicles were safe for their entire lifecycle when deployed, and any safety defect was an unwelcome surprise. This approach too often provoked denial and minimization of the risk presented by evidence of operational safety issues so as to avoid expensive recalls and blame. In the future, the industry will need to embrace a model in which issues are proactively detected and corrected in a way that avoids most loss events, and that uses field incident data as a primary driver of improvement. Responding to automatically generated field incident reports to avoid later losses should be a daily practice in the normal course of business rather than evidence of an engineering mistake for which blame is assigned. This type of engineering feedback approach should complement any on-board runtime adaptation and fault mitigation.

Talk video: https://youtu.be/mRXotHN0Z6I
Slides: https://users.ece.cmu.edu/~koopman/lectures/L127-2022-04-SEAMS-Feedback-SPIs.pdf
Archive.org downloadable mirror: http://archive.org/details/l127-safety-performance-indicators-and-continuous-improvement-feedback_202205

Tuesday, March 1, 2022

Maturation path for safety & security practices

Brief informal notes from a wrap-up quick position statement talk I did at a workshop today.

Both safety and security have a lot in common in terms of how they are maturing over time. Without getting into a religious debate about the difference between them, I note that their trajectory seems to include the following steps, especially for autonomous systems. I'd argue that each step is in a sense more mature than the previous step.

Get the system to work. Safety/security can come later.
Get the system to work almost all the time. Conflate this with safety/security even though you're still really just getting it to work in the common cases (safety for a vehicle is "doesn't hit stuff" while security is "doesn't get taken down by the usual continuous stream of automated attacks")
Brute force problem fixes: fly/crash/fix/fly (air) and drive/crash/fix/drive (ground)
Create a set of best practices in the nature of a building code ("build your system this way")

Create a useful fiction that you have completely characterized the requirements and operational environment and that your building code will always work.
Any failure is an embarrassing piece of bad news that violates the fiction of complete understanding.

As system matures, complain about false alarm safety/security shutdowns

It might feel like this means your system has problems, but in fact you're a lot more safe and secure than systems that operate oblivious to their vulnerabilities

Start permitting breaking the building code standard rules by arguing that exceptions still result in equivalent safety/security
Evolve to full-up deductive assurance cases to argue safety/security beyond building codes

Still maintain the fiction of complete knowledge of requirements and environment

Start operating in more open environments and admit you didn't really understand requirements, nor environment

Spend a lot of time chasing down problems that reveal defects in your safety case (safety case does not match environmental assumptions, or might not even match deployed system)

Switch to an inductive safety case approach:

Account for risk from epistemic uncertainty (unknown unknowns)
Instrument system for failure precursors (e.g., safety performance indicators tied to safety case claims)
Treat incidents as an opportunity to fix problems before there is a loss event.