Safe Autonomy: safety case

Showing posts with label safety case. Show all posts

Wednesday, December 7, 2022

SCSC Talk: Bootstrapping Safety Assurance

Bootstrapping Safety Assurance

Abstract:
The expense and general impracticability of doing enough real-world testing to demonstrate safety for autonomous systems motivates finding some sort of shortcut. A bootstrapped testing approach is often proposed, using evidence from initial mishap-free testing to argue that continued testing is safe enough. In this talk I'll explain why pure bootstrapping based on testing exposure as well as arguments involving "probably perfect" bootstrapping expose public road users to undue risk. Moreover, phased deployments often used to argue safe update release have the same problem. An approach that bootstraps on the safety case rather than on vehicle testing is proposed as a potentially better alternative. While the examples given involve autonomous ground vehicles, the principles involved apply to any argument that safety will be demonstrated via a bootstrap testing process.

This talk was recorded as part of the SCSC Future of Testing for Safety-Critical Systems seminar on Dec. 1, 2022.

Talks and videos are available here (access with paid annual club membership): https://scsc.uk/e966prog

Free public-access copy of slides here:

https://users.ece.cmu.edu/~koopman/lectures/L133_SCSC_Bootstrapping.pdf

Friday, May 20, 2022

A gentle introduction to autonomous vehicle safety cases

I recently ran into this readable article about AV safety cases by Thomas & Vandenberg from 2019. While things have changed a bit, it still is a reasonable introduction for anyone asking "what exactly would an AV safety case look like."

A real industry-strength safety case is going to be complicated in many ways. In particular, there are many different approaches for breaking down G1 which will significantly affect things. On the other hand all the pieces will need to be there somewhere, so choosing this high level breakdown is more of an architectural choice (for the safety case, not necessarily the system). We do not yet have a consensus on an optimal strategy for building such safety cases, but this is not a bad starting place from safety folks who were previously at Uber ATG.

Thomas & Vandenberg, Harnessing Uncertainty in Autonomous Vehicle Safety, Journal of System Safety, Vol. 55, No. 2 (2019)

https://doi.org/10.56094/jss.v55i2.46

(Uber ATG also published a much more complex safety case. However, I recommend this overview paper rather than that more complex safety case to get insight if you are just getting started.)

Wednesday, May 18, 2022

SEAMS Keynote talk: Safety Performance Indicators and Continuous Improvement Feedback

Abstract: Successful autonomous ground vehicles will require a continuous improvement strategy after deployment. Feedback from road testing and deployed operation will be required to ensure enduring safety in the face of newly discovered rare events. Additionally, the operational environment will change over time, requiring the system design to adapt to new conditions. The need for ensuring life critical safety is likely to limit the amount of real time adaptation that can be relied upon. Beyond runtime responses, lifecycle safety approaches will need to incorporate significant field engineering feedback based on safety performance indicator monitoring.

A continuous monitoring and improvement approach will require a fundamental shift in the safety world-view for automotive applications. Previously, a useful fiction was maintained that vehicles were safe for their entire lifecycle when deployed, and any safety defect was an unwelcome surprise. This approach too often provoked denial and minimization of the risk presented by evidence of operational safety issues so as to avoid expensive recalls and blame. In the future, the industry will need to embrace a model in which issues are proactively detected and corrected in a way that avoids most loss events, and that uses field incident data as a primary driver of improvement. Responding to automatically generated field incident reports to avoid later losses should be a daily practice in the normal course of business rather than evidence of an engineering mistake for which blame is assigned. This type of engineering feedback approach should complement any on-board runtime adaptation and fault mitigation.

Talk video: https://youtu.be/mRXotHN0Z6I
Slides: https://users.ece.cmu.edu/~koopman/lectures/L127-2022-04-SEAMS-Feedback-SPIs.pdf
Archive.org downloadable mirror: http://archive.org/details/l127-safety-performance-indicators-and-continuous-improvement-feedback_202205

Tuesday, March 1, 2022

Maturation path for safety & security practices

Brief informal notes from a wrap-up quick position statement talk I did at a workshop today.

Both safety and security have a lot in common in terms of how they are maturing over time. Without getting into a religious debate about the difference between them, I note that their trajectory seems to include the following steps, especially for autonomous systems. I'd argue that each step is in a sense more mature than the previous step.

Get the system to work. Safety/security can come later.
Get the system to work almost all the time. Conflate this with safety/security even though you're still really just getting it to work in the common cases (safety for a vehicle is "doesn't hit stuff" while security is "doesn't get taken down by the usual continuous stream of automated attacks")
Brute force problem fixes: fly/crash/fix/fly (air) and drive/crash/fix/drive (ground)
Create a set of best practices in the nature of a building code ("build your system this way")

Create a useful fiction that you have completely characterized the requirements and operational environment and that your building code will always work.
Any failure is an embarrassing piece of bad news that violates the fiction of complete understanding.

As system matures, complain about false alarm safety/security shutdowns

It might feel like this means your system has problems, but in fact you're a lot more safe and secure than systems that operate oblivious to their vulnerabilities

Start permitting breaking the building code standard rules by arguing that exceptions still result in equivalent safety/security
Evolve to full-up deductive assurance cases to argue safety/security beyond building codes

Still maintain the fiction of complete knowledge of requirements and environment

Start operating in more open environments and admit you didn't really understand requirements, nor environment

Spend a lot of time chasing down problems that reveal defects in your safety case (safety case does not match environmental assumptions, or might not even match deployed system)

Switch to an inductive safety case approach:

Account for risk from epistemic uncertainty (unknown unknowns)
Instrument system for failure precursors (e.g., safety performance indicators tied to safety case claims)
Treat incidents as an opportunity to fix problems before there is a loss event.

Sunday, June 20, 2021

A More Precise Definition for ANSI/UL 4600 Safety Performance Indicators (SPIs)

Safety Performance Indicators (SPIs) are defined by chapter 16 of ANSI/UL 4600 in the context of autonomous vehicles as performance metrics that are specifically related to safety (4600 at 16.1.1.6.1).

Woman in Taxi looking at phone. Photo by Uriel Mont from Pexels

This is a fairly general definition that is intended to encompass both leading metrics (e.g., number of failed detections of pedestrians for a single sensor channel) and lagging metrics (e.g., number of collisions in real world operation).

However, it is so general that there can be a tendency to try to call metrics that are not related to safety SPIs when, more properly, they are really KPIs. As an example, ride quality smoothness when cornering is a Key Performance Indicator (KPI) that is highly desirable for passenger comfort. But it might have little or nothing to do with the crash rate for a particular vehicle. (It might be correlated -- sloppy control might be associated with crashes, but it might not be.)

So we've come up with a more precise definition of SPI (with special thanks to Dr. Aaron Kane for long discussions and crystalizing the concept).

An SPI is a metric supported by evidence that uses a
threshold comparison to condition a claim in a safety case.

Let's break that down:

SPI - Safety Performance Indicator - a {metric, threshold} pair that measures some aspect of safety in an autonomous vehicle.
Metric - a value, typically related to one or more of product performance, design quality, process quality, or adherence to operational procedures. Often metrics are related to time (e.g., incidents per million km, maintenance mistakes per thousand repairs) but can also be related to particular versions (e.g., significant defects per thousand lines of code; unit test coverage; peer review effectiveness)
Evidence - the metric values are derived from measurement rather than theoretical calculations or other non-measurement sources
Threshold - a metric on its own is not an SPI because context within the safety case matters. For example, false negative detections on a sensor as a number is not a SPI because it misses the part about how good it has to be to provide acceptable safety when fused with other sensor data in a particular vehicle's operational context. ("We have 1% false negatives on camera #1. Is that good enough? Well, it depends...") There is no limit to the complexity of the threshold which might be, for example, whether a very complicated state space is either inside or outside a safety envelope. But in the end the answer is some sort of comparison between the metric and the threshold that results in "true" or "false." (Analogous multi-valued operations and outputs are OK if you are using multi-valued logic in your safety case.) We call the state of an SPI output being "false" an SPI Violation.
Condition a claim - each SPI is associated with a claim in a safety case. If the SPI is true the claim is supported by the SPI. If the SPI is false then the associated claim has been falsified. (SPIs based on time series data could be true for a long time before going false, so this is a time and state dependent outcome in many cases.)
Safety case - Per ANSI/UL 4600 a safety case is "a structured argument, supported by a body of evidence, that provides a compelling, comprehensible and valid case that a system is safe for a given application in a given environment." In the context of that standard, anything that is related to safety is in the safety case. If it's not in the safety case, it is by definition not related to safety.

A direct conclusion of the above is that if a metric does not have a threshold, or does not condition a claim in a safety case, then it can't be an SPI.

Less formally, the point of an SPI is that you've built up a safety case, but there is always the chance you missed something in the safety case argument (forgot a relevant reason why a claim might not be true), or made an assumption that isn't as true as you thought it was in the real world, or otherwise have some sort of a problem with your safety case. An SPI violation amounts to: "Well, you thought you had everything covered and this thing (claim) was always true. And yet, here we are with the claim being false when we encountered a particular unforeseen situation in validation or real world operation. Better update your safety argument!"

In other words, a SPI is a measurement you take to make sure that if your safety case is invalidated you'll detect it and notice that your safety case has a problem so that you can fix it.

An important point of all this is that not every metric is an SPI. SPIs are a very specific term. The rest are all KPIs.

KPIs can be very useful, for example in measuring progress toward a functional system. But they are not SPIs unless they meet the definition given above.

NOTES:

The ideas in this posting are due in large part to efforts of Dr. Aaron Kane. He should be cited as a co-author of this work.

(1) Aviation uses SPI for metrics related to the operational phase and SMS activities. The definition given here is rooted in ANSI/UL 4600 and is a superset of the aviation use, including technical metrics and design cycle metrics as well as operational metrics.

(2) In this formulation an SPI is not quite the same as a safety monitor. It might well be that some SPI violations also happen to trigger a vehicle system shutdown. But for many SPI violations there might not be anything actionable at the individual vehicle level. Indeed, some SPI violations might only be detectable at the fleet level in retrospect. For example, if you have a budget of 1 incident per 100 million km of a particular type, an individual vehicle having such an incident does not necessarily mean the safety case has been invalidated. Rather, you need to look across the fleet data history to see if such an incident just happens to be that budgeted one in 100 million based on operational exposure, or is part of a trend of too many such incidents.

(3) We pronounce "SPI" as "S-P-I" rather than "spy" after a very confusing conversation in which we realized we needed to explain to a government official that we were not actually proposing that the CIA become involved with validating autonomous vehicle safety.

Sunday, December 13, 2020

Safety Performance Indicator (SPI) metrics (Metrics Episode 14)

SPIs help ensure that assumptions in the safety case are valid, that risks are being mitigated as effectively as you thought they would be, and that fault and failure responses are actually working the way you thought they would.

Safety Performance Indicators, or SPIs, are safety metrics defined in the Underwriters Laboratories 4600 standard. The 4600 SPI approach covers a number of different ways to approach safety metrics for a self-driving car, divided into several categories.

One type of 4600 SPI safety metric is a system-level safety metric. Some of these are lagging metrics such as the number of collisions, injuries and fatalities. But others have some leading metric characteristics because while they’re taken during deployment, they’re intended to predict loss events. Examples of these are incidents for which no loss occurs, sometimes called near misses or near hits, and the number of traffic rule violations. While by definition, neither of these actually results in a loss, it’s a pretty good bet that if you have many, many near misses and many traffic-rule infractions, eventually something worse will happen.

Another type of 4600 metric is intended to deal with ineffective risk mitigation. An important type of SPI relates to measuring that hazards and faults are not occurring more frequently than expected in the field.

Here’s a narrow but concrete example. Let’s assume your design takes into account that you might lose one in a million network packets due to corrupted data being detected. But out in the field, you’re dropping every tenth network packet. Something’s clearly wrong, and it’s a pretty good chance that undetected errors are slipping through. You need to do something about that situation to maintain safety.

A broader example is that a very rare hazard might be deemed not to be risky because it just essentially never happens. But just because you think it almost never happens doesn’t mean that’s what happens in the real world. You need to take data to make sure that something you thought would happen to one vehicle in the fleet every hundred years isn’t in fact happening every day to someone, because if that’s the case, you badly misestimated your risk.

Another type of SPI for field data is measuring how often components fail or behave badly. For example, you might have two redundant computers so that if one crashes, the other one will keep working. Consider one of those computers is failing every 10 minutes. You might drive around for an entire day and not really notice there’s a problem because there’s always a second computer there for you. But if your calculations assume a failure once a year and it’s failing every 10 minutes, you’re going to get unlucky and have both fail at the same time a lot sooner than you expected.

So it’s important to know that you have an underlying problem, even though it’s being masked by the fault tolerance strategy.

A related type of SPI has to do with classification algorithm performance for self-driving cars. When you’re doing your safety analysis, it’s likely you’re assuming certain false positive and false negative rates for your perception system. But just because you see those in testing doesn’t mean you’ll see those in the real world, especially if the operational design domain changes and new things pop up that you didn’t train on. So you need a SPI to monitor the false negative and false positive rates to make sure that they don’t change from what you expected.

Now, you might be asking, how do you figure out false negatives if you didn’t see it? But in fact, there’s a way to approach this problem with automatic detection. Let’s say that you have three different types of sensors for redundancy and you vote three sensors and go with the majority. Well, that means every once in a while, one of the sensors can be wrong and you still get safe behavior. But what you want to do is take a measurement of how often the one wrong happens, because if it happens frequently, or the faults on that sensor correlate with certain types of objects, those are important things to know to make sure your safety case is still valid.

A third type of 4600 metric is intended to measure how often surprises are encountered. There’s another segment on surprises, but examples are the frequency at which an object is classified with poor confidence, or a safety relevant object flickers between classifications. These give you a hint that something is wrong with your perception system, and that it’s struggling with some type of object. If this happens constantly, then that indicates a problem with the perception system. It might indicate that the environment has changed and includes novel objects not accounted for by training data. Either way, monitoring for excessive perception issues is important to know that your perception performance is degraded, even if an underlying tracking system or other mechanism is keeping your system safe.

A fourth type of 4600 metric is related to recoveries from faults and failures. It is common to argue that safety-critical systems are in fact safe because they use fail-safes and fall-back operational modes. So if something bad happens, you argue that the system will do something safe. It’s good to have metrics that measure how often those mechanisms are in fact invoked, because if they’re invoked more often than you expected, you might be taking more risks than you thought. It’s also important to measure how often they actually work. Nothing’s going to be perfect. And if you’re assuming they work 99% of the time but they only work 90% of the time, that dramatically changes your safety calculations.

It’s useful to differentiate between two related concepts. One is safety performance indicators, SPIs, which is what I’ve been talking about. But another concept is key performance indicators, KPIs. KPIs are used in project management and are very useful to try and measure product performance and utility provided to the customer. KPIs are a great way of tracking whether you’re making progress on the intended functionality and the general product quality, but not every KPI is useful for safety. For example, a KPI for a fuel economy is great stuff, but normally it doesn’t have that much to do with safety.

In contrast, an SPI is supposed to be something that’s directly traced to parts of the safety case and provides evidence for the safety case. Different types of SPIs include making sure the assumptions in the safety case are valid, that risks are being mitigated as effectively as you thought they would be, and that fault and failure responses are actually working the way you thought they would. Overall, SPIs have more to do with whether the safety case is valid and the rate of unknown surprise arrivals is tolerable. All these areas need to be addressed one way or another to deploy a safe self-driving car.