Safe Autonomy: data collection

Sunday, June 20, 2021

A More Precise Definition for ANSI/UL 4600 Safety Performance Indicators (SPIs)

Safety Performance Indicators (SPIs) are defined by chapter 16 of ANSI/UL 4600 in the context of autonomous vehicles as performance metrics that are specifically related to safety (4600 at 16.1.1.6.1).

Woman in Taxi looking at phone. Photo by Uriel Mont from Pexels

This is a fairly general definition that is intended to encompass both leading metrics (e.g., number of failed detections of pedestrians for a single sensor channel) and lagging metrics (e.g., number of collisions in real world operation).

However, it is so general that there can be a tendency to try to call metrics that are not related to safety SPIs when, more properly, they are really KPIs. As an example, ride quality smoothness when cornering is a Key Performance Indicator (KPI) that is highly desirable for passenger comfort. But it might have little or nothing to do with the crash rate for a particular vehicle. (It might be correlated -- sloppy control might be associated with crashes, but it might not be.)

So we've come up with a more precise definition of SPI (with special thanks to Dr. Aaron Kane for long discussions and crystalizing the concept).

An SPI is a metric supported by evidence that uses a
threshold comparison to condition a claim in a safety case.

Let's break that down:

SPI - Safety Performance Indicator - a {metric, threshold} pair that measures some aspect of safety in an autonomous vehicle.
Metric - a value, typically related to one or more of product performance, design quality, process quality, or adherence to operational procedures. Often metrics are related to time (e.g., incidents per million km, maintenance mistakes per thousand repairs) but can also be related to particular versions (e.g., significant defects per thousand lines of code; unit test coverage; peer review effectiveness)
Evidence - the metric values are derived from measurement rather than theoretical calculations or other non-measurement sources
Threshold - a metric on its own is not an SPI because context within the safety case matters. For example, false negative detections on a sensor as a number is not a SPI because it misses the part about how good it has to be to provide acceptable safety when fused with other sensor data in a particular vehicle's operational context. ("We have 1% false negatives on camera #1. Is that good enough? Well, it depends...") There is no limit to the complexity of the threshold which might be, for example, whether a very complicated state space is either inside or outside a safety envelope. But in the end the answer is some sort of comparison between the metric and the threshold that results in "true" or "false." (Analogous multi-valued operations and outputs are OK if you are using multi-valued logic in your safety case.) We call the state of an SPI output being "false" an SPI Violation.
Condition a claim - each SPI is associated with a claim in a safety case. If the SPI is true the claim is supported by the SPI. If the SPI is false then the associated claim has been falsified. (SPIs based on time series data could be true for a long time before going false, so this is a time and state dependent outcome in many cases.)
Safety case - Per ANSI/UL 4600 a safety case is "a structured argument, supported by a body of evidence, that provides a compelling, comprehensible and valid case that a system is safe for a given application in a given environment." In the context of that standard, anything that is related to safety is in the safety case. If it's not in the safety case, it is by definition not related to safety.

A direct conclusion of the above is that if a metric does not have a threshold, or does not condition a claim in a safety case, then it can't be an SPI.

Less formally, the point of an SPI is that you've built up a safety case, but there is always the chance you missed something in the safety case argument (forgot a relevant reason why a claim might not be true), or made an assumption that isn't as true as you thought it was in the real world, or otherwise have some sort of a problem with your safety case. An SPI violation amounts to: "Well, you thought you had everything covered and this thing (claim) was always true. And yet, here we are with the claim being false when we encountered a particular unforeseen situation in validation or real world operation. Better update your safety argument!"

In other words, a SPI is a measurement you take to make sure that if your safety case is invalidated you'll detect it and notice that your safety case has a problem so that you can fix it.

An important point of all this is that not every metric is an SPI. SPIs are a very specific term. The rest are all KPIs.

KPIs can be very useful, for example in measuring progress toward a functional system. But they are not SPIs unless they meet the definition given above.

NOTES:

The ideas in this posting are due in large part to efforts of Dr. Aaron Kane. He should be cited as a co-author of this work.

(1) Aviation uses SPI for metrics related to the operational phase and SMS activities. The definition given here is rooted in ANSI/UL 4600 and is a superset of the aviation use, including technical metrics and design cycle metrics as well as operational metrics.

(2) In this formulation an SPI is not quite the same as a safety monitor. It might well be that some SPI violations also happen to trigger a vehicle system shutdown. But for many SPI violations there might not be anything actionable at the individual vehicle level. Indeed, some SPI violations might only be detectable at the fleet level in retrospect. For example, if you have a budget of 1 incident per 100 million km of a particular type, an individual vehicle having such an incident does not necessarily mean the safety case has been invalidated. Rather, you need to look across the fleet data history to see if such an incident just happens to be that budgeted one in 100 million based on operational exposure, or is part of a trend of too many such incidents.

(3) We pronounce "SPI" as "S-P-I" rather than "spy" after a very confusing conversation in which we realized we needed to explain to a government official that we were not actually proposing that the CIA become involved with validating autonomous vehicle safety.

Tuesday, March 5, 2019

The Human Filter Pitfall for autonomous system safety

The Human Filter Pitfall:
Using data from human-driven vehicles has gaps corresponding to situations a human knows to avoid getting into in the first place, but which an autonomous system might experience.

The operational history (and thus the failure history) of many systems is filtered by human control actions, preemptive incident avoidance actions and exposure to operator-specific errors. This history might not cover the entirety of the required functionality, might primarily cover the system in a comparatively low risk environment, and might under-represent failures that manifest infrequently with human operators present.

From a proven in use perspective, trying to use data from human-operated vehicles, such as historical crash data, might be insufficient to establish safety requirements or a basis for autonomous vehicle safety metrics. The situations in which human-operated vehicles have trouble may not be the same situations that autonomous systems find difficult. It is easy to overlook the situations humans are good at navigating but which may cause problems for autonomous systems when looking at existing data of situations that humans get wrong. An autonomous system cannot be validated only against problematic human driving scenarios (e.g. NHTSA (2007) pre-crash typology). The autonomy might handle these hazardous situations perfectly yet fail often in otherwise common situations that humans regularly perform safely. Thus, an argument that a system is safe solely because it has been checked to properly handle situations that have high rates of human mishaps is incomplete in that it does not address the possibility of new types of mishaps.

This pitfall can also occur when arguing safety for existing components being used in a new system. For example, consider a safety shutdown system used as a backup to a human operator, such as an Automated Emergency Braking (AEB) system. It might be that human operators tend to systematically avoid putting an AEB system in some particular situation that it has trouble handling. As a hypothetical example, consider an AEB system that has trouble operating effectively when encountering obstacles in tight curves. If human drivers habitually slow down on such curves there might be no significant historical data indicating this is a potential problem, and autonomous vehicles that operate at the vehicle dynamics limit rather than a sight distance limit on such curves will be exposed to collisions due to this bias in historical operational data that hides an AEB weakness. A proven-in-use argument in such a situation has a systematic bias and is based on incomplete evidence. It could be unsafe, for example, to base a safety argument primarily upon that AEB system for a fully autonomous vehicle, since it would be exposed to situations that would normally be pre-emptively handled by a human driver, even though field data does not directly reveal this as a problem.

(This is an excerpt of our SSS 2019 paper: Koopman, P., Kane, A. & Black, J., "Credible Autonomy Safety Argumentation," Safety-Critical Systems Symposium, Bristol UK, Feb. 2019. Read the full text here)

NHTSA (2007) Pre-Crash Scenario Typology for Crash Avoidance Research, Na-tional Highway Traffic Safety Administration, DOT HS-810-767, April 2007