Safe Autonomy: assurance argument

Showing posts with label assurance argument. Show all posts

Sunday, June 20, 2021

A More Precise Definition for ANSI/UL 4600 Safety Performance Indicators (SPIs)

Safety Performance Indicators (SPIs) are defined by chapter 16 of ANSI/UL 4600 in the context of autonomous vehicles as performance metrics that are specifically related to safety (4600 at 16.1.1.6.1).

Woman in Taxi looking at phone. Photo by Uriel Mont from Pexels

This is a fairly general definition that is intended to encompass both leading metrics (e.g., number of failed detections of pedestrians for a single sensor channel) and lagging metrics (e.g., number of collisions in real world operation).

However, it is so general that there can be a tendency to try to call metrics that are not related to safety SPIs when, more properly, they are really KPIs. As an example, ride quality smoothness when cornering is a Key Performance Indicator (KPI) that is highly desirable for passenger comfort. But it might have little or nothing to do with the crash rate for a particular vehicle. (It might be correlated -- sloppy control might be associated with crashes, but it might not be.)

So we've come up with a more precise definition of SPI (with special thanks to Dr. Aaron Kane for long discussions and crystalizing the concept).

An SPI is a metric supported by evidence that uses a
threshold comparison to condition a claim in a safety case.

Let's break that down:

SPI - Safety Performance Indicator - a {metric, threshold} pair that measures some aspect of safety in an autonomous vehicle.
Metric - a value, typically related to one or more of product performance, design quality, process quality, or adherence to operational procedures. Often metrics are related to time (e.g., incidents per million km, maintenance mistakes per thousand repairs) but can also be related to particular versions (e.g., significant defects per thousand lines of code; unit test coverage; peer review effectiveness)
Evidence - the metric values are derived from measurement rather than theoretical calculations or other non-measurement sources
Threshold - a metric on its own is not an SPI because context within the safety case matters. For example, false negative detections on a sensor as a number is not a SPI because it misses the part about how good it has to be to provide acceptable safety when fused with other sensor data in a particular vehicle's operational context. ("We have 1% false negatives on camera #1. Is that good enough? Well, it depends...") There is no limit to the complexity of the threshold which might be, for example, whether a very complicated state space is either inside or outside a safety envelope. But in the end the answer is some sort of comparison between the metric and the threshold that results in "true" or "false." (Analogous multi-valued operations and outputs are OK if you are using multi-valued logic in your safety case.) We call the state of an SPI output being "false" an SPI Violation.
Condition a claim - each SPI is associated with a claim in a safety case. If the SPI is true the claim is supported by the SPI. If the SPI is false then the associated claim has been falsified. (SPIs based on time series data could be true for a long time before going false, so this is a time and state dependent outcome in many cases.)
Safety case - Per ANSI/UL 4600 a safety case is "a structured argument, supported by a body of evidence, that provides a compelling, comprehensible and valid case that a system is safe for a given application in a given environment." In the context of that standard, anything that is related to safety is in the safety case. If it's not in the safety case, it is by definition not related to safety.

A direct conclusion of the above is that if a metric does not have a threshold, or does not condition a claim in a safety case, then it can't be an SPI.

Less formally, the point of an SPI is that you've built up a safety case, but there is always the chance you missed something in the safety case argument (forgot a relevant reason why a claim might not be true), or made an assumption that isn't as true as you thought it was in the real world, or otherwise have some sort of a problem with your safety case. An SPI violation amounts to: "Well, you thought you had everything covered and this thing (claim) was always true. And yet, here we are with the claim being false when we encountered a particular unforeseen situation in validation or real world operation. Better update your safety argument!"

In other words, a SPI is a measurement you take to make sure that if your safety case is invalidated you'll detect it and notice that your safety case has a problem so that you can fix it.

An important point of all this is that not every metric is an SPI. SPIs are a very specific term. The rest are all KPIs.

KPIs can be very useful, for example in measuring progress toward a functional system. But they are not SPIs unless they meet the definition given above.

NOTES:

The ideas in this posting are due in large part to efforts of Dr. Aaron Kane. He should be cited as a co-author of this work.

(1) Aviation uses SPI for metrics related to the operational phase and SMS activities. The definition given here is rooted in ANSI/UL 4600 and is a superset of the aviation use, including technical metrics and design cycle metrics as well as operational metrics.

(2) In this formulation an SPI is not quite the same as a safety monitor. It might well be that some SPI violations also happen to trigger a vehicle system shutdown. But for many SPI violations there might not be anything actionable at the individual vehicle level. Indeed, some SPI violations might only be detectable at the fleet level in retrospect. For example, if you have a budget of 1 incident per 100 million km of a particular type, an individual vehicle having such an incident does not necessarily mean the safety case has been invalidated. Rather, you need to look across the fleet data history to see if such an incident just happens to be that budgeted one in 100 million based on operational exposure, or is part of a trend of too many such incidents.

(3) We pronounce "SPI" as "S-P-I" rather than "spy" after a very confusing conversation in which we realized we needed to explain to a government official that we were not actually proposing that the CIA become involved with validating autonomous vehicle safety.

Tuesday, February 11, 2020

Positive Trust Balance for Self Driving Car Deployment

By Philip Koopman and Michael Wagner, Edge Case Research

Self-driving cars promise improved road safety. But every publicized incident chips away at confidence in the industry’s ability to deliver on this promise, with zero-crash nirvana nowhere in sight. We need a way to balance long term promise vs. near term risk when deciding that this technology is ready for deployment. A “positive trust balance” approach provides a framework for making a responsible deployment decision by combining testing, engineering rigor, operational feedback, and transparent safety culture.

MILES AND DISENGAGEMENTS AREN’T ENOUGH

Too often, discussions about why the public should believe a particular self-driving car platform is well designed center around number of miles driven. Simply measuring the number of miles driven has a host of problems, such as distinguishing “easy” from “hard” miles and ensuring that miles driven are representative of real world operations. That aside, accumulating billions of road miles to demonstrate approximate parity to human drivers is an infeasible testing goal. Simulation helps, but still leaves unresolved questions about including enough edge cases that pose problems for deployment at scale.

By the time a self-driving car design is ready to deploy, the rate of potentially dangerous disengagements and incidents seen in on-road testing should approach zero. But that isn’t enough to prove safety. For example, a hypothetical ten million on-road test miles with no substantive incidents would still be a hundred times too little to prove that a vehicle is as safe as a typical human driver. So getting to a point that dangerous events are too rare to measure is only a first step.

In fact, competent human drivers are so good that there is no practical way to measure that a newly developed self-driving car has a suitably low fatality rate. This should not be news. We don’t fly new aircraft designs for billions of hours before deployment to measure the crash rate. Instead, we count on a combination of thorough testing, good engineering, and safety culture. Self-driving cars typically rely on machine learning to sense the world around them, so we will also need to add significant feedback from vehicles operating in the field to plug inevitable gaps in training data.

POSITIVE TRUST BALANCE

The self-driving car industry is invested in achieving a “positive risk balance” of being safer than a human driver. And years from now actuarial data will tell us if we succeeded. But there will be significant uncertainty about risk when it’s time to deploy. So we’ll need to trust development and deployment organizations to be doing the right things to minimize and manage that risk.

To be sure, developers already do better than brute force mileage accumulation. Simulations backed up by comprehensive scenario catalogs ensure that common cases are covered. Human copilots and data triage pipelines flag questionable self-driving behavior, providing additional feedback. But those approaches have their limits.

Rather than relying solely on testing, other industries use safety standards to ensure appropriate engineering rigor. While traditional safety standards were never intended to address self-driving aspects of these vehicles, new standards such as Underwriters Laboratories 4600 and ISO/PAS 21448 are emerging to set the bar on engineering rigor and best practices for self-driving car technology.

The bad news is that nobody knows how to prove that machine learning based technology will actually be safe. Although we are developing best practices, when deploying a self-driving car we’ll only know whether it is apparently safe, and not whether it is actually as safe as a human driver. Going past that requires real world experience at scale.

Deploying novel self-driving car technology without undue public risk will involve being able to explain why it is socially responsible to operate these systems in specific operational design domains. This requires addressing all of the following points:

Is the technology as safe as we can measure? This doesn’t mean it will be perfect when deployed. Rather, at some point we will have reached the limits of viable simulation and testing.

Has sufficient engineering rigor been applied? This doesn’t mean perfection. Nonetheless, some objective process such as establishing conformance to sufficiently rigorous engineering standards that go beyond testing is essential.

Is a robust feedback mechanism used to learn from real world experience? There must be proactive, continual risk management over the life of each vehicle based on extensive field data collection and analysis.

Is there a transparent safety culture? Transparency is required in evolving robust engineering standards, evaluating that best practices are followed, and ensuring that field feedback actually improves safety. A proactive, robust safety culture is essential. So is building trust with the public over time.

Applying these principles will potentially change how we engineer, regulate, and litigate automotive safety. Nonetheless, the industry will be in a much better place when the next adverse news event occurs if their figurative public trust account has a positive balance.

Philip Koopman is the CTO of Edge Case Research and an expert in autonomous vehicle safety. Including his role as a faculty member at Carnegie Mellon University, Koopman has been helping government, commercial and academic self-driving developers improve safety for over 20 years. He is a principal contributor to the Underwriters Laboratories 4600 safety standard.

Michael Wagner is the CEO of Edge Case Research. He started working on autonomy at Carnegie Mellon over 20 years ago.

(Original post here: https://medium.com/@pr_97195/positive-trust-balance-for-self-driving-car-deployment-ff3f04a7ef93)

Wednesday, May 1, 2019

Other Autonomous Vehicle Safety Argument Observations

Other AV Safety Issues:
We've seen some teams get it right. And some get it wrong. Don't make these mistakes if you're trying to ensure your autonomous vehicle is safe.

Defective disengagement mechanisms. Generally this involves the ability of an arbitrary fail-active autonomy failure to prevent successful disengagement by a human supervisor. As a concrete example, a system might read the state of the disengagement activation mechanism (the “big red button”) as an I/O device fed directly into the primary autonomy computer rather than using an independent safing mechanism. This is a special case of a single point of failure in the form of the autonomy computer.

Assuming perception failures are independent. Some arguments assume independent failures of multiple perception modes. While there is clearly utility in creating a safety case for the non-perception parts of an autonomous vehicle, one must argue rather than assume the safety of perception to create a credible safety case at the vehicle level.

Requiring perfect human supervision of autonomy. Humans are well known to struggle when assigned such monitoring tasks. Koopman et al. (2019) cover this topic in more detail as it relates to autonomous vehicle road testing safety.

Dismissing a potential fault as “unrealistic” without supporting data. For example, argumentation might state that a lightning strike on a moving vehicle is unrealistic or could not happen in the “real world,” despite data to the contrary (e.g. Holle 2008). To be sure, this does not mean that something like a lightning strike must be completely mitigated via keeping the vehicle fully operational. Rather, such faults must be considered in risk analysis. Dismissing hazards without risk analysis based on a subjective assertion that they are “unrealistic” results in a safety case with insufficient evidence.

Using multi-channel comparison approaches for autonomy. In general autonomy algorithms are nondeterministic, sensitive to initial conditions, and have many acceptable (or at least safe) behaviors for any given situation. Architectural approaches based on voting diverse autonomy algorithms tend to run into a problem of deciding whether the outputs are close enough to be valid. Averaging and other similar approaches are not necessarily appropriate. As a simple example, the average of veering to the right and veering to the left to avoid an obstacle could result in hitting the obstacle dead-on.

Confusion about fault vs. failure. While there is a widely recognized terminology document for dependable system design (Avizienis 2004), we have found that there is widespread confusion about the terms fault and failure in practical use. This is especially true when discussing malfunctions that are not due to a component fault, but rather a requirements gap or an excursion from the intended operational environment. It is beyond the scope of this paper to attempt to resolve this, but we note it as an area worthy of future work and particular attention in interdisciplinary discussions of autonomy safety.

(This is an excerpt of our SSS 2019 paper: Koopman, P., Kane, A. & Black, J., "Credible Autonomy Safety Argumentation," Safety-Critical Systems Symposium, Bristol UK, Feb. 2019. Read the full text here)

Avizienis, A., Laprie, J.-C., Randell B., Landwehr, C. (2004) “Basic concepts and taxonomy of dependable and secure computing,” IEEE Trans. Dependability, 1(1):11-33, Oct. 2004.
Holle, R. (2008) Lightning-caused deaths and injuries in the vicinity of vehicles, American Meteorological Society Conference on Meteorological Applications of Light-ning Data, 2008.
Koopman, P. and Latronico, B., (2019) Safety Argument Considerations for Public Road Testing of Autonomous Vehicles, SAE WCX, 2019.

Wednesday, April 24, 2019

Human Test Scenario Bias in Autonomous Vehicle Validation

Human Test Scenario Bias:
Machine Learning perceives the world differently than you do. That means your intuition is not necessarily a good source for test planning.

Simulation-based testing (including especially closed-course testing of real vehicles) can suffer from a test planning bias. The problem is that a test plan is often made according to human perception of the scenario being tested. For example, a test scenario might be “child crossing in a painted cross-walk.” Details of the test scenario might explore various corner cases involving child clothing, size, weather conditions, scene clutter, and so on.

Commonly test scenarios map to a human-interpretable taxonomy of the system and environmental state space. However, autonomy systems might have a different internal state space representation than humans, meaning that they classify the world in ways that differ from how humans do so. This in turn can lead to a situation in which a human believes apparent complete coverage via a testing plan has been achieved, while in reality significant aspects of the autonomy system have not been tested.

As a hypothetical example, the autonomy system might have deduced that a human’s shirt color is a predictor of whether that human will step into a street because of accidental correlations in a training data set. But the test plan might not specify shirt color as a variable, because test designers did not realize it was a relevant autonomy feature for pedestrian motion prediction. That means that a human-designed vision test for an autonomous vehicle can be an excellent starting point to make sure that obstacles, pedestrians, and other objects can be detected by the system. But, more is required.

Machine-learning based systems are known to be vulnerable to learning bias that is not recognized by human testers, at least initially. Some such failures have been quite dramatic (e.g. Grush 2015). Thus, simplistic tests such as having an average body size white male in neutral summer clothing cross a street to test pedestrian avoidance do not demonstrate a robust safety capability. Rather, such tests tend to demonstrate a minimum performance capability.

Interpreting the results of human-constructed test designs, including humans interpreting why a particular on-road scenario failed, are also subject to human test scenario bias. A credible safety argument that relies upon human-constructed tests or human interpretation of root cause analysis in claiming that test failures have been fixed should address this pitfall.

(This is an excerpt of our SSS 2019 paper: Koopman, P., Kane, A. & Black, J., "Credible Autonomy Safety Argumentation," Safety-Critical Systems Symposium, Bristol UK, Feb. 2019. Read the full text here)

Grush, L. (2015) “Google engineer apologizes after Photos app tags two black people as gorillas,” The Verge, July 1, 2015. https://www.theverge.com/2015/7/1/8880363/google-apologizes-photos-app-tags-two-black-people-gorillas (Accessed Oct. 27, 2018)

Wednesday, April 10, 2019

Safety Argument Consideration for Public Road Testing of Autonomous Vehicles

Beth Osyk and I are presenting our paper at SAE WCX today on how to argue sufficient road test safety for self-driving car technology.

See below slideshare or follow this link for presentation slides.

Preprint of paper here:
https://users.ece.cmu.edu/~koopman/pubs/koopman19_TestingSafetyCase_SAEWCX.pdf

Abstract:
Autonomous vehicle (AV) developers test extensively on public roads, potentially putting other road users at risk. A safety case for human supervision of road testing could improve safety transparency. A credible safety case should include: (1) the supervisor must be alert and able to respond to an autonomy failure in a timely manner, (2) the supervisor must adequately manage autonomy failures, and (3) the autonomy failure profile must be compatible with effective human supervision.\

Human supervisors and autonomous test vehicles form a combined human-autonomy system, with the total rate of observed failures including the product of the autonomy failure rate and the rate of unsuccessful failure mitigation by the supervisor. A difficulty is that human ability varies in a nonlinear way with autonomy failure rates, counter-intuitively making it more difficult for a supervisor to assure safety as autonomy maturity improves. Thus, road testing safety cases must account for both the expected failures during testing and the practical effectiveness of human supervisors given that failure profile. This paper outlines a high level safety case that identifies key factors for credibly arguing the safety of an onroad AV test program. A similar approach could be used to analyze potential safety issues for high capability semiautonomous production vehicles.

Safety Argument Consideration for Public Road Testing of Autonomous Vehicles from Philip Koopman

Wednesday, April 3, 2019

Nondeterministic Behavior and Legibility in Autonomous Vehicle Validation

Nondeterministic Behavior and Legibility:
How do you know your autonomous vehicle passed the test for the right reason? What if it just got lucky, or is gaming the test?

The nature of the algorithms used by autonomy systems creates problems for modelling and testing that go beyond typical safety critical software. Some autonomy algorithms, such as randomized path planning, are inherently non-deterministic. Others can be brittle, failing dramatically with subtle variations in data, such as perception false negatives induced by adversarial attacks (Szegedy at al. 2013) or false negatives induced by slight image degradation due to haze or defocus (Pezzementi et al. 2018).

A related issue is over-fitting to the test, in which an autonomy system over-fits and learns how to beat a fixed test. By analogy, this is the pitfall of the system cheating by having memorized the correct answers. A proposed way to deal with this risk is by randomly varying aspects of test cases.

In such a fuzzing or variable testing approach it is important to randomly vary all relevant aspects of a problem. For example, varying geometries for traffic situations can be helpful, but probably does not address potential over-fitting for perception algorithms that perform object classification.

The use of potentially non-deterministic test scenarios combined with non-deterministic system behaviors and opaque system designs means it is difficult to know whether a system has passed a test, because there is no single correct answer. Rather, there must be some algorithmic way to determine whether a particular system response is acceptable or not, making that test oracle algorithm safety critical.

Moreover, it is possible that a system has passed a particular test by chance. For example, a pedestrian might be avoided due to a properly functioning detection and avoidance algorithm. But a pedestrian might also be avoided merely because a random path planner by chance picked a path that did not intersect the pedestrian, or responded to a completely unrelated aspect of the environment that caused it to pick a fortuitously safe path. Similarly, a pedestrian might be detected in one image, but undetected in another that differs in ways that are essentially imperceptible to a human.

It is unclear if resolving this issue requires solving the difficult problem of explainable AI (Gunning 2018). As a minimum, a credible safety argument will need to address the problem of how plans to test vehicles with less than a statistically valid amount of real-world exposure data can avoid these pitfalls. It seems likely that a credible argument will also have to establish that each type of test has been passed due to safe operation of the system rather than simply by chance (Koopman & Wagner 2018).

Gunning, D. (2018), Explainable Artificial Intelligence (XAI), Defense Advanced Research Projects Agency, https://www.darpa.mil/program/explainable-artificial-intelligence (accessed October 27, 2018).
Koopman, P. & Wagner, M., (2018) "Toward a Framework for Highly Automated Vehicle Safety Validation," SAE World Congress, 2018. SAE-2018-01-1071.
Pezzementi, Z., Tabor, T., Yim, S., Chang, J., Drozd, B., Guttendorf, D., Wagner, M., Koopman, P., "Putting image manipulations in context: robustness testing for safe perception," IEEE International Symposium on Safety, Security, and Rescue Robotics (SSRR), Aug. 2018.
Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., Fer-gus, R. (2013) "Intriguing properties of neural networks." arXiv preprint arXiv:1312.6199 (2013).

Wednesday, March 27, 2019

Missing Rare Events in Autonomous Vehicle Simulation

Missing Rare Events in Simulation:
A highly accurate simulation and system model doesn't solve the problem of what scenarios to simulate. If you don't know what edge cases to simulate, your system won't be safe.

It is common, and generally desirable, to use vehicle-level simulation rather than on-road operation is used as a proxy field testing strategy. Simulation offers a number of potential advantages over field testing of a real vehicle including lower marginal cost per mile, better scalability, and reduced risk to the public from testing. Ultimately, simulation is based upon data that generates scenarios used to exercise the system under test, commonly called the simulation workload. The validity of the simulation workload is just as relevant as the validity of the simulation models and software.

Simulation-based validation is often accomplished with a weighting of scenarios that is intentionally different than the expected operational profile. Such an approach has the virtue of being able to exercise corner cases and known rare events with less total exposure than would be required by waiting for such situations to happen by chance in real-world testing (Ding 2017). To the extent that corner cases and known rare events are intentionally induced in physical vehicle field testing or closed course testing, those amount to simulation in that the occurrence of those events is being simulated for the benefit of the test vehicle.

A more sophisticated simulation approach should use a simulation “stack” with layered levels of abstraction. High level, faster simulation can explore system-level issues while more detailed but slower simulations, bench tests, and other higher fidelity validation approaches are used for subsystems and components (Koopman & Wagner 2018).

Regardless of the mix of simulation approaches, simulation fidelity and realism of the scenarios is generally recognized as a potential threat to validity. The simulation must be validated to ensure that it produces sufficiently accurate results for aspects that matter to the safety case. This might include requiring conformance of the simulation code and model data to a safety-critical software standard.

Even with a conceptually perfect simulation, the question remains as to what events to simulate. Even if simulation were to cover enough miles to statistically assure safety, the question would remain as to whether there are gaps in the types of situations simulated. This corresponds to the representativeness issue with field testing and proven in use arguments. However, representativeness is a more pressing matter if simulation scenarios are being designed as part of a test plan rather than being based solely on statistically significant amounts of collected field data.

Another way to look at this problem is that simulation can remove the need to do field testing for rare events, but does not remove determine what rare events matter. All things being equal, simulation does not reduce the number of road miles needed for data collection to observe rare events. Rather, it permits a substantial fraction of data collection to be done with a non-autonomous vehicle. Thus, even if simulating billions of miles is feasible, there needs to be a way to ensure that the test plan and simulation workload exercise all the aspects of a vehicle that would have been exercised in field testing of the same magnitude.

As with the fly-fix-fly anti-pattern, fixing defects identified in simulation requires additional simulation input data to validate the design. Simply re-running the same simulation and fixing bugs until the simulation passes invokes the “pesticide paradox.” (Beizer 1990) This paradox holds that a system which has been debugged to the point that it passes a set of tests can’t be considered completely bug free. Rather, it is simply free of the bugs that the test suite knows how to find, leaving the system exposed to bugs that might involve only very subtle differences from the test suite.

Beizer, B. (1990) Software Testing Techniques, 2nd Ed., 1990.
Koopman, P. & Wagner, M., (2018) "Toward a Framework for Highly Automated Vehicle Safety Validation," SAE World Congress, 2018. SAE-2018-01-1071.

Wednesday, March 20, 2019

Dealing with Edge Cases in Autonomous Vehicle Validation

Dealing with Edge Cases:

Some failures are neither random nor independent. Moreover, safety is typically more about dealing with unusual cases. This means that brute force testing is likely to miss important edge case safety issues.

A significant limitation to a field testing argument is the assumption of random independent failures inherent in the statistical analysis. Arguing that software failures are random and independent is clearly questionable, since multiple instances of a system will have identical software defects.

Moreover, arguing that the arrival of exceptional external events is random and independent across a fleet is clearly incorrect in the general case. A few simple examples of correlated events between vehicles in a fleet include:

· Timekeeping events (e.g. daylight savings time, leap second)

· Extreme weather (e.g. tornado, tsunami, flooding, blizzard white-out, wildfires) affecting multiple systems in the same geographic area

· Appearance of novel-looking pedestrians occurring on holidays (e.g. Halloween, Mardi Gras)

· Security vulnerabilities being attacked in a coordinated way

For life-critical systems, proper operation in typical situations needs to be validated. But this should be a given. Progressing from baseline functionality (a vehicle that can operate acceptably in normal situations) to a safe system (a vehicle that safely handles unusual situations and unexpected situations) requires dealing with unusual cases that will inevitably occur in the deployed fleet.

We define an edge case as a rare situation that will occur only occasionally, but still needs specific design attention to be dealt with in a reasonable and safe way. The quantification of “rare” is relative, and generally refers to situations or conditions that will occur often enough in a full-scale deployed fleet to be a problem but have not been captured in the design or requirements process. (It is understood that the process of identifying and handling edge cases makes them – by definition – no longer edge cases. So in practice the term applies to situations that would not have otherwise been handled had special attempts not be made to identify them during the design and validation process.)

It is useful to distinguish edge cases from corner cases. Corner cases are combinations of normal operational parameters. Not all corner cases are edge cases, and the converse. An example of a corner case could be a driving situation with an iced over road, low sun angle, heavy traffic, and a pedestrian in the roadway. This is a corner case since each item in that list ought to be an expected operational parameter, and it is the combination that might be rare. This would be an edge case only if there is some novelty to the combination that produces an emergent effect with system behavior. If the system can handle the combination of factors in a corner case without any special design work, then it’s not really an edge case by our definition. In practice, even difficult-to-handle corner cases that occur frequently will be identified during system design.

Only corner cases that are both infrequent and present novelty due to the combination of conditions are edge cases. It is worth noting that changing geographic location, season of year, or other factors can result in different corner cases being identified during design and test, and leave different sets of edge cases unresolved. Thus, in practice, edge cases that remain after normal system design procedures could differ depending upon the operational design domain of the vehicle, the test plan, and even random chance occurrences of which corner cases happened to appear in training data and field trials.

Classically an edge case refers to a type of boundary condition that affects inputs or reveals gaps in requirements. More generally, edge cases can be wholly unexpected events, such as the appearance of a unique road sign, or an unexpected animal type on a highway. They can be a corner case that was thought to be impossible, such as an icy road in a tropical climate. They can also be an unremarkable (to a human), non-corner case that somehow triggers an autonomy fault or stumbles upon a gap in training data, such as a light haze that results in perception failure. The thing that makes something an edge case is that it unexpectedly activates a requirements, design, or implementation defect in the system.

There are two implications to the occurrence of such edge cases in safety argumentation. One is that fixing edge cases as they arrive might not improve safety appreciably if the population of edge cases is large due to the heavy tail distribution problem (Koopman 2018c). This is because removing even a large number of individual defects from an essentially infinite-size pool of rarely activated defects does not materially improve things. Another implication is that the arrival of edge cases might be correlated by date, time, weather, societal events, micro-location, or combinations of these triggers. Such a correlation can invalidate an assumption that losses from activation of a safety defect will result in small losses between the time the defect first activates and the time a fix can be produced. (Such correlated mishaps can be thought of as the safety equivalent of a “zero day attack” from the security world.)

It is helpful to identify edge cases to the degree possible within the constraints of the budget and resources available to a project. This can be partially accomplished via corner case testing (e.g. Ding 2017). The strategy here would be to test essentially all corner cases to flush out any that happen to present special problems that make them edge cases. However, some edge cases also require identifying likely novel situations beyond combinations of ordinary and expected scenario components. And other edge cases are exceptional to an autonomous system, but not obviously corner cases in the eyes of a human test designer.

Ultimately, it is unclear if it can ever be shown that all edge cases have been identified and corresponding mitigations designed into the system. (Formal methods could help here, but the question would be whether any assumptions that needed to be made to support proofs were themselves vulnerable to edge cases.) Therefore, for immature systems it is important to be able to argue that inevitable edge cases will be dealt with in a safe way frequently enough to achieve an appropriate level of safety. One potential argumentation approach is to aggressively monitor and report unusual operational scenarios and proactively respond to near misses and incidents before a similar edge case can trigger a loss event, arguing that the probability of a loss event from unhandled edge cases is sufficiently low. Such an argument would have to address potential issues from correlated activation of edge cases.

(This is an excerpt of our SSS 2019 paper: Koopman, P., Kane, A. & Black, J., "Credible Autonomy Safety Argumentation," Safety-Critical Systems Symposium, Bristol UK, Feb. 2019. Read the full text here)

Koopman, P. (2018c) "The Heavy Tail Safety Ceiling," Automated and Connected Vehicle Systems Testing Symposium, June 2018.
Ding, Z., “Accelerated evaluation of automated vehicles,” http://www-personal.umich.edu/~zhaoding/accelerated-evaluation.html on 10/15/2017.

Monday, March 11, 2019

The Fly-Fix-Fly Antipattern for Autonomous Vehicle Validation

Fly-Fix-Fly Antipattern:
Brute force mileage accumulation to fix all the problems you see on the road won't get you all the way to being safe. For so very many reasons...

This A400m crash was caused by corrupted software calibration parameters.

In practice, autonomous vehicle field testing has typically not been a deployment of a finished product to demonstrate a suitable integrity level. Rather, it is an iterative development approach that can amount to debugging followed by attempting to achieve improved system dependability over time based on field experience. In the aerospace domain this is called a “fly-fix-fly” strategy. The system is operated and each problem that crops up in operations is fixed in an attempt to remove all defects. While this can help improve safety, it is most effective for a rigorously engineered system that is nearly perfect to begin with, and for which the fixes do not substantially alter the vehicle’s design in ways that could produce new safety defects.

It's a moving target: A significant problem with argumentation based on fly-fix-fly occurs when designers try to take credit for previous field testing despite a major change. When arguing field testing safety, it is generally inappropriate to take credit for field experience accumulated before the last change to the system that can affect safety-relevant behaviour. (The “small change” pitfall has already been discussed in another blog post.)

Discounting Field Failures: There is no denying the intuitive appeal to an argument that the system is being tested until all bugs have been fixed. However, this is a fundamentally flawed argument. That is because this amounts to saying that no matter how many failures are seen during field testing, none of them “count” if a story can be concocted as to how they were non-reproducible, fixed via bug patches, and so on.

To the degree such an argument could be credible, it would have to find and fix the root cause of essentially all field failures. It would further have to demonstrate (somehow) that the field failure root cause had been correctly diagnosed, which is no small thing, especially if the fix involves retraining a machine-learning based system.

Heavy Tail: Additionally, it would have to argue that a sufficiently high fraction of field failures have actually been encountered, resulting in a sufficiently low probability of encountering novel additional failures during deployment. (Probably this would be an incorrect argument if, as expected, field failure types have a heavy tail distribution.)

Fault Reinjection: A fly-fix-fly argument must also address both the fault reinjection problem. Fault reinjection occurs when a bug fix introduces a new bug as a side effect of that fix. Ensuring that this has not happened via field testing alone requires resetting the field testing clock to zero after every bug fix. (Hybrid arguments that include rigorous engineering analysis are possible, but simply assuming no fault reinjection without any supporting evidence is not credible.)

It Takes Too Long: It is difficult to believe an argument that claims that a fly-fix-fly process alone (without any rigorous engineering analysis to back it up) will identify and fix all safety-relevant bugs. If there is a large population of bugs that activate infrequently compared to the amount of field testing exposure, such a claim would clearly be incorrect. Generally speaking, fly-fix-fly requires an infeasible amount of operation to achieve the ultra-dependable results required for life critical systems, and typically makes unrealistic assumptions such as no new faults are injected by fixing a fault identified in testing (Littlewood and Strigini 1996). A specific issue is the matter of edge cases, discussed in an upcoming blog post.

(This is an excerpt of our SSS 2019 paper: Koopman, P., Kane, A. & Black, J., "Credible Autonomy Safety Argumentation," Safety-Critical Systems Symposium, Bristol UK, Feb. 2019. Read the full text here)

Littlewood, B., Strigini, L. (1993) “Validation of Ultra-High Dependability for Soft-ware-Based Systems,” Communications of the ACM, 36(11):69-80, November 1993.

Thursday, February 28, 2019

The Discounted Failure Pitfall for autonomous system safety

The Discounted Failure Pitfall:

Arguing that something is safe because it has never failed before doesn't work if you keep ignoring its failures based on that reasoning.

A particularly tricky pitfall occurs when a proven in use argument is based upon a lack of observed field failures when field failures have been systematically under-reported or even not reported at all. In this case, the argument is based upon faulty evidence.

One way that this pitfall manifests in practice is that faults that result in low consequence failures tend to go unreported, with system redundancy tending to reduce the consequences of a typical incident. It can take time and effort to report failures, and there is little incentive to report each incident if the consequence is small, the system can readily continue service, and the subjective impression is that the system is perceived as overall safe. Perversely, reporting numerous recovered malfunctions or warnings can actually increase user confidence in a system even while events go unreported. This can be a significant issue when removing backup systems such as mechanical interlocks based on a lack of reported loss events. If the interlocks were keeping the system safe but interlock or other failsafe engagements go unreported, removing those interlocks can be deadly. Various aspects of this pitfall came into play in the Therac 25 loss events (Leveson 1993).

An alternate way that this pitfall manifests is when there is a significant economic or other incentive for suppressing or mischaracterizing the cause of field failures. For example, there can be significant pressure and even systematic approaches to failure reporting and analysis that emphasize human error over equipment failure in mishap reporting (Koopman 2018b). Similarly, if technological maturity is being measured by a trend of reduced safety mechanism engagements (e.g. autonomy disengagements during road testing (Banerjee et al. 2018)), there can be significant pressure to artificially reduce the number of events reported. Claiming proven in use integrity for a component subject to artificially reduced or suppressed error reports is again basing an argument on faulty evidence.

Finally, arguing that a component cannot be unsafe because it has never caused a mishap before can result in systems that are unsafe by induction. More specifically, if each time you argue it wasn't that component's fault, and therefore don't attribute cause to that component, then you can continue that behavior forever, all the while denying that an unsafe component is OK.

(This is an excerpt of our SSS 2019 paper: Koopman, P., Kane, A. & Black, J., "Credible Autonomy Safety Argumentation," Safety-Critical Systems Symposium, Bristol UK, Feb. 2019. Read the full text here)

Leveson, N. (1993) An investigation of the Therac-25 Accidents, IEEE Computer, July 1993, pp. 18-41.
Koopman, P. (2018b), "Practical Experience Report: Automotive Safety Practices vs. Accepted Principles," SAFECOMP, Sept. 2018.
Bannerjee, S., Jha, S., Cyriac, J., Kalbarczyk, Z, Iyer, R., (2018) “Hands Off the Wheel in Autonomous Vehicles?: A Systems Perspective on over a Million Miles of Field Data,” 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2018.

Tuesday, February 26, 2019

The “Small” Change Fallacy for autonomous system safety

The “small” change fallacy:
In software, even a single character source code can cause catastrophic failure. With the possible exception of a very rigorous change analysis process, there is no such thing as a "small" change to software.

Headline: July 22, 1969: Mariner 1 Done In By A Typo

Another threat to validity for a proven in use strategy is permitting “small” changes without re-validation (e.g. as implicitly permitted by (NHTSA 2016), which requires an updated safety assessment for significant changes). Change analysis can be difficult in general for software, and ineffective for software that has high coupling between modules and resultant complex inter-dependencies.

Because even one line of bad code can result in a catastrophic system failure, it is highly questionable to argue that a change is “small” because it only affects a small fraction of the source code base. Rigorous analysis should be performed on the change and its possible effects, which generally requires analysis of design artifacts beyond just source code.

A variant of this pitfall is arguing that a particular type of technology is generally “well understood” and implying based upon this that no special attention is required to ensure safety. There is no reason to expect that a completely new implementation – potentially by a different development team with a different engineering process – has sufficient safety integrity simply because it is not a novel function to the application domain.

(This is an excerpt of our SSS 2019 paper: Koopman, P., Kane, A. & Black, J., "Credible Autonomy Safety Argumentation," Safety-Critical Systems Symposium, Bristol UK, Feb. 2019. Read the full text here)

NHTSA (2016) Federal Automated Vehicles Policy, National Highway Transporta-tion System Administration, U.S. Department of Transportation, September 2016.

Thursday, February 21, 2019

Assurance Pitfalls When Using COTS Components

Assurance Pitfalls When Using COTS Components:
Using a name-brand, familiar component doesn't automatically ensure safety.

It is common to repurpose Commercial Off-The-Shelf (COTS) software or components for use in critical autonomous vehicle applications. These include components originally developed for other domains such as mine safety, low volume research components such as LIDAR units, and automotive components such as radars previously used in non-critical or less critical ADAS applications.

Generally such COTS components are being used in a somewhat different way than the original non-critical commercial purpose, and are often modified for use as well. Moreover, even field proven automotive components are typically customized for each vehicle manufacturer to conform to customer-specific design requirements. When arguing that a COTS item is proven in use, it is important to account for at least whether there is in fact sufficient field experience, whether the field experience is for a previous or modified version of the component, and other factors such as potential supply-chain changes, manufacturing quality fade, and the possibility of counterfeit goods.

In some cases we have seen proven in use arguments attempted for which the primary evidence relied upon is the reputation of a manufacturer based on historical performance on other components. While purchasing from a reputable manufacturer is often a good start, a brand name label by itself does not necessarily demonstrate that a particular component is fit for purpose, especially if a complex supply chain is involved.

COTS components can be problematic if they don't come with the information needed for safety assessment. (Hint: source code is only a starting point. But often even that isn't provided.) While third party certification certainly is not a panacea, looking for independent evaluation that relevant technical, development process, and assurance process activities have been performed is a good start to making sure COTS components are fit for purpose.

(This is an excerpt of our SSS 2019 paper: Koopman, P., Kane, A. & Black, J., "Credible Autonomy Safety Argumentation," Safety-Critical Systems Symposium, Bristol UK, Feb. 2019. Read the full text here)

Tuesday, February 19, 2019

Proven In Use: The Violated Assumptions Pitfall for Autonomous Vehicle Validation

The Violated Assumptions Pitfall:
Arguing something is Proven In Use must address whether the new use has sufficiently similar assumptions and operational conditions as the old one.

The proven in use argumentation pattern uses field experience of a component (or, potentially, an engineering process) to argue that field failure rates have been sufficiently low to assure a necessary level of integrity.

Historically this has been used as a way to grandfather existing components into new safety standards (for example, in the process industry under IEC 61508). The basis of the argument is generally that if a huge number of operational hours of experience in the actual application have been accumulated with a sufficiently low number of safety-relevant failures, then there is compelling experimental data supporting the integrity claim for a component. We assume in this section that sufficient field experience has actually been obtained to make a credible argument. A subsequent section on field testing addresses the issue of how much experience is required to make a statistically valid claim.

A plausible variant could be that a component is initially deployed with temporary additional level of redundancy or other safety backup mechanism such as using a Doer/Checker pair. (By analogy, the Checker is a set of training wheels for a child learning to ride a bicycle.) When sufficient field experience has been accumulated to attain confidence regarding the safety of the main system, the additional safety mechanism might be removed to reduce cost.

For autonomous vehicles, the safety backup mechanism frequently manifests as a human safety driver, and the argument being made is that a sufficient amount of safe operation with a human safety driver enables removal of that safety driver for future deployments. The ability of a human safety driver to adequately supervise autonomy is itself a difficult topic (Koopman and Latronico 2019, Gao 2016), but is beyond the scope of this paper.

A significant threat to validity for proven in use argumentation is that the system will be used in an operational environment that differs in a safety-relevant way from its field history. In other words, a proven in use argument is invalidated if the component in question is exposed to operational conditions substantively different than the historical experience.

Diverging from historical usage can happen in subtle ways, especially for software components. Issues of data value limits, timing, and resource exhaustion can be relevant and might be caused by functionally unrelated components within a system. Differences in fault modes, sensor input values, or the occurrence of exceptional situations can also cause problems.

A proven in use safety argument must address how the risk of the system being deployed outside this operational profile is mitigated (e.g. by detecting distributional shifts in the environment, by external limitations on the environment, or by procedural mitigations). It is important to realize that an operational profile for a safety critical system must include not only typical cases, but also rare but expected edge cases as well. An even more challenging issue is that there might be operational environment assumptions that designers do not realize are being made, resulting in a residual risk of violated assumptions even after a thorough engineering review of the component reuse plan.

A classic example of this pitfall is the failure of Ariane 5 flight 501. That mishap involved reusing an inertial navigation system from the Ariane 4 design employing a proven in use argument (Lyons 1996). A contributing factor was that software functionality used by Ariane 4 but not required for Ariane 5 flight was left in place to avoid having to requalify the component.

(This is an excerpt of our SSS 2019 paper: Koopman, P., Kane, A. & Black, J., "Credible Autonomy Safety Argumentation," Safety-Critical Systems Symposium, Bristol UK, Feb. 2019. Read the full text here)

Koopman, P. and Latronico, B., (2019) Safety Argument Considerations for Public Road Testing of Autonomous Vehicles, SAE WCX, 2019.
Gao, F. (2016) Modeling Human Attention and Performance in Automated Environ-ments with Low Task Loading, Ph.D. Dissertation, Massachusetts Institute of Technolo-gy, Feb. 2016.

Thursday, February 14, 2019

Pitfall: Arguing via Compliance with an Inappropriate Safety Standard

For a standard to help ensure you are actually safe, not only do you have to actually follow it, but it must be a suitable standard in domain, scope, and level of integrity assured. Proprietary safety standards often don't actually make you safe.

Historically some car makers have used internal, proprietary software safety guidelines as their way of attempting to assure an appropriate level of safety (Koopman 2018b). If a safety argument is based in part upon conformance to a safety standard, it is essential that the comprehensiveness of the safety standard itself be supported by evidence. For public standards that argument can involve group consensus by experts, public scrutiny, assessments of historical effectiveness, improvement in response to loss events, and general industry acceptance.

On the other hand, proprietary standards lack at least public scrutiny, and often lack other aspects such as general industry acceptance as well as falling short of accepted practices in technically substantive ways. One potential approach to argue that a proprietary safety standard is adequate is to map it to a publicly available standard, although there might be other viable argumentation approaches.

Sometimes arguing compliance with a well-known and documented industry safety standard may be inadequate for a particular safety case. Although all passenger vehicle subsystems in the United States must comply with the Federal Motor Vehicle Safety Standards (FMVSS) issued by the National Highway Transportation Safety Administration (NHTSA), compliance with these standards alone concentrates on mechanical and electrical issues, and only high level electronics functionality. FMVSS compliance is insufficient for a safety case that must encompass software (Koopman 2018b).

Similarly, use of particular standards might be an accepted or recommended practice, but not a means of ensuring safety. For example, the use of AUTOSAR (Autosar.org 2018) and Model Based Design techniques are sometimes proposed as indicators of safe software, but use of these technologies has essentially no predictive power with respect to system safety.

For automotive systems with safety-critical software, ISO 26262 is typically an appropriate standard, although other standards such as IEC 61508 (IEC 1998) can be relevant as well. For some autonomous system functions, conformance to ISO PAS 21448 (ISO 2018) might well be appropriate. A new standard, UL 4600, is specifically intended to cover autonomous vehicles and is expected to be introduced in 2019.

Arguments of the form "technology X, standard Y, or methodology Z is adequate because it has historically produced safe systems in that past" should address the potential issues with "proven in use" argumentation considered in the following section.

(This is an excerpt of our SSS 2019 paper: Koopman, P., Kane, A. & Black, J., "Credible Autonomy Safety Argumentation," Safety-Critical Systems Symposium, Bristol UK, Feb. 2019. Read the full text here)

Autosar.org (2018) The Standard Software Framework for Intelligent Mobility, https://www.autosar.org/ accessed Nov. 5, 2018.
Koopman, P. (2018b), "Practical Experience Report: Automotive Safety Practices vs. Accepted Principles," SAFECOMP, Sept. 2018.
IEC (1998) IEC 61508: Functional Safety of Electrical/Electronic/Programmable Electronic Safety-related Systems, International Electronic Commission, December, 1998.
ISO (2018) Road vehicles – Safety of the Intended Function, ISO/PRF PAS 21448 (under development), International Standards Organization, 2018.

Monday, February 11, 2019

The Implicit Controllability Pitfall for autonomous system safety

The Implicit Controllability Pitfall:
A safety case must account for not only failures within the autonomy system, but also failures within the vehicle. With a fully autonomous vehicle, responsibility for managing potentially unsafe equipment malfunctions that were previously mitigated by a human driver falls to the autonomy.

A subtle pitfall when arguing based on conformance to a safety standard is neglecting that the assumptions made when assessing the subsystem that have potentially been violated or changed by the use of autonomy. Of particular concern for ground vehicles is the “controllability” aspect of an ISO 26262 ASIL analysis. (Severity and exposure might also change for an autonomous vehicle due to different usage patterns and should also be considered, but are beyond the scope of this discussion.)

The risk analysis of an underlying conventional vehicle according to ISO 26262 requires taking into account the severity, exposure, and controllability of each hazard (ISO 2011). The controllability aspect assumes a competent human driver is available to react to and mitigate equipment malfunctions. Taking credit for some degree of controllability generally reduces the integrity requirements of a component. This idea is especially relevant for Advanced Driver-Assistance Systems (ADAS) safety arguments, in which it is assumed that the driver will intervene in a timely manner to correct any vehicle misbehavior, including potential ADAS malfunctions.

With a fully autonomous vehicle, responsibility for managing potentially unsafe equipment malfunctions that were previously mitigated by a human driver falls to the autonomy. That means that all the assumed capabilities of a human driver that have been built in to the safety arguments regarding under-lying vehicle malfunctions are now imposed upon the autonomy system.

If the autonomy design team does not have access to the analysis behind underlying equipment safety arguments, there might be no practical way to know what all the controllability assumptions are. In other words, the makers of an autonomy kit might be left guessing what failure response capabilities they must provide to preserve the correctness of the safety argumentation for the underlying vehicle.

The need to mitigate some malfunctions is likely obvious, but we have found that “obvious” is in the eye of the beholder. Some examples of assumed human driver interventions we have noted, or even experienced first-hand include:

Pressing hard on the brake pedal to compensate for loss of power assist
Controlling the vehicle in the event of a tire blowout
Path planning after catastrophic windshield damage from debris impact
Manually pumping brakes when anti-lock brake mechanisms time out due to excessive activation on slick surfaces
Navigating by ambient light starlight after a lighting system electrical failure at speed while bring vehicle to a stop
Attempting to mitigate the effects of uncommanded propulsion power

To the degree that the integrity level requirements of the vehicle were reduced by the assumption that a human driver could mitigate the risk inherent in such malfunction scenarios, the safety case is insufficient if an autonomy kit does not have comparable capabilities.

Creating a thorough safety argument will require either obtaining or re-verse engineering all the controllability assumptions made in the design of the underlying vehicle. Then, the autonomy must be assessed to have an adequate ability to provide the safety relevant controllability assumed in the vehicle design, or an alternate safety argument must be made.

For cases in which the controllability assumptions are not available, there are at least two approaches that should both be used by a prudent design team. First, FMEA, HAZOP, and other appropriate analyses should be performed on vehicle components and safety relevant functions to ensure that the autonomy can react in a safe way to malfunctions. Such an analysis will likely struggle with whether or not it is safe to assume that the worst types of malfunctions will be adequately mitigated by the vehicle without autonomy intervention.

Second, defects reported on comparable production vehicles should be considered as credible malfunctions of the non-autonomous portions of any vehicle control system since they have already happened in production systems. Such malfunctions include issues such as the drivetrain reporting the opposite of the current direction of motion, uncommanded acceleration, significant braking lag, loss of headlights, and so on (Koopman 2018a).

ISO (2011) Road vehicles -- Functional Safety -- Management of functional safety, ISO 26262, International Standards Organization, 2011.
Koopman, P., (2018a), Potentially deadly automotive software defects, https://betterembsw.blogspot.com/2018/09/potentially-deadly-automotive-software.html, Sept. 25, 2018.

Wednesday, February 6, 2019

Edge Cases and Autonomous Vehicle Safety -- SSS 2019 Keynote

Here is my keynote talk for SSS 2019 in Bristol UK.

Edge Cases and Autonomous Vehicle Safety

Making self-driving cars safe will require a combination of techniques. ISO 26262 and the draft SOTIF standards will help with vehicle control and trajectory stages of the autonomy pipeline. Planning might be made safe using a doer/checker architectural pattern that uses deterministic safety envelope enforcement of non-deterministic planning algorithms. Machine-learning based perception validation will be more problematic. We discuss the issue of perception edge cases, including the potentially heavy-tail distribution of object types and brittleness to slight variations in images. Our Hologram tool injects modest amounts of noise to cause perception failures, identifying brittle aspects of perception algorithms. More importantly, in practice it is able to identify context-dependent perception failures (e.g., false negatives) in unlabeled video.

Downloadable slides
Slideshare (also see below for embedded slideshare)

Edge Cases and Autonomous Vehicle Safety -- SSS 2019 from Philip Koopman

Monday, February 4, 2019

Command Override Anti-Pattern for autonomous system safety

Command Override Anti-Pattern:
Don't let a non-critical "Doer" over-ride the safety critical "Checker." If you do this, the Doer can tell the Checker that something is safe when it isn't.

https://pixabay.com/en/base-jump-jump-base-jumper-leaping-1600668/

(First in a series of postings on pitfalls and fallacies we've seen used in safety assurance arguments for autonomous vehicles.)

A common pitfall when identifying safety relevant portions of a system is overlooking the safety relevance of sensors, actuators, software, or some other portion of a system. A common example is creating a system that permits the Doer to perform a command override of the Checker. (In other words, the designers think they are building a Doer/Checker pattern in which only the Checker is safety relevant, but in fact the Doer is safety relevant due to its ability to override the Checker’s functionality.)

The usual claim being made is that a safety layer will prevent any malfunction of an autonomy layer from creating a mishap. This claim generally involves arguing that an autonomy failure will activate a safing function in the safety layer, and that an attempt by the autonomy to do something unsafe will be prevented. A typical (deficient) scheme is to have autonomy failure detected via some sort of self-test combined with the safety layer monitoring an autonomy layer heartbeat signal. It is further argued that the safety layer is designed in conformance with a suitable functional safety standard, and therefore acts as a safety-rated Checker as part of a Doer/Checker pair.

The flaw in that safety argumentation approach is that the autonomy layer has been presumed to fail silent via a combination of self-diagnosed fault detection and lack of heartbeat. However, self-diagnosis and heartbeat detection methods provide only partial fault detection (Hammett 2001). For example, there is always the possibility that the checking function itself has been com-promised by a fault that leads to false negatives of checks for incorrect functionality.

As a simple example, a heartbeat signal might be generated by a timer interrupt in the autonomy computer that continues to function even if significant portions of the autonomy software have crashed or are generating incorrect results. In general, such an architectural pattern is unsafe because it permits a non-silent failure in the autonomy layer to issue an unsafe vehicle trajectory command that overrides the safety layer. Fixing this fault requires making the autonomy layer safety-critical, which defeats a primary purpose of using a Doer/Checker architecture.

In practice, safety layer logic is usually less permissive than the autonomy layer. By less permissive, we mean that it under-approximates the safe state space of the system in exchange for simplifying computation (Machin et al. 2018). As a practical example, the safety layer might leave a larger spatial buffer area around obstacles to simplify computations, resulting in a longer total path length for the vehicle or even denying the vehicle entry into situations such as a tight alleyway that is only slightly larger than the vehicle.

A significant safety compromise can occur when vehicle designers attempt to increase permissiveness by enabling a non-safety-rated autonomy layer to say “trust me, this is OK” to override the safety layer. This creates a way for a malfunctioning autonomy layer to override the safety layer, again subverting the safety of the Doer/Checker pair architecture.
Eliminating this command override anti-pattern requires that the designers accept that there is an inherent trade-off between permissiveness and simplicity. A simple Checker tends to have limited permissiveness. Increasing permissiveness makes the Checker more complex, increasing the fraction of the system design work that must be done with high integrity. Permitting a lower integrity Doer to override the safety-relevant behavior of a high integrity Checker in an attempt to avoid Checker complexity is unsafe.

Related pitfalls are first a system in which the Checker only covers a sub-set of the safety properties of the system. This implicitly trusts the Doer to not have certain classes of defects, including potentially requirements defects. If the Checker does not actually check some aspects of safety, then the Doer is in fact safety relevant. A second pitfall is having the Checker supervise a diagnosis operation for a Doer health check. Even if the Doer passes a health check, that does not mean its calculations are correct. At best it means that the Doer is operating as designed – which might be unsafe since the Doer has not been verified with the level of rigor required to assure safety.

We have found it productive to conduct the following thought experiment when evaluating Doer/Checker architectural patterns and other systems that rely upon assuring the integrity of only a subset of safety-related functions. Ask this question: “Assume the Doer (a portion of the system, including software, sensors, and actuators, that are not ‘safety related’) maliciously at-tempts to compromise safety in the worst way possible, with full and complete knowledge of every aspect of the design. Could it compromise safety?” If the answer is yes, then the Doer is in fact safety relevant, and must be designed to a sufficiently high level of integrity. Attempts to argue that such an outcome is unlikely in practice must be supported by strong evidence.

Hammett, R., (2001) “Design by extrapolation: an evaluation of fault-tolerant avion-ics, 20th Conference on Digital Avionics Systems, IEEE, 2001.
Machin, M., Guiochet, J., Waeselynck, H., Blanquart, J-P., Roy, M., Masson, L., (2018) “SMOF – a safety monitoring framework for autonomous systems,” IEEE Trans. System, Man and Cybernetics Systems, 48(5) May 2018, pp. 702-715.

Friday, February 1, 2019

Credible Autonomy Safety Argumentation Paper

Here is our paper on pitfalls in safety argumentation for autonomous systems for SSS 2019. My keynote talk will mostly be about perception stress testing but I'm of course happy to talk about this paper as well at the meeting.

Credible Autonomy Safety Argumentation
Philip Koopman, Aaron Kane, Jen Black
Carnegie Mellon University, Edge Case Research Pittsburgh, PA, USA

Abstract A significant challenge to deploying mission- and safety-critical autonomous systems is the difficulty of creating a credible assurance argument. This paper collects lessons learned from having observed both credible and faulty assurance argumentation attempts, with a primary emphasis on autonomous ground vehicle safety cases. Various common argumentation approaches are described, including conformance to a non-autonomy safety standard, proven in use, field testing, simulation, and formal verification. Of particular note are argumentation faults and anti-patterns that have shown up in numerous safety cases that we have encountered. These observations can help both designers and auditors detect common mistakes in safety argumentation for autonomous systems.

Download the full paper:
https://users.ece.cmu.edu/~koopman/pubs/Koopman19_SSS_CredibleSafetyArgumentation.pdf

Monday, July 16, 2018

A Safe Way to Apply FMVSS Principles to Self-Driving Cars

As the self-driving car industry works to create safer vehicles, it is facing a significant regulatory challenge. Complying with existing Federal Motor Vehicle Safety Standards (FMVSS) can be difficult or impossible for advanced designs. For conventional vehicles the FMVSS structure helps ensure a basic level of safety by testing some key safety capabilities. However, it might be impossible to run these tests on advanced self-driving cars that lack a brake pedal, steering wheel, or other components required by test procedures.

While there is industry pressure to waive some FMVSS requirements in the name of hastening progress, doing so is likely to result in safety problems. I’ll explain a way out of this dilemma based on the established technique of using safety cases. In brief, auto makers should create an evidence-based explanation as to why they achieve the intended safety goals of current FMVSS regulations even if they can’t perform the tests as written. This does not require disclosure of proprietary autonomous vehicle technology, and does not require waiting for the government to design new safety test procedures.

Why the Current FMVSS Structure Must Change

Consider an example of FMVSS 138, which relates to tire pressure monitoring. At some point many readers have seen a tire pressure telltale light, warning of low tire pressure:

FMVSS 138 Low Tire Pressure Telltale

This light exists because of FMVSS, which specifies tests to make sure that a driver-visible telltale light turns on for under-inflation and blow-out conditions with specified road surface conditions, vehicle speed, and so on.

But what if an unmanned vehicle doesn’t have a driver seat? Or even a dashboard for mounting the telltale? Should we wait years for the government to develop an alternate self-driving car FMVSS series? Or should we simply waive FMVSS compliance when the tests don’t make sense as written?

Simplistic, blanket waivers are a bad idea. It is said that safety standards such as FMVSS are written in the blood of past victims. Self-driving cars are supposed to improve safety. We shouldn’t grant FMVSS waivers that will result in having more blood spilled to re-learn well understood lessons for self-driving cars.

The weakness of the FMVSS approach is that the tests don’t explicitly capture the “why” of the safety standard. Rather, there is a very prescriptive set of rules, operating in a manner similar to building codes for houses. Like building codes, they can take time to update when new technology appears. But just as it is a bad idea to skip a building inspection on your new house, you shouldn’t let vehicle makers skip FMVSS tests for your new car – self-driving or otherwise. Despite the fear of hindering progress, something must be done to adapt the FMVSS framework to self-driving cars.

A Safety Case Approach to FMVSS

A way to permit rapid progress while still ensuring that we don’t lose ground on basic vehicle safety is to adopt a safety case approach. A safety case is a written explanation of why a system is appropriately safe. Safety cases include: a safety goal, a strategy for meeting the goal, and evidence that the strategy actually works.

To create an FMVSS 138 safety case, a self-driving car maker would first need to identify the safety goals behind that standard. A number of public documents that precede FMVSS 138 state safety goals of detecting low tire pressure and avoiding blowouts. Those goals were, in turn, motivated by dozens of deaths resulting from tire blowouts that provoked the 2000 TREAD act.

The next step is for the vehicle maker to propose a safety strategy compatible with its product. For example, vehicle software might set internal speed and distance limits in response to a tire failure, or simply pull off the road to await service. The safety case would also propose tests to provide concrete evidence that the safety strategy is effective. For example, instead of demonstrating that a telltale light illuminates, the test might instead show that the vehicle pulls to the side of the road within a certain timeframe when low tire pressure is detected. There is considerable flexibility in safety strategy and evidence so long as the safety goal is adequately met.

Regulators will need a process for documenting the safety case for each requested FMVSS deviation. They must decide whether they should evaluate safety cases up front or employ less direct feedback approaches such as post-mishap litigation. Regardless of approach, the safety cases can be made public, because they will describe a way to test vehicles for basic safety, and not the inner workings of highly proprietary autonomy algorithms.

Implementing this approach only requires vehicle makers to do extra work for FMVSS deviations that provide their products with a competitive advantage. Over time, it is likely that a set of standardized industry approaches for typical vehicle designs will emerge, reducing the effort involved. And if an FMVSS requirement is truly irrelevant, a safety case can explain why.

While there is much more to self-driving car safety than FMVSS compliance, we should not be moving backward by abandoning accepted vehicle safety requirements. Instead, a safety case approach will enable self-driving car makers to innovate as rapidly as they like, with a pay-as-you-go burden to justify why their alternative approaches to providing existing safety capabilities are adequate.

Author info: Prof. Koopman has been helping government, commercial, and academic self-driving developers improve safety for 20 years.

Contact: koopman@cmu.edu

Originally published in The Hill 6/30/2018:

http://thehill.com/opinion/technology/394945-how-to-keep-self-driving-cars-safe-when-no-one-is-watching-for-dashboard