Safe Autonomy: heavy tail

Showing posts with label heavy tail. Show all posts

Wednesday, March 20, 2019

Dealing with Edge Cases in Autonomous Vehicle Validation

Dealing with Edge Cases:

Some failures are neither random nor independent. Moreover, safety is typically more about dealing with unusual cases. This means that brute force testing is likely to miss important edge case safety issues.

A significant limitation to a field testing argument is the assumption of random independent failures inherent in the statistical analysis. Arguing that software failures are random and independent is clearly questionable, since multiple instances of a system will have identical software defects.

Moreover, arguing that the arrival of exceptional external events is random and independent across a fleet is clearly incorrect in the general case. A few simple examples of correlated events between vehicles in a fleet include:

· Timekeeping events (e.g. daylight savings time, leap second)

· Extreme weather (e.g. tornado, tsunami, flooding, blizzard white-out, wildfires) affecting multiple systems in the same geographic area

· Appearance of novel-looking pedestrians occurring on holidays (e.g. Halloween, Mardi Gras)

· Security vulnerabilities being attacked in a coordinated way

For life-critical systems, proper operation in typical situations needs to be validated. But this should be a given. Progressing from baseline functionality (a vehicle that can operate acceptably in normal situations) to a safe system (a vehicle that safely handles unusual situations and unexpected situations) requires dealing with unusual cases that will inevitably occur in the deployed fleet.

We define an edge case as a rare situation that will occur only occasionally, but still needs specific design attention to be dealt with in a reasonable and safe way. The quantification of “rare” is relative, and generally refers to situations or conditions that will occur often enough in a full-scale deployed fleet to be a problem but have not been captured in the design or requirements process. (It is understood that the process of identifying and handling edge cases makes them – by definition – no longer edge cases. So in practice the term applies to situations that would not have otherwise been handled had special attempts not be made to identify them during the design and validation process.)

It is useful to distinguish edge cases from corner cases. Corner cases are combinations of normal operational parameters. Not all corner cases are edge cases, and the converse. An example of a corner case could be a driving situation with an iced over road, low sun angle, heavy traffic, and a pedestrian in the roadway. This is a corner case since each item in that list ought to be an expected operational parameter, and it is the combination that might be rare. This would be an edge case only if there is some novelty to the combination that produces an emergent effect with system behavior. If the system can handle the combination of factors in a corner case without any special design work, then it’s not really an edge case by our definition. In practice, even difficult-to-handle corner cases that occur frequently will be identified during system design.

Only corner cases that are both infrequent and present novelty due to the combination of conditions are edge cases. It is worth noting that changing geographic location, season of year, or other factors can result in different corner cases being identified during design and test, and leave different sets of edge cases unresolved. Thus, in practice, edge cases that remain after normal system design procedures could differ depending upon the operational design domain of the vehicle, the test plan, and even random chance occurrences of which corner cases happened to appear in training data and field trials.

Classically an edge case refers to a type of boundary condition that affects inputs or reveals gaps in requirements. More generally, edge cases can be wholly unexpected events, such as the appearance of a unique road sign, or an unexpected animal type on a highway. They can be a corner case that was thought to be impossible, such as an icy road in a tropical climate. They can also be an unremarkable (to a human), non-corner case that somehow triggers an autonomy fault or stumbles upon a gap in training data, such as a light haze that results in perception failure. The thing that makes something an edge case is that it unexpectedly activates a requirements, design, or implementation defect in the system.

There are two implications to the occurrence of such edge cases in safety argumentation. One is that fixing edge cases as they arrive might not improve safety appreciably if the population of edge cases is large due to the heavy tail distribution problem (Koopman 2018c). This is because removing even a large number of individual defects from an essentially infinite-size pool of rarely activated defects does not materially improve things. Another implication is that the arrival of edge cases might be correlated by date, time, weather, societal events, micro-location, or combinations of these triggers. Such a correlation can invalidate an assumption that losses from activation of a safety defect will result in small losses between the time the defect first activates and the time a fix can be produced. (Such correlated mishaps can be thought of as the safety equivalent of a “zero day attack” from the security world.)

It is helpful to identify edge cases to the degree possible within the constraints of the budget and resources available to a project. This can be partially accomplished via corner case testing (e.g. Ding 2017). The strategy here would be to test essentially all corner cases to flush out any that happen to present special problems that make them edge cases. However, some edge cases also require identifying likely novel situations beyond combinations of ordinary and expected scenario components. And other edge cases are exceptional to an autonomous system, but not obviously corner cases in the eyes of a human test designer.

Ultimately, it is unclear if it can ever be shown that all edge cases have been identified and corresponding mitigations designed into the system. (Formal methods could help here, but the question would be whether any assumptions that needed to be made to support proofs were themselves vulnerable to edge cases.) Therefore, for immature systems it is important to be able to argue that inevitable edge cases will be dealt with in a safe way frequently enough to achieve an appropriate level of safety. One potential argumentation approach is to aggressively monitor and report unusual operational scenarios and proactively respond to near misses and incidents before a similar edge case can trigger a loss event, arguing that the probability of a loss event from unhandled edge cases is sufficiently low. Such an argument would have to address potential issues from correlated activation of edge cases.

(This is an excerpt of our SSS 2019 paper: Koopman, P., Kane, A. & Black, J., "Credible Autonomy Safety Argumentation," Safety-Critical Systems Symposium, Bristol UK, Feb. 2019. Read the full text here)

Koopman, P. (2018c) "The Heavy Tail Safety Ceiling," Automated and Connected Vehicle Systems Testing Symposium, June 2018.
Ding, Z., “Accelerated evaluation of automated vehicles,” http://www-personal.umich.edu/~zhaoding/accelerated-evaluation.html on 10/15/2017.

Monday, March 11, 2019

The Fly-Fix-Fly Antipattern for Autonomous Vehicle Validation

Fly-Fix-Fly Antipattern:
Brute force mileage accumulation to fix all the problems you see on the road won't get you all the way to being safe. For so very many reasons...

This A400m crash was caused by corrupted software calibration parameters.

In practice, autonomous vehicle field testing has typically not been a deployment of a finished product to demonstrate a suitable integrity level. Rather, it is an iterative development approach that can amount to debugging followed by attempting to achieve improved system dependability over time based on field experience. In the aerospace domain this is called a “fly-fix-fly” strategy. The system is operated and each problem that crops up in operations is fixed in an attempt to remove all defects. While this can help improve safety, it is most effective for a rigorously engineered system that is nearly perfect to begin with, and for which the fixes do not substantially alter the vehicle’s design in ways that could produce new safety defects.

It's a moving target: A significant problem with argumentation based on fly-fix-fly occurs when designers try to take credit for previous field testing despite a major change. When arguing field testing safety, it is generally inappropriate to take credit for field experience accumulated before the last change to the system that can affect safety-relevant behaviour. (The “small change” pitfall has already been discussed in another blog post.)

Discounting Field Failures: There is no denying the intuitive appeal to an argument that the system is being tested until all bugs have been fixed. However, this is a fundamentally flawed argument. That is because this amounts to saying that no matter how many failures are seen during field testing, none of them “count” if a story can be concocted as to how they were non-reproducible, fixed via bug patches, and so on.

To the degree such an argument could be credible, it would have to find and fix the root cause of essentially all field failures. It would further have to demonstrate (somehow) that the field failure root cause had been correctly diagnosed, which is no small thing, especially if the fix involves retraining a machine-learning based system.

Heavy Tail: Additionally, it would have to argue that a sufficiently high fraction of field failures have actually been encountered, resulting in a sufficiently low probability of encountering novel additional failures during deployment. (Probably this would be an incorrect argument if, as expected, field failure types have a heavy tail distribution.)

Fault Reinjection: A fly-fix-fly argument must also address both the fault reinjection problem. Fault reinjection occurs when a bug fix introduces a new bug as a side effect of that fix. Ensuring that this has not happened via field testing alone requires resetting the field testing clock to zero after every bug fix. (Hybrid arguments that include rigorous engineering analysis are possible, but simply assuming no fault reinjection without any supporting evidence is not credible.)

It Takes Too Long: It is difficult to believe an argument that claims that a fly-fix-fly process alone (without any rigorous engineering analysis to back it up) will identify and fix all safety-relevant bugs. If there is a large population of bugs that activate infrequently compared to the amount of field testing exposure, such a claim would clearly be incorrect. Generally speaking, fly-fix-fly requires an infeasible amount of operation to achieve the ultra-dependable results required for life critical systems, and typically makes unrealistic assumptions such as no new faults are injected by fixing a fault identified in testing (Littlewood and Strigini 1996). A specific issue is the matter of edge cases, discussed in an upcoming blog post.

(This is an excerpt of our SSS 2019 paper: Koopman, P., Kane, A. & Black, J., "Credible Autonomy Safety Argumentation," Safety-Critical Systems Symposium, Bristol UK, Feb. 2019. Read the full text here)

Littlewood, B., Strigini, L. (1993) “Validation of Ultra-High Dependability for Soft-ware-Based Systems,” Communications of the ACM, 36(11):69-80, November 1993.

Wednesday, February 6, 2019

Edge Cases and Autonomous Vehicle Safety -- SSS 2019 Keynote

Here is my keynote talk for SSS 2019 in Bristol UK.

Edge Cases and Autonomous Vehicle Safety

Making self-driving cars safe will require a combination of techniques. ISO 26262 and the draft SOTIF standards will help with vehicle control and trajectory stages of the autonomy pipeline. Planning might be made safe using a doer/checker architectural pattern that uses deterministic safety envelope enforcement of non-deterministic planning algorithms. Machine-learning based perception validation will be more problematic. We discuss the issue of perception edge cases, including the potentially heavy-tail distribution of object types and brittleness to slight variations in images. Our Hologram tool injects modest amounts of noise to cause perception failures, identifying brittle aspects of perception algorithms. More importantly, in practice it is able to identify context-dependent perception failures (e.g., false negatives) in unlabeled video.

Downloadable slides
Slideshare (also see below for embedded slideshare)

Edge Cases and Autonomous Vehicle Safety -- SSS 2019 from Philip Koopman

Saturday, July 14, 2018

AVS 2018 Panel Session

It was great to have the opportunity to participate in a panel on autonomous vehicle validation and safety at AVS in San Francisco this past week. Thanks especially to Steve Shladover for organizing such an excellent forum for discussion.

The discussion was the super-brief version. If you want to dig deeper, you can find much more complete slide decks attached to other blog posts:

The first question was to spend 5 minutes talking about the types of things we do for validation and safety. Here are my slides from that very brief opening statement.

AVS 2018 Slides from Philip Koopman

Thursday, June 28, 2018

Safety Validation and Edge Case Testing for Autonomous Vehicles (Slides)

Here is a slide deck that expands upon the idea that the heavy tail ceiling is a problem for AV validation. It also explains ways to augment image sensor inputs to improve robustness.

Safety Validation and Edge Case Testing for Autonomous Vehicles from Philip Koopman

(If slideshare is blocked for you, try this alternate download source)

Saturday, June 23, 2018

Heavy Tail Ceiling Problem for AV Testing

I enjoyed participating in the AV Benchmarking Panel hosted by Clemson ICAR last week. Here are my slides and a preprint of my position paper on the Heavy Tail Ceiling problem for AV safety testing.

Abstract
Creating safe autonomous vehicles will require not only extensive training and testing against realistic operational scenarios, but also dealing with uncertainty. The real world can present many rare but dangerous events, suggesting that these systems will need to be robust when encountering novel, unforeseen situations. Generalizing from observed road data to hypothesize various classes of unusual situations will help. However, a heavy tail distribution of surprises from the real world could make it impossible to use a simplistic drive/fail/fix development process to achieve acceptable safety. Autonomous vehicles will need to be robust in handling novelty, and will additionally need a way to detect that they are encountering a surprise so that they can remain safe in the face of uncertainty

Paper Preprint:
http://users.ece.cmu.edu/~koopman/pubs/koopman18_heavy_tail_ceiling.pdf

Presentation:

Autonomous Vehicle Benchmarking Panel from Philip Koopman