Safe Autonomy: edge cases

Showing posts with label edge cases. Show all posts

Friday, December 4, 2020

Coverage Driven Metrics (Metrics Episode 5)

Coverage based metrics need to account for both the planning and the perception edge cases, possibly with two separate metrics.

It takes way too many road miles to be able to establish whether a self driving car is safe by brute force. Billions of miles of on-road testing are just not going to happen.

Sometimes people say, “Well, that’s okay. We can do those billion miles in simulation.” While simulation surely can be helpful, there are two potential issues to this. The first is that simulation has to be shown to predict outcomes on real roads. That’s a topic for a different day, but the simple version is you have to make sure that what the simulator says actually predicts what will happen on the road.

The second problem, which is what I’d like to talk about this time, is that you need to know what to feed the simulation.

Consider that if you hypothetically drove a billion miles on the real road, you’re actually doing two things at the same time. The first thing is you’re testing to see how often the system fails. But the second thing, a little more subtle, is you’re exposing the self driving car test platform to a billion miles of situations and weird things. That means the safety claim you’d be making based on that hypothetical exercise is that your car is safe because it did a billion miles safely.

But you’re tangling up two things with that testing. One is whether the system performs, and the other is what the system has been exposed to. If you do a billion miles of simulation, then sure, you’re exposing the system to a billion miles of whether it does the right thing. But what you might be missing is that billion miles of weird stuff that happens in the real world.

Think about it. Simulating going around the same block a billion times with the same weather and the same objects doesn’t really prove very much at all. So, you really need a billion miles worth of exposure to the real world in representative conditions that span everything you would actually see if you were driving on the road. In other words, the edge cases are what matter.

To make this more concrete, there is a story about a self driving car test platform that went to Australia. The first time they encountered kangaroos there was a big problem because their distance estimation assumed that animal’s feet were on the ground and that’s not how kangaroos work. Even if they had simulated a billion miles, if they didn’t have kangaroos in their simulator, they would have never seen that problem coming. But it’s not just kangaroos. There’s lots of things that happen every day but are not necessarily included in the self driving car test simulator, and that’s the issue.

A commonly discussed approach to get out of the “let’s do a billion miles game,” is to use an alternative approach of identifying and taking care of the edge cases one at a time. This is the approach favored by the community that uses a Safety Of The Intended Function (SOTIF) methodology, for example, as described in the standard ISO 21448. The idea is to go out, find edge cases, figure out how to mitigate any risk presented by them and continue until you found enough of the edge cases that you think it’s okay to deploy. The good part of this approach is that it changes the metrics conversation from lots and lots of miles to instead talking about what percentage of the edge cases you’ve covered. If you think of a notional zoo of all the possible edge cases, well, once you’ve covered them all, then you should be good to go.

This works up to a point. The problem is you don’t actually know what all the edge cases are. You don’t know which edge case cases happen only once in a while that you didn’t see during testing. This coverage approach works great for things where 90% or 99% is fine.

If there’s a driver in charge of a car and you’re designing a system that helps the driver recover after the drivers made a mistake, and you only do that 90% of the time (just to pick a number), that’s still a win. Nine times out of 10 you help the driver. As long as you’re not causing an accident on the 10th time, it’s all good. But for a self driving car, you’re not helping a driver. You’re actually in charge of getting everything done so 90% isn’t near good enough. You need 99.99...lots of nines. Then you have a problem that if you’re missing even a few things from the edge case zoo that will happen in the real world, you could have a loss event when you hit one of them.

That means the SOTIF approach is great when you know or can easily discover the edge cases. But it has a problem with unknown unknowns -- things you didn’t even know you didn’t know because you didn’t see them during testing.

It’s important to realize there are actually two flavors of edge cases. Most of the discussions happen around scenario planning. Things like geometry: an unprotected left turn; somebody is turning in front of you; there is a a pedestrian at a crosswalk. Those sorts of planning type things are one class of edge cases.

But there’s a completely different class of edge case, which is object classification. "What’s that thing that’s yellow and blobby? I don’t know what that is. Is that a person or is that a tarp that’s gotten loose and blowing in the wind? I don’t know." Being able to handle the edge cases for geometry is important. Being able to handle the perception edge cases is also important, but it’s quite different.

If you’re doing coverage based metrics, then your metrics need to account for both the planning and the perception edge cases, possibly with two separate metrics.

Okay, so the SOTIF coverage approach can certainly help, but it has a limit that you don’t know all the edge cases. Why is that? Well, the explanation is the 90/10 rule. The 90/10 rule in this case is 90% of the times you have a problem, it’s only caused by the 10% of the very common edge cases that happen every day. When you get out to the stuff that happens very rarely, once every 10 million miles say, well, that’s 90% of the edge cases, but you only see them 10% of the time because they happen so rarely.

The issue is there’s an essentially infinite number of edge cases such that each one happens very rarely, but in aggregate, they happen often enough to be a problem. This is due to the heavy tail nature of edge cases and generally weird things in the world. The practical implication is you can look as hard as you want for as long as you want, but you’ll never find all the edge cases. And yet they may be arriving so often that you can’t guarantee an appropriate level of safety, even though you fixed every single one you found. That's because it might take too long to find enough to get acceptable safety if you emphasize only fixing things you've seen in data collection.

Going back to closing loop with simulation, what this means is if you want to simulate a billion miles worth of operation to prove you’re a billion miles worth of safe, you need a billion miles worth of actual real world data to know that you’ve seen enough of the rare edge cases that statistically probably it works out. We’re back to a billion miles of data on the same exact sensor suite you’re going to deploy is not such a simple thing. What might be able to help is ways to sift through data and identify generic versions of the various edge cases so you can put them in a simulation. Even then, if the rare edge cases for the second billion miles are substantially different, it still might not be enough (the heavy tail issue).

The takeaways from all this or that doing simulation and analysis to make sure you’ve covered all the edge cases you know about is crucial to being able to build a self driving car, but it’s not quite enough. What you want is a metric that gives you the coverage of the perception edge cases and gives you the coverage of the scenario and planning edge cases. When you’ve covered everything you know about, that’s great, but it’s not the only thing you need to think about when deciding if you’re safe enough to deploy.

If you have those coverage metrics, one way you can measure progress is by looking at how often surprises happen. How often do you discover a new edge case for perception? How often you discover a new edge case for planning?

When you get to the point that the edge cases arriving very infrequently or maybe you’ve gone a million miles and haven’t seen one, that means there’s probably not a lot of utility in accumulating more miles and accumulating more data because you’re getting diminishing returns. The important thing is that does not mean you’ve got them all. It means you’ve covered all the ones you know about and it’s becoming too expensive to discover new edge cases. When you hit that point, you need another plan to assure safety beyond just coverage of edge cases via a SOTIF approach.

Summing up, metrics that have to do with perception and planning edge cases are an important piece of self driving car safety, but you need to do something beyond that to handle the unknown unknowns.

Missing Rare Events in Autonomous Vehicle Simulation

Missing Rare Events in Simulation:
A highly accurate simulation and system model doesn't solve the problem of what scenarios to simulate. If you don't know what edge cases to simulate, your system won't be safe.

It is common, and generally desirable, to use vehicle-level simulation rather than on-road operation is used as a proxy field testing strategy. Simulation offers a number of potential advantages over field testing of a real vehicle including lower marginal cost per mile, better scalability, and reduced risk to the public from testing. Ultimately, simulation is based upon data that generates scenarios used to exercise the system under test, commonly called the simulation workload. The validity of the simulation workload is just as relevant as the validity of the simulation models and software.

Simulation-based validation is often accomplished with a weighting of scenarios that is intentionally different than the expected operational profile. Such an approach has the virtue of being able to exercise corner cases and known rare events with less total exposure than would be required by waiting for such situations to happen by chance in real-world testing (Ding 2017). To the extent that corner cases and known rare events are intentionally induced in physical vehicle field testing or closed course testing, those amount to simulation in that the occurrence of those events is being simulated for the benefit of the test vehicle.

A more sophisticated simulation approach should use a simulation “stack” with layered levels of abstraction. High level, faster simulation can explore system-level issues while more detailed but slower simulations, bench tests, and other higher fidelity validation approaches are used for subsystems and components (Koopman & Wagner 2018).

Regardless of the mix of simulation approaches, simulation fidelity and realism of the scenarios is generally recognized as a potential threat to validity. The simulation must be validated to ensure that it produces sufficiently accurate results for aspects that matter to the safety case. This might include requiring conformance of the simulation code and model data to a safety-critical software standard.

Even with a conceptually perfect simulation, the question remains as to what events to simulate. Even if simulation were to cover enough miles to statistically assure safety, the question would remain as to whether there are gaps in the types of situations simulated. This corresponds to the representativeness issue with field testing and proven in use arguments. However, representativeness is a more pressing matter if simulation scenarios are being designed as part of a test plan rather than being based solely on statistically significant amounts of collected field data.

Another way to look at this problem is that simulation can remove the need to do field testing for rare events, but does not remove determine what rare events matter. All things being equal, simulation does not reduce the number of road miles needed for data collection to observe rare events. Rather, it permits a substantial fraction of data collection to be done with a non-autonomous vehicle. Thus, even if simulating billions of miles is feasible, there needs to be a way to ensure that the test plan and simulation workload exercise all the aspects of a vehicle that would have been exercised in field testing of the same magnitude.

As with the fly-fix-fly anti-pattern, fixing defects identified in simulation requires additional simulation input data to validate the design. Simply re-running the same simulation and fixing bugs until the simulation passes invokes the “pesticide paradox.” (Beizer 1990) This paradox holds that a system which has been debugged to the point that it passes a set of tests can’t be considered completely bug free. Rather, it is simply free of the bugs that the test suite knows how to find, leaving the system exposed to bugs that might involve only very subtle differences from the test suite.

(This is an excerpt of our SSS 2019 paper: Koopman, P., Kane, A. & Black, J., "Credible Autonomy Safety Argumentation," Safety-Critical Systems Symposium, Bristol UK, Feb. 2019. Read the full text here)

Beizer, B. (1990) Software Testing Techniques, 2nd Ed., 1990.
Koopman, P. & Wagner, M., (2018) "Toward a Framework for Highly Automated Vehicle Safety Validation," SAE World Congress, 2018. SAE-2018-01-1071.

Wednesday, March 20, 2019

Dealing with Edge Cases in Autonomous Vehicle Validation

Dealing with Edge Cases:

Some failures are neither random nor independent. Moreover, safety is typically more about dealing with unusual cases. This means that brute force testing is likely to miss important edge case safety issues.

A significant limitation to a field testing argument is the assumption of random independent failures inherent in the statistical analysis. Arguing that software failures are random and independent is clearly questionable, since multiple instances of a system will have identical software defects.

Moreover, arguing that the arrival of exceptional external events is random and independent across a fleet is clearly incorrect in the general case. A few simple examples of correlated events between vehicles in a fleet include:

· Timekeeping events (e.g. daylight savings time, leap second)

· Extreme weather (e.g. tornado, tsunami, flooding, blizzard white-out, wildfires) affecting multiple systems in the same geographic area

· Appearance of novel-looking pedestrians occurring on holidays (e.g. Halloween, Mardi Gras)

· Security vulnerabilities being attacked in a coordinated way

For life-critical systems, proper operation in typical situations needs to be validated. But this should be a given. Progressing from baseline functionality (a vehicle that can operate acceptably in normal situations) to a safe system (a vehicle that safely handles unusual situations and unexpected situations) requires dealing with unusual cases that will inevitably occur in the deployed fleet.

We define an edge case as a rare situation that will occur only occasionally, but still needs specific design attention to be dealt with in a reasonable and safe way. The quantification of “rare” is relative, and generally refers to situations or conditions that will occur often enough in a full-scale deployed fleet to be a problem but have not been captured in the design or requirements process. (It is understood that the process of identifying and handling edge cases makes them – by definition – no longer edge cases. So in practice the term applies to situations that would not have otherwise been handled had special attempts not be made to identify them during the design and validation process.)

It is useful to distinguish edge cases from corner cases. Corner cases are combinations of normal operational parameters. Not all corner cases are edge cases, and the converse. An example of a corner case could be a driving situation with an iced over road, low sun angle, heavy traffic, and a pedestrian in the roadway. This is a corner case since each item in that list ought to be an expected operational parameter, and it is the combination that might be rare. This would be an edge case only if there is some novelty to the combination that produces an emergent effect with system behavior. If the system can handle the combination of factors in a corner case without any special design work, then it’s not really an edge case by our definition. In practice, even difficult-to-handle corner cases that occur frequently will be identified during system design.

Only corner cases that are both infrequent and present novelty due to the combination of conditions are edge cases. It is worth noting that changing geographic location, season of year, or other factors can result in different corner cases being identified during design and test, and leave different sets of edge cases unresolved. Thus, in practice, edge cases that remain after normal system design procedures could differ depending upon the operational design domain of the vehicle, the test plan, and even random chance occurrences of which corner cases happened to appear in training data and field trials.

Classically an edge case refers to a type of boundary condition that affects inputs or reveals gaps in requirements. More generally, edge cases can be wholly unexpected events, such as the appearance of a unique road sign, or an unexpected animal type on a highway. They can be a corner case that was thought to be impossible, such as an icy road in a tropical climate. They can also be an unremarkable (to a human), non-corner case that somehow triggers an autonomy fault or stumbles upon a gap in training data, such as a light haze that results in perception failure. The thing that makes something an edge case is that it unexpectedly activates a requirements, design, or implementation defect in the system.

There are two implications to the occurrence of such edge cases in safety argumentation. One is that fixing edge cases as they arrive might not improve safety appreciably if the population of edge cases is large due to the heavy tail distribution problem (Koopman 2018c). This is because removing even a large number of individual defects from an essentially infinite-size pool of rarely activated defects does not materially improve things. Another implication is that the arrival of edge cases might be correlated by date, time, weather, societal events, micro-location, or combinations of these triggers. Such a correlation can invalidate an assumption that losses from activation of a safety defect will result in small losses between the time the defect first activates and the time a fix can be produced. (Such correlated mishaps can be thought of as the safety equivalent of a “zero day attack” from the security world.)

It is helpful to identify edge cases to the degree possible within the constraints of the budget and resources available to a project. This can be partially accomplished via corner case testing (e.g. Ding 2017). The strategy here would be to test essentially all corner cases to flush out any that happen to present special problems that make them edge cases. However, some edge cases also require identifying likely novel situations beyond combinations of ordinary and expected scenario components. And other edge cases are exceptional to an autonomous system, but not obviously corner cases in the eyes of a human test designer.

Ultimately, it is unclear if it can ever be shown that all edge cases have been identified and corresponding mitigations designed into the system. (Formal methods could help here, but the question would be whether any assumptions that needed to be made to support proofs were themselves vulnerable to edge cases.) Therefore, for immature systems it is important to be able to argue that inevitable edge cases will be dealt with in a safe way frequently enough to achieve an appropriate level of safety. One potential argumentation approach is to aggressively monitor and report unusual operational scenarios and proactively respond to near misses and incidents before a similar edge case can trigger a loss event, arguing that the probability of a loss event from unhandled edge cases is sufficiently low. Such an argument would have to address potential issues from correlated activation of edge cases.

(This is an excerpt of our SSS 2019 paper: Koopman, P., Kane, A. & Black, J., "Credible Autonomy Safety Argumentation," Safety-Critical Systems Symposium, Bristol UK, Feb. 2019. Read the full text here)

Koopman, P. (2018c) "The Heavy Tail Safety Ceiling," Automated and Connected Vehicle Systems Testing Symposium, June 2018.
Ding, Z., “Accelerated evaluation of automated vehicles,” http://www-personal.umich.edu/~zhaoding/accelerated-evaluation.html on 10/15/2017.

Safe Autonomy

Friday, December 4, 2020

Coverage Driven Metrics (Metrics Episode 5)

Wednesday, March 27, 2019

Missing Rare Events in Autonomous Vehicle Simulation

Wednesday, March 20, 2019

Dealing with Edge Cases in Autonomous Vehicle Validation

Popular Posts