Friday, December 4, 2020

Coverage Driven Metrics (Metrics Episode 5)

Coverage based metrics need to account for both the planning and the perception edge cases, possibly with two separate metrics.

It takes way too many road miles to be able to establish whether a self driving car is safe by brute force. Billions of miles of on-road testing are just not going to happen. 

Sometimes people say, “Well, that’s okay. We can do those billion miles in simulation.” While simulation surely can be helpful, there are two potential issues to this. The first is that simulation has to be shown to predict outcomes on real roads. That’s a topic for a different day, but the simple version is you have to make sure that what the simulator says actually predicts what will happen on the road.

The second problem, which is what I’d like to talk about this time, is that you need to know what to feed the simulation. 

Consider that if you hypothetically drove a billion miles on the real road, you’re actually doing two things at the same time. The first thing is you’re testing to see how often the system fails. But the second thing, a little more subtle, is you’re exposing the self driving car test platform to a billion miles of situations and weird things. That means the safety claim you’d be making based on that hypothetical exercise is that your car is safe because it did a billion miles safely. 

But you’re tangling up two things with that testing. One is whether the system performs, and the other is what the system has been exposed to. If you do a billion miles of simulation, then sure, you’re exposing the system to a billion miles of whether it does the right thing. But what you might be missing is that billion miles of weird stuff that happens in the real world.

Think about it. Simulating going around the same block a billion times with the same weather and the same objects doesn’t really prove very much at all. So, you really need a billion miles worth of exposure to the real world in representative conditions that span everything you would actually see if you were driving on the road. In other words, the edge cases are what matter. 

To make this more concrete, there is a story about a self driving car test platform that went to Australia. The first time they encountered kangaroos there was a big problem because their distance estimation assumed that animal’s feet were on the ground and that’s not how kangaroos work. Even if they had simulated a billion miles, if they didn’t have kangaroos in their simulator, they would have never seen that problem coming. But it’s not just kangaroos. There’s lots of things that happen every day but are not necessarily included in the self driving car test simulator, and that’s the issue.

A commonly discussed approach to get out of the “let’s do a billion miles game,” is to use an alternative approach of identifying and taking care of the edge cases one at a time. This is the approach favored by the community that uses a Safety Of The Intended Function (SOTIF) methodology, for example, as described in the standard ISO 21448. The idea is to go out, find edge cases, figure out how to mitigate any risk presented by them and continue until you found enough of the edge cases that you think it’s okay to deploy. The good part of this approach is that it changes the metrics conversation from lots and lots of miles to instead talking about what percentage of the edge cases you’ve covered. If you think of a notional zoo of all the possible edge cases, well, once you’ve covered them all, then you should be good to go.

This works up to a point. The problem is you don’t actually know what all the edge cases are. You don’t know which edge case cases happen only once in a while that you didn’t see during testing. This coverage approach works great for things where 90% or 99% is fine. 

If there’s a driver in charge of a car and you’re designing a system that helps the driver recover after the drivers made a mistake, and you only do that 90% of the time (just to pick a number), that’s still a win. Nine times out of 10 you help the driver. As long as you’re not causing an accident on the 10th time, it’s all good. But for a self driving car, you’re not helping a driver. You’re actually in charge of getting everything done so 90% isn’t near good enough. You need 99.99...lots of nines. Then you have a problem that if you’re missing even a few things from the edge case zoo that will happen in the real world, you could have a loss event when you hit one of them.

That means the SOTIF approach is great when you know or can easily discover the edge cases. But it has a problem with unknown unknowns -- things you didn’t even know you didn’t know because you didn’t see them during testing. 

It’s important to realize there are actually two flavors of edge cases. Most of the discussions happen around scenario planning. Things like geometry: an unprotected left turn; somebody is turning in front of you; there is a a pedestrian at a crosswalk. Those sorts of planning type things are one class of edge cases.

But there’s a completely different class of edge case, which is object classification. "What’s that thing that’s yellow and blobby? I don’t know what that is. Is that a person or is that a tarp that’s gotten loose and blowing in the wind? I don’t know." Being able to handle the edge cases for geometry is important. Being able to handle the perception edge cases is also important, but it’s quite different.

If you’re doing coverage based metrics, then your metrics need to account for both the planning and the perception edge cases, possibly with two separate metrics.

Okay, so the SOTIF coverage approach can certainly help, but it has a limit that you don’t know all the edge cases. Why is that? Well, the explanation is the 90/10 rule. The 90/10 rule in this case is 90% of the times you have a problem, it’s only caused by the 10% of the very common edge cases that happen every day. When you get out to the stuff that happens very rarely, once every 10 million miles say, well, that’s 90% of the edge cases, but you only see them 10% of the time because they happen so rarely.

The issue is there’s an essentially infinite number of edge cases such that each one happens very rarely, but in aggregate, they happen often enough to be a problem. This is due to the heavy tail nature of edge cases and generally weird things in the world. The practical implication is you can look as hard as you want for as long as you want, but you’ll never find all the edge cases. And yet they may be arriving so often that you can’t guarantee an appropriate level of safety, even though you fixed every single one you found. That's because it might take too long to find enough to get acceptable safety if you emphasize only fixing things you've seen in data collection.

Going back to closing loop with simulation, what this means is if you want to simulate a billion miles worth of operation to prove you’re a billion miles worth of safe, you need a billion miles worth of actual real world data to know that you’ve seen enough of the rare edge cases that statistically probably it works out. We’re back to a billion miles of data on the same exact sensor suite you’re going to deploy is not such a simple thing. What might be able to help is ways to sift through data and identify generic versions of the various edge cases so you can put them in a simulation. Even then, if the rare edge cases for the second billion miles are substantially different, it still might not be enough (the heavy tail issue).

The takeaways from all this or that doing simulation and analysis to make sure you’ve covered all the edge cases you know about is crucial to being able to build a self driving car, but it’s not quite enough. What you want is a metric that gives you the coverage of the perception edge cases and gives you the coverage of the scenario and planning edge cases. When you’ve covered everything you know about, that’s great, but it’s not the only thing you need to think about when deciding if you’re safe enough to deploy.

If you have those coverage metrics, one way you can measure progress is by looking at how often surprises happen. How often do you discover a new edge case for perception? How often you discover a new edge case for planning? 

When you get to the point that the edge cases arriving very infrequently or maybe you’ve gone a million miles and haven’t seen one, that means there’s probably not a lot of utility in accumulating more miles and accumulating more data because you’re getting diminishing returns. The important thing is that does not mean you’ve got them all. It means you’ve covered all the ones you know about and it’s becoming too expensive to discover new edge cases. When you hit that point, you need another plan to assure safety beyond just coverage of edge cases via a SOTIF approach.

Summing up, metrics that have to do with perception and planning edge cases are an important piece of self driving car safety, but you need to do something beyond that to handle the unknown unknowns.

For the podcast version of this posting, see: https://archive.org/details/metrics-06-coverage-metrics

Thanks to Jackie Erickson for her support.


No comments:

Post a Comment

All comments are moderated by a human. While it is always nice to see "I like this" comments, only comments that contribute substantively to the discussion will be approved for posting.