Wednesday, November 25, 2020

Disengagements as a progress metric is a bad idea (Metrics Episode 2)

 We should be worried about road testing safety metrics, not disengagements.

A disengagement happens when the autonomy in a self driving car detects an internal problem, or a human test driver takes over control of a self driving car test platform because of safety concerns. Self driving car developers have to report these disengagements, for example, to California. The apparent rationale for requiring these reports is that all things being equal, disengagements per mile might decrease over time as technology matures. Along those lines, eventually when disengagements reached zero, you might think it’s time to deploy the vehicle without a human test driver. The problem is that this model is much too simplistic and more importantly, not all things are equal.

Let’s start with some basics. Not all miles are equal. If you wanted to game disengagements, you could do so by driving around an empty block in beautiful weather at 4:00 AM with no traffic, no pedestrians, nothing on the road, around and around in circles. You get a lot of miles. You wouldn’t learn much, but you get a lot of miles. That’s not at all the same as, for example, trying to drive across all 446 bridges in Pittsburgh during a blizzard. Those miles are just not the same. Another potential problem is that not all safety drivers are equal. Some safety drivers will be more prone to be cautious and others less cautious. Hopefully there is rigorous driver screening so that the test drivers safety drivers are the right amount of cautious, but in fact, this is still an area the industry’s working on. So even with the best intentions, all disengagements might not be equal.

Now what happens next? After this disengagement data is collected, the metrics get published and that leads the media to trend those published metrics into the great disengagement metric horse race. Pundits opine about which company is in the lead and companies who are ahead say, "yeah, look at our low disengagement rate" and so on. Now it’s hard to blame people for doing this because the developers operate in such great secrecy. That’s really the only progress metric out there, but it’s not a good metric. In fact, it’s probably a harmful metric. A big concern is that using disengagements as a metric provide strong incentives for behavior that make things worse instead of better, especially if you’re being judged in it for progress and maybe your next funding round depends on your disengagement metric.

Here’s a problem: the disengagement metric penalizes companies who tell their safety drivers to be extra safe by being extra quick to disengage. So, that means there’s an incentive to tell drivers to give the vehicles a little more slack, which might or might not be as safe as it should be. Now, I’m not saying that people necessarily do this intentionally, but in a very competitive environment, there’s going to be natural pressure to say, well, if it’s on the borderline, let it go to make our numbers look better and probably it’s safe enough. And people might convince themselves of that, even though they’re operating unsafely. 

Another problem is the metric penalizes companies who are working on difficult operational design domains and incentivizes them to chase easy miles. Now again, I’m not saying companies are doing this on purpose, but certainly the incentive is there. In fact, there are good reasons why a company making excellent progress would actually see their disengagements increase rather than decrease. Maybe the company’s expanded its operational design domain to handle more challenging situations. The week that they decide to start operating in rain, I’d imagine the disengagement rate would go up instead of down.

Another reason is maybe safety driver training has been improved and the policies have been changed to improve road test safety at the expense of increased disengagements. I’d love to see that kind of outcome, but it makes the metric look bad. 

Some companies filter the disengagement to say, okay, well we’re only going to report the disengagements that count and the problem is that’s a two edge sword. Sure, it makes sense not to report planned disengagements. If it’s the end of the testing day and you’re going to take the car back to the garage and you turn off autonomy, surely that disengagement should not count.

But because companies are being judged on disengagements, there’s also incentive to gain them a bit. For example, the driver might take over control and maybe it was a dangerous situation, maybe it wasn’t, but because of the pressure of metrics, the company decides to round down and attributed to something else when in fact probably the car should have been doing better and the disengagement should have at least partially counted. You might end up with under-reporting disengagements and that should be a cause for concern.

Let me give a couple of hypothetical examples of the kind of situations which could lead to this kind of bad outcome. For example, let’s say a company only reports disengagement if an after the fact simulation says yes, they would hit something. And then the car goes by a pedestrian and the safety driver disengages because it looks like it’s going to be kind of close to the pedestrian, they don’t want to take a chance. So far, so good. Now let’s say you would’ve missed the pedestrian by 10 to 20 feet. Okay, fine. That disengagement probably should not count, but what if you’re only going to miss the pedestrian by one inch? Well, you didn’t hit the pedestrian. You could say, well, that one doesn’t count because we didn’t hit anything. But I’m going to say missing a pedestrian by one inch, that one ought to count.

And so without details about how exactly this reporting has been done, we don’t really know what the numbers mean. 

Here’s another hypothetical example. Let’s say a test vehicle runs a red light, but it’s late at night and no one’s around. The driver looks around, the driver’s been told, unfortunately, to make sure the disengagement numbers look good. There’s no cross traffic. The driver says, you know what? I’m just going to let it run the red light because no harm will be done. There’s a situation where the disengagement doesn’t happen, the number looks good, but in that hypothetical scenario, the driver’s been incentivized to do unsafe things. That made up example brings to a head the real issue here. 

Disengagements might be useful input to some parts of an engineering process, but in a hyper competitive market, they provide all the wrong incentives for road test safety.  Really are we worried about progress? 

Do we really want the Departments of Transportation measuring progress of companies? Their job really has to be keeping people safe on the road. And so if the publicly reported data actually provides incentives to do road testing unsafely to make progress look good, that’s a problem. Historically, those kinds of incentives, they lead to a story that ends badly for everyone. 

Well, it’s interesting to know what progress might be made by the industry. But if you really care about safety, the thing you ought to be worried about right now is road testing safety. Now, some of the reporting data actually does help with that. For example, crashes being reported. Sure, that actually directly measures road testing safety. But all the disengagement metrics and all the buzz really doesn’t help road testing safety -- and in fact might be hurting it. It might be putting pressure on the companies to undermine road testing safety just to make their numbers look better.

I would recommend that California and any other government that’s doing this should stop forcing disengagement reporting and instead encourage the reporting of more productive metrics that are about road testing safety, not about the horse race to get a self driving car deployed. This is not a simple ask. This is actually a hard thing to do. But the industry should step up and propose metrics that have to do with road testing safety. That’s going to help them build trust with the public and it’s going to help the government agencies fulfill their responsibility to ensure that the road testing and eventual deployment is done in a safe, responsible manner.

To learn more, we recommend a paper our team published for SAE World Congress titled “Safety Argument Considerations For Road Testing of Autonomous Vehicles.” This paper gives guidance on a safety case for human supervision of road testing.

Number of miles as a self-driving car progress and safety metric (Metrics Episode 1)

Even if you have the best possible safety drivers, every test mile adds some risk. Make sure every mile of road testing is actually doing something important.

You hear people saying that testing for lots and lots of miles must mean some self-driving car company is better than the rest, or at least in the lead in the so-called race to autonomy. But not so fast, there’s more to it than that.

Some companies have millions of miles of road testing experience, and to be sure that’s an impressive accomplishment. If they have that many miles, certainly they have incentive to boast about it in saying, “Look how many miles we have.” And the press often says, “All right, these guys have lots and lots of miles, somehow that must be in they’re ahead.” But miles really doesn’t tell you who’s safer or even necessarily ahead. Miles is a mostly reflection about the resources they have available to deploy a test fleet.

If you have lots of money, you can buy a lot of cars, hire lots of people, and put them out there to rack up the miles. Sure, having lots of resources makes it easier to make progress, but it doesn’t tell you who’s better. In fact, some companies are taking pride in reducing their testing road miles and instead putting those resources into simulation and other sorts of engineering activities other than road testing. It’s hard to believe that if you don’t have any miles or you only have a few miles that you’re really ready to deploy, I get that. And it’s hard to believe that a company without lots of data collection miles is actually seeing all the things that are going to need to deal with when they pick an operational design domain. But nobody will ever get to the billions of representative miles necessary to say anything compelling about expected safety until after they actually deploy their fleet.

Even if you have a lot of miles, not all miles are created equal. Okay, so we have the billion miles -- are they in simulation around the same block? Are they all in sunny weather -- or do they include rain and hail and ice and all those other types of weather conditions you care about? Are they in a place with wide roads and no pedestrians or are they in a chaotic urban center? Oh, was that chaotic urban center at 5:00 AM? Was it during rush hour? Did you include things like construction zones or Halloween costumes or all sorts of things that don’t happen that often, but that you have to handle the right way? If somebody wants to talk about miles, they should also talk about how those miles show they’ve covered the entirety of their operational design domain.

There’s a potential problem with using miles as a measure of progress because they can motivate the wrong behavior. If you’re judged solely on how many miles you’ve racked up, then you’re going to optimize the easy miles perhaps, but worse, every mile on public roads is a chance to make a mistake. Even if you have the best possible safety drivers, every test mile adds some risk. Hopefully that risk is no worse than human driver vehicle risks, but that’s a different discussion. So every mile costs not only money, but also puts you at some risk of some adverse news or some unfortunate event. So you should think carefully about racking up a lot of miles for the sake of miles; you should make every mile earn its keep. Make sure every mile of road testing is actually doing something important.

Now you have to have miles. For sure you won’t know if your system is done until you’ve done some road testing. But the early miles don’t have to be about testing at all, they can be about data collection. You don’t need to run an autonomous system to collect sensor data. You can just collect sensor data with a human driver and have risk no different than someone just driving around normally. That means the road test miles should not be used for primary data collection, but rather as a way to confirm your design is solid. So don’t think of miles on the road is as debugging problems with your system while you’re on public roads. Rather think about road testing miles as a way of making sure you didn’t overlook anything in a system you think is just about ready to go. That means instead of going for road test miles, companies should instead be trying to get data collection miles for most of the miles and making sure they cover the ODD.

Then they can feed that information to a simulation, make sure the system handles all the miles properly, and then after they’re pretty sure they got it right, they should be doing road testing miles simply to confirm that all the engineering effort they put in resulted in a system that behaves the way they expected it to. In other words, road testing miles should be the boring part of tying a ribbon around and putting a bow on a great design. Road testing miles should not be the pointy end of the spear for development.

Going back up to why people talk about miles as a progress metric, sure, having no miles on public roads probably means you’re not ready to deploy because you haven’t checked to make sure it works. But having a ton of miles doesn’t mean you’re ahead, it just means you’re well-funded and you’re out there operating. If you really want to know how someone’s doing, it isn’t just the miles but rather it’s which miles, and it’s how those miles go together with all the other engineering activities to make sure that they’re taking a solid engineering approach to designing a safe self driving car.