Friday, December 4, 2020

Leading and Lagging Metrics (Metrics Episode 6)

You'll need to use leading metrics to decide when it's safe to deploy, including process quality metrics and product maturity metrics. Here are some examples of how leading and lagging metrics fit together.

Ultimately, the point of metrics is to have a measurement that tells us if a self-driving car will be safe enough. For example, whether it will be safer than a human driver. The outcome we want is a measure of how things are going to turn out on public roads. Metrics that take direct measurements of the outcomes are called lagging metrics because they lag after the deployment. That’s things like number of crashes that will happen, number of fatal crashes, and so on. To be sure, we should be tracking lagging metrics to identify problems in the fleet after we’ve deployed. 

But that type of metric doesn’t really help with a decision about whether to deploy in the first place. You really want some assurance that self-driving cars will be appropriately safe before you start deploying them at scale. To predict that safety, we need leading metrics.

Leading metrics are things that predict the outcome before it actually happens. Sometimes operational leading metrics can predict other lagging metrics if appropriate ratios are known. For example, if you know the ratio between low and high severity crashes, you can monitor the number of low severity crashes and use it to get some sort of prediction of potential future high severity crashes. (I’ll note that that only works if you know the ratio. We know the ratio for human drivers, but it’s not clear that the ratio for self-driving cars will be the same.)

Another example is that the number of near misses or incidents might predict the number of crashes. The current most common example is a hope that performance with a human safety driver will predict performance once the safety driver is removed and the self-driving car becomes truly driverless.

Those examples show that some lagging metrics -- metrics collected after you deploy -- can be used to protect other longer term, less frequent lagging metrics if you know the right ratios. But that still doesn’t help with really supporting deployment decisions. You still need to really know before you deploy whether the vehicles are safe enough that it’s a responsible decision to deploy. To get there, we need other leading metrics. That’s metrics that predict the future, and by their nature, are indirect or correlated measures rather than the actual measures of on-the-road operation outcomes.

There are a large number of possibility metrics, and I’ll list some different types that seem appealing. One type is conventional software quality leading metrics. For example, topics discussed in safety standards such as ISO 26262. An example of that is code quality metrics. Things like code complexity or static analysis defect rates. Another example is development process metrics. Common examples are things like what fraction of defects are you finding in peer review and what’s your test coverage.

A somewhat higher level metric would be the degree to which you’ve covered a safety standard. More specific to self-driving car technology, you could have a metric that is what fraction of the ODD, that’s an operational design domain, have you covered with your testing, simulation, and analysis? During simulation and on-road testing, you might use operational risk metrics. For example, the fraction of time a vehicle has an unsafe following distance under the assumption the lead vehicle panic brakes. Maybe you’d look at the frequency at which a vehicle passes a little too close to pedestrians given the situation. You might have scenario and planning metrics that deal with the topic of whether your planner is safe across the entire ODD. Metrics there might include the degree to which you’ve tested scenarios that cover the whole ODD or the completeness of your catalog of scenarios against the full span of the ODD.

You could have perception metrics. Perception metrics would tend to deal with the accuracy of building the internal model across the entire ODD. Things like whether your perception has been tested across the full span of the ODD. Things like the completeness of the perception object catalog. Are all the objects in your ODD actually accounted for? Related metrics might include the accuracy of prediction compared to actual outcomes. There probably also should be safety case metrics. Somewhere, there’s an argument of why you believe you’re safe, and that argument is probably based on some assumptions. Some of the assumptions you want to track with metrics are whether or not your hazard list is actually complete and whether the safety case, that argument, actually spans the entire ODD or only covers some subset of the ODD.

Another important leading metric is the arrival rate of surprises or unknown unknowns. It’s hard to argue you’re safe enough to deploy if every single day in testing you’re suffering high severity problems, and the root cause diagnosis shows a gap in requirements, a gap in your design, missing tests, incorrect safety argument assumptions, and things like this. 

Now those different types of metrics I listed have a couple of different flavors, and there are probably at least two major flavors that have to be treated somewhat differently.

One flavor is a progress metric, which is how close are you to a hundred percent of some sort of target. Now it’s important to mention that a hundred percent coverage doesn’t guarantee safety. So for example, if you’re testing code, and you test every single line of code at least once, that doesn’t guarantee the code’s perfect. But if you only test 50% of the code, the other 50% didn’t get tested at all, then clearly that’s a problem. So coverage metrics help you know that you’ve at least poked at everything but are not a guarantee of safety. A lot of these metrics that have to do with did you cover the entire ODD, should often be up in the high nineties and arguably a hundred percent depending on what the coverage metric is. But that’s what gets you into the game. It doesn’t prove safety.

Another flavor of metric is product quality metric.  That isn’t coverage of testing or coverage of analysis, but rather, how well your product’s doing, covering the things that are measuring the maturity of your product. An example is how often you see unsafe maneuvers. Hopefully, for a stable ODD as you’re refining your product, that number goes down over time. Frequency of incorrect object classification is another example. Yet another is frequency of assumption violations in the safety case. For sure, these metrics can go up if you expand the ODD or make a change. But before you deploy, you would hope that these metrics are going down and settling to somewhere near zero at the time your product is mature enough to deploy.

One of the big pitfalls for metrics is measuring the things that are easy to measure or that you can think of instead of things that actually predict outcomes for safety. Ultimately, there must be an argument linking the leading metrics to some expectation that they predict the lagging metrics. Often that link is indirect. For example, “Well, to be safe, you need good engineering and good code quality,” but the argument should be able to be made rather than just saying, “Well, this is easy to measure, so we’ll measure that.”

Summing up, after we deploy self-driving cars, we’ll be able to use lagging metrics to measure what the outcome was. However, to predict that self-driving cars will be appropriately safe, we’ll need a set of leading metrics that covers all the different types of considerations, including the code, and the ODD coverage, and whether your planner is robust, and whether your perception is robust, and so on. We need to cover all those things in a way that the measures we’re taking are reasonably expected to predict good outcomes for the lagging metrics after we deploy. In other pieces, I’ll talk about these different types of metrics.

For the podcast version of this posting, see:

Thanks to podcast producer Jackie Erickson.

No comments:

Post a Comment

All comments are moderated by a human. While it is always nice to see "I like this" comments, only comments that contribute substantively to the discussion will be approved for posting.