Saturday, December 12, 2020

Surprise Metrics (Metrics Episode 12)

You can estimate how many unknown unknowns are left to deal with via a metric that measures the surprise arrival rate.  Assuming you're really looking, infrequent surprises predict that they will be infrequent in the near future as well.

Your first reaction to thinking about measuring unknown unknowns may be how in the world can you do that? Well, it turns out the software engineering community has been doing this for decades: they call it software reliability growth modeling. That area’s quite complex with a lot of history, but for our purposes, I’ll boil it down to the basics.

Software reliability growth modeling deals with the problem of knowing whether your software is reliable enough, or in other words, whether or not you’ve taken out enough bugs that it’s time to ship the software. All things being equal, if a complete same system test reveals 10 times more defects in the current release than in the previous release, it’s a good bet your new release is not as reliable as your old one.

On the other hand, if you’re running a weekly test/debug cycle with a single release, so every week you test it, you remove some bugs, then you test it some more the next week, at some point you’d hope that the number of bugs found each week will be lower, and eventually you’ll stop finding bugs. When the number of bugs per week you find is low enough, maybe zero, or maybe some small number, you decide it’s time to ship. Now that doesn’t mean your software is perfect! But what it does mean is there’s no point testing anymore if you’re consistently not finding bugs. Alternately, if you have a limited testing budget, you can look at the curve over time of the number of bugs you’re discovering each week and get some sort of estimate about how many bugs you would find if you continued testing for additional cycles.

At some point, you may decide that the number of bugs you’ll find and the amount of time it will take simply isn’t worth the expense. And especially for a system that is not life critical, you may decide it’s just time to ship. A dizzying array of mathematical models has been proposed over the years for the shape of the curve of how many more bugs are left in the system based on your historical rate of how often you find bugs. Each one of those models comes with significant assumptions and limits to applicability. 

But the point is that people have been thinking about this for more than 40 years in terms of how to project how many more bugs are left in a system even though you haven’t found them. And there’s no point trying to reinvent all those approaches yourself.

Okay, so what does this have to do with self-driving car metrics?

Well, it’s really the same problem. In software tests, the bugs are the unknowns, because if you knew where the bugs were, you’d fix them.  You’re trying to estimate how many unknowns there are or how often they’re going to arrive during a testing process. In self-driving cars, the unknown unknowns are the things you haven’t trained on or haven’t thought about in the design. You’re doing road testing, simulation and other types of validation to try and uncover these. But it ends up in the same place. You’re trying to look for latent defects or functionality gaps and you’re trying to get idea of how many more there are left in the system that you haven’t found yet, or how many you can expect to find if you invest more resources in further testing.

For simplicity, let’s call the things in self-driving cars that you haven’t found yet surprises. 

The reason I put it this way is that there are two fundamentally different types of defects in these systems. One is you built the system the wrong way. It’s an actual software bug. You knew what you were supposed to do, and you didn’t get there. Traditional software testing and traditional software quality will help with those, but a surprise isn’t that. 

A surprise is a requirements gap or something in the environment you didn’t know was there. Or a surprise has to do with imperfect knowledge of the external world. But you can still treat it as a similar, although different, class from software defects and go at it the same way. One way to look at this is a surprise is something you didn’t realize should be in your ODD and therefore is a defect in the ODD description. Or, you didn’t realize the surprise could kick your vehicle out of the ODD and is a defect in the model of ODD violations that you have to detect. You’d expect that surprises that can lead to safety-critical failures are the ones that need the highest priority for remediation.

To create a metric for surprises, you need to track the number of surprises over time. You hope that over time, the arrival rate of surprises gets lower. In other words, they happen less often and that reflects that your product has gotten more mature, all things being equal. 

If the number of surprises gets higher, that could be a sign that your system has gotten worse with dealing unknowns, or could also be a sign that your operational domain has changed, and more novel things are happening than used to because of some change in the outside world. That requires you to update your ODD to reflect the new real world situation. Either way, a higher rival rate of surprises means you’re less mature or less reliable and a lower rate means you’re probably doing better.

This may sound a little bit like disengagements as a metric, but there’s a profound difference. That difference applies even if disengagements on road testing are one of the sources of data.

The idea is that measuring how often you disengage, that a safety driver takes over, or the system gives up and says, “I don’t know what to do” is a source of raw data. But the disengagements could be for many different reasons. And what you really care about for surprises is only disengagements that happened because of a defect in the ODD description or some other requirements gap.

Each incident that could be a surprise needs to be analyzed to see if it was a design defect, which isn’t really an unknown unknown. That’s just a mistake that needs to be fixed.

But some incidents will be true unknown unknown situations that require re-engineering or retraining your perception system or another remediation to handle something you didn’t realize until now was a requirement or operational condition that you need to deal with. Since even with a perfect design and perfect implementation, unknowns are going to continue to present risk, what you need to be tracking with a surprise metric is the arrival of actual surprises.

It should be obvious that you need to be looking for surprises to see them. That’s why things like monitoring near misses and investigating the occurrence of unexpected, but seemingly benign, behavior matters. Safety culture plays a role here. You have to be paying attention to surprises instead of dismissing them if they didn’t seem to do immediate harm. A deployment decision can use the surprise arrival rate metric to get an approximate answer of how much risk will be taken due to things missing from the system requirements and test plan. In other words, if you’re seeing surprises arrive every few minutes or every hour and you deploy, there’s every reason to believe that will continue to happen about that often during your initial deployment.

If you haven’t seen a surprise in thousands or hundreds of thousands of hours of testing, then you can reasonably assume that surprises are unlikely to happen every hour once you deploy. (You can always get unlucky, so this is playing the odds to be sure.)

To deploy, you want to see the surprise arrival rate reduced to something acceptably low. You’ll also want to know the system has a good track record so that when a surprise does happen, it’s pretty good at recognizing something has gone wrong and doing something safe in response.

To be clear, in the real world, the arrival rate of surprises will probably never be zero, but you need to measure that it’s acceptably low so you can make a responsible deployment decision.

For the podcast version of this posting, see:

Thanks to podcast producer Jackie Erickson.

No comments:

Post a Comment

All comments are moderated by a human. While it is always nice to see "I like this" comments, only comments that contribute substantively to the discussion will be approved for posting.