Safe Autonomy: December 2020

Sunday, December 13, 2020

Safety Performance Indicator (SPI) metrics (Metrics Episode 14)

SPIs help ensure that assumptions in the safety case are valid, that risks are being mitigated as effectively as you thought they would be, and that fault and failure responses are actually working the way you thought they would.

Safety Performance Indicators, or SPIs, are safety metrics defined in the Underwriters Laboratories 4600 standard. The 4600 SPI approach covers a number of different ways to approach safety metrics for a self-driving car, divided into several categories.

One type of 4600 SPI safety metric is a system-level safety metric. Some of these are lagging metrics such as the number of collisions, injuries and fatalities. But others have some leading metric characteristics because while they’re taken during deployment, they’re intended to predict loss events. Examples of these are incidents for which no loss occurs, sometimes called near misses or near hits, and the number of traffic rule violations. While by definition, neither of these actually results in a loss, it’s a pretty good bet that if you have many, many near misses and many traffic-rule infractions, eventually something worse will happen.

Another type of 4600 metric is intended to deal with ineffective risk mitigation. An important type of SPI relates to measuring that hazards and faults are not occurring more frequently than expected in the field.

Here’s a narrow but concrete example. Let’s assume your design takes into account that you might lose one in a million network packets due to corrupted data being detected. But out in the field, you’re dropping every tenth network packet. Something’s clearly wrong, and it’s a pretty good chance that undetected errors are slipping through. You need to do something about that situation to maintain safety.

A broader example is that a very rare hazard might be deemed not to be risky because it just essentially never happens. But just because you think it almost never happens doesn’t mean that’s what happens in the real world. You need to take data to make sure that something you thought would happen to one vehicle in the fleet every hundred years isn’t in fact happening every day to someone, because if that’s the case, you badly misestimated your risk.

Another type of SPI for field data is measuring how often components fail or behave badly. For example, you might have two redundant computers so that if one crashes, the other one will keep working. Consider one of those computers is failing every 10 minutes. You might drive around for an entire day and not really notice there’s a problem because there’s always a second computer there for you. But if your calculations assume a failure once a year and it’s failing every 10 minutes, you’re going to get unlucky and have both fail at the same time a lot sooner than you expected.

So it’s important to know that you have an underlying problem, even though it’s being masked by the fault tolerance strategy.

A related type of SPI has to do with classification algorithm performance for self-driving cars. When you’re doing your safety analysis, it’s likely you’re assuming certain false positive and false negative rates for your perception system. But just because you see those in testing doesn’t mean you’ll see those in the real world, especially if the operational design domain changes and new things pop up that you didn’t train on. So you need a SPI to monitor the false negative and false positive rates to make sure that they don’t change from what you expected.

Now, you might be asking, how do you figure out false negatives if you didn’t see it? But in fact, there’s a way to approach this problem with automatic detection. Let’s say that you have three different types of sensors for redundancy and you vote three sensors and go with the majority. Well, that means every once in a while, one of the sensors can be wrong and you still get safe behavior. But what you want to do is take a measurement of how often the one wrong happens, because if it happens frequently, or the faults on that sensor correlate with certain types of objects, those are important things to know to make sure your safety case is still valid.

A third type of 4600 metric is intended to measure how often surprises are encountered. There’s another segment on surprises, but examples are the frequency at which an object is classified with poor confidence, or a safety relevant object flickers between classifications. These give you a hint that something is wrong with your perception system, and that it’s struggling with some type of object. If this happens constantly, then that indicates a problem with the perception system. It might indicate that the environment has changed and includes novel objects not accounted for by training data. Either way, monitoring for excessive perception issues is important to know that your perception performance is degraded, even if an underlying tracking system or other mechanism is keeping your system safe.

A fourth type of 4600 metric is related to recoveries from faults and failures. It is common to argue that safety-critical systems are in fact safe because they use fail-safes and fall-back operational modes. So if something bad happens, you argue that the system will do something safe. It’s good to have metrics that measure how often those mechanisms are in fact invoked, because if they’re invoked more often than you expected, you might be taking more risks than you thought. It’s also important to measure how often they actually work. Nothing’s going to be perfect. And if you’re assuming they work 99% of the time but they only work 90% of the time, that dramatically changes your safety calculations.

It’s useful to differentiate between two related concepts. One is safety performance indicators, SPIs, which is what I’ve been talking about. But another concept is key performance indicators, KPIs. KPIs are used in project management and are very useful to try and measure product performance and utility provided to the customer. KPIs are a great way of tracking whether you’re making progress on the intended functionality and the general product quality, but not every KPI is useful for safety. For example, a KPI for a fuel economy is great stuff, but normally it doesn’t have that much to do with safety.

In contrast, an SPI is supposed to be something that’s directly traced to parts of the safety case and provides evidence for the safety case. Different types of SPIs include making sure the assumptions in the safety case are valid, that risks are being mitigated as effectively as you thought they would be, and that fault and failure responses are actually working the way you thought they would. Overall, SPIs have more to do with whether the safety case is valid and the rate of unknown surprise arrivals is tolerable. All these areas need to be addressed one way or another to deploy a safe self-driving car.

Conformance Metrics (Metrics Episode 13)

Metrics that evaluate progress in conforming to an appropriate safety standard can help track safety during development. Beware of weak conformance claims such as only hardware, but not software, conforms to a safety standard.

Conformance metrics have to do with how extensively your system conforms to a safety standard.

A typical software or systems safety standard has a large number of requirements to meet the standard, with each requirement often called clauses. An example of a clause might be something like "all hazards shall be identified" and another clause might be "all identified hazard shall be mitigated." (Strictly speaking a clause is typically a numbered statement in the standard in the form of a "shall" requirement that usually has a lot more words in it than those simplified examples.)

There are often extensive tables of engineering techniques or technical mitigation measures that need to be done based on the risk presented by each hazard. For example, mitigating a low risk hazard might just need normal software quality practices, while a life critical hazard might need dozens or hundreds of very specific safety and software quality techniques to make sure the software is not going to fail in use. The higher the risk, the more table entries need to be performed in design validation and deployment.

The simplest metric related to a safety standard is as simple yes/no question: Do you actually conform to the standard?

However, there are nuances that matter. Conforming to a standard might mean a lot less than you might think for a number of reasons. So one way to measure the value of that conformance statement is to ask about the scope of the conformance and any assessment that was performed to confirm the conformance. For example, is a conformance just hardware components and not software also, or is it both hardware and software? It’s fairly common to see claims of conformance to an appropriate safety standard that only covered the hardware, and that’s a problem if a lot of the safety critical functionality is actually in the software.

If it does cover the software, what scope? Is it just the self test software that exercises the hardware (again, a common conformance claim that omits important aspects of the product)? Does it include the operating system? Does it include all the application software that’s relevant to safety? What actually is the claim of conformance be made on? Is it just a single component within a very large system? Is it a subsystem? Is it entire vehicle? Does it cover both the vehicle and its cloud infrastructure and the communications to the cloud? Does it cover the system used to collect training data that is assumed to be accurate to create a safety critical machine learning based system? And so on. So if you see a claim of conformance, be sure to ask what exactly the claim applies to you because it might not be everything that matters for safety.

Also conformance can have different levels of credibility ranging from – well it’s "in the spirit of the standard." Or "we use an internal standard that we think is equivalent to this international standard." Or "our engineering team decided we think we meet it." Or "a team inside our company thinks we meet it but they report to the engineering manager so there’s pressure upon them to say yes." Or "conformance evaluation is done by a robustly separated group inside our company." Or "conformance evaluation is done via qualified external assessment with a solid track record for technical integrity."

Depending on the system, any one of these categories might be appropriate. But for life critical systems, you need as much independence and actual standards conformance as you can get. If you hear a claim for conformance it’s reasonable ask: well, how do you know you conform to the extent that matters, and is the group assessing conformance independent enough and credible enough for this particular application?

Another dimension of conformance metrics is: how much of the standard is actually being conformed to? Is it only some chapters or all of the chapters? Sometimes we’re back to where only the hardware conformed so they really only looked at one chapter of a system standard that would otherwise cover hardware and software. Is it only the minimum basics? Some standards have a significant amount of text that some treat as optional (in the lingo: "non-normative clauses"). In some standards most of the test is not actually required to claim conformance. So did only the required texts get addressed or were the optional parts addressed as well?

Is the integrity level appropriate? So it might conform to a lower ASIL than you really need for your application, but it still has the conformance stamp to the standard on it. That can be a problem if using, for example, something assessed for noncritical functions and you want to use it in a life critical application. Is the scope of the claim conformance appropriate? For example, you might have dozens of safety critical functions in a system, but only three or four were actually checked for conformance and the rest were not. You can say it conforms to a standard, but the problem is there’s pieces that really matter that were never checked for conformance.

Has the standard been aggressively tailored so that it weakens the value of the claim conformance? Some standards, permit skipping some clauses if they don’t matter to safety in that particular application, but with funding and deadline pressures, there might be some incentive to drop out clauses that really might matter. So it’s important to understand how tailored the standard was. Was that the full standard or where pieces left out that really should matter?

Now to be sure, sometimes limited conformance on all these paths makes perfect sense. It’s okay to do that so long as, first of all, you don’t compromise safety. So you’re only leaving out things that don’t matter to safety. Second you’re crystal clear about what you’re claiming and you don’t ask more of the system that can really deliver for safety.

Typically signs of aggressive tailoring or conformance to only part of a standard are problematic for life critical systems. It’s common to see misunderstandings based on one or more of these issues. Somebody claims conformance to a standard does not disclose the limitations and somebody else gets confused and says, oh, well, the safety box has been checked so nothing to worry about. But, in fact safety is a problem because the conformance claim is much narrower than is required for safety in that application.

During development (before the design is complete), partial conformance and measuring progress against partial conformance can actually be quite helpful. Ideally, there’s a safety case that documents the conformance plan and has a list of how you plan to conform to all the aspects of the standard you care about. Then you can measure progress against the completeness of the safety case. The progress is probably not linear, and not every clause take same amount of effort. But still just looking at what fraction of the standard you’ve achieved conformance to internally can be very helpful for managing the engineering process.

Near the end of the design validation process, you can do mock conformance checks. The metric there is the number of problems found with conformance, which basically amounts to bug reports against the safety case rather than against the software itself.

Summing up, conforming to relevant safety standards is an essential part of ensuring safety, especially in life critical products. There are a number of metrics, measures and ways to assess how well that conformance actually is going to help your safety. It’s important to make sure you’ve conformed to the right standards, you’ve conformed with the right scope and that you’ve done the right amount of tailoring so that you’re actually hitting all the things that you need to in the engineering validation and deployment process to ensure you’re appropriately safe.

Surprise Metrics (Metrics Episode 12)

You can estimate how many unknown unknowns are left to deal with via a metric that measures the surprise arrival rate. Assuming you're really looking, infrequent surprises predict that they will be infrequent in the near future as well.

Your first reaction to thinking about measuring unknown unknowns may be how in the world can you do that? Well, it turns out the software engineering community has been doing this for decades: they call it software reliability growth modeling. That area’s quite complex with a lot of history, but for our purposes, I’ll boil it down to the basics.

Software reliability growth modeling deals with the problem of knowing whether your software is reliable enough, or in other words, whether or not you’ve taken out enough bugs that it’s time to ship the software. All things being equal, if a complete same system test reveals 10 times more defects in the current release than in the previous release, it’s a good bet your new release is not as reliable as your old one.

On the other hand, if you’re running a weekly test/debug cycle with a single release, so every week you test it, you remove some bugs, then you test it some more the next week, at some point you’d hope that the number of bugs found each week will be lower, and eventually you’ll stop finding bugs. When the number of bugs per week you find is low enough, maybe zero, or maybe some small number, you decide it’s time to ship. Now that doesn’t mean your software is perfect! But what it does mean is there’s no point testing anymore if you’re consistently not finding bugs. Alternately, if you have a limited testing budget, you can look at the curve over time of the number of bugs you’re discovering each week and get some sort of estimate about how many bugs you would find if you continued testing for additional cycles.

At some point, you may decide that the number of bugs you’ll find and the amount of time it will take simply isn’t worth the expense. And especially for a system that is not life critical, you may decide it’s just time to ship. A dizzying array of mathematical models has been proposed over the years for the shape of the curve of how many more bugs are left in the system based on your historical rate of how often you find bugs. Each one of those models comes with significant assumptions and limits to applicability.

But the point is that people have been thinking about this for more than 40 years in terms of how to project how many more bugs are left in a system even though you haven’t found them. And there’s no point trying to reinvent all those approaches yourself.

Okay, so what does this have to do with self-driving car metrics?

Well, it’s really the same problem. In software tests, the bugs are the unknowns, because if you knew where the bugs were, you’d fix them. You’re trying to estimate how many unknowns there are or how often they’re going to arrive during a testing process. In self-driving cars, the unknown unknowns are the things you haven’t trained on or haven’t thought about in the design. You’re doing road testing, simulation and other types of validation to try and uncover these. But it ends up in the same place. You’re trying to look for latent defects or functionality gaps and you’re trying to get idea of how many more there are left in the system that you haven’t found yet, or how many you can expect to find if you invest more resources in further testing.

For simplicity, let’s call the things in self-driving cars that you haven’t found yet surprises.

The reason I put it this way is that there are two fundamentally different types of defects in these systems. One is you built the system the wrong way. It’s an actual software bug. You knew what you were supposed to do, and you didn’t get there. Traditional software testing and traditional software quality will help with those, but a surprise isn’t that.

A surprise is a requirements gap or something in the environment you didn’t know was there. Or a surprise has to do with imperfect knowledge of the external world. But you can still treat it as a similar, although different, class from software defects and go at it the same way. One way to look at this is a surprise is something you didn’t realize should be in your ODD and therefore is a defect in the ODD description. Or, you didn’t realize the surprise could kick your vehicle out of the ODD and is a defect in the model of ODD violations that you have to detect. You’d expect that surprises that can lead to safety-critical failures are the ones that need the highest priority for remediation.

To create a metric for surprises, you need to track the number of surprises over time. You hope that over time, the arrival rate of surprises gets lower. In other words, they happen less often and that reflects that your product has gotten more mature, all things being equal.

If the number of surprises gets higher, that could be a sign that your system has gotten worse with dealing unknowns, or could also be a sign that your operational domain has changed, and more novel things are happening than used to because of some change in the outside world. That requires you to update your ODD to reflect the new real world situation. Either way, a higher rival rate of surprises means you’re less mature or less reliable and a lower rate means you’re probably doing better.

This may sound a little bit like disengagements as a metric, but there’s a profound difference. That difference applies even if disengagements on road testing are one of the sources of data.

The idea is that measuring how often you disengage, that a safety driver takes over, or the system gives up and says, “I don’t know what to do” is a source of raw data. But the disengagements could be for many different reasons. And what you really care about for surprises is only disengagements that happened because of a defect in the ODD description or some other requirements gap.

Each incident that could be a surprise needs to be analyzed to see if it was a design defect, which isn’t really an unknown unknown. That’s just a mistake that needs to be fixed.

But some incidents will be true unknown unknown situations that require re-engineering or retraining your perception system or another remediation to handle something you didn’t realize until now was a requirement or operational condition that you need to deal with. Since even with a perfect design and perfect implementation, unknowns are going to continue to present risk, what you need to be tracking with a surprise metric is the arrival of actual surprises.

It should be obvious that you need to be looking for surprises to see them. That’s why things like monitoring near misses and investigating the occurrence of unexpected, but seemingly benign, behavior matters. Safety culture plays a role here. You have to be paying attention to surprises instead of dismissing them if they didn’t seem to do immediate harm. A deployment decision can use the surprise arrival rate metric to get an approximate answer of how much risk will be taken due to things missing from the system requirements and test plan. In other words, if you’re seeing surprises arrive every few minutes or every hour and you deploy, there’s every reason to believe that will continue to happen about that often during your initial deployment.

If you haven’t seen a surprise in thousands or hundreds of thousands of hours of testing, then you can reasonably assume that surprises are unlikely to happen every hour once you deploy. (You can always get unlucky, so this is playing the odds to be sure.)

To deploy, you want to see the surprise arrival rate reduced to something acceptably low. You’ll also want to know the system has a good track record so that when a surprise does happen, it’s pretty good at recognizing something has gone wrong and doing something safe in response.

To be clear, in the real world, the arrival rate of surprises will probably never be zero, but you need to measure that it’s acceptably low so you can make a responsible deployment decision.

Operational Design Domain Metrics (Metrics Episode 11)

Operational Design Domain metrics (ODD metrics) deal with both how thoroughly the ODD has been validated as well as the completeness of the ODD description. How often the vehicle is forcibly ejected from its ODD also matters.

Operational Design Domain metrics (ODD metrics) deal with both how thoroughly the ODD has been validated as well as the completeness of the ODD description.

An ODD is the designer’s model of the types of things that the self-driving cars intended to deal with. The actual world, in general, is going to have things that are outside the ODD. As a simple example, the ODD might include fair weather and rain, but snow and ice might be outside the ODD because the vehicle is intended to be deployed in a place where snow is very infrequent.

Despite designer’s best efforts, it’s always possible for the ODD to be violated. For example, if the ODD is Las Vegas in the desert, this system might be designed for mostly dry weather or possibly light rain. But in fact, in Vegas, once in a while, it rains and sometimes it even snows. The day that it snows the vehicle will be outside its ODD, even though it’s deployed in Las Vegas.

There are several types of ODD safety metrics that can be helpful. One is how well validation covers the ODD. What that means is whether the testing, analysis, simulation and other validation actually cover everything in the ODD, or have gaps in coverage.

When considering ODD coverage it’s important to realize that ODDs have many, many dimensions. There are much more than just geo-fencing boundaries. Sure, there’s day and night, wet versus dry, and freeze versus thaw. But you also have traffic rules, condition of road markings, the types of vehicles present, the types of pedestrians present, whether there are leaves on the tree that affect LIDAR localization, and so on. All these things and more can affect perception, planning, and motion constraints.

While it’s true that a geo-fence area can help limit some of the diversity in the ODD, simply specifying a geo-fence doesn’t tell you everything you need to know, and you’ve covered all the things that are inside that geo-fenced area. Metrics for ODD validation can be based on a detailed model of what’s actually in the ODD -- basically an ODD taxonomy of all the different factors that have to be handled and how well testing, simulation, and other validation cover that taxonomy.

Another type of metric is how well the system detects ODD violations. At some point, a vehicle will be forcibly ejected from its ODD even though it didn’t do anything wrong, simply due to external events. For example, a freak snowstorm in the desert, a tornado or the appearance of a new type of completely unexpected vehicle and force a vehicle out of its ODD with essentially no warning. The system has to recognize when it has exited its ODD and be safe. A metric related to this is how often ODD violations are happening during testing and on the road after deployment.

Another metric is what fraction of ODD violations are actually detected by the vehicle. This could be a crucial safety metric, because if an ODD violation occurs and the vehicle doesn’t know it, it might be operating unsafely. Now it’s hard to build a detector for ODD violations that the vehicle can’t detect (and such failures should be corrected). But this metric can be gathered by root cause analysis whenever there’s been some sort of system failure or incident. One of the root causes might simply be failure to detect an ODD violation.

Coverage of the ODD is important, but an equally important question is how good is the ODD description itself? If your ODD description is missing many things that happen every day in your actual operational domain (the real world,), then you’re going to have some problems.

A higher level of metric to talk about is ODD description quality. That is likely to be tied to other metrics already mentioned in this and other segments. Here are some examples. The frequency of ODD violations can help inform the coverage metric of the ODD against the operational domain. Frequency of motion failures could be related to motion system problems, but could also be due to missing environmental characteristics in your ODD. For example, cobblestone pavers are going to have significantly different surface dynamics than a smooth concrete surface and might come as a surprise when they are encountered.

Frequency of perception failures could be due to training issues, but could also be something missing from the ODD object taxonomy. For example, a new aggressive clothing style or new types of vehicles. The frequency of planning failures could be due to planning bugs, but could also be due to the ODD missing descriptions of informal local traffic conventions.

Frequency of prediction failures could be prediction issues, but could also be due to missing a specific class of actors. For example, groups of 10 and 20 runners in formation near a military base might present a challenge if formation runners aren't in training data. It might be okay to have an incomplete ODD so long as you can always tell when something is happening that forced you out of the ODD. But it’s important to consider that metric issues in various areas might be due to unintentionally restricted ODD versus being an actual failure of the system design itself.

Summing up, ODD metric should address how well validation covers the whole ODD and how well the system detects ODD violations. It’s also useful to consider that a cause of poor metrics and other aspects of the design might in fact be that the ODD description is missing something important compared to what happens in the real world.

Prediction Metrics (Metrics Episode 10)

You need to drive not where the free space is, but where the free space is going to be when you get there. That means perception classification errors can affect not only the "what" but also the "future where" of an object.

Prediction metrics deal with how well a self driving car is able to take the results of perception data and predict what happens next so that it can create a safe plan.

There are different levels of prediction sophistication required depending on operational conditions and desired own-vehicle capability. The first, simplest prediction capability is no prediction at all. If you have a low speed vehicle in an operational design domain in which everything is guaranteed to also be moving at low speeds and be relatively far away compared to the speeds, then a fast enough control loop might be able to handle things based simply on current object positions. The assumption there would be everything’s moving slowly, it’s far away, and you can stop your vehicle faster than things can get out of control. (Note that if you move slowly but other vehicles move quickly, that violates the assumptions for this case.)

The prediction basically amounts to, nothing moves fast compared to its distance. But even here, a prediction metric can be helpful because there’s an assumption that everything is moving slow compared to its distance away. That assumption might be violated by nearby objects moving slowly but a little bit too fast because they’re so close, or by far away things moving fast such as a high speed vehicle in an urban environment that is supposed to have a low speed limit. The frequency at which the assumption is violated that things move slowly compared to the distance away will be an important safety metric.

For self driving cars that operate at more than a slow crawl. You’ll start to need some sort of prediction based on likely object movement. You often hear: "drive to where the free space is" with the free space being the open road space that’s safe for a car to maneuver in.

But that doesn’t actually work once you’re doing more than about walking speed, because it isn’t where the free space is now that matters. What you need to do is to drive to where the free space is going to be when you get there. Doing that requires prediction because many of the things on the road move over time, changing where the free space is one second from now, versus five seconds from now, versus 10 seconds from now.

A starting point for prediction is assuming that everything maintains the same speed and direction as it currently has and update the speeds and directions periodically as you run your control loop. Doing this requires tracking so that you know not only where something is, but also what its direction and speed are. That means that with this type of prediction, metrics having to do with tracking accuracy become important, including distance, direction of travel and speed.

For safety it isn’t perfect position accuracy on an absolute coordinate frame that matters, but rather whether tracking is accurate enough to know if there’s a potential collision situation or other danger. It’s likely that better accuracy is required for things that are close and things that are moving quickly toward you and in general things that pose collision threats.

For more sophisticated self driving cars, you’ll need to predict something more sophisticated than just tracking data. That’s because other vehicles, people, animals and so on will change direction or even change their mind about where they’re going or what they’re doing.

From a physics point of view, one way to look at this is in terms of derivatives. The simplest prediction is the current position. A slightly more sophisticated prediction has to do with the first derivative: speed and direction. An even more sophisticated prediction would be to use the second derivative: acceleration and curvature. You can even use the third derivative: jerk or change in acceleration. To the degree you can predict these things, you’ll be able to have a better understanding of where the free space will be when you get there.

From an every day point of view, the way to look at it is that real things don’t stand still -- they move. But when they’re moving, they change direction, they change speed, and sometimes they completely change what they’re trying to do, maybe doubling back on themselves.

An example of a critical scenario is a pedestrian standing on a curb waiting for a crossing light. Human drivers use the person’s body language to tell the pedestrian is a risk of stepping off the curb even though they’re not supposed to be crossing. While that’s not perfect, most drivers will have stories of the time they didn’t hit someone because they noticed the person was distracted by looking at their cell phone or the person looked like they were about to jump into the road and so on. If you only look at speed and possibly acceleration, you won’t handle cases in which a human driver would say, “That looks dangerous. I’m going to slow down to give myself more reaction time in case behavior changes suddenly.”

It isn’t just the current trajectory that matters for a pedestrian. It’s what the pedestrian’s about to do, which might be a dramatic change from standing still to running across through to catch a bus.

The same would hold true for a human driver of another vehicle that you have some telltale available that suggests they’re about to swerve or turn in front of you. For even more sophisticated predictions, you probably don’t end up with a single prediction, but rather with a probability cloud of possible positions and directions of travel over time, where keeping on the same path might be the most probable. But a maximum command authority, right turn left turn, accelerate, decelerate might all be possible with lower probability but not zero probability. Given how complicated prediction can be, metrics might have to be more complicated than simply "did you guess exactly right?" There’s always going to be some margin of error in any prediction, but you need to predict in a way that results in acceptable safety even in the face of surprises.

One way to handle the prediction is to take a snapshot of the current position and the predicted movement. Wait a few control loop cycles, some fractions of a second or a second. Then check to see how it turned out. In other words, you can just wait a little while, see how well your prediction turned out and keep score as to how good your prediction is. In terms of metrics, you need some sort of bounds on the worst case error of prediction. Every time that bound is violated, it is potentially a safety-related event and should be counting it as a metric. Those bounds might be probabilistic in nature, but at some point there has to be a bound as to what is acceptable prediction error and what’s not.

To the degree that prediction is based on object type, for example, you’re likely to assume a pedestrian typically cannot go as fast as a bicycle, but that a pedestrian can jump backwards and pivot turn. You might want to know if the type-specific prediction behavior is violated. For example, a pedestrian suddenly going from stop to 20 miles per hour crossing right in front of your car, might be a competitive sprinter that’s decided to run across the road, but more likely signals that electric rental scooters have arrived in your town and you need to include them in your operational design domain.

Prediction metrics might be related to the metrics for correct object classification if the prediction is based on the class of the object.

Summing up, sophisticated prediction of behavior might be needed for highly permissive operation in complex dense environments. If you’re in a narrow city street with pedestrians close by and other things going on, you’re going to need really good prediction. Metrics for this topic should focus not only on motion measurement accuracy and position accuracy, but also on the ability to successfully predict what happens next, even if a particular object performs a sudden change in direction, speed, and so on. In the end, your metric should help you understand the likelihood that you’ll correctly interpret where the free space is going to be so that your path planner can plan a safe path.

Perception Metrics (Metrics Episode 9)

Don’t forget that there will always be something in the world you’ve never seen before and have never trained on, but your self driving car is going to have to deal with it. A particular area of concern is correlated failures across sensing modes.

Perception safety metrics deal with how a self driving car takes sensor inputs and maps them into a real-time model of the world around it.

Perception metrics should deal with a number of areas. One area is sensor performance. This is not absolute performance, but rather with respect to safety requirements. Can a sensor see far enough ahead to give accurate perception in time for the planner to react? Does the accuracy remain sufficient given changes in environmental and operational conditions? Note that for the needs of the planner, further isn’t better without limit. At some point, you can see far enough ahead that you’ve reached the planning horizon, and sensor performance beyond that might help with ride comfort or efficiency but is not necessarily directly related to safety.

Another type of metric deals with sensor fusion. At a high level, success with sensor fusion is whether that fusion strategy can actually detect the types of things you need to see in the environment. But even if it seems like sensor fusion is seeing everything it needs to, there are some underlying safety issues to consider.

One is measuring correlated failures. Suppose your sensor fusion algorithm assumes that multiple sensors have independent failures. So you’ve done some math and said, well, the chance of all the sensors failing at the same time as low enough to tolerate, that analysis assumes there’s some independence across the sensor failures.

For example, if you have three sensors and you’re assuming that they fail independently, knowing that two of those sensors failed at the same time on the same thing is really important because it provides counter-evidence to your independence assumption. But you need to be looking for this specifically because your vehicle may have performed just fine because the third sensor was independent. So the important thing here is the metric is not about whether your sensor fusion work but rather whether the independence assumption behind your analysis was valid or invalid.

Another metric to consider related to the area of sensor fusion is whether or not detection ride-through based on tracking is covering up problems. It’s easy enough to rationalize that if you see something nine frames out of 10, then missing one frame isn’t a big deal because you can track through the dropout. If missed detections are infrequent and random, that might be valid assumption. But it’s also possible you have clusters of missed detections based on some types of environments or some types of objects related to certain types of sensors even if overall they are a small fraction. Keeping track of how often and how long ride through is actually required to track through missing detections is important to validate the underlying assumption of random dropouts rather than clustered or correlated dropouts.

A third type of metric is classification accuracy. It’s common to track false negatives, which are how often you miss something that matters. For example, if you miss a pedestrian, it’s hard to avoid hitting something you don’t see. But you should track false negatives not just based on the sensor fusion output, but also per sensor and per combinations of sensors. This goes back to making sure there aren’t systematic faults that undermine the independence of failure assumptions.

There are also false positives, which is how often you see something there that isn’t really there. For example, a pattern of cracks in the pavement might look like an obstacle and could cause a panic stop. Again, sensor fusion might be masking a lot of false positives. But you need to know whether or not your independence assumption for deciding how the sensors fail as a system is valid or not.

Somewhere in between is misclassifications. For example, saying something is a bicycle versus a wheelchair versus a pedestrian is likely to matter for prediction, even though all three of those things are an object that shouldn’t be hit.

Just touching on the independence point one more time, all these metrics: false plate negatives, false positive, and misclassifications, should be per sensor modality. That’s because if sensor fusion saves you, say, for example, vision misclassifies something but later still gets it right, you can’t count on that always working. You want to make sure that each of your sensor modalities works as well as it can without systematic defects, because maybe next time you won’t get lucky and the sensor fusion algorithm will suffer correlated fault that leads to a problem.

In all the different aspects of perception, edge cases matter. There are going to be things you haven’t seen before and you can’t train on something you’ve never seen.

So how well does your sensing system generalize? There are very likely to be systematic biases in training and validation data that never occurred to anyone to notice. An example we’ve seen is that if you take data in cool weather, nobody’s wearing shorts outdoors in the Northeast US. Therefore, the system learns implicitly that tan or brown things sticking out of the ground with green blobs on top are bushes or trees. But in the summer that might be someone in shorts wearing a green shirt.

You also have to think about unusual presentations of known objects. For example, a person carrying a bicycle is different than a bicycle carrying a person. Or maybe someone’s fallen down into the roadway. Or maybe you see very strange vehicle configurations or weird paint jobs on vehicles.

The thing to look for in all these is clusters or correlations in perception failures -- things that don’t support a random independent failure assumption between modes. Because those are the places where you’re going to have trouble with sensor fusion sorting out the mess and compensating for failures.

A big challenge in perception is that the world is an essentially infinite supply of edge cases. It’s advisable to have a robust taxonomy of objects you expect to see in your operational design domain, especially to the degree that prediction, which we’ll discuss later on, requires accurate classification of objects or maybe even object subtypes.

While it’s useful to have a metric that deals with coverage of the taxonomy in training and testing, it’s just as important to have a metric for how well the taxonomy actually represents the operational design domain. Along those lines, a metric that might be interesting is how often you encounter something that’s not in the taxonomy, because if that’s happening every minute or every hour, that tells you your taxonomy probably needs more maturity before you deploy.

Because the world is open-ended, a metric is also useful for how often your perception is saying: "I’m not sure what that is." Now, it’s okay to handle "I’m not sure" by doing a safety shutdown or doing something safe. But knowing how often your perception is confused or has a hole is an important way to measure your perception maturity.

Summing up, perception metrics, as we’ve discussed them, cover a broad swath from sensors through sensor fusion to object classification. In practice, these might be split out to different types of metrics, but they have to be covered somewhere. And during this discussion we’ve seen that they do interact a bit.

The most important outcome of these metrics is to get a feel for how well the system is able to build a model of the outside world, given that sensors are imperfect, operational conditions can compromise sensor capabilities, and the real world can present objects and environmental conditions that both have never been seen, and worse, might cause correlated sensor failures that compromise the ability of sensor fusion to actually come up with an accurate classification of specific types of objects. Don’t forget, there will always be something in the world you’ve never seen before and have never trained on, but your self driving car is going to have to deal with it.

Planning Metrics (Metrics Episode 8)

Planning metrics should cover whether the plan paths are actually safe. But just as important is whether plans work across the full ODD and account for safety when pushed out of the ODD by external events.

Planning metrics deal with how effectively a vehicle can plan a path through the environment, obstacles, and other actors. Often, planning metrics are tied to the concept of having various scenarios and actors that a vehicle might encounter when dealing with the various behaviors, maneuvers, and other considerations. In another segment, I’ll talk about how the system builds a model of the external world. For now, let’s assume that the self driving car knows exactly where all the objects are and what their predicted trajectories and behaviors are. The objective is typically to make progress in navigating through the scenario without hitting things.

Some path planning metrics are likely to be tied closely to the motion safety metrics. A self driving car that creates a path plan that involves a collision with a pedestrian clearly has an issue, but in practice, things are some shades of gray. Typically, it’s not okay to just barely miss something. Rather, you want to leave some sort of sufficient time, space, or combination buffer around objects and obstacles to provide a safety margin. You need to do better than just not hitting things and in fact, you want to give everything else in the environment an appropriate amount of leeway. From a planning point of view, this metric would cover how often and how severely object buffers are violated. This ties in with motion planning metrics, but instead of saying, “What’s the worst case to avoid a collision?” You have to add in some sort of buffer as well.

For safety, it’s important to differentiate between safety boundaries and continuous performance metrics. Here’s an example. Let’s say you have a one-meter hard threshold from bicycles for a certain urban setting at a certain travel speed. Let’s say your vehicle leaves 1.1 meters to bicyclists. Well, that’s great, a little bit further than one meter sounds safe. Is two meters better? Well, all things being equal, leaving a bicycle a little more room is probably also a good idea. But on that metric, it’s tempting to say that 0.9 meters is only about 10% worse than one meter, when in fact it’s not. With a one-meter hard threshold, safe is one meter or better. Kind of doesn’t matter, as long as you’re at least one meter, you’re by definition, safe. 0.99 is unsafe because you’ve violated a hard threshold.

There’s a potential big difference between safety thresholds and general performance indications. For general background risk, sure, leaving a little more room is a good thing. But as soon as you pass a hard safety threshold, that changes it from a little bit worse to a safety violation that requires some sort of response to fix the system’s behavior.

Now, for those who are thinking, “Well, it doesn’t always have to be one meter,” that’s right. What I assumed in this example was that for the particular circumstances, it was determined one meter was the hard deck; you couldn’t go any closer. It might well be the case that it slower speeds it’s closer and at higher speeds it’s further away. The point here is that in some cases you will have hard cutoffs of safety that are never supposed to be violated. Those are fundamentally different than continuous metrics where a little bit further or a little bit closer is a little bit better or a little bit worse.

Other path planning metrics are likely to be based more on coverage. Some self driving car projects use the concept of a scenario, which is a specific combination of environment, objects, and own vehicle intended behavior over a relatively short time period. For example, a fairly generic scenario might be making an unprotected left turn in rush hour traffic on a sunny day. The idea is that you come up with a large set of scenarios to cover all the possibilities in your operational design domain, or ODD. If you do that and you can validate that each scenario has been built properly, then you can claim you’ve covered the whole ODD. In practice, development teams tend to build scenario catalogs with varying levels of abstraction from high level, to parameterized, to very concrete single settings of parameter scenarios that can be fed into a simulator or executed on a track. Then they test the concrete examples to see if they violate motion safety metrics or other bad things happen.

There are several different ways to look at coverage of a catalog of scenarios. One is how well the high level and parameterized scenarios cover the ODD. In principle, you would like the scenario catalog to cover all possibilities across all dimensions of the ODD. Now, ODDs are pretty complicated, which we’ll discuss another time, so that includes at least weather and types of actors and road geometries, but there are probably many other considerations. It’s going to be a big catalog, thousands, tens of thousands, maybe more scenarios to really cover a complicated ODD.

A different take on the same topic is how well the concrete scenarios, those are your test cases, actually sample the ODD. Sure, you have these high level and parameterized scenarios that are supposed to cover everything, but at some point you have to actually run tests on a specific set of geometries, and behaviors, and actors. If you don’t sample that properly, there’ll be corners of the ODD that you didn’t exercise. There may be edge cases where there’s some boundary between two things and you didn’t test at the boundary, even though in principle your more generic scenarios sweep across the entire ODD. You want to make sure when you’re sampling the ODD via these concrete scenarios that you cover both frequent scenarios, which is probably pretty obvious, as well as infrequent but very severe, very high consequence scenarios that have to be gotten right even though they may not happen often.

For scenarios in which there’s a specific response intended, another metric can be how well the system follows its script. For example, if you’ve designed a self driving car to always leave two meters clearance to bicycle, even though one meter is the hard deck for safety, and it’s consistently going at 1.5 meters, that’s not an immediate safety violation, but there’s something wrong because it was supposed to do two meters and it’s consistently doing 1.5. Something isn’t quite right. The issue is that might be indicative of a deeper problem that at some other time could impact safety.

Another metric has to do with how well the system deals with scenarios getting more complex, more objects, unpredictable actors, minimizing the severity of unavoidable crashes when it’s been put in a no win situation, incorrect sensor information that may be presented, and so on. In general, one of the metrics is how well the system reacts to stress.

Another metric that can be useful is the brittleness of the system when encountering novel concrete examples that aren’t used in system training or might be outside the ODD. Remember that even though the system is designed to operate inside the ODD, in the real world, every once in a while, something weird will happen that is outside the ODD. The system may not have to remain operable, but it should remain safe even if that means invoking some sort of safety shutdown procedure. That means it has to know that something weird has happened, even if that something is not part of its designed ODD.

Summing up, planning metrics should include at least two categories. First is whether the plan paths are actually safe. Second is how well the path planner design covers the full scope of the intended operational design domain, as well as what happens when you exit the ODD. All this depends on the system having an accurate model of the world it’s operating in, which we’ll cover in another segment.

Motion Metrics (Metrics Episode 7)

Approaches to safety metrics for motion ultimately boil down to a combination of Newton’s laws. Implementation in the real world also requires the ability to understand, predict, and measure both the actions of others and the environmental conditions that you’re in.

The general idea of a metric for motion safety is to determine how well a self-driving car is doing at not hitting things. One of the older metrics is called "time-to-collision." In its simplest form, this is how long it will take for two vehicles to collide if nothing changes. For example, if one car is following another and the trailing car is going faster than the leading car, eventually, if nothing changes, they’ll hit. How long that will take depends on the relative closing speed and gives you the time-to-collision. The general idea is that the shorter the time, the higher the risk, because there’s less time for a human driver to react and intervene.

There are more complicated formulations of this concept that, for example, take into account acceleration and braking. But for all the time-to-collisions, the basic idea is: how much reaction time is available to avoid a collision? This metric was originally developed for traffic engineering, and was used predicting some aspects of road safety for human drivers. Time-to-collision gets pretty complicated in a hurry because, in the real world, vehicles accelerate and decelerate, they change lanes, they encounter crossing traffic and so on. So for safe highly automated vehicles we need something more sophisticated than a simple, “Here’s your reaction time for one vehicle following another.”

There are a number of proposals that address this such as Mobileye RSS, NVIDIA Force Fields, and the NHTSA Instantaneous Safety Metric. These ideas and more are at play in the IEEE Draft Standard Working Group P2846. But rather than go into the details of each of these individual approaches, let’s just sketch out some of the high-level characteristics, ideas, and issues.

The first big idea is that this is just physics. No matter how you package it, Newton’s Laws Of Motion come into play. The rest of it is just about how to encode those laws, reason about them, and apply them to everyday traffic. Some of these approaches attempt to provide mathematically proven guarantees of no collisions. And that can work, provided the assumptions behind the guarantees are correct.

While all this seems pretty straightforward in principle, when you try and apply it to real cars in the real world, things get a little bit more complicated. One of the topics that you need to consider is the various geometries and situations. Sure, one car following another in the same lane of traffic is a good starting point. But you have to think about cross traffic at intersections, merging traffic, changing lanes, non-90 degree intersections, and the list goes on. That means you need various different cases to work the math on.

You also have to consider the worst-case actions of other objects. For example, some cars can brake very quickly, some only slowly, some can turn quickly, and some slowly. Characteristics tend to be different for different classes of objects such as trucks, cars, bicycles, and pedestrians. In general, one way or another, you have to consider all the possibilities of one of these objects exercising its maximum turning authority, maximum braking authority, maximum acceleration authority, or some combination, in all types of different physical positions so you can avoid a collision.

Some situations are notoriously difficult to handle. One example is a so-called cutout maneuver. That’s where you’re following another car, and the other car changes lanes, revealing right in front of you a boulder that’s just sitting in the road, slow moving vehicle, or some other surprise. If you look at the math, that’s worse than the car in front of you hitting the brakes, because the boulder’s already at zero. Another difficult case is when you have oncoming traffic, you have to worry whether the oncoming vehicle will swerve into your lane going the wrong way. There’s not a lot of time and not a lot of room to react. Now, in an ideal world, none of these things would happen because everything would be well behaved and boulders wouldn’t appear in roadways. But the real world is messy place and so these types of things have to be considered.

Another thing to consider is environmental factors such as the coefficient of friction of the road surface, whether you’re on hills, and whether you’re on banked turns. The ability of other vehicles to maneuver and your own ability to maneuver is limited by how much friction you can generate against the road surface.

You also need to consider operational edge cases. For example, if you’re following a truck up a hill, you might do Newtonian physics math that says as long as you’re going slower than the truck, everything’s fine. But that may not work if the truck ahead of you hits ice and actually slides backwards down the hill. (I've seen that happen -- up close and personal.) It’s actually possible to hit something from behind, even if you’re at a stop, because it’s going backwards into you. Now you can say, “Okay, that’s a special case out of scope.” But those cases will happen, and you need to do the analysis to decide what’s in scope and what’s out of scope for any assurances you’re making.

You’ll also need the ability to predict other system capabilities and decide what assumptions you’re going to make. For example, you might assume that a sprinter is not going to suddenly cross a four lane road in the middle of the block in front of you at 25 miles an hour. That would be faster than you might want to assume a typical human can cross the street. A related assumption is you might assume that the car in front of you is not able to brake faster than at 1g, one times the force of gravity, because if it can, your brakes may not be good enough to stop in time.

Now this one’s kind of interesting because it is about other vehicles, and that might not be controllable. You might say, “Well I’ll assume no one brakes at more than 1g,” or you might actually create regulations saying that when cars are being followed, they’re not allowed to brake at more than 1g. I’m not saying that that’s a regulation that should be passed, but what I am saying though is that limitations on vehicle motion might play a role in cooperatively ensuring that vehicles can move safely.

Some of the efforts in this area try and prove that you’re unconditionally safe. But you also need to consider what to do when it is not possible to guarantee you’re safe. A simple example is, you’re following another vehicle in the same lane with just enough following distance so that if the one in front panic brakes you can stop in time, and someone else cuts in front of you. Well, you don’t have enough following distance anymore. You can say that’s not your fault, but you shouldn't simply give up and say there is not point trying to minimize the risk of a crash.

The question is, how do you behave when you’ve been placed in a situation that is provably unsafe in the worst case? To address that you probably not only need rules for ensuring you’re perfectly safe given assumptions, but also rules for reasonable behavior to restore safety or minimize the risk if you’re put in an unsafe situation and you have to operate there for a while.

A bigger related issue is that in an extremely dense environment, you simply may not be able to unconditionally guarantee safety. If you’re going in a dense urban area and there are a bunch of pedestrians standing on the curb ready to cross the street, but you have the green light and you’re going through the intersection, it simply may not be possible to prove that if one of the pedestrians jump off the curve, you won’t hit them. Hopefully the higher levels of autonomy are doing things like assessing the risk that that will happen. But from a pure physics point of view, at some point it’s not possible to mathematically prove you’ll always be able to stop fast enough to avoid a collision, no matter what, if you actually want to navigate in a dense situation.

This brings us to the idea of the trade-off between permissiveness and safety. Permissiveness is how much freedom of movement you have. You often balance permissiveness against the amount of safety or amount of risk you want to take. Sure, you can be perfectly safe by leaving the car in the garage and never taking it out. But once you go out on the road, there’s always some non-zero risk something bad will happen. The question is: what’s the appropriate trade off in terms of the physics, and how much slack you leave to minimize the risk of collision?

Along these lines, you may come up with trade-offs. I’ll give some hypothetical examples which might or might not be the right thing to do. For example, you might say, “It’s okay if the car in front of me brakes at 1g. But because that almost never happens, I can be a little bit closer so long as I will hit at less than one mile an hour relative closing by the time the car stops.” In other words, a fender bender or a light tap on the bumper might be deemed acceptable if you think it will almost never happen. You might say, “I can get increased road throughput by having the cars a little bit closer together, and really not worrying about the low-velocity impacts that are extremely unlikely to result in injuries or even serious property damage.”

Now, whether you go down this path is a public policy decision, and I’m not saying that this one example is what you want to do. But the idea is that the world isn’t a perfect place and there is a nonzero chance of crashes. So you should think about what types of loss events are generally acceptable as long as they’re infrequent, and which types of loss events you want to absolutely guarantee never happened to the degree it’s at all possible.

Wrapping up, approaches to safety metrics for motion ultimately boil down to a combination of Newton’s laws. Implementation in the real world also requires the ability to understand, predict, and measure both the actions of others and the environmental conditions that you’re in. At some point, both you and other actors will have limits on ability to accelerate, decelerate, and make high speed turns. Those limits can be used in your favor to plan, to minimize, or avoid the risk of collision, but you have to know what they are. Given that, your own planned actions also come into play, and link with planning metrics and scenario coverage metrics, which we’ll talk about another time.

Anytime you’re considering safety on public roads, there will be pressure to increase permissiveness that might justifiably be at the expense of slight amounts of theoretical safety capability. The question is how to make that trade-off, and where to responsibly place the line to make sure you’re as safe as you need to be, but you’re still actually allowed to move around on the roads. For this trade off Newton’s laws provide the framework, but public policy provides the acceptable trade-off points.

Leading and Lagging Metrics (Metrics Episode 6)

You'll need to use leading metrics to decide when it's safe to deploy, including process quality metrics and product maturity metrics. Here are some examples of how leading and lagging metrics fit together.

Ultimately, the point of metrics is to have a measurement that tells us if a self-driving car will be safe enough. For example, whether it will be safer than a human driver. The outcome we want is a measure of how things are going to turn out on public roads. Metrics that take direct measurements of the outcomes are called lagging metrics because they lag after the deployment. That’s things like number of crashes that will happen, number of fatal crashes, and so on. To be sure, we should be tracking lagging metrics to identify problems in the fleet after we’ve deployed.

But that type of metric doesn’t really help with a decision about whether to deploy in the first place. You really want some assurance that self-driving cars will be appropriately safe before you start deploying them at scale. To predict that safety, we need leading metrics.

Leading metrics are things that predict the outcome before it actually happens. Sometimes operational leading metrics can predict other lagging metrics if appropriate ratios are known. For example, if you know the ratio between low and high severity crashes, you can monitor the number of low severity crashes and use it to get some sort of prediction of potential future high severity crashes. (I’ll note that that only works if you know the ratio. We know the ratio for human drivers, but it’s not clear that the ratio for self-driving cars will be the same.)

Another example is that the number of near misses or incidents might predict the number of crashes. The current most common example is a hope that performance with a human safety driver will predict performance once the safety driver is removed and the self-driving car becomes truly driverless.

Those examples show that some lagging metrics -- metrics collected after you deploy -- can be used to protect other longer term, less frequent lagging metrics if you know the right ratios. But that still doesn’t help with really supporting deployment decisions. You still need to really know before you deploy whether the vehicles are safe enough that it’s a responsible decision to deploy. To get there, we need other leading metrics. That’s metrics that predict the future, and by their nature, are indirect or correlated measures rather than the actual measures of on-the-road operation outcomes.

There are a large number of possibility metrics, and I’ll list some different types that seem appealing. One type is conventional software quality leading metrics. For example, topics discussed in safety standards such as ISO 26262. An example of that is code quality metrics. Things like code complexity or static analysis defect rates. Another example is development process metrics. Common examples are things like what fraction of defects are you finding in peer review and what’s your test coverage.

A somewhat higher level metric would be the degree to which you’ve covered a safety standard. More specific to self-driving car technology, you could have a metric that is what fraction of the ODD, that’s an operational design domain, have you covered with your testing, simulation, and analysis? During simulation and on-road testing, you might use operational risk metrics. For example, the fraction of time a vehicle has an unsafe following distance under the assumption the lead vehicle panic brakes. Maybe you’d look at the frequency at which a vehicle passes a little too close to pedestrians given the situation. You might have scenario and planning metrics that deal with the topic of whether your planner is safe across the entire ODD. Metrics there might include the degree to which you’ve tested scenarios that cover the whole ODD or the completeness of your catalog of scenarios against the full span of the ODD.

You could have perception metrics. Perception metrics would tend to deal with the accuracy of building the internal model across the entire ODD. Things like whether your perception has been tested across the full span of the ODD. Things like the completeness of the perception object catalog. Are all the objects in your ODD actually accounted for? Related metrics might include the accuracy of prediction compared to actual outcomes. There probably also should be safety case metrics. Somewhere, there’s an argument of why you believe you’re safe, and that argument is probably based on some assumptions. Some of the assumptions you want to track with metrics are whether or not your hazard list is actually complete and whether the safety case, that argument, actually spans the entire ODD or only covers some subset of the ODD.

Another important leading metric is the arrival rate of surprises or unknown unknowns. It’s hard to argue you’re safe enough to deploy if every single day in testing you’re suffering high severity problems, and the root cause diagnosis shows a gap in requirements, a gap in your design, missing tests, incorrect safety argument assumptions, and things like this.

Now those different types of metrics I listed have a couple of different flavors, and there are probably at least two major flavors that have to be treated somewhat differently.

One flavor is a progress metric, which is how close are you to a hundred percent of some sort of target. Now it’s important to mention that a hundred percent coverage doesn’t guarantee safety. So for example, if you’re testing code, and you test every single line of code at least once, that doesn’t guarantee the code’s perfect. But if you only test 50% of the code, the other 50% didn’t get tested at all, then clearly that’s a problem. So coverage metrics help you know that you’ve at least poked at everything but are not a guarantee of safety. A lot of these metrics that have to do with did you cover the entire ODD, should often be up in the high nineties and arguably a hundred percent depending on what the coverage metric is. But that’s what gets you into the game. It doesn’t prove safety.

Another flavor of metric is product quality metric. That isn’t coverage of testing or coverage of analysis, but rather, how well your product’s doing, covering the things that are measuring the maturity of your product. An example is how often you see unsafe maneuvers. Hopefully, for a stable ODD as you’re refining your product, that number goes down over time. Frequency of incorrect object classification is another example. Yet another is frequency of assumption violations in the safety case. For sure, these metrics can go up if you expand the ODD or make a change. But before you deploy, you would hope that these metrics are going down and settling to somewhere near zero at the time your product is mature enough to deploy.

One of the big pitfalls for metrics is measuring the things that are easy to measure or that you can think of instead of things that actually predict outcomes for safety. Ultimately, there must be an argument linking the leading metrics to some expectation that they predict the lagging metrics. Often that link is indirect. For example, “Well, to be safe, you need good engineering and good code quality,” but the argument should be able to be made rather than just saying, “Well, this is easy to measure, so we’ll measure that.”

Summing up, after we deploy self-driving cars, we’ll be able to use lagging metrics to measure what the outcome was. However, to predict that self-driving cars will be appropriately safe, we’ll need a set of leading metrics that covers all the different types of considerations, including the code, and the ODD coverage, and whether your planner is robust, and whether your perception is robust, and so on. We need to cover all those things in a way that the measures we’re taking are reasonably expected to predict good outcomes for the lagging metrics after we deploy. In other pieces, I’ll talk about these different types of metrics.

Coverage Driven Metrics (Metrics Episode 5)

Coverage based metrics need to account for both the planning and the perception edge cases, possibly with two separate metrics.

It takes way too many road miles to be able to establish whether a self driving car is safe by brute force. Billions of miles of on-road testing are just not going to happen.

Sometimes people say, “Well, that’s okay. We can do those billion miles in simulation.” While simulation surely can be helpful, there are two potential issues to this. The first is that simulation has to be shown to predict outcomes on real roads. That’s a topic for a different day, but the simple version is you have to make sure that what the simulator says actually predicts what will happen on the road.

The second problem, which is what I’d like to talk about this time, is that you need to know what to feed the simulation.

Consider that if you hypothetically drove a billion miles on the real road, you’re actually doing two things at the same time. The first thing is you’re testing to see how often the system fails. But the second thing, a little more subtle, is you’re exposing the self driving car test platform to a billion miles of situations and weird things. That means the safety claim you’d be making based on that hypothetical exercise is that your car is safe because it did a billion miles safely.

But you’re tangling up two things with that testing. One is whether the system performs, and the other is what the system has been exposed to. If you do a billion miles of simulation, then sure, you’re exposing the system to a billion miles of whether it does the right thing. But what you might be missing is that billion miles of weird stuff that happens in the real world.

Think about it. Simulating going around the same block a billion times with the same weather and the same objects doesn’t really prove very much at all. So, you really need a billion miles worth of exposure to the real world in representative conditions that span everything you would actually see if you were driving on the road. In other words, the edge cases are what matter.

To make this more concrete, there is a story about a self driving car test platform that went to Australia. The first time they encountered kangaroos there was a big problem because their distance estimation assumed that animal’s feet were on the ground and that’s not how kangaroos work. Even if they had simulated a billion miles, if they didn’t have kangaroos in their simulator, they would have never seen that problem coming. But it’s not just kangaroos. There’s lots of things that happen every day but are not necessarily included in the self driving car test simulator, and that’s the issue.

A commonly discussed approach to get out of the “let’s do a billion miles game,” is to use an alternative approach of identifying and taking care of the edge cases one at a time. This is the approach favored by the community that uses a Safety Of The Intended Function (SOTIF) methodology, for example, as described in the standard ISO 21448. The idea is to go out, find edge cases, figure out how to mitigate any risk presented by them and continue until you found enough of the edge cases that you think it’s okay to deploy. The good part of this approach is that it changes the metrics conversation from lots and lots of miles to instead talking about what percentage of the edge cases you’ve covered. If you think of a notional zoo of all the possible edge cases, well, once you’ve covered them all, then you should be good to go.

This works up to a point. The problem is you don’t actually know what all the edge cases are. You don’t know which edge case cases happen only once in a while that you didn’t see during testing. This coverage approach works great for things where 90% or 99% is fine.

If there’s a driver in charge of a car and you’re designing a system that helps the driver recover after the drivers made a mistake, and you only do that 90% of the time (just to pick a number), that’s still a win. Nine times out of 10 you help the driver. As long as you’re not causing an accident on the 10th time, it’s all good. But for a self driving car, you’re not helping a driver. You’re actually in charge of getting everything done so 90% isn’t near good enough. You need 99.99...lots of nines. Then you have a problem that if you’re missing even a few things from the edge case zoo that will happen in the real world, you could have a loss event when you hit one of them.

That means the SOTIF approach is great when you know or can easily discover the edge cases. But it has a problem with unknown unknowns -- things you didn’t even know you didn’t know because you didn’t see them during testing.

It’s important to realize there are actually two flavors of edge cases. Most of the discussions happen around scenario planning. Things like geometry: an unprotected left turn; somebody is turning in front of you; there is a a pedestrian at a crosswalk. Those sorts of planning type things are one class of edge cases.

But there’s a completely different class of edge case, which is object classification. "What’s that thing that’s yellow and blobby? I don’t know what that is. Is that a person or is that a tarp that’s gotten loose and blowing in the wind? I don’t know." Being able to handle the edge cases for geometry is important. Being able to handle the perception edge cases is also important, but it’s quite different.

If you’re doing coverage based metrics, then your metrics need to account for both the planning and the perception edge cases, possibly with two separate metrics.

Okay, so the SOTIF coverage approach can certainly help, but it has a limit that you don’t know all the edge cases. Why is that? Well, the explanation is the 90/10 rule. The 90/10 rule in this case is 90% of the times you have a problem, it’s only caused by the 10% of the very common edge cases that happen every day. When you get out to the stuff that happens very rarely, once every 10 million miles say, well, that’s 90% of the edge cases, but you only see them 10% of the time because they happen so rarely.

The issue is there’s an essentially infinite number of edge cases such that each one happens very rarely, but in aggregate, they happen often enough to be a problem. This is due to the heavy tail nature of edge cases and generally weird things in the world. The practical implication is you can look as hard as you want for as long as you want, but you’ll never find all the edge cases. And yet they may be arriving so often that you can’t guarantee an appropriate level of safety, even though you fixed every single one you found. That's because it might take too long to find enough to get acceptable safety if you emphasize only fixing things you've seen in data collection.

Going back to closing loop with simulation, what this means is if you want to simulate a billion miles worth of operation to prove you’re a billion miles worth of safe, you need a billion miles worth of actual real world data to know that you’ve seen enough of the rare edge cases that statistically probably it works out. We’re back to a billion miles of data on the same exact sensor suite you’re going to deploy is not such a simple thing. What might be able to help is ways to sift through data and identify generic versions of the various edge cases so you can put them in a simulation. Even then, if the rare edge cases for the second billion miles are substantially different, it still might not be enough (the heavy tail issue).

The takeaways from all this or that doing simulation and analysis to make sure you’ve covered all the edge cases you know about is crucial to being able to build a self driving car, but it’s not quite enough. What you want is a metric that gives you the coverage of the perception edge cases and gives you the coverage of the scenario and planning edge cases. When you’ve covered everything you know about, that’s great, but it’s not the only thing you need to think about when deciding if you’re safe enough to deploy.

If you have those coverage metrics, one way you can measure progress is by looking at how often surprises happen. How often do you discover a new edge case for perception? How often you discover a new edge case for planning?

When you get to the point that the edge cases arriving very infrequently or maybe you’ve gone a million miles and haven’t seen one, that means there’s probably not a lot of utility in accumulating more miles and accumulating more data because you’re getting diminishing returns. The important thing is that does not mean you’ve got them all. It means you’ve covered all the ones you know about and it’s becoming too expensive to discover new edge cases. When you hit that point, you need another plan to assure safety beyond just coverage of edge cases via a SOTIF approach.

Summing up, metrics that have to do with perception and planning edge cases are an important piece of self driving car safety, but you need to do something beyond that to handle the unknown unknowns.

Using a Driving Test as a Safety Metric (Metrics Episode 4)

The part of the driver test that is problematic for self-driving cars is deciding that the system has maturity of judgement. How do you check that it is old enough to drive?

At some point, companies are done testing and they need to make a decision about whether it’s okay to let their vehicles actually be completely self-driving, driverless cars. The important change here is that there is no longer a human responsible for continuously monitoring operational safety. That changes the complexion of safety, because now there’s no human driver to rely upon to take care of anything weird that might go wrong. That means you’ll need different metrics for safety when you deploy compared to those used during road testing.

One way people talk about knowing that it’s time to deploy is to give the car a road test, just like a human gets a driver’s test. Everyone wants something quick, cheap, and painless to decide whether their self-driving car is ready to go, and a driver test has some intuitive appeal. Indeed, if a car can’t pass a driver test, then truly that’s a problem. But is passing such a test enough?

Let’s go down the list of things that happen in a driver test.

There’s a written test. Well, surely a self-driving car needs to know the rules of the road, just like a person. But for a self-driving car it’s a little more complicated, because it’s not just the rules of the road, but also what to do about conflicting rules or incomplete rules, or how you handle justifiable rule breaking.

For example, if there’s a big truck broken down in the travel lane on a two lane road, do you wait there until the truck is towed away, several hours perhaps? Or do you go around it if there’s no traffic? Well, going over the double yellow line is clearly breaking the usual rules, but in everyday driving people do that sort of thing all the time.

Another thing you need is a vision test. Surely self-driving cars need to be able to see things on the road. For a person it’s typically just whether or not they have the right glasses on. But for a self-driving car it’s more complicated, because it isn’t just seeing that something’s in the road, but also figuring out what’s on the road and what might happen next. It isn’t just about reading road signs.

Another thing, the classic thing people have in mind is a road skills test. Surely a self driving car needs to be able to maneuver the vehicle in traffic and account for all the things that happen. But again, it’s more complicated to know that a self-driving car is ready beyond what you see on a typical road test. Sure, a typical road test covers things like parallel parking and using turn signals. But that’s the easy stuff. Did your driver test cover spin-outs? Did it cover handling a blown out tire with an actual blown out tire? Did you have to deal with loss of brakes? Did you have another car run a red light during the driver test to see how you’d respond? (OK, well that last one actually did happen to me by chance on my own driver test. But I digress.)

Even if you were to address all those types of things, there’s another important piece of a human driver test that people don’t think of as a test.

That’s actually proving you’re a human. You do that by showing your birth certificate. Oh, look: I’m a 16 year old human, and while I may not be the most mature person in the world, society’s determine I’m good enough to be able to handle a driver’s license. Now, what comes with that? It isn’t just that you’re 16. It’s that being 16, or whatever the age is where you are, is a proxy for things like being able to handle ambiguity and reason about consequences well enough. It’s a proxy for knowing that a situation is unpredictable and becoming more cautious, because you’re not sure what happens next. It's a proxy for understanding consequences and personal responsibility. Humans are moderately good at knowing when they’re not sure what’s going on, but things like machine learning are notoriously bad at knowing that they’re out of their element.

It’s also a proxy for handling unexpected events. While humans aren’t perfect, they do remarkably well when completely unexpected, unstructured events happen. They usually figure out something to do that’s reasonable.

In general, being a 16 year old human is a proxy for some level of maturity and experience at handling the unexpected. What we do is we use a driver test to say: okay, this person has basic skills, and because they’re a reasonably mature human, they can probably handle all the weird stuff that happens and they’ll figure it out as they go.

A big problem is it’s unclear how you figure out that a self-driving car has the level of judgment maturity of at least a 16 year old human. We’re not sure how to do that.

What we have known for decades is that you can’t prove a software based system is safe via testing alone. There are just too many complex situations and too many edge cases to be handled. No set of tests can cover everything. It’s just not possible. So, a driver test alone is never going to be enough to prove that a self-driving car is safe. Sure, elements of a driving test are useful. You absolutely want the self-driving car to know the rules of the road, to be able to look down the right away and see what’s there, and to be able to maneuver in traffic. But that is a minimum and insufficient set of things to prove it’s safe enough to responsibly deploy.

The point is you need more than a test. You need good engineering rigor. You need to know the car was designed properly, just like in every other safety critical computer based system. You don’t fly airplanes around to see if they fall out of this sky, and if they don’t, you say it’s safe. In fact, you do a lot of good engineering and you make sure the system is designed to do the right thing, and the testing is just there to make sure that the engineering process did what you thought it did.

So, for self-driving cars, sure, a road test is helpful, but you’re going to need a good engineering process was executed as well. You need to know that the system is going to handle all the things that have to go right, as well as all the things that can possibly go wrong, and that goes far beyond any reasonable driver test.

Safe Autonomy