Sunday, December 13, 2020

Safety Performance Indicator (SPI) metrics (Metrics Episode 14)

SPIs help ensure that assumptions in the safety case are valid, that risks are being mitigated as effectively as you thought they would be, and that fault and failure responses are actually working the way you thought they would.

Safety Performance Indicators, or SPIs, are safety metrics defined in the Underwriters Laboratories 4600 standard. The 4600 SPI approach covers a number of different ways to approach safety metrics for a self-driving car, divided into several categories.

One type of 4600 SPI safety metric is a system-level safety metric. Some of these are lagging metrics such as the number of collisions, injuries and fatalities. But others have some leading metric characteristics because while they’re taken during deployment, they’re intended to predict loss events. Examples of these are incidents for which no loss occurs, sometimes called near misses or near hits, and the number of traffic rule violations. While by definition, neither of these actually results in a loss, it’s a pretty good bet that if you have many, many near misses and many traffic-rule infractions, eventually something worse will happen.

Another type of 4600 metric is intended to deal with ineffective risk mitigation. An important type of SPI relates to measuring that hazards and faults are not occurring more frequently than expected in the field.

Here’s a narrow but concrete example. Let’s assume your design takes into account that you might lose one in a million network packets due to corrupted data being detected. But out in the field, you’re dropping every tenth network packet. Something’s clearly wrong, and it’s a pretty good chance that undetected errors are slipping through. You need to do something about that situation to maintain safety.

A broader example is that a very rare hazard might be deemed not to be risky because it just essentially never happens. But just because you think it almost never happens doesn’t mean that’s what happens in the real world. You need to take data to make sure that something you thought would happen to one vehicle in the fleet every hundred years isn’t in fact happening every day to someone, because if that’s the case, you badly misestimated your risk.

Another type of SPI for field data is measuring how often components fail or behave badly. For example, you might have two redundant computers so that if one crashes, the other one will keep working. Consider one of those computers is failing every 10 minutes. You might drive around for an entire day and not really notice there’s a problem because there’s always a second computer there for you. But if your calculations assume a failure once a year and it’s failing every 10 minutes, you’re going to get unlucky and have both fail at the same time a lot sooner than you expected. 

So it’s important to know that you have an underlying problem, even though it’s being masked by the fault tolerance strategy.

A related type of SPI has to do with classification algorithm performance for self-driving cars. When you’re doing your safety analysis, it’s likely you’re assuming certain false positive and false negative rates for your perception system. But just because you see those in testing doesn’t mean you’ll see those in the real world, especially if the operational design domain changes and new things pop up that you didn’t train on. So you need a SPI to monitor the false negative and false positive rates to make sure that they don’t change from what you expected.

Now, you might be asking, how do you figure out false negatives if you didn’t see it? But in fact, there’s a way to approach this problem with automatic detection. Let’s say that you have three different types of sensors for redundancy and you vote three sensors and go with the majority. Well, that means every once in a while, one of the sensors can be wrong and you still get safe behavior. But what you want to do is take a measurement of how often the one wrong happens, because if it happens frequently, or the faults on that sensor correlate with certain types of objects, those are important things to know to make sure your safety case is still valid.

A third type of 4600 metric is intended to measure how often surprises are encountered. There’s another segment on surprises, but examples are the frequency at which an object is classified with poor confidence, or a safety relevant object flickers between classifications. These give you a hint that something is wrong with your perception system, and that it’s struggling with some type of object. If this happens constantly, then that indicates a problem with the perception system. It might indicate that the environment has changed and includes novel objects not accounted for by training data. Either way, monitoring for excessive perception issues is important to know that your perception performance is degraded, even if an underlying tracking system or other mechanism is keeping your system safe.

A fourth type of 4600 metric is related to recoveries from faults and failures. It is common to argue that safety-critical systems are in fact safe because they use fail-safes and fall-back operational modes. So if something bad happens, you argue that the system will do something safe. It’s good to have metrics that measure how often those mechanisms are in fact invoked, because if they’re invoked more often than you expected, you might be taking more risks than you thought. It’s also important to measure how often they actually work. Nothing’s going to be perfect. And if you’re assuming they work 99% of the time but they only work 90% of the time, that dramatically changes your safety calculations.

It’s useful to differentiate between two related concepts. One is safety performance indicators, SPIs, which is what I’ve been talking about. But another concept is key performance indicators, KPIs. KPIs are used in project management and are very useful to try and measure product performance and utility provided to the customer. KPIs are a great way of tracking whether you’re making progress on the intended functionality and the general product quality, but not every KPI is useful for safety. For example, a KPI for a fuel economy is great stuff, but normally it doesn’t have that much to do with safety.

In contrast, an SPI is supposed to be something that’s directly traced to parts of the safety case and provides evidence for the safety case. Different types of SPIs include making sure the assumptions in the safety case are valid, that risks are being mitigated as effectively as you thought they would be, and that fault and failure responses are actually working the way you thought they would. Overall, SPIs have more to do with whether the safety case is valid and the rate of unknown surprise arrivals is tolerable. All these areas need to be addressed one way or another to deploy a safe self-driving car.

Saturday, December 12, 2020

Conformance Metrics (Metrics Episode 13)

Metrics that evaluate progress in conforming to an appropriate safety standard can help track safety during development. Beware of weak conformance claims such as only hardware, but not software, conforms to a safety standard.

Conformance metrics have to do with how extensively your system conforms to a safety standard. 

A typical software or systems safety standard has a large number of requirements to meet the standard, with each requirement often called clauses. An example of a clause might be something like "all hazards shall be identified" and another clause might be "all identified hazard shall be mitigated."  (Strictly speaking a clause is typically a numbered statement in the standard in the form of a "shall" requirement that usually has a lot more words in it than those simplified examples.)

There are often extensive tables of engineering techniques or technical mitigation measures that need to be done based on the risk presented by each hazard. For example, mitigating a low risk hazard might just need normal software quality practices, while a life critical hazard might need dozens or hundreds of very specific safety and software quality techniques to make sure the software is not going to fail in use. The higher the risk, the more table entries need to be performed in design validation and deployment.

The simplest metric related to a safety standard is as simple yes/no question: Do you actually conform to the standard? 

However, there are nuances that matter. Conforming to a standard might mean a lot less than you might think for a number of reasons. So one way to measure the value of that conformance statement is to ask about the scope of the conformance and any assessment that was performed to confirm the conformance. For example, is a conformance just hardware components and not software also, or is it both hardware and software? It’s fairly common to see claims of conformance to an appropriate safety standard that only covered the hardware, and that’s a problem if a lot of the safety critical functionality is actually in the software.

If it does cover the software, what scope? Is it just the self test software that exercises the hardware (again, a common conformance claim that omits important aspects of the product)? Does it include the operating system? Does it include all the application software that’s relevant to safety? What actually is the claim of conformance be made on? Is it just a single component within a very large system? Is it a subsystem? Is it entire vehicle? Does it cover both the vehicle and its cloud infrastructure and the communications to the cloud? Does it cover the system used to collect training data that is assumed to be accurate to create a safety critical machine learning based system? And so on. So if you see a claim of conformance, be sure to ask what exactly the claim applies to you because it might not be everything that matters for safety.

Also conformance can have different levels of credibility ranging from – well it’s "in the spirit of the standard."  Or "we use an internal standard that we think is equivalent to this international standard." Or "our engineering team decided we think we meet it." Or "a team inside our company thinks we meet it but they report to the engineering manager so there’s pressure upon them to say yes." Or "conformance evaluation is done by a robustly separated group inside our company." Or "conformance evaluation is done via qualified external assessment with a solid track record for technical integrity." 

Depending on the system, any one of these categories might be appropriate. But for life critical systems, you need as much independence and actual standards conformance as you can get. If you hear a claim for conformance it’s reasonable ask: well, how do you know you conform to the extent that matters, and is the group assessing conformance independent enough and credible enough for this particular application?

Another dimension of conformance metrics is: how much of the standard is actually being conformed to? Is it only some chapters or all of the chapters? Sometimes we’re back to where only the hardware conformed so they really only looked at one chapter of a system standard that would otherwise cover hardware and software. Is it only the minimum basics? Some standards have a significant amount of  text that some treat as optional (in the lingo: "non-normative clauses"). In some standards most of the test is not actually required to claim conformance. So did only the required texts get addressed or were the optional parts addressed as well?

Is the integrity level appropriate? So it might conform to a lower ASIL than you really need for your application, but it still has the conformance stamp to the standard on it. That can be a problem if using, for example, something assessed for noncritical functions and you want to use it in a life critical application. Is the scope of the claim conformance appropriate? For example, you might have dozens of safety critical functions in a system, but only three or four were actually checked for conformance and the rest were not. You can say it conforms to a standard, but the problem is there’s pieces that really matter that were never checked for conformance.

Has the standard been aggressively tailored so that it weakens the value of the claim conformance? Some standards, permit skipping some clauses if they don’t matter to safety in that particular application, but with funding and deadline pressures, there might be some incentive to drop out clauses that really might matter. So it’s important to understand how tailored the standard was. Was that the full standard or where pieces left out that really should matter?

Now to be sure, sometimes limited conformance on all these paths makes perfect sense. It’s okay to do that so long as, first of all, you don’t compromise safety. So you’re only leaving out things that don’t matter to safety. Second you’re crystal clear about what you’re claiming and you don’t ask more of the system that can really deliver for safety. 

Typically signs of aggressive tailoring or conformance to only part of a standard are problematic for life critical systems. It’s common to see misunderstandings based on one or more of these issues. Somebody claims conformance to a standard does not disclose the limitations and somebody else gets confused and says, oh, well, the safety box has been checked so nothing to worry about. But, in fact safety is a problem because the conformance claim is much narrower than is required for safety in that application.

During development (before the design is complete), partial conformance and measuring progress against partial conformance can actually be quite helpful. Ideally, there’s a safety case that documents the conformance plan and has a list of how you plan to conform to all the aspects of the standard you care about. Then you can measure progress against the completeness of the safety case. The progress is probably not linear, and not every clause take same amount of effort.  But still just looking at what fraction of the standard you’ve achieved conformance to internally can be very helpful for managing the engineering process.

Near the end of the design validation process, you can do mock conformance checks. The metric there is the number of problems found with conformance, which basically amounts to bug reports against the safety case rather than against the software itself.

Summing up, conforming to relevant safety standards is an essential part of ensuring safety, especially in life critical products. There are a number of metrics, measures and ways to assess how well that conformance actually is going to help your safety. It’s important to make sure you’ve conformed to the right standards, you’ve conformed with the right scope and that you’ve done the right amount of tailoring so that you’re actually hitting all the things that you need to in the engineering validation and deployment process to ensure you’re appropriately safe.

Surprise Metrics (Metrics Episode 12)

You can estimate how many unknown unknowns are left to deal with via a metric that measures the surprise arrival rate.  Assuming you're really looking, infrequent surprises predict that they will be infrequent in the near future as well.

Your first reaction to thinking about measuring unknown unknowns may be how in the world can you do that? Well, it turns out the software engineering community has been doing this for decades: they call it software reliability growth modeling. That area’s quite complex with a lot of history, but for our purposes, I’ll boil it down to the basics.

Software reliability growth modeling deals with the problem of knowing whether your software is reliable enough, or in other words, whether or not you’ve taken out enough bugs that it’s time to ship the software. All things being equal, if your system test reveals 10 times more defects in the current release than in the previous release, it’s a good bet your new release is not as reliable as your old one.

On the other hand, if you’re running a weekly test debug cycle with a single release, so every week you test it, you remove some bugs, then you test it some more the next week, at some point you’d hope that the number of bugs found each week will be lower, and eventually you’ll stop finding bugs. When the number of bugs per week you find is low enough, maybe zero, or maybe some small number, you decide it’s time to ship. Now that doesn’t mean your software is perfect! But what it does mean is there’s no point testing anymore if you’re consistently not finding bugs. Alternately, if you have a limited testing budget, you can look at the curve over time of the number of bugs you’re discovering each week and get some sort of estimate about how many bugs you would find if you continued testing for additional cycles.

At some point, you may decide that the number of bugs you’ll find and the amount of time it will take simply isn’t worth the expense. And especially for a system that is not life critical, you may decide it’s just time to ship. A dizzying array of mathematical models has been proposed over the years for the shape of the curve of how many more bugs are left in the system based on your historical rate of how often you find bugs. Each one of those models comes with significant assumptions and limits to applicability. 

But the point is that people have been thinking about this for more than 40 years in terms of how to project how many more bugs are left in a system even though you haven’t found them. And there’s no point trying to reinvent all those approaches yourself.

Okay, so what does this have to do with self-driving car metrics?

Well, it’s really the same problem. In software tests, the bugs are the unknowns, because if you knew where the bugs were, you’d fix them.  You’re trying to estimate how many unknowns there are or how often they’re going to arrive during a testing process. In self-driving cars, the unknown unknowns are the things you haven’t trained on or haven’t thought about in the design. You’re doing road testing, simulation and other types of validation to try and uncover these. But it ends up in the same place. You’re trying to look for latent defects or functionality gaps and you’re trying to get idea of how many more there are left in the system that you haven’t found yet, or how many you can expect to find if you invest more resources in further testing.

For simplicity, let’s call the things in self-driving cars that you haven’t found yet surprises. 

The reason I put it this way is that there are two fundamentally different types of defects in these systems. One is you built the system the wrong way. It’s an actual software bug. You knew what you were supposed to do, and you didn’t get there. Traditional software testing and traditional software quality will help with those, but a surprise isn’t that. 

A surprise is a requirements gap or something in the environment you didn’t know was there. Or a surprise has to do with imperfect knowledge of the external world. But you can still treat it as a similar, although different, class from software defects and go at it the same way. One way to look at this is a surprise is something you didn’t realize should be in your ODD and therefore is a defect in the ODD description. Or, you didn’t realize the surprise could kick your vehicle out of the ODD and is a defect in the model of ODD violations that you have to detect. You’d expect that surprises that can lead to safety-critical failures are the ones that need the highest priority for remediation.

To create a metric for surprises, you need to track the number of surprises over time. You hope that over time, the arrival rate of surprises gets lower. In other words, they happen less often and that reflects that your product has gotten more mature, all things being equal. 

If the number of surprises gets higher, that could be a sign that your system has gotten worse with dealing unknowns, or could also be a sign that your operational domain has changed, and more weird things are happening than used to because of some change in the outside world. That requires you to update your ODD to reflect the new real world situation. Either way, a higher rival rate of surprises means you’re less mature or less reliable and a lower rate means you’re probably doing better.

This may sound a little bit like disengagements as a metric, but there’s a profound difference. That difference applies even if disengagements on road testing are one of the sources of data.

The idea is that measuring how often you disengage, that a safety driver takes over, or the system gives up and says, “I don’t know what to do” is a source of raw data. But the disengagements could be for many different reasons. And what you really care about for surprises is only disengagements that happened because of a defect in the ODD description or some other requirements gap.

Each incident that could be a surprise needs to be analyzed to see if it was a design defect, which isn’t an unknown. That’s just a mistake that needs to be fixed.  

But an incident could be a true unknown unknown that requires re-engineering or retraining your perception system or another remediation to handle something you didn’t realize until now was a requirement or operational condition that you need to deal with. Since even with a perfect design and perfect implementation, unknowns are going to continue to present risk, what you need to be tracking with a surprise metric is the arrival of actual surprises.

It should be obvious that you need to be looking for surprises to see them. That’s why things like monitoring near misses and investigating the occurrence of weird, but seemingly benign, behavior matters. Safety culture plays a role here. You have to be paying attention to surprises instead of dismissing them if they didn’t seem to do immediate harm. A deployment decision can use the surprise arrival rate metric to get an approximate answer of how much risk will be taken due to things missing from the system requirements and test plan. In other words, if you’re seeing surprises arrive every few minutes or every hour and you deploy, there’s every reason to believe that will continue to happen.

If you haven’t seen a surprise in thousands or tens or hundreds of thousands of hours of testing, then you can reasonably assume that surprises are unlikely to happen every hour once you deploy. (You can always get unlucky, so this is playing the odds to be sure.)

To deploy, you want to see the surprise arrival rate reduced to something acceptably low. You’ll also want to know the system has a good track record so that when a surprise does happen, it’s pretty good at recognizing something weird has happened and doing something safe in response.

To be clear, in the real world, the arrival rate of surprises will probably never be zero, but you need to measure that it’s acceptably low so you can make a responsible deployment decision.

Operational Design Domain Metrics (Metrics Episode 11)

Operational Design Domain metrics (ODD metrics) deal with both how thoroughly the ODD has been validated as well as the completeness of the ODD description. How often the vehicle is forcibly ejected from its ODD also matters.

Operational Design Domain metrics (ODD metrics) deal with both how thoroughly the ODD has been validated as well as the completeness of the ODD description.

An ODD is the designer’s model of the types of things that the self-driving cars intended to deal with. The actual world, in general, is going to have things that are outside the ODD. As a simple example, the ODD might include fair weather and rain, but snow and ice might be outside the ODD because the vehicle is intended to be deployed in a place where snow is very infrequent.

Despite designer’s best efforts, it’s always possible for the ODD to be violated. For example, if the ODD is Las Vegas in the desert, this system might be designed for mostly dry weather or possibly light rain. But in fact, in Vegas, once in a while, it rains and sometimes it even snows. The day that it snows the vehicle will be outside its ODD, even though it’s deployed in Las Vegas.

There are several types of ODD safety metrics that can be helpful. One is how well validation covers the ODD. What that means is whether the testing, analysis, simulation and other validation actually cover everything in the ODD, or have gaps in coverage.

When considering ODD coverage it’s important to realize that ODDs have many, many dimensions. There are much more than just geo-fencing boundaries. Sure, there’s day and night, wet versus dry, and freeze versus thaw.  But you also have traffic rules, condition of road markings, the types of vehicles present, the types of pedestrians present, whether there are leaves on the tree that affect LIDAR localization, and so on.  All these things and more can affect perception, planning, and motion constraints.

While it’s true that a geo-fence area can help limit some of the diversity in the ODD, simply specifying a geo-fence doesn’t tell you everything you need to know, and you’ve covered all the things that are inside that geo-fenced area. Metrics for ODD validation can be based on a detailed model of what’s actually in the ODD -- basically an ODD taxonomy of all the different factors that have to be handled and how well testing, simulation, and other validation cover that taxonomy.

Another type of metric is how well the system detects ODD violations. At some point, a vehicle will be forcibly ejected from its ODD even though it didn’t do anything wrong, simply due to external events. For example, a freak snowstorm in the desert, a tornado or the appearance of a new type of completely unexpected vehicle and force a vehicle out of its ODD with essentially no warning. The system has to recognize when it has exited its ODD and be safe. A metric related to this is how often ODD violations are happening during testing and on the road after deployment.

Another metric is what fraction of ODD violations are actually detected by the vehicle. This could be a crucial safety metric, because if an ODD violation occurs and the vehicle doesn’t know it, it might be operating unsafely. Now it’s hard to build a detector for ODD violations that the vehicle can’t detect (and such failures should be corrected). But this metric can be gathered by root cause analysis whenever there’s been some sort of system failure or incident. One of the root causes might simply be failure to detect an ODD violation.

Coverage of the ODD is important, but an equally important question is how good is the ODD description itself? If your ODD description is missing many things that happen every day in your actual operational domain (the real world,), then you’re going to have some problems.

A higher level of metric to talk about is ODD description quality. That is likely to be tied to other metrics already mentioned in this and other segments. Here are some examples. The frequency of ODD violations can help inform the coverage metric of the ODD against the operational domain. Frequency of motion failures could be related to motion system problems, but could also be due to missing environmental characteristics in your ODD. For example, cobblestone pavers are going to have significantly different surface dynamics than a smooth concrete surface and might come as a surprise when they are encountered. 

Frequency of perception failures could be due to training issues, but could also be something missing from the ODD object taxonomy. For example, a new aggressive clothing style or new types of vehicles. The frequency of planning failures could be due to planning bugs, but could also be due to the ODD missing descriptions of informal local traffic conventions.

Frequency of prediction failures could be prediction issues, but could also be due to missing a specific class of actors. For example, groups of 10 and 20 runners in formation near a military base might present a challenge if formation runners aren't in training data. It might be okay to have an incomplete ODD so long as you can always tell when something is happening that forced you out of the ODD. But it’s important to consider that metric issues in various areas might be due to unintentionally restricted ODD versus being an actual failure of the system design itself.

Summing up, ODD metric should address how well validation covers the whole ODD and how well the system detects ODD violations. It’s also useful to consider that a cause of poor metrics and other aspects of the design might in fact be that the ODD description is missing something important compared to what happens in the real world.

Prediction Metrics (Metrics Episode 10)

You need to drive not where the free space is, but where the free space is going to be when you get there. That means perception classification errors can affect not only the "what" but also the "future where" of an object.

Prediction metrics deal with how well a self driving car is able to take the results of perception data and predict what happens next so that it can create a safe plan. 

There are different levels of prediction sophistication required depending on operational conditions and desired own-vehicle capability. The first, simplest prediction capability is no prediction at all. If you have a low speed vehicle in an operational design domain in which everything is guaranteed to also be moving at low speeds and be relatively far away compared to the speeds, then a fast enough control loop might be able to handle things based simply on current object positions. The assumption there would be everything’s moving slowly, it’s far away, and you can stop your vehicle faster than things can get out of control.  (Note that if you move slowly but other vehicles move quickly, that violates the assumptions for this case.)

The prediction basically amounts to, nothing moves fast compared to its distance. But even here, a prediction metric can be helpful because there’s an assumption that everything is moving slow compared to its distance away. That assumption might be violated by nearby objects moving slowly but a little bit too fast because they’re so close, or by far away things moving fast such as a high speed vehicle in an urban environment that is supposed to have a low speed limit. The frequency at which the assumption is violated that things move slowly compared to the distance away will be an important safety metric.

For self driving cars that operate at more than a slow crawl. You’ll start to need some sort of prediction based on likely object movement. You often hear: "drive to where the free space is" with the free space being the open road space that’s safe for a car to maneuver in. 

But that doesn’t actually work once you’re doing more than about walking speed, because it isn’t where the free space is now that matters. What you need to do is to drive to where the free space is going to be when you get there. Doing that requires prediction because many of the things on the road move over time, changing where the free space is one second from now, versus five seconds from now, versus 10 seconds from now.

A starting point for prediction is assuming that everything maintains the same speed and direction as it currently has and update the speeds and directions periodically as you run your control loop. Doing this requires tracking so that you know not only where something is, but also what its direction and speed are. That means that with this type of prediction, metrics having to do with tracking accuracy become important, including distance, direction of travel and speed. 

For safety it isn’t perfect position accuracy on an absolute coordinate frame that matters, but rather whether tracking is accurate enough to know if there’s a potential collision situation or other danger. It’s likely that better accuracy is required for things that are close and things that are moving quickly toward you and in general things that pose collision threats. 

For more sophisticated self driving cars, you’ll need to predict something more sophisticated than just tracking data. That’s because other vehicles, people, animals and so on will change direction or even change their mind about where they’re going or what they’re doing. 

From a physics point of view, one way to look at this is in terms of derivatives. The simplest prediction is the current position. A slightly more sophisticated prediction has to do with the first derivative: speed and direction. An even more sophisticated prediction would be to use the second derivative: acceleration and curvature. You can even use the third derivative: jerk or change in acceleration. To the degree you can predict these things, you’ll be able to have a better understanding of where the free space will be when you get there.

From an every day point of view, the way to look at it is that real things don’t stand still -- they move. But when they’re moving, they change direction, they change speed, and sometimes they completely change what they’re trying to do, maybe doubling back on themselves. 

An example of a critical scenario is a pedestrian standing on a curb waiting for a crossing light. Human drivers use the person’s body language to tell the pedestrian is a risk of stepping off the curb even though they’re not supposed to be crossing. While that’s not perfect, most drivers will have stories of the time they didn’t hit someone because they noticed the person was distracted by looking at their cell phone or the person looked like they were about to jump into the road and so on. If you only look at speed and possibly acceleration, you won’t handle cases in which a human driver would say, “That looks dangerous. I’m going to slow down to give myself more reaction time in case behavior changes suddenly.” 

It isn’t just the current trajectory that matters for a pedestrian. It’s what the pedestrian’s about to do, which might be a dramatic change from standing still to running across through to catch a bus. 

The same would hold true for a human driver of another vehicle that you have some telltale available that suggests they’re about to swerve or turn in front of you. For even more sophisticated predictions, you probably don’t end up with a single prediction, but rather with a probability cloud of possible positions and directions of travel over time, where keeping on the same path might be the most probable. But a maximum command authority, right turn left turn, accelerate, decelerate might all be possible with lower probability but not zero probability. Given how complicated prediction can be, metrics might have to be more complicated than simply "did you guess exactly right?"  There’s always going to be some margin of error in any prediction, but you need to predict in a way that results in acceptable safety even in the face of surprises.

One way to handle the prediction is to take a snapshot of the current position and the predicted movement. Wait a few control loop cycles, some fractions of a second or a second. Then check to see how it turned out. In other words, you can just wait a little while, see how well your prediction turned out and keep score as to how good your prediction is. In terms of metrics, you need some sort of bounds on the worst case error of prediction. Every time that bound is violated, it is potentially a safety-related event and should be counting it as a metric. Those bounds might be probabilistic in nature, but at some point there has to be a bound as to what is acceptable prediction error and what’s not.

To the degree that prediction is based on object type, for example, you’re likely to assume a pedestrian typically cannot go as fast as a bicycle, but that a pedestrian can jump backwards and pivot turn. You might want to know if the type-specific prediction behavior is violated. For example, a pedestrian suddenly going from stop to 20 miles per hour crossing right in front of your car, might be a competitive sprinter that’s decided to run across the road, but more likely signals that electric rental scooters have arrived in your town and you need to include them in your operational design domain.

Prediction metrics might be related to the metrics for correct object classification if the prediction is based on the class of the object. 

Summing up, sophisticated prediction of behavior might be needed for highly permissive operation in complex dense environments. If you’re in a narrow city street with pedestrians close by and other things going on, you’re going to need really good prediction. Metrics for this topic should focus not only on motion measurement accuracy and position accuracy, but also on the ability to successfully predict what happens next, even if a particular object performs a sudden change in direction, speed, and so on. In the end, your metric should help you understand the likelihood that you’ll correctly interpret where the free space is going to be so that your path planner can plan a safe path.

Sunday, December 6, 2020

Perception Metrics (Metrics Episode 9)

Don’t forget that there will always be something in the world you’ve never seen before and have never trained on, but your self driving car is going to have to deal with it. A particular area of concern is correlated failures across sensing modes.

Perception safety metrics deal with how a self driving car takes sensor inputs and maps them into a real-time model of the world around it. 

Perception metrics should deal with a number of areas. One area is sensor performance. This is not absolute performance, but rather with respect to safety requirements. Can a sensor see far enough ahead to give accurate perception in time for the planner to react? Does the accuracy remain sufficient given changes in environmental and operational conditions? Note that for the needs of the planner, further isn’t better without limit. At some point, you can see far enough ahead that you’ve reached the planning horizon, and sensor performance beyond that might help with ride comfort or efficiency but is not necessarily directly related to safety.

Another type of metric deals with sensor fusion. At a high level, success with sensor fusion is whether that fusion strategy can actually detect the types of things you need to see in the environment. But even if it seems like sensor fusion is seeing everything it needs to, there are some underlying safety issues to consider. 

One is measuring correlated failures. Suppose your sensor fusion algorithm assumes that multiple sensors have independent failures. So you’ve done some math and said, well, the chance of all the sensors failing at the same time as low enough to tolerate, that analysis assumes there’s some independence across the sensor failures. 

For example, if you have three sensors and you’re assuming that they fail independently, knowing that two of those sensors failed at the same time on the same thing is really important because it provides counter-evidence to your independence assumption. But you need to be looking for this specifically because your vehicle may have performed just fine because the third sensor was independent. So the important thing here is the metric is not about whether your sensor fusion work but rather whether the independence assumption behind your analysis was valid or invalid.

Another metric to consider related to the area of sensor fusion is whether or not detection ride-through based on tracking is covering up problems. It’s easy enough to rationalize that if you see something nine frames out of 10, then missing one frame isn’t a big deal because you can track through the dropout. If missed detections are infrequent and random, that might be valid assumption. But it’s also possible you have clusters of missed detections based on some types of environments or some types of objects related to certain types of sensors even if overall they are a small fraction. Keeping track of how often and how long ride through is actually required to track through missing detections is important to validate the underlying assumption of random dropouts rather than clustered or correlated dropouts.

A third type of metric is classification accuracy. It’s common to track false negatives, which are how often you miss something that matters. For example, if you miss a pedestrian, it’s hard to avoid hitting something you don’t see. But you should track false negatives not just based on the sensor fusion output, but also per sensor and per combinations of sensors. This goes back to making sure there aren’t systematic faults that undermine the independence of failure assumptions. 

There are also false positives, which is how often you see something there that isn’t really there. For example, a pattern of cracks in the pavement might look like an obstacle and could cause a panic stop. Again, sensor fusion might be masking a lot of false positives. But you need to know whether or not your independence assumption for deciding how the sensors fail as a system is valid or not. 

Somewhere in between is misclassifications. For example, saying something is a bicycle versus a wheelchair versus a pedestrian is likely to matter for prediction, even though all three of those things are an object that shouldn’t be hit.

Just touching on the independence point one more time, all these metrics: false plate negatives, false positive, and misclassifications, should be per sensor modality. That’s because if sensor fusion saves you, say, for example, vision misclassifies something but later still gets it right, you can’t count on that always working. You want to make sure that each of your sensor modalities works as well as it can without systematic defects, because maybe next time you won’t get lucky and the sensor fusion algorithm will suffer correlated fault that leads to a problem.

In all the different aspects of perception, edge cases matter. There are going to be things you haven’t seen before and you can’t train on something you’ve never seen. 

So how well does your sensing system generalize? There are very likely to be systematic biases in training and validation data that never occurred to anyone to notice. An example we’ve seen is that if you take data in cool weather, nobody’s wearing shorts outdoors in the Northeast US. Therefore, the system learns implicitly that tan or brown things sticking out of the ground with green blobs on top are bushes or trees.  But in the summer that might be someone in shorts wearing a green shirt.

You also have to think about unusual presentations of known objects. For example, a person carrying a bicycle is different than a bicycle carrying a person. Or maybe someone’s fallen down into the roadway. Or maybe you see very strange vehicle configurations or weird paint jobs on vehicles.

The thing to look for in all these is clusters or correlations in perception failures -- things that don’t support a random independent failure assumption between modes. Because those are the places where you’re going to have trouble with sensor fusion sorting out the mess and compensating for failures.

A big challenge in perception is that the world is an essentially infinite supply of edge cases. It’s advisable to have a robust taxonomy of objects you expect to see in your operational design domain, especially to the degree that prediction, which we’ll discuss later on, requires accurate classification of objects or maybe even object subtypes.

While it’s useful to have a metric that deals with coverage of the taxonomy in training and testing, it’s just as important to have a metric for how well the taxonomy actually represents the operational design domain. Along those lines, a metric that might be interesting is how often you encounter something that’s not in the taxonomy, because if that’s happening every minute or every hour, that tells you your taxonomy probably needs more maturity before you deploy.

Because the world is open-ended, a metric is also useful for how often your perception is saying: "I’m not sure what that is." Now, it’s okay to handle "I’m not sure" by doing a safety shutdown or doing something safe. But knowing how often your perception is confused or has a hole is an important way to measure your perception maturity.

Summing up, perception metrics, as we’ve discussed them, cover a broad swath from sensors through sensor fusion to object classification. In practice, these might be split out to different types of metrics, but they have to be covered somewhere. And during this discussion we’ve seen that they do interact a bit.

The most important outcome of these metrics is to get a feel for how well the system is able to build a model of the outside world, given that sensors are imperfect, operational conditions can compromise sensor capabilities, and the real world can present objects and environmental conditions that both have never been seen, and worse, might cause correlated sensor failures that compromise the ability of sensor fusion to actually come up with an accurate classification of specific types of objects. Don’t forget, there will always be something in the world you’ve never seen before and have never trained on, but your self driving car is going to have to deal with it.

Safety Performance Indicator (SPI) metrics (Metrics Episode 14)

SPIs help ensure that assumptions in the safety case are valid, that risks are being mitigated as effectively as you thought they would be, ...