Showing posts with label perception validation. Show all posts
Showing posts with label perception validation. Show all posts

Sunday, December 6, 2020

Perception Metrics (Metrics Episode 9)

Don’t forget that there will always be something in the world you’ve never seen before and have never trained on, but your self driving car is going to have to deal with it. A particular area of concern is correlated failures across sensing modes.

Perception safety metrics deal with how a self driving car takes sensor inputs and maps them into a real-time model of the world around it. 

Perception metrics should deal with a number of areas. One area is sensor performance. This is not absolute performance, but rather with respect to safety requirements. Can a sensor see far enough ahead to give accurate perception in time for the planner to react? Does the accuracy remain sufficient given changes in environmental and operational conditions? Note that for the needs of the planner, further isn’t better without limit. At some point, you can see far enough ahead that you’ve reached the planning horizon, and sensor performance beyond that might help with ride comfort or efficiency but is not necessarily directly related to safety.

Another type of metric deals with sensor fusion. At a high level, success with sensor fusion is whether that fusion strategy can actually detect the types of things you need to see in the environment. But even if it seems like sensor fusion is seeing everything it needs to, there are some underlying safety issues to consider. 

One is measuring correlated failures. Suppose your sensor fusion algorithm assumes that multiple sensors have independent failures. So you’ve done some math and said, well, the chance of all the sensors failing at the same time as low enough to tolerate, that analysis assumes there’s some independence across the sensor failures. 

For example, if you have three sensors and you’re assuming that they fail independently, knowing that two of those sensors failed at the same time on the same thing is really important because it provides counter-evidence to your independence assumption. But you need to be looking for this specifically because your vehicle may have performed just fine because the third sensor was independent. So the important thing here is the metric is not about whether your sensor fusion work but rather whether the independence assumption behind your analysis was valid or invalid.

Another metric to consider related to the area of sensor fusion is whether or not detection ride-through based on tracking is covering up problems. It’s easy enough to rationalize that if you see something nine frames out of 10, then missing one frame isn’t a big deal because you can track through the dropout. If missed detections are infrequent and random, that might be valid assumption. But it’s also possible you have clusters of missed detections based on some types of environments or some types of objects related to certain types of sensors even if overall they are a small fraction. Keeping track of how often and how long ride through is actually required to track through missing detections is important to validate the underlying assumption of random dropouts rather than clustered or correlated dropouts.

A third type of metric is classification accuracy. It’s common to track false negatives, which are how often you miss something that matters. For example, if you miss a pedestrian, it’s hard to avoid hitting something you don’t see. But you should track false negatives not just based on the sensor fusion output, but also per sensor and per combinations of sensors. This goes back to making sure there aren’t systematic faults that undermine the independence of failure assumptions. 

There are also false positives, which is how often you see something there that isn’t really there. For example, a pattern of cracks in the pavement might look like an obstacle and could cause a panic stop. Again, sensor fusion might be masking a lot of false positives. But you need to know whether or not your independence assumption for deciding how the sensors fail as a system is valid or not. 

Somewhere in between is misclassifications. For example, saying something is a bicycle versus a wheelchair versus a pedestrian is likely to matter for prediction, even though all three of those things are an object that shouldn’t be hit.

Just touching on the independence point one more time, all these metrics: false plate negatives, false positive, and misclassifications, should be per sensor modality. That’s because if sensor fusion saves you, say, for example, vision misclassifies something but later still gets it right, you can’t count on that always working. You want to make sure that each of your sensor modalities works as well as it can without systematic defects, because maybe next time you won’t get lucky and the sensor fusion algorithm will suffer correlated fault that leads to a problem.

In all the different aspects of perception, edge cases matter. There are going to be things you haven’t seen before and you can’t train on something you’ve never seen. 

So how well does your sensing system generalize? There are very likely to be systematic biases in training and validation data that never occurred to anyone to notice. An example we’ve seen is that if you take data in cool weather, nobody’s wearing shorts outdoors in the Northeast US. Therefore, the system learns implicitly that tan or brown things sticking out of the ground with green blobs on top are bushes or trees.  But in the summer that might be someone in shorts wearing a green shirt.

You also have to think about unusual presentations of known objects. For example, a person carrying a bicycle is different than a bicycle carrying a person. Or maybe someone’s fallen down into the roadway. Or maybe you see very strange vehicle configurations or weird paint jobs on vehicles.

The thing to look for in all these is clusters or correlations in perception failures -- things that don’t support a random independent failure assumption between modes. Because those are the places where you’re going to have trouble with sensor fusion sorting out the mess and compensating for failures.

A big challenge in perception is that the world is an essentially infinite supply of edge cases. It’s advisable to have a robust taxonomy of objects you expect to see in your operational design domain, especially to the degree that prediction, which we’ll discuss later on, requires accurate classification of objects or maybe even object subtypes.

While it’s useful to have a metric that deals with coverage of the taxonomy in training and testing, it’s just as important to have a metric for how well the taxonomy actually represents the operational design domain. Along those lines, a metric that might be interesting is how often you encounter something that’s not in the taxonomy, because if that’s happening every minute or every hour, that tells you your taxonomy probably needs more maturity before you deploy.

Because the world is open-ended, a metric is also useful for how often your perception is saying: "I’m not sure what that is." Now, it’s okay to handle "I’m not sure" by doing a safety shutdown or doing something safe. But knowing how often your perception is confused or has a hole is an important way to measure your perception maturity.

Summing up, perception metrics, as we’ve discussed them, cover a broad swath from sensors through sensor fusion to object classification. In practice, these might be split out to different types of metrics, but they have to be covered somewhere. And during this discussion we’ve seen that they do interact a bit.

The most important outcome of these metrics is to get a feel for how well the system is able to build a model of the outside world, given that sensors are imperfect, operational conditions can compromise sensor capabilities, and the real world can present objects and environmental conditions that both have never been seen, and worse, might cause correlated sensor failures that compromise the ability of sensor fusion to actually come up with an accurate classification of specific types of objects. Don’t forget, there will always be something in the world you’ve never seen before and have never trained on, but your self driving car is going to have to deal with it.

Wednesday, February 6, 2019

Edge Cases and Autonomous Vehicle Safety -- SSS 2019 Keynote

Here is my keynote talk for SSS 2019 in Bristol UK.

Edge Cases and Autonomous Vehicle Safety

Making self-driving cars safe will require a combination of techniques. ISO 26262 and the draft SOTIF standards will help with vehicle control and trajectory stages of the autonomy pipeline. Planning might be made safe using a doer/checker architectural pattern that uses  deterministic safety envelope enforcement of non-deterministic planning algorithms. Machine-learning based perception validation will be more problematic. We discuss the issue of perception edge cases, including the potentially heavy-tail distribution of object types and brittleness to slight variations in images. Our Hologram tool injects modest amounts of noise to cause perception failures, identifying brittle aspects of perception algorithms. More importantly, in practice it is able to identify context-dependent perception failures (e.g., false negatives) in unlabeled video.



Monday, November 19, 2018

Webinar on Robustness Testing of Perception

Zachary Pezzementi and Trenton Tabor have done some great work on perception systems in general, and how image degradation affects things.  I'd previously posted information about their paper, but now there is a webinar available here:
    Webinar home page with details & links:  http://ieeeagra.com/events/webinar-november-4-2018/

This includes pointers to slides, a recorded webinar, the paper, and papers.

My robustness testing team at NREC worked with them on the perception stress testing parts, so here are quick links to the parts covering that part:


Friday, July 27, 2018

Putting image manipulations in context: robustness testing for safe perception

UPDATE 8/17 -- added presentation slides!

I'm very pleased to share a publication from our NREC autonomy validation team that explains how computationally cheap image perturbations and degradations can expose catastrophic perception brittleness issues.  You don't need adversarial attacks to foil machine learning-based perception -- straightforward image degradations such as blur or haze can cause problems too.

Our paper "Putting image manipulations in context: robustness testing for safe perception" will be presented at IEEE SSRR August 6-8.  Here's a submission preprint:

https://users.ece.cmu.edu/~koopman/pubs/pezzementi18_perception_robustness_testing.pdf

Abstract—We introduce a method to evaluate the robustness of perception systems to the wide variety of conditions that a deployed system will encounter. Using person detection as a sample safety-critical application, we evaluate the robustness of several state-of-the-art perception systems to a variety of common image perturbations and degradations. We introduce two novel image perturbations that use “contextual information” (in the form of stereo image data) to perform more physically-realistic simulation of haze and defocus effects. For both standard and contextual mutations, we show cases where performance drops catastrophically in response to barely perceptible
changes. We also show how robustness to contextual mutators can be predicted without the associated contextual information in some cases.

Fig. 6: Examples of images that show the largest change in detection performance for MS-CNN under moderate blur and haze. For all of them, the rate of FPs per image required to detect the person increases by three to five orders of magnitude. In each image, the green box shows the labeled location of the person. The blue and red boxes are the detection produced by the SUT before and after mutation respectively, and the white-on-blue text is the strength of that detection (ranged 0 to 1). Finally, the value in whiteon-yellow text shows the average FP rate per image that a sensitivity threshold set at that value would yield. i.e., that is the required FP rate to still detect the person.




Alternate slide download link: https://users.ece.cmu.edu/~koopman/pubs/pezzementi18_perception_robustness_testing_slides.pdf

Citation:
Pezzementi, Z., Tabor, T., Yim, S., Chang, J., Drozd, B., Guttendorf, D., Wagner, M., & Koopman, P., "Putting image manipulations in context: robustness testing for safe perception," IEEE International Symposium on Safety, Security, and Rescue Robotics (SSRR), Aug. 2018.

Saturday, July 14, 2018

AVS 2018 Panel Session

It was great to have the opportunity to participate in a panel on autonomous vehicle validation and safety at AVS in San Francisco this past week.  Thanks especially to Steve Shladover for organizing such an excellent forum for discussion.

The discussion was the super-brief version. If you want to dig deeper, you can find much more complete slide decks attached to other blog posts:
The first question was to spend 5 minutes talking about the types of things we do for validation and safety.  Here are my slides from that very brief opening statement.



Thursday, June 28, 2018

Safety Validation and Edge Case Testing for Autonomous Vehicles (Slides)

Here is a slide deck that expands upon the idea that the heavy tail ceiling is a problem for AV validation. It also explains ways to augment image sensor inputs to improve robustness.



Safety Validation and Edge Case Testing for Autonomous Vehicles from Philip Koopman

(If slideshare is blocked for you, try this alternate download source)

Saturday, June 23, 2018

Heavy Tail Ceiling Problem for AV Testing

I enjoyed participating in the AV Benchmarking Panel hosted by Clemson ICAR last week.  Here are my slides and a preprint of my position paper on the Heavy Tail Ceiling problem for AV safety testing.

Abstract
Creating safe autonomous vehicles will require not only extensive training and testing against realistic operational scenarios, but also dealing with uncertainty. The real world can present many rare but dangerous events, suggesting that these systems will need to be robust when encountering novel, unforeseen situations. Generalizing from observed road data to hypothesize various classes of unusual situations will help. However, a heavy tail distribution of surprises from the real world could make it impossible to use a simplistic drive/fail/fix development process to achieve acceptable safety. Autonomous vehicles will need to be robust in handling novelty, and will additionally need a way to detect that they are encountering a surprise so that they can remain safe in the face of uncertainty

Paper Preprint:
http://users.ece.cmu.edu/~koopman/pubs/koopman18_heavy_tail_ceiling.pdf

Presentation:




Friday, June 1, 2018

Can Mobileye Validate ‘True Redundancy’?

I'm quoted in this article by Junko Yoshida on Mobileye's approach to AV safety.

Can Mobileye Validate ‘True Redundancy’?
Intel/Mobileye’s robocars start running in Jerusalem
Junko Yoshida
5/22/2018 02:01 PM EDT

...
Issues include how to achieve “true redundancy” in perception, how to explain objectively what “safe” really means, and how to formulate “a consistent and complete set of safety rules” agreeable to the whole AV industry, according to Phil Koopman, professor of Carnegie Mellon University.
...

Read the story here:
  https://www.eetimes.com/document.asp?doc_id=1333308

Wednesday, May 16, 2018

AutoSens 2018 slides

I enjoyed presenting at AutoSens 2018 today.   The audience was very engaged and asked great questions.

Here are my slides. (If you seen my other recent slide decks probably not a lot of surprises, but I remixed things to emphasize perception validation.)