Safe Autonomy: #DidYouThinkofThat?

Showing posts with label #DidYouThinkofThat?. Show all posts

Saturday, December 12, 2020

Surprise Metrics (Metrics Episode 12)

You can estimate how many unknown unknowns are left to deal with via a metric that measures the surprise arrival rate. Assuming you're really looking, infrequent surprises predict that they will be infrequent in the near future as well.

Your first reaction to thinking about measuring unknown unknowns may be how in the world can you do that? Well, it turns out the software engineering community has been doing this for decades: they call it software reliability growth modeling. That area’s quite complex with a lot of history, but for our purposes, I’ll boil it down to the basics.

Software reliability growth modeling deals with the problem of knowing whether your software is reliable enough, or in other words, whether or not you’ve taken out enough bugs that it’s time to ship the software. All things being equal, if a complete same system test reveals 10 times more defects in the current release than in the previous release, it’s a good bet your new release is not as reliable as your old one.

On the other hand, if you’re running a weekly test/debug cycle with a single release, so every week you test it, you remove some bugs, then you test it some more the next week, at some point you’d hope that the number of bugs found each week will be lower, and eventually you’ll stop finding bugs. When the number of bugs per week you find is low enough, maybe zero, or maybe some small number, you decide it’s time to ship. Now that doesn’t mean your software is perfect! But what it does mean is there’s no point testing anymore if you’re consistently not finding bugs. Alternately, if you have a limited testing budget, you can look at the curve over time of the number of bugs you’re discovering each week and get some sort of estimate about how many bugs you would find if you continued testing for additional cycles.

At some point, you may decide that the number of bugs you’ll find and the amount of time it will take simply isn’t worth the expense. And especially for a system that is not life critical, you may decide it’s just time to ship. A dizzying array of mathematical models has been proposed over the years for the shape of the curve of how many more bugs are left in the system based on your historical rate of how often you find bugs. Each one of those models comes with significant assumptions and limits to applicability.

But the point is that people have been thinking about this for more than 40 years in terms of how to project how many more bugs are left in a system even though you haven’t found them. And there’s no point trying to reinvent all those approaches yourself.

Okay, so what does this have to do with self-driving car metrics?

Well, it’s really the same problem. In software tests, the bugs are the unknowns, because if you knew where the bugs were, you’d fix them. You’re trying to estimate how many unknowns there are or how often they’re going to arrive during a testing process. In self-driving cars, the unknown unknowns are the things you haven’t trained on or haven’t thought about in the design. You’re doing road testing, simulation and other types of validation to try and uncover these. But it ends up in the same place. You’re trying to look for latent defects or functionality gaps and you’re trying to get idea of how many more there are left in the system that you haven’t found yet, or how many you can expect to find if you invest more resources in further testing.

For simplicity, let’s call the things in self-driving cars that you haven’t found yet surprises.

The reason I put it this way is that there are two fundamentally different types of defects in these systems. One is you built the system the wrong way. It’s an actual software bug. You knew what you were supposed to do, and you didn’t get there. Traditional software testing and traditional software quality will help with those, but a surprise isn’t that.

A surprise is a requirements gap or something in the environment you didn’t know was there. Or a surprise has to do with imperfect knowledge of the external world. But you can still treat it as a similar, although different, class from software defects and go at it the same way. One way to look at this is a surprise is something you didn’t realize should be in your ODD and therefore is a defect in the ODD description. Or, you didn’t realize the surprise could kick your vehicle out of the ODD and is a defect in the model of ODD violations that you have to detect. You’d expect that surprises that can lead to safety-critical failures are the ones that need the highest priority for remediation.

To create a metric for surprises, you need to track the number of surprises over time. You hope that over time, the arrival rate of surprises gets lower. In other words, they happen less often and that reflects that your product has gotten more mature, all things being equal.

If the number of surprises gets higher, that could be a sign that your system has gotten worse with dealing unknowns, or could also be a sign that your operational domain has changed, and more novel things are happening than used to because of some change in the outside world. That requires you to update your ODD to reflect the new real world situation. Either way, a higher rival rate of surprises means you’re less mature or less reliable and a lower rate means you’re probably doing better.

This may sound a little bit like disengagements as a metric, but there’s a profound difference. That difference applies even if disengagements on road testing are one of the sources of data.

The idea is that measuring how often you disengage, that a safety driver takes over, or the system gives up and says, “I don’t know what to do” is a source of raw data. But the disengagements could be for many different reasons. And what you really care about for surprises is only disengagements that happened because of a defect in the ODD description or some other requirements gap.

Each incident that could be a surprise needs to be analyzed to see if it was a design defect, which isn’t really an unknown unknown. That’s just a mistake that needs to be fixed.

But some incidents will be true unknown unknown situations that require re-engineering or retraining your perception system or another remediation to handle something you didn’t realize until now was a requirement or operational condition that you need to deal with. Since even with a perfect design and perfect implementation, unknowns are going to continue to present risk, what you need to be tracking with a surprise metric is the arrival of actual surprises.

It should be obvious that you need to be looking for surprises to see them. That’s why things like monitoring near misses and investigating the occurrence of unexpected, but seemingly benign, behavior matters. Safety culture plays a role here. You have to be paying attention to surprises instead of dismissing them if they didn’t seem to do immediate harm. A deployment decision can use the surprise arrival rate metric to get an approximate answer of how much risk will be taken due to things missing from the system requirements and test plan. In other words, if you’re seeing surprises arrive every few minutes or every hour and you deploy, there’s every reason to believe that will continue to happen about that often during your initial deployment.

If you haven’t seen a surprise in thousands or hundreds of thousands of hours of testing, then you can reasonably assume that surprises are unlikely to happen every hour once you deploy. (You can always get unlucky, so this is playing the odds to be sure.)

To deploy, you want to see the surprise arrival rate reduced to something acceptably low. You’ll also want to know the system has a good track record so that when a surprise does happen, it’s pretty good at recognizing something has gone wrong and doing something safe in response.

To be clear, in the real world, the arrival rate of surprises will probably never be zero, but you need to measure that it’s acceptably low so you can make a responsible deployment decision.

Autonomous Vehicle Fault and Failure Management

When you build an autonomous vehicle you can't count on a human driver to notice when something's wrong and "do the right thing." Here is a list of faults, system limitations, and fault responses AVs will need to get right. Did you think of these?

https://www.godlikeproductions.com/sm/custom/z/l/mppijgrr.jpeg

System Limitations:

Sometimes the issue isn't that something is broken, but rather simply that all vehicles have limitations. You have to know your system's limitations.

Current capabilities of sensors and actuators, which can depend upon the operational state space.
Detecting and handling a vehicle excursion outside the operational state space for which it was validated, including all aspects of {ODD, OEDR, Maneuver, Fault} tuples.
Desired availability despite fault states, including any graceful degradation plan, and any limits placed upon the degraded operational state space.
Capability variation based on payload characteristics (e.g. passenger vehicle overloaded with cargo, uneven weight distribution, truck loaded with gravel, tanker half filled with liquid) and autonomous payload modification (e.g. trailer connect/disconnect).
Capability variation based on functional modes (e.g. pivot vs. Ackerman vs. crab steering, rear wheel steering, ABS or 4WD engaged/disengaged).
Capability variation based on ad-hoc teaming (e.g. V2V, V2I) and planned teaming (e.g. leader-follower or platooning vehicle pairing).
Incompleteness, incorrectness, corruption or unavailability of external information (V2V, V2I).

System Faults:

Perception failure, including transient and permanent faults in classification and pose of objects.
Planning failures, including those leading to collision, unsafe trajectories (e.g., rollover risk), and dangerous paths (e.g., roadway departure).
Vehicle equipment operational faults (e.g., blown tire, engine stall, brake failure, steering failure, lighting system failure, transmission failure, uncommanded engine power, autonomy equipment failure, electrical system failure, vehicle diagnostic trouble codes).
Vehicle equipment maintenance faults (e.g., improper tire pressure, bald tires, misaligned wheels, empty sensor cleaning fluid reservoir, depleted fuel/battery).
Operational degradation of sensors and actuators including temporary (e.g., accumulation of mud, dirt, dust, heat, water, ice, salt spray, smashed insects) and permanent (e.g., manufacturing imperfections, scratches, scouring, aging, wear-out, blockage, impact damage).
Equipment damage including detecting and mitigating catastrophic loss (e.g., vehicle collisions, lighting strikes, roadway departure), minor losses (e.g., sensor knocked off, actuator failures), and temporary losses (e.g., misalignment due to bent support bracket, loss of calibration).
Incorrect, missing, stale, and inaccurate map data.
Training data incompleteness, incorrectness, known bias, or unknown bias.

Fault Responses:

Some of the faults and limitations fall within the purview of safety standards that apply to non-autonomous functions. However, a unified system-level view of fault detection and mitigation can be useful to ensure that no faults are left unaddressed. More importantly, to the degree that credit has been taken for a human driver participating in fault mitigation by safety standards, that places fault mitigation obligations upon the autonomy.

How the system behaves when encountering an exceptional operational state space, experiencing a fault, or reaching a system limitation.
Diagnostic gaps (e.g., latent faults, undetected faults, undetected faulty redundancy).
How the system re-integrates failed components, including recovery from transient faults and recovery from repaired permanent faults during operation and/or after maintenance.
Response and policies for prioritizing or otherwise determining actions in inherently risky or certain-loss situations.
Withstanding an attack (system security, compromised infrastructure, compromised other vehicles), and deterring inappropriate use (e.g., malicious commands, inappropriately dangerous cargo, dangerous passenger behavior).
How the system is updated to correct functional defects, security defects, safety defects, and addition of new or improved capabilities.

Is there anything we missed?

(This is an excerpt of Koopman, P. & Fratrik, F., "How many operational design domains, objects, and events?" SafeAI 2019, AAAI, Jan 27, 2019.)

Wednesday, June 26, 2019

Event Detection Considerations for Autonomous Vehicles (OEDR -- part 2)

Object and Event Detection and Recognition (OEDR) also involves making predictions about what might happen next. Is that pedestrian waiting for a bus? or about to walk out into the crosswalk right in front of my car? Did you think of all of these aspects?

https://www.post-gazette.com/local/neighborhood/2018/09/06/Pittsburgh-Left-question-jagoff-neighborhood-vote-hearken-tradition-culture/stories/201809030108

The infamous Pittsburgh Left. The first vehicle at a red light
will (sometimes) turn left when the light turns to green.

Some factors to consider when deciding what events and behaviors your system needs to predict include:

Determining expected behaviors of other objects, which might involve a probability distribution and is likely to be based on object classification.
Normal or reasonably expected movements by objects in the environment.
Unexpected, incorrect, or exceptional movement of other vehicles, obstacles, people, or other objects in the environment.
Failure to move by other objects which are reasonably expected to move.
Operator interactions prior to, during, and post autonomy engagement including: supervising driver alertness monitoring, informing occupants, interaction with local or remote operator locations, mode selection and enablement, operator takeover, operator cancellation or redirect, operator status feedback, operator intervention latency, single operator supervision of multiple systems (multi-tasking), operator handoff, loss of operator ability to interact with vehicle.
Human interactions including: human commands (civilians performing traffic direction, police pull-over, passenger distress), normal human interactions (pedestrian crossing, passenger entry/egress), common human rule-breaking (crossing mid-block when far from an intersection, speeding, rubbernecking, use of parking chairs, distracted walking), abnormal human interactions (defiant jaywalking, attacks on vehicle, attempted carjacking), and humans who are not able to follow rules (children, impaired adults).
Non-human interactions including: animal interaction (flocks/herds, pets, dangerous wildlife, protected wildlife) and delivery robots.

Is there anything we missed? (Previous post had the "objects" part of OEDR.)

(This is an excerpt of Koopman, P. & Fratrik, F., "How many operational design domains, objects, and events?" SafeAI 2019, AAAI, Jan 27, 2019.)

Wednesday, June 19, 2019

Object Detection Considerations for Autonomous Vehicles (OEDR -- part 1)

Object and Event Detection and Recognition (OEDR) involves having an autonomous vehicle detect and classify various types of objects so that it can plan a response. Detection is only the first step; you need to also be able to classify the obstacle to predict what might happen next. Pedestrians tend to walk into the roadway. Bushes, not so much. Did you think of all of these aspects?

Q: Why did the Mr. Rogers-saurus cross the street?
A: Trick question; he doesn't actually move because he is part of the Pittsburgh Dinosaur Parade.

Some factors to consider when deciding what objects your system needs to detect and recognize include:

Ability to detect and identify (e.g. classify) all relevant objects in the environment.
Processing and thresholding of sensor data to avoid both false positives (e.g., bouncing drink can, steel bridge joint, steel road construction cover plate, roadside sign, dust cloud, falling leaves) and false negatives (e.g., highly publicized partially automated vehicle collisions with stationary vehicles)
Characterizing the likely operational parameters of other road users (e.g., braking capability of leading and following vehicle, or whether another vehicle is behaving erratically enough that there is a likely control fault.)
Permanent obstacles such as structures, curbs, median dividers, guard rails, trees, bridges, tunnels, berms, ditches, roadside and overhanging signage.
Temporary obstacles such as transient keep-out zones, spills, floods, water-filled potholes, landslides, washed out bridges, overhanging vegetation, and downed power lines. (For practical purposes, “temporary” might mean obstacles not included on maps, with some vehicle having to be the first vehicle to detect an obstacle for placement even on a dynamic map.)
People, including cooperative people, uncooperative people, malicious behaviors, and people who are unaware of the operation of the autonomous system.
At-risk populations which might be unable, incapable, or exempt from following established rules and norms, such as children as well as injured, ability-impaired, or under-the-influence people.
Other cooperative and uncooperative human-driven and autonomous vehicles.
Other road users including special purpose vehicles, temporary structures, street dining, street festivals, parades, motorcades, funeral processions, farm equipment, construction crews, draft animals, farm animals, and endangered species.
Other non-stationary objects including uncontrolled moving objects, falling objects, wind-blown objects, in-traffic cargo spills, and low-flying aircraft.

Is there anything we missed? (Next post will have the "events" part of OEDR.)

(This is an excerpt of Koopman, P. & Fratrik, F., "How many operational design domains, objects, and events?" SafeAI 2019, AAAI, Jan 27, 2019.)

Wednesday, June 12, 2019

Operational Design Domain (ODD) for Autonomous Systems

The Operational Design Domain (ODD) is the set of environmental conditions that an autonomous system is designed to work in. Typically an ODD is thought of as some sort of geo-fencing plus a obvious weather conditions (rain, snow, sun). But, it's a lot more than that. Did you think of all of these?

Canton Avenue, the unofficial steepest street in the world, is less than 4 miles from downtown Pittsburgh.
Note cobblestone on the top half and the sidewalk stairs. Cars slide (sometimes backwards) down the street in winter.
Geo-fencing is more complicated than drawing a circle around a city center.
[Wikipedia]

Characterizing the system operational environment should include at least the following:

Operational terrain, and associated location-dependent characteristics (e.g., slope, camber, curvature, banking, coefficient of friction, road roughness, air density) including immediate vehicle surroundings and projected vehicle path. It is important to note that dramatic changes can occur in relatively short distances.
Environmental and weather conditions such as surface temperature, air temperature, wind, visibility, precipitation, icing, lighting, glare, electromagnetic interference, clutter, vibration, and other types of sensor noise.
Operational infrastructure, such as availability and placement of operational surfacing, navigation aids (e.g., beacons, lane markings, augmented signage), traffic management devices (e.g., traffic lights, right of way signage, vehicle running lights), keep-out zones, special road use rules (e.g., time-dependent lane direction changes) and vehicle-to-infrastructure availability.
Rules of engagement and expectations for interaction with the environment and other aspects of the operational state space, including traffic laws, social norms, and customary signaling and negotiation procedures with other agents (both autonomous and human, including explicit signaling as well as implicit signaling via vehicle motion control).
Considerations for deployment to multiple regions/countries (e.g., blue stop signs, “right turn keep moving” stop sign modifiers, horizontal vs. vertical traffic signal orientation, side-of-road changes).
Communication modes, bandwidth, latency, stability, availability, reliability, including both machine-to-machine communications and human interaction.
Availability and freshness of infrastructure characterization data such as level of mapping detail and identification of temporary deviations from baseline data (e.g., construction zones, traffic jams, temporary traffic rules such as for hurricane evacuation).
Expected distributions of operational state space elements, including which elements are considered rare but in-scope (e.g. toll booths, police traffic stops), and which are considered outside the region of the state space in which the system is intended to operate.

Special attention should be paid to ODD aspects that are relevant to inherent equipment limitations, such as the minimum illumination required by cameras.

Are there any other aspects of ODD we missed?

(This is an excerpt of Koopman, P. & Fratrik, F., "How many operational design domains, objects, and events?" SafeAI 2019, AAAI, Jan 27, 2019.)

Safe Autonomy