Safe Autonomy: operational safety

Showing posts with label operational safety. Show all posts

Monday, December 4, 2023

Video: AV Safety Lessons To Be Learned from 2023 experiences

Here is a retrospective video of robotaxi lessons learned in 2023

What happened to robotaxis in 2023 in San Francisco.
The Cruise crash and related events.
Lessons the industry needs to learn to take a more expansive view to safety/acceptability:

Not just statistically better than a human driver
Avoid negligent driving behavior
Avoid risk transfer to vulnerable populations
Fine-grain regulatory risk management
Conform to industry safety standards
Address ethical & equity concerns
Build sustainable trust.

Preprint with more detail about these lessons here: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4634179

Talk Slides: https://users.ece.cmu.edu/~koopman/lectures/L141-2023-12-AV-Safety.pdf

YouTube Video: https://youtu.be/eTwJJ3lpQC4

Archive.org alternate video source: https://archive.org/details/l-141-2023-12-av-safety

Saturday, September 30, 2023

Cruise publishes a baseline for their safety analysis

Summary: a Cruise study suggests they are better than a young male ride hail driver in a leased vehicle. However, this result is an estimate, because there is not yet enough data to have a firm conclusion.

I am glad to see Cruise release a paper describing the methodology for computing the human driver baseline, which they had not previously done. Same too for their "meaningful risk of injury" estimation method. And it is good to see a benchmark that is specific to a deployment rather than a US average.

Cruise has published a baseline study for their safety analysis here:
blog post: https://getcruise.com/news/blog/2023/human-ridehail-crash-rate-benchmark/
baseline study: https://deepblue.lib.umich.edu/handle/2027.42/178179
(note that the baseline study is a white paper and not a peer reviewed publication)

The important take-aways from this in terms of their robotaxi safety analysis are:

The baseline is leased ride hail vehicles, not ordinary privately owned vehicles
The drivers of the baseline are young males (almost a third are below 30 years old)
A "meaningful risk of injury" threshold is defined, but somewhat arbitrary. They apparently do not have enough data to measure actual injury rates with statistical confidence. Given that we have seen two injuries to Cruise passengers so far (and at least one other injury crash), this is not a hypothetical concern.

It should be no surprise if young males driving leased vehicles as Uber/Lyft drivers have a higher crash rate than other vehicles. That is their baseline comparison. In fairness, if their business model is to put all the Uber and Lyft drivers out of work, perhaps that is a useful baseline. But it does not scale to the general driving population.

A conclusion that a Cruise robotaxi is safer (fewer injuries/fatalities) than an ordinary human driver is not quite supported by this study.

It is not an "average" human driver unless you only care about Uber/Lyft. If that is the concern, then OK, yes, that is a reasonable comparison baseline.
I did not see control for weather, time of day, congestion, and other conditions in the baseline. Road type and geo-fence were the aspects of ODD being used.
There is insufficient data to have a conclusion about injury rates, although that will come somewhat soon
We are a long way away from insight into how fatality rates will turn out, since the study and Cruise have about 5 million miles and San Francisco fatality rate is more like one per 100 million miles
The Cruise emphasis on "at fault" crashes is a distraction from crash outcomes that must necessarily include the contribution of defensive driving behavior (avoiding not-at-fault crashes)

This study could support a Cruise statement that they are on track to being safe according to their selected criteria. But we still don't know how that will turn out. This is not the same as a claim of proven safety in terms of harm reduction.

A different report does not build a model and estimate, but rather compares actual crash reports for robotaxis with crash reports for ride hail cars and comes to the conclusion that Cruise and Waymo operated at 4 to 8 times as many crashes as average US drivers, but that their crash rate is comparable to ride hail vehicles in California.

https://www.researchgate.net/publication/373698259_Assessing_Readiness_of_Self-Driving_Vehicles

Friday, October 21, 2022

AV Safety with a Telepresent Driver or Remote Safety Operator

Some teams propose to test or even operate autonomous vehicles (AVs) with a telepresent driver or remote safety operator. Making this safe is no easy thing.

Woman wearing VR goggles holding steering wheel

Typically the remote human driver/supervisor located at a remote operating base, although sometimes they will operate by closely following the AV test platform in a chase vehicle for cargo-only AV configurations.

Beyond the considerations for an in-vehicle safety driver, telepresent safety operators have to additionally contend with at least:

· Restricted sensory information such as potentially limited visual coverage, lack of audio information, lack of road feel, and lack of other vehicle physical cues depending on the particular vehicle involved. This could cause problems with reacting to emergency vehicle sirens and reacting to physical vehicle damage that might be detected by a physically present driver such as a tire blow-out, unusual vibration, or strange vehicle noise. Lack of road feel might also degrade the driver’s ability to remotely drive the vehicle to perform a fallback operation in an extreme situation.

· Delayed reaction time due to the round-trip transmission lag. In some situations, tenths or even hundredths of seconds of additional lag time in transmissions might make the difference between a crash and a recovery from a risky situation.

· The possibility of wireless connectivity loss. Radio frequency interference or loss of a cell tower might interrupt an otherwise reliable connection to the vehicle. Using two different cell phone providers can easily have redundancy limitations due to shared infrastructure such as cell phone towers,[1] cell tower machine rooms (for some providers), and disruption of shared backhaul fiber bundles.[2] A single infrastructure failure or localized interference can disrupt multiple different connectivity providers to one or multiple AVs.

Role of remote safety operator

Achieving acceptable safety with remote operators depends heavily on the duties of the remote operator. Having human operators provide high-level guidance with soft deadlines is one thing: “Vehicle: I think that flag holder at the construction site is telling me to go, but my confidence is too low; did I get that right? Operator: Yes, that is a correct interpretation.” However, depending on a person to take full control of remotely driving a vehicle in real time with a remote steering wheel at speed is quite another, and makes ensuring safety quite difficult.

A further challenge is the inexorable economic pressure to have remote operators monitoring more than one vehicle. Beyond being bad at boring automation supervision tasks, humans are also inefficient at multitasking. Expecting a human supervisor to notice when an AV is getting itself into a tricky situation is made harder by monitoring multiple vehicles. Additionally, there will inevitably be a situation in which two vehicles under control of a single supervisor will need concurrent attention when the operator can only handle one AV in a crisis at a time.

There are additional legal issues to consider for remote operators. For example, how does an on-scene police officer give a field sobriety test to a remote operator after a crash if that operator is hundreds of miles away – possibly in a different country? These issues must be addressed to ensure that remote safety driver arrangements can be managed effectively.

Any claim of testing safety with a telepresent operator needs to address the issues of restricted sensory information, reaction time delays, and the inevitability of an eventual connectivity loss at the worst possible time. There are also hard questions to be asked about the accountability issues and law enforcement implications of such an approach.

Active vs. passive remote monitoring

A special remote monitoring concern is a safety argument that amounts to the vehicle will notify a human operator when it needs help, so there is no need for any human remote operator to continuously monitor driving safety. Potentially the most difficult part of AV safety is ensuring that the AV actually knows when it is in trouble and needs help. Any argument that the AV will call for help is unpersuasive unless it squarely addresses the issue of how it will know it is in a situation it has not been trained to handle.

The source of this concern is that machine learning-based systems are notorious for false confidence. In other words, saying an ML-based system will ask for help when it needs it assumes that the most difficult part to get right – knowing the system is encountering an unknown unsafe condition – is working perfectly during the testing being performed to see if, in fact, that most difficult part is working. That type of circular dependency is a problem for ensuring safety.

Even if such a system were completely reliable at asking for help when needed, the ability of a remote operator to acquire situational awareness and react to a crisis situation quickly is questionable. It is better for the AV to have a validated capable of performing Fallback operations entirely on its own rather than relying on a remote operator to jump in to save the day. Before autonomous Fallback capabilities are trustworthy, a human safety supervisor should continuously monitor and ensure safety.

Any remote operator road testing that claims the AV will inform the remote operator when attention is needed should be treated as an uncrewed road testing operation as discussed in book section 9.5.7. Any such AV should be fully capable of handling a Fallback operation completely on its own, and only ask a remote operator for help with recovery after the situation has been stabilized.

[1] For example, a cell tower fire video shows the collapse of a tower with three antenna rows, suggesting it was hosting three different providers.
See: https://www.youtube.com/watch?v=0cT5cXuyiYY

[2] While it is difficult to get public admissions of the mistake of routing both a primary and backup critical telecom service in the same fiber bundle, it does happen.
See: https://www.postindependent.com/news/local/the-goof-behind-losing-911-service-in-mays-big-outage/

The NIEON Driver Benchmark and the Two-Sided Coin for AV safety

Waymo just published a study showing that they can outperform an unimpaired (NIEON) driver on a set of real-world crashes. That's promising, but it's only half the story. To be a safe driver, an Autonomous Vehicle (AV) must both not only be good -- but also not be bad. Those are two different things. Let me explain...

Man showing younger driver how to hold steering wheel properly

Learning to drive includes learning how to drive well.

It also includes learning to avoid making avoidable mistakes. They're not the same thing.

A Sept. 29, 2022 blog posting by Waymo explains their next step in showing that their automated driver can perform better than people at avoiding crashes. The approach is to recreate crashes that happened in the real world and show that their automated driver would have avoided them.

For this newest work they compare their driver to not just any human, but a Non-Impaired, with Eyes always ON the conflict (NIEON) driver. They correctly point out that no human is likely to be this good (you gotta blink, right?), but that's OK. The point is to have an ideal upper bound on human performance, which is a good idea.

Setting a goal that to be acceptably safe an AV should be at least as good as a NIEON driver makes sense. It cuts out all the debate about which human driver is being used as a reference (e.g., a tired 16 year old vs. a well rested professional limo driver 50 year old will have very different driving risk exposure).

Unsurprisingly Waymo does better than a NIEON on scenarios they know they will be running, because computers can react more quickly than people when they correctly sense, detect, and model what is going on in the real world. Doing so is not trivial, to be sure. As Waymo describes, this involves not just good twitch reflexes, but also realizing when things are getting risky and slowing down to give other road users more space, etc.

That is the "be a good driver" part, and kudos to Waymo and other industry players who are making progress on this. This is the promise of AV safety in action.

But to be safe in the real world, that is not enough. You also have to not be a bad driver. Doing better at things humans mess up is half the picture. The other half is not making "stupid" mistakes that humans would likely avoid.

AVs will surely have crashes that would be unlikely for a human driver to experience. Or will sometimes fail to be cautious when they should, and so on. While this did not lead to crashes, the infamous Waymo ride video showing construction zone cone confusion shows there is more work to be done in getting AVs to handle unusual situations. To its credit, in that scenario the Waymo vehicle did not experience a crash. But other uncrewed AVs are in fact having crashes due to misjudging traffic situations. And an automated test truck crashed into a safety barrier despite having a human safety driver due to an ill-considered initialization strategy that at least some would consider a rookie mistake (e.g., repeats a type of mistake made in Grand Challenge events -- should have known better).

It is good to see Waymo showing their automated driver can do well when it correctly interprets the situation it is in. That shows it is a potentially capable driver. We still need to see it is additionally not making mistakes in novel situations that aren't part of the human driver crash dataset. If Waymo can show NIEON level safety in both the knowns and the unknowns, that will be an impressive achievement.

Cruise robotaxi struggles with real-world emergency vehicle situation

A Cruise robotaxi failed to yield effectively to a fire truck, delaying it.

Sub-headline: Garbage truck driver saves the day when Cruise autonomous vehicle proves itself to not be autonomous.

This referenced article explains the incident in detail, which involves a garbage truck blocking one lane and the Cruise vehicle pulling over into a position that did not leave enough room for the fire truck to pass. But it also argues that things like this should be excused because it is in the cause of developing life saving technology. I have to disagree. Real harm done now to real people should not be balanced against theoretical harm potentially saved in the future. Especially when there is no reason (other than business incentives) to be doing the harm today, and the deployment continues once it is obvious that near-term harm is likely.

I would say that if the car can't drive in the city like a human driver, it should have a human driver to take over when the car can't. Whatever remote system Cruise has is clearly inadequate, because we've seen three problems recently (the article mentions them; an important one is driving with headlights off at night and Cruise's reaction to that incident). The article attributes the root cause of this incident to Cruise not having worked through all interactions with emergency vehicles, which is a reasonable analysis as far as it goes. But why are they operating in a major city with half-baked emergency vehicle interaction?

This time no major harm was done because the garbage truck driver was able to move that vehicle instead. (Realize that garbage trucks stopped in-lane with no driver in the vehicle is business as usual for them, as a human driver would know.) The fire did in fact cause property damage and injuries, so things could have been a lot worse due to a longer delay if not for a quick-acting truck driver. (Who by the way has my admiration for acting quickly and decisively.) What if the garbage truck had been a disabled vehicle or a the truck had been in the middle of an operation it could not be moved during? Then the fire truck would have been stuck. The article says the situation was complex, but driving in the real world is complex. I've personally been in situations where I needed to do something unconventional to let an emergency vehicle pass. A competent human driver understands the situation and acts. Yep, it's complex. If you can't handle complex, don't get on the road without a human backup driver.

The safety driver should not be removed until the vehicle can 100% conform to safety relevant traffic laws and practical resolution of related situations such as this. "Testing" without a human safety driver when the vehicle isn't safe is not testing -- it's just plain irresponsible. There is no technical reason preventing Cruise from keeping a safety driver in their vehicles while they continue testing. Doing so wouldn't delay the technology development in the slightest -- if what they care about is safety. If the safety driver literally has to do nothing, you've still done your testing and your multi-billion dollar company is out a few bucks for safety driver wages. If safety is truly #1, why would you choose to cut costs and remove that safety driver if you know your system isn't at 100% safe yet? Removing the safety driver is pure theater playing to public opinion and, one assumes, investors.

Cruise says that they apply a Safety Management System (SMS) "rigorously across the company." A main point of an SMS is to recognize operational hazards and alter operations in response to discovered hazards. In this case, it is clear that interaction with emergency vehicles requires more sophistication and presents a public safety hazard as currently implemented. Safety drivers should go back into the vehicles until they fix all such issues (not just this one particular interaction scenario) and their vehicle can really drive safely. Unless they simply decide that letting fire trucks on the way to a burning building pass is low priority for them.

Cruise is lucky the delayed fire truck arrival was not attributable to a death -- this time. This incident happened at 4 AM, which shows even in a nearly empty city you need to have a very sophisticated driver to avoid safety issues. At the very least they should halt no-human-driver operations until they can attest that they can handle every possible emergency vehicle interaction without causing more delay to the emergency vehicle than a proficient human driver, including situations in which a human driver would normally get creative to allow emergency vehicle progress. City officials wrote in a filing to the California Public Utilities Commission: "This incident slowed SFFD response to a fire that resulted in property damage and personal injuries,” and were concerned that frequent in-lane stops by Cruise vehicles could have a "negative impact" on fire department response times. Every safety related incident needs to be addressed. It is a golden opportunity to improve before you get unlucky. Cruise says they have a "rigorous" SMS, but I'm not seeing it. Will Cruise learn? Or will they keep rolling dice without safety drivers? Cruise shouldn't wait for something worse to happen before getting the message that they need to do better if they want to operate without a safety driver.

Wednesday, May 25, 2022

Tesla emergency door releases -- what a mess!

The Tesla manual door releases -- and lack thereof in some cases -- present unreasonable risk. What in the world were they thinking? Really bad human interface design. Cool design shouldn't come at expense of life critical peril. This article this week sums up the latest, but this has been going on for a long time.

Tesla fans seem to be saying that it is the driver's responsibility to know where the manual release latch is to escape in case of fire. Anyone who doesn't is (and has in past fires) been ridiculed on-line for not knowing where the manual release is hidden. Even if they died due to not successfully operating the control, or having to kick the window out, somehow they are the idiots and it is their fault, not Tesla's. (If someone you love has died or been injured in this way you have my sympathy, and it is the trolls who are idiots, not your loved one.)

On-line articles saying "here's how to operate the door release so you don't die in a Tesla fire" tell you there is a problem. This design is unreasonably risky for real world use. A "bet you didn't know -- so here is how to not die" article in social media means there is unacceptable risk. Example: "Tesla Model Y fire incident: remember, there's a manual door release, here's how to use it in an emergency."

Front doors you have to lift up a not particularly obvious lever in front of the window switches that is easy to miss if you don't know it is there. Maybe if you have used it a few times -- but if you never realized it is there or you have rented/borrowed the car, good luck with that. I'd probably have trouble finding it even if I weren't suffocating from smoke from a battery fire. (Have you ever had to consult the owner manual to find your hood release? Imagine doing that to find out how to open the door when your car is literally on fire -- oh, but if it is an electronic manual and you've lost power, you can't do that on the center console, can you?)

And if you're a passenger and driver is unconscious you will have issues. Etc. Do you read all the safety instructions in the driver manual when you catch a quick ride as a passenger with a friend? Does your friend brief you on escape safety features so you can exit before a 5 minute ride? Thought not.

But wait, there's more:

Model S rear door: "fold back the edge of the carpet" to find a pull cable
Model X falcon wing doors: "carefully remove the speaker grille from the door and pull the mechanical release cable..."
Model 3 rear door -- NOT EQUIPPED WITH MANUAL RELEASE (from manual: "Only the front doors are equipped with a manual door release")

So I guess the passengers in the back are kind of expendable. For many that will be the kids.

This is stunningly bad human interface design. It is entirely unreasonable to expect an ordinary car owner to know where a hidden/non-obvious emergency control is and activate it when they are trapped inside a burning car. Let alone passengers. Apparently without mandatory training and mandatory periodic refresher training.

Anyone who thinks it is reasonable to expect someone not trained in military/aviation/etc. to get this right probably has not served or been through that type of training. I have been through tons of training. Emergency drills that might give some nightmares (sealed inside a tank with broken pipes and told to plug the flooding is extra-special). And a few times the real thing. Not always with perfect execution, because there is compelling data showing humans suck at performing complicated, non-reflex-trained tasks under stress (and thus, more practice, more drills). After all that, I wouldn't want to risk my life on this hot mess of an egress system.

Education and shaming won't prevent the next death from this unreasonable risk.

I can't imagine why NHTSA wouldn't want to do a recall on this.

(To the extent this is true of other brands that is equally problematic. I don't have info on them.)

EDIT: a Linkedin commenter pointed me to this story about a Corvette fatality related to a similar issue. From what I can tell repair parts for Corvettes indicated a clearly marked egress pull that is on the floorboards. So not ideal, and possibly difficult to see if you are already in the seat. Worth reconsidering. But not literally hidden (or missing) as in Teslas, and certainly not in vehicles being sold as family cars. Perhaps now that Tesla has pushed the enveloped past any reasonable limits it's time for standards on egress actuator visibility and accessibility.

This has been a known issue at least since a 2019 crash, summarized here: https://www.autoblog.com/2019/02/28/tesla-fiery-crash-closer-look-door-locks/ That fatality also had to do with door handles not popping up after a crash, so a rescuer was unable to open doors from the outside. It's time to pay attention before more people get trapped inside burning cars.

Regulating Automated Vehicles with Human Drivers

Summary

Regulatory oversight of automated vehicle operation on public roads is being gamed by the vehicle automation industry via two approaches: (1) promoting SAE J3016, which is explicitly not a safety standard, as the basis for safety regulation, and (2) using the "Level 2 loophole" to deploy autonomous test platforms while evading regulatory oversight. Regulators are coming to understand they need to do something to reign in the reckless driving and other safety issues that are putting their constituents at risk. We propose a regulatory approach to deal with this situation that involves a clear distinction between production "cruise control" style automation that can be subject to conventional regulatory oversight vs. test platforms that should be regulated via SAE J3018 use for testing operational safety.

Video showing Tesla FSD beta tester unsafely turning into oncoming traffic.

https://www.youtube.com/watch?v=Fmj5MkyUD08&t=405s

Do not use SAE J3016 in regulations

The SAE J3016 standards document has been promoted by the automotive industry for use in regulations, and in fact is the basis for regulations and policies at the US federal, state, and municipal levels. However, it is fundamentally unsuitable for the job. The issues with using SAE J3016 for regulations are many, so we provide a brief summary. (More detail can be found in Section V.A of our SSRN paper.)

SAE J3016 contains two different types of information. The first is a definition of terminology for automated vehicles, which is not really the problem, and in general could be suitable for regulatory use. The second is a definition of the infamous SAE Levels, which are highly problematic for at least the following reasons:

The defined levels are arbitrarily selected, and do not fit vehicles being actually produced that well. (If they did, you would not see companies advertising "Level 2+" when fractional levels are forbidden by J3016.)
The defined levels have lower but not upper bounds on capabilities, making it straightforward to game levels to avoid regulations if desired by leaving small intentional holes in capabilities (in essence, a notional Level 2.99 might be used as Level 3 or 4 in practice by a customer, but not regulated because it rounds down to Level 2.)
Vehicles designed to implement Levels 2 and 3 -- and no more -- can be expected to be unsafe in practice, for example due to lack of mandatory driver monitoring.
Levels are set by manufacturers according to "design intent," meaning that a manufacturer can attempt to game regulations simply via disingenuous declarations of intent.
Overall, the structure and details of J3016 are vague and subjective, degrading regulatory certainty.

In practice, one of two big issues is the "Level 2 Loophole" in which a company might claim that the fact there is a safety driver makes its system Level 2, while insisting it does not intend to ever release that same automated driving feature as a higher level feature. This could be readily gamed by, for example, saying that Feature X, which is in fact a prototype fully automated driving system, is Level 2 at first. When the company feels that the prototype is fully mature, it could simply rebrand it Feature Y, slap on a Level 4 designation, and proceed to sell that feature without ever having applied for a Level 4 testing permit. We argue that this is essentially what Tesla is doing with its FSD "beta" program that has, among other things, yielded numerous social media videos of reckless driving despite claims that its elite "beta test" drivers are selected to be safe (e.g., failure to stop at stop signs, failure to stop at red traffic signals, driving in opposing direction traffic lanes).

The second big practical issue is that J3016 is not intended to be a safety standard, but is being used as such in regulations. This is making regulations more complex than they need to be, stretching the limits of the esoteric technical expertise in AVs required of regulatory agencies, especially for municipalities. This is combined with the AV industry promoting a series of myths as part of a campaign to deter regulator effectiveness at protecting constituents from potential safety issues. The net result is that most regulations do not actually address the core safety issues related to on-road testing of this immature technology, in large part because they aren't really sure how to do that.

For road testing safety purposes, regulators should focus on both the operational concept and technology maturity of the vehicle being operated rather than on what might eventually be built as a product. In other words "design intent" isn't relevant to the risk being presented to road users when a test vehicle veers into opposing traffic. Avoiding crashes is the goal, not parsing overly-complex engineering taxonomies.

The solution is to reject SAE J3016 levels as a basis for regulation, instead favoring other industry standards that are actually intended to be relevant to safety. (Again, using J3016 for terminology is OK if the terms are relevant, but not the level definitions.)

Four Regulatory Categories

We propose four regulatory categories, with details to follow:

Non-automated vehicles: These are vehicles that DO NOT control steering on a sustained basis in any operational mode. They might have adaptive speed control, automatic emergency braking, and active safety features that temporarily control steering (e.g., an emergency swerve around obstacles capability, or bumping the steering wheel at lane boundaries to alert the driver).

Low automation vehicles: These are vehicles with automation that CAN control steering on a sustained basis (and, in practice, also vehicle speed). They are vehicles that ordinary drivers can operate safely and intuitively along the lines of a "cruise control" system that performs lane keeping in addition to speed control. In particular, they have these characteristics:

Can be driven with acceptable safety by an ordinary licensed driver with no special training beyond that required for a non-automated version of the same vehicle type.
Includes an effective driver monitoring system (DMS) to ensure adequate driver alertness despite inevitable automation complacency
Deters reasonably foreseeable misuse and abuse, especially with regard to DMS and its operational design domain (ODD)
Safety-relevant behavioral inadequacies consist of omissive behaviors rather than actively dangerous behavior
Safety-relevant issues are both intuitively understood and readily mitigated by driver intervention with conventional vehicle controls (steering wheel, brake pedal)
Automation is not capable of executing turns at intersections.
Field data monitoring indicates that vehicles remain at least as safe as non-automated vehicles that incorporate comparable active safety features over the vehicle life.

Highly automated vehicles: These are vehicles in which a human driver has no responsibility for safe driving. If any person inside the vehicle (or a tele-operator) can be blamed for a driving mishap, it is not a highly automated vehicle. Put simply, it's safe for anyone to go to sleep in these vehicles (including no requirement for a continuous remote safety driver) when in automated operation.

Automation test platforms: These are vehicles that have automated steering capability and have a person responsible for driving safety, but do meet one or more of the listed requirements for low automation vehicles. In practical terms, such vehicles tend to be test platforms for capabilities that might someday be highly automated vehicles, but require a human test driver -- either in vehicle or remote -- for operational safety.

Non-automated vehicles can be subject to regulatory requirements for conventional vehicles, and correspond to SAE Levels 0 and 1. We discuss each of the remaining three categories in turn.

Low automation vehicles

The idea of the low automation vehicle is that it is a tame enough version of automation that any licensed driver should be able to handle it. Think of it as "cruise control" that works for both steering and speed. It keeps the car moving down the road, but is quite stupid about what is going on around the car. DMS and ODD enforcement along with mitigation of misuse and abuse are required for operational safety. Required driver training should be no more than trivial familiarization with controls that one would expect, for example, during a car rental transaction at an airport rental lot.

Safety relevant issues should be omissive (vehicle fails to do something) rather than errors of commission (vehicle does the wrong thing). For example, a vehicle might gradually drift out of lane while warning the driver it has lost lane lock, but it should not aggressively turn across a centerline into oncoming traffic. With very low capability automation this should be straightforward (although still technically challenging), because the vehicle isn't trying to do more than drive within its lane. As capabilities increase, this becomes more difficult to design, but dealing with that is up to the companies who want to increase capabilities. We draw a hard line at capability to execute turns at intersections, which is clearly an attempt at high automation capabilities, and is well beyond the spirit of a "cruise control" type system.

An important principle is that human drivers of a production low automation vehicle should not serve as Moral Crumple Zones by being asked to perform beyond civilian driver capabilities to compensate for system shortcomings and work-in-progress system defects. If human drivers are being blamed for failure to compensate for behavior that would be considered defective in a non-automated vehicle (such attempting to turning across opposing traffic for no reason), this is a sign that the vehicle is really a test platform in disguise.

Low automation vehicles could be regulated by holding the vehicles accountable to the same regulations as non-automated vehicles as is done today for Level 2 vehicles. However, the regulatory change would be excluding some vehicles currently called "Level 2" from this category if they don't meet all the listed requirements. In other words, any vehicle not meeting all the listed requirements would require special regulatory handling.

Highly automated vehicles

These are highly automated vehicles for which the driver is not responsible for safety, generally corresponding to SAE Levels 4 and 5. (As a practical matter, some vehicles that are advertised as Level 3 will end up in this category in practice if they do not hold the driver accountable for crashes when automation is engaged.)

Highly automated vehicles should be regulated by requiring conformance to industry safety standards such as ISO 26262, ISO 21448, and ANSI/UL 4600. This is an approach NHTSA has already proposed, so we recommend states and municipalities simply track that topic for the time being.

There is a separate issue of how to regulate vehicle testing of these vehicles without a safety driver, but that issue is beyond the scope of this essay.

Automation test platforms

These are vehicles that need skilled test drivers or remote safety monitoring to operate safely on public roads. Operation of such vehicles should be done in accordance with SAE J3018, which covers safety driver skills and operational safety procedures, and should also be done under the oversight of a suitable Safety Management System (SMS) such as one based on the AVSC SMS guidelines.

Crashes while automation is turned on are generally attributed to a failure of the safety driver to cope with dangerous vehicle behavior, with dangerous behavior being an expectation for any test platform. (The point of a test platform is to see if there are any defects, which means defects must be expected to manifest during testing.)

In other words, with an automation test platform, safety responsibility primarily rests with the safety driver and test support team, not the automation. Test organizations should convince regulators that testing will overall present an acceptably low risk to other road users. Among other things, this will require that safety drivers be specifically trained to handle the risks of testing, which differ significantly from the risks of normal driving. For example, use of retail car customers who have had no special training per the requirements of SAE J3018 and who are conducting testing without the benefit of an appropriate SMS framework should be considered unreasonably risky.

This category covers all vehicles currently said to be Level 4/5 test vehicles, and also any other Level 2 or Level 3 vehicles that make demands on driver attention and reaction capabilities that are excessive for drivers without special tester training.

Regulating automated test platforms should concentrate on driver safety, per my State/Municipal DOT regulatory playbook. This includes specifically requiring compliance with practices in SAE J3018 and having an SMS that is at least as strong as the one discussed in the AVSC SMS guidelines.

Wrap-up

Automated vehicles regulatory data reporting at the municipal and state levels should concentrate on collecting mishap data to ensure that the driver+vehicle combination is acceptably safe. A high rate of crashes indicates that either the drivers aren't trained well enough, or the vehicle is defective. Which way you look at it depends on whether you're a state/municipal government or the US government, and whether the vehicle is a test platform or not. But the reality is that if drivers have trouble driving the vehicles, you need to do something to fix that situation before there is a severe injury or fatality on your watch.

The content in this essay is an informal summary of the content in Section V of: Widen, W. & Koopman, P., "Autonomous Vehicle Regulation and Trust" SSRN, Nov. 22, 2021. In case of doubt or ambiguity, that SSRN publication should be consulted for more comprehensive treatment.

-----

Philip Koopman is an associate professor at Carnegie Mellon University specializing in autonomous vehicle safety. He is on the voting committees for the industry standards mentioned. Regulators are welcome to contact him for support.

Sunday, June 20, 2021

A More Precise Definition for ANSI/UL 4600 Safety Performance Indicators (SPIs)

Safety Performance Indicators (SPIs) are defined by chapter 16 of ANSI/UL 4600 in the context of autonomous vehicles as performance metrics that are specifically related to safety (4600 at 16.1.1.6.1).

Woman in Taxi looking at phone. Photo by Uriel Mont from Pexels

This is a fairly general definition that is intended to encompass both leading metrics (e.g., number of failed detections of pedestrians for a single sensor channel) and lagging metrics (e.g., number of collisions in real world operation).

However, it is so general that there can be a tendency to try to call metrics that are not related to safety SPIs when, more properly, they are really KPIs. As an example, ride quality smoothness when cornering is a Key Performance Indicator (KPI) that is highly desirable for passenger comfort. But it might have little or nothing to do with the crash rate for a particular vehicle. (It might be correlated -- sloppy control might be associated with crashes, but it might not be.)

So we've come up with a more precise definition of SPI (with special thanks to Dr. Aaron Kane for long discussions and crystalizing the concept).

An SPI is a metric supported by evidence that uses a
threshold comparison to condition a claim in a safety case.

Let's break that down:

SPI - Safety Performance Indicator - a {metric, threshold} pair that measures some aspect of safety in an autonomous vehicle.
Metric - a value, typically related to one or more of product performance, design quality, process quality, or adherence to operational procedures. Often metrics are related to time (e.g., incidents per million km, maintenance mistakes per thousand repairs) but can also be related to particular versions (e.g., significant defects per thousand lines of code; unit test coverage; peer review effectiveness)
Evidence - the metric values are derived from measurement rather than theoretical calculations or other non-measurement sources
Threshold - a metric on its own is not an SPI because context within the safety case matters. For example, false negative detections on a sensor as a number is not a SPI because it misses the part about how good it has to be to provide acceptable safety when fused with other sensor data in a particular vehicle's operational context. ("We have 1% false negatives on camera #1. Is that good enough? Well, it depends...") There is no limit to the complexity of the threshold which might be, for example, whether a very complicated state space is either inside or outside a safety envelope. But in the end the answer is some sort of comparison between the metric and the threshold that results in "true" or "false." (Analogous multi-valued operations and outputs are OK if you are using multi-valued logic in your safety case.) We call the state of an SPI output being "false" an SPI Violation.
Condition a claim - each SPI is associated with a claim in a safety case. If the SPI is true the claim is supported by the SPI. If the SPI is false then the associated claim has been falsified. (SPIs based on time series data could be true for a long time before going false, so this is a time and state dependent outcome in many cases.)
Safety case - Per ANSI/UL 4600 a safety case is "a structured argument, supported by a body of evidence, that provides a compelling, comprehensible and valid case that a system is safe for a given application in a given environment." In the context of that standard, anything that is related to safety is in the safety case. If it's not in the safety case, it is by definition not related to safety.

A direct conclusion of the above is that if a metric does not have a threshold, or does not condition a claim in a safety case, then it can't be an SPI.

Less formally, the point of an SPI is that you've built up a safety case, but there is always the chance you missed something in the safety case argument (forgot a relevant reason why a claim might not be true), or made an assumption that isn't as true as you thought it was in the real world, or otherwise have some sort of a problem with your safety case. An SPI violation amounts to: "Well, you thought you had everything covered and this thing (claim) was always true. And yet, here we are with the claim being false when we encountered a particular unforeseen situation in validation or real world operation. Better update your safety argument!"

In other words, a SPI is a measurement you take to make sure that if your safety case is invalidated you'll detect it and notice that your safety case has a problem so that you can fix it.

An important point of all this is that not every metric is an SPI. SPIs are a very specific term. The rest are all KPIs.

KPIs can be very useful, for example in measuring progress toward a functional system. But they are not SPIs unless they meet the definition given above.

NOTES:

The ideas in this posting are due in large part to efforts of Dr. Aaron Kane. He should be cited as a co-author of this work.

(1) Aviation uses SPI for metrics related to the operational phase and SMS activities. The definition given here is rooted in ANSI/UL 4600 and is a superset of the aviation use, including technical metrics and design cycle metrics as well as operational metrics.

(2) In this formulation an SPI is not quite the same as a safety monitor. It might well be that some SPI violations also happen to trigger a vehicle system shutdown. But for many SPI violations there might not be anything actionable at the individual vehicle level. Indeed, some SPI violations might only be detectable at the fleet level in retrospect. For example, if you have a budget of 1 incident per 100 million km of a particular type, an individual vehicle having such an incident does not necessarily mean the safety case has been invalidated. Rather, you need to look across the fleet data history to see if such an incident just happens to be that budgeted one in 100 million based on operational exposure, or is part of a trend of too many such incidents.

(3) We pronounce "SPI" as "S-P-I" rather than "spy" after a very confusing conversation in which we realized we needed to explain to a government official that we were not actually proposing that the CIA become involved with validating autonomous vehicle safety.

Sunday, December 8, 2019

The Lesson Learned from the Tempe Arizona Autonomous Driving System Testing Fatality NTSB Report

Now that the press flurry over the NTSB's report on the Autonomous Driving System (ADS) fatality in Tempe has subsided, it's important to reflect on the lessons to be learned. Hats off to the NTSB for absolutely nailing this. Cheers to the Press who got the messaging right. But not everyone did. The goal of this essay is to help focus on the right lessons to learn, clarify publicly stated misconceptions, and emphasize the most important take-aways.

I encourage everyone in the AV industry to watch the first 5 and a half minutes of the NTSB board meeting video ( Youtube: Link // NTSB Link). Safety leadership should watch the whole thing. Probably twice. Then present a summary at your company's lunch & learn.

Pay particular attention to this part from Chairman Sumwalt: "If your company tests automated driving systems on public roads, this crash -- it was about you. If you use roads where automated driving systems were being tested, this crash -- it was about you."

I live in Pittsburgh and these public road tests happen near my place of work and my home. I take the lessons from this crash personally. In principle, every time I cross a street I'm potentially placed at risk by any company that might be cutting corners on safety. (I hope that's none. All the companies testing here have voluntarily submitted compliance reports for the well-drafted PennDOT testing guidelines. But not every state has those, and those guidelines were developed largely in response to the fatality we’re discussing.)

I also have long time friends who have invested their careers in this technology. They have brought a vibrant and promising industry to Pittsburgh and other cities. Negative publicity resulting from a major mishap can threaten the jobs of those employed by those companies.

So it is essential for all of us to get safety right.

The first step: for anyone in charge of testing who doesn't know what a Safety Management System (SMS) is: (A) Watch that NSTB hearing intro. (B) Pause testing on public roads until your company makes a good start down that path. (Again, the PennDOT guidelines are a reasonable first step accepted by a number of companies. LINK) You’ll sleep better having dramatically improved your company’s safety culture before anyone gets hurt unnecessarily.

A major implication of the NTSB report is that if there is even a single company road testing without an adequate SMS -- that is everyone's problem.

Clearing up some misconceptions

I’ve seen some articles and commentary that missed the point of all of this. Large segments of coverage emphasized technical shortcomings of the system - That's not the point. Other coverage highlighted test driver distraction - That's not the point either. Some blame that it was too dark, which is simply untrue -- NTSB estimates the pedestrian was visible at the full line of sight range of 637 feet. The fatal mishap involved technical shortcomings, and the test driver was not paying adequate attention. Both contributed to the mishap, and both were bad things.

But the lesson to learn is that solid safety culture is without a doubt necessary to prevent avoidable fatalities like these. That is the Point.

To make the most of this teachable moment let's break things down further. These discussions are not really about the particular test platform that was involved. The NTSB report gave that company credit for significant improvement. Rather, the objective is to make sure everyone is focused on ensuring they have learned the most important lesson so we don’t suffer another avoidable ADS testing fatality.

A self-driving car killed someone - NOT THE POINT

This was not a self-driving car. It was a test platform for Automated Driving System (ADS) technology. The difference is night and day. Any argument that this vehicle was safe to operate on public roads hinged on a human driver not only taking complete responsibility for operational safety, but also being able to intervene when the test vehicle inevitably made a mistake. It's not a fully automated self-driving car if a driver is required to hover with hands above the steering wheel and foot above the brake pedal the entire time the vehicle is operating.

It's a test vehicle. The correct statement is: a test vehicle for developing ADS technology killed someone.

The pedestrian was initially said to jump out of the dark in front of the car - NOT THE POINT

I still hear this sometimes based on the initial video clip that was released immediately after the mishap. The pedestrian walked across almost 4 lanes of road in view of the test vehicle before being struck, with sufficient lighting to have been seen by the driver. The test vehicle detected the pedestrian 5.6 seconds before the crash. That was plenty of time to avoid the crash, and plenty of time to track the pedestrian crossing the street to predict that a crash would occur. Attempting to claim that this crash was unavoidable is incorrect, and won't prevent the next ADS testing fatality.

It's the pedestrian's fault for jaywalking - NOT THE POINT

Jaywalking is what people do when it is 125 yards to the nearest intersection and there is a paved walkway on the median. Even if there is a sign saying not to cross. Tearing up the paved walkway might help a little on this particular stretch of road, but that's not going to prevent jaywalking as a potential cause of the next ADS testing fatality.

Victim's apparent drug use - NOT THE POINT

It was unlikely that the victim was a fully functional, alert pedestrian. But much of the population isn't in this category for other reasons. Children, distracted walkers, and others with less than perfect capabilities and attention cross the street every day, and we expect drivers to do their best to avoid hitting them.

There is no indication that the victim’s medical condition substantively caused the fatality. (We're back to the fact that the pedestrian did not jump in front of the car.) It would be unreasonable to insist that the public has the responsibility to be fully alert and ready to jump out of the way of an errant ADS test platform at all times they are outside their homes.

Tracking and classification failure - NOT THE POINT

The ADS system on the test vehicle suffered some technical issues that prevented predicting where the pedestrian would be when the test vehicle got there, or even recognizing the object it was sensing was a pedestrian walking a bicycle. However, the point of operating the test vehicle was to find and fix defects.

Defects were expected, and should be expected on other ADS test vehicles. That's why there is a human safety driver. Forbidding public road testing of imperfect ADS systems basically outlaws road testing at this stage. Blaming the technology won't prevent the next ADS testing fatality, but it could hurt the industry for no reason.

It's the technology's fault for ignoring jaywalkers - NOT THE POINT

This idea has been circulating, but apparently this isn't quite true. Jaywalkers aren't ignored, but rather according to the information presented by the NTSB a pedestrian isn't expected to cross the street at first. Once the pedestrian moves for a while a track is built up that could indicate street crossing, but until then movement into the street is considered unexpected if the pedestrian is not at a designated crossing location. A deployment-ready ADS could potentially use a more sophisticated approach to predict when a pedestrian would enter the roadway.

Regardless of implementation, this did not contribute to the fatality because the system never actually classified the victim as a pedestrian. Again, improving this or other ADS technical features won't prevent the next ADS testing fatality. That’s because testing safety is about the safety driver, not which ADS prototype functions happen to be active on any particular test run.

ADS emergency braking behavior - NOT THE POINT

The ADS emergency braking function had behaviors that could hinder its ability to provide backup support to the safety driver. Perhaps another design could have done better for this particular mishap. However, it wasn't the job of the ADS emergency braking to avoid hitting a pedestrian. That was the safety driver's job. Improving ADS emergency braking capabilities might reduce the probability of an ADS testing fatality, but won't entirely prevent the next fatality from happening sooner than it should.

Native emergency braking disabled - NOT THE POINT

It looks bad to have disabled the built-in emergency braking system on the passenger vehicle used as the test platform. The purpose of such systems is to help out after the driver makes a mistake. In this case there is a good, but not perfect, chance it would have helped. But as with the ADS emergency braking function, this simply improves the odds. Any safety expert is going to say your odds are better with both belt and suspenders, but enabling this function alone won't entirely prevent the next ADS testing fatality from happening before it should.

Inattentive safety driver - NOT THE POINT

There is no doubt that an inattentive safety driver is dangerous when supervising an ADS test vehicle. And yet, driver complacency is the expected outcome of asking a human to supervise an automated system that works most of the time. That’s why it’s important to ensure that driver monitoring is done continually and used to provide feedback. (In this case a form of driver monitoring equipment was installed, but data was apparently not used in a way that assured effective driver alertness.)

While enhanced training and stringent driver selection can help, effective analysis and action taken upon monitoring data is required to ensure that drivers are actually paying attention in practice. Simply firing this driver without changing anything else won't prevent the next ADS testing fatality from happening to some other driver who has slipped into bad operational habits.

A fatality is regrettable, but human drivers killed about 100 people that same day with minimal news attention - NOT THE POINT

Some commentators point out the ratio of fatalities caused by test vehicles vs. general automotive fatality rates. They then generally argue that a few deaths in comparison to the ongoing carnage of regular cars is a necessary and appropriate price to pay for progress. However, this argument is not statistically valid.

Consider a reasonable goal that ADS testing (with highly qualified, alert drivers presumed) should be no more dangerous than the risk presented by normal cars. For normal US cars that's ballpark 500 million road miles per pedestrian fatality. This includes mishaps caused by drunk, distracted, and speeding drivers. Due to the far smaller number of miles being driven by current test platform fleet sizes, the "budget" for fatal accidents due to ADS road testing phase should, at this early stage, still be zero.

The fatality somehow “proves” that self-driving car technology isn't viable - NOT THE POINT

Some have tried to draw conclusions about the viability of ADS technology from the fact that there was a testing fatality. However, the issues with ADS technical performance only prove what we already knew. The technology is still maturing, and a human needs to intervene to keep things safe. This crash wasn't about the maturity of the technology; it was about whether the ADS public road testing itself was safe.

Concentrating on technology maturity (for example, via disclosing disengagement rates) serves to focus attention on a long term future of system performance without a safety driver. But the long term isn’t what’s at issue.

The more pressing issue is ensuring that the road testing going on right now is sufficiently safe. At worst, continued use of disengagement rates as the primary metric of ADS performance could hurt safety rather than help. This is because disengagements, if gamed, could incentivize safety drivers to take chances by avoiding disengagements in uncertain situations to make the numbers look better. (Some companies no doubt have strategies to mitigate this risk. But those are probably the companies with an SMS, which is back to the point that matters.)

THE POINT: The safety culture was broken

Safety culture issues were the enabler for this particular crash. Given the limited number of miles that can be accumulated by any current test fleet, we should see no fatalities occur during ADS testing. (Perhaps a truly unavoidable fatality will occur. This is possible, but given the numbers it is unlikely if ADS testing is reasonably safe. So our goal should be set to zero.) Safety culture is critical to ensure this.

The NTSB rightly pushes hard for a safety management system (SMS). But be careful to note that they simply say that this is a part of safety culture, not all of it. Safety culture means, among other things, taking responsibility for ensuring that their safety drivers are actually safe despite the considerable difficulty of accomplishing that. Human safety drivers will make mistakes, but a strong safety culture accounts for such mistakes in ensuring overall safety.

It is important to note that the urgent point here is not regulating self-driving car safety, but rather achieving safe ADS road testing. They are (almost) two entirely different things. Testing safety is about whether the company can consistently put an alert, able-to-react safety driver on the road. On the other hand, ADS safety is about the technology. We need to get to the technology safety part over time, but ADS road testing is the main risk to manage right now.

Perhaps dealing with ADS safety would be easier if the discussions of testing safety and deployment safety were more cleanly separated.

THE TAKEAWAYS:

Chairman Sumwalt summed it up nicely in the intro. (You did watch that 5 and half minute intro, right?) But to make sure it hits home, this is my take:

One company's crash is every company's crash. You'll note I didn't name the company involved, because really that's irrelevant to preventing there from being a next fatality and the potential damage it could do to the industry’s reputation.

The bigger point is every company can and should institute good safety culture before further fatalities take place if they have not done so already. The NTSB credited the company at issue with significant change in the right direction. But it only takes one company who hasn’t gotten the message to be a problem for everyone. We can reasonably expect fatalities involving ADS technology in the future even if these systems are many times safer than human drivers. But there simply aren’t that many vehicles on the road yet for a truly unavoidable mishap to be likely to occur. It’s far too early.

If your company is testing (or plans to test) autonomous vehicles, get a Safety Management System in place before you do public road testing. At least conform to the details in the PennDOT testing guidelines, even if you’re not testing in Pennsylvania. If you are already testing on public roads without an SMS, you should stand down until you get one in place.

Once you have an SMS, consider it a down-payment on a continuing safety culture journey.

The NTSB full report can be accessed here: https://www.ntsb.gov/investigations/AccidentReports/Reports/HAR1903.pdf

Prof. Philip Koopman, Carnegie Mellon University

Author Note: The author and his company work with a variety of customers helping to improve safety. He has been involved with self-driving car safety since the late 1990s. These opinions are his own, and this piece was not sponsored.

Safe Autonomy