Safe Autonomy: Autonomous Vehicle Testing Guidance for State & City DOTs

Once in a while I'm contacted by a city or state Department of Transportation (DOT) to provide advice on safety for "self-driving" car testing. (Generally that means public road testing of SAE Level 3-5 vehicles that are intended for eventual deployment as automated or autonomous capable vehicles,)

The good news is that industry standards are maturing. Rather than having to create their own guidelines and requirements as they have in the past, DOTs now have the option of primarily relying upon having AV testers conform to industry-created guidelines and consensus standards.

And ... in September 2021 NYC DOT blazed a trail by requiring the self-driving car industry to conform to their own industry consensus testing safety standard (J3018). Kudos to NYC DOT! (check it out here (link); more on that in the details below.

The #1 important thing to keep in mind is that testing safety is not about the automation technology -- it is about the ability of the human safety driver to monitor and intervene when needed to make safety. The technology is going to fail, because the point of testing is to find surprise failures. If a failure of technology causes a fatality, then most likely the testing wasn't being done safely enough. It is essential that human safety drivers be skilled and attentive enough to prevent loss events when such failures inevitably occur.

The short version is that DOTs should:

Follow the AAMVA road testing guidelines plus some additional key practices.
Define how safe testing should be when considering the safety driver + vehicle system as a whole.
Ask testers for conformance to SAE J3018 for road testing.
Ask testers to have a credible Safety Management System (SMS) approach, including a testing plan.
Ask testers to provide metrics that show that their testing is safe (not just a promise up front, but also periodic testing safety metrics as they operate). Don't get distracted by measuring the maturity of the technology they are testing -- it's all about the safety driver ability to intervene when something goes wrong.
If testing takes place with a safety driver in a chase vehicle or remote, ask for conformance to safety standards for the mechanisms required to ensure safety (e.g., per ISO 26262), but otherwise conforming to SAE J3018 for the training and protocols.
If testing takes place without continuously monitoring safety driver, ask for conformance to industry-consensus safety standards for the autonomous vehicle itself. If there is no person continuously monitoring and capable of assuring safety, then the safety aspects of the technology have to be done. You shouldn't let vehicles without fully mature safety technology operate without a human safety driver.

The long version (below) gets pretty detailed, but this is a complicated and nuanced issue. So, here we go...

The AAMVA Guidelines as a starting point:

DOTs should follow applicable AAMVA guidelines with a few additional points.

The American Association of Motor Vehicle Administrators (AAMVA) released the 2nd edition of Safe Testing and Deployment of Vehicles Equipped with Automated Driving Systems Guidelines in September 2020. There is plenty of good information here. However, there are a few areas that require going beyond these guidelines to ensure what might be considered acceptable safety.

My additional recommendations within the scope of these guidelines include:

Vehicle manufacturer or testing organization should be required to publish a Voluntary Safety Self Assessment (VSSA) report (see AAMVA Guidelines 3.1.5). That VSSA should address all relevant topics in the NHTSA Automated Driving Safety documents (2.0, 3.0, 4.0). A VSSA does not provide all information required for technical evaluation of safety, but complete lack of a VSSA suggests an unwillingness to provide public transparency.
Require statement of areas of intended operation in a manner that does not compromise any claimed secrets as to detailed specifics of tests being conducted.

For example, require reporting of zip codes of where testing is to be conducted.
Report speeds at which testing will be conducted (e.g., 25 mph speed limit street testing is much different than Interstate System highway testing).
Report other relevant Operational Design Domain factors that will limit testing (e.g., daytime only, in rain, in snow) so that any particularly hazardous environmental testing situations can be considered with regard to public safety.
Discuss any unique characteristics of the test area to ensure the tester understands what unique challenges might be presented that someone not from the location might find unusual (e.g., The Pittsburgh Left, parking chairs, cable cars, cattle grids, gator crossings).

Require a tester statement that a defined level of technology quality will be confirmed before it is used for public road testing (along with the definition of what that might be). This should include at least:

A comprehensive simulation and closed course testing plan should be completed before testing on public roads.
All software updates should be subjected to confirmatory closed course testing to ensure no new defects have been introduced before being used in road testing.
Any vehicle feature that does not pass closed course testing should not be active during road testing. (In other words, if a feature fails closed course track testing, it shouldn't be operated on public roads.) Public roads should be used to confirm that the vehicle works as expected, not for debugging of known-faulty features.

Explanation for why the tester thinks that safety driver training and performance will be sufficient to ensure that test vehicles do not present increased risk to other road users.

The AAMVA guideline scope, while quite useful, is primarily administrative in nature rather than technical. To go beyond this we need to look at engineering standards. (Some of the above points also appear in the following standards and guidance.)

Define how safe is safe enough:

DOTs should define the desired safety outcome, but not how to measure it.

This is perhaps the trickiest point. It's important for the DOT to set the bar for how safe is safe enough. Testers likely have overwhelming financial incentive to get their testing done. Even with the best intentions, the threat of losing funding for lack of progress can loom larger than a possibility of a problem with a testing crash that might (or might not) happen in the future. It seems insufficient in such an environment to simply assume that for-profit organizations will set a safety target that reflects local societal norms.

However, it would be irresponsible for a testing organization to do public road testing without regard for public safety. This means that any (responsible) testing organization will have: a safety goal and analysis before testing starts to predict whether they are likely to reach that safety goal, and metrics collected during testing to ensure that they are meeting their safety goal.

DOTs might not have the technical sophistication to tell testers how to predict safety during testing, nor to know exactly which metrics and associated metric thresholds would be appropriate for a particular test plan. However, the DOT should take responsibility (absent legislation) for making it clear what the testing safety goal should be.

An example might be: road testing operations shall be at least as safe as unimpaired human drivers, taking into account local driving safety statistics and testing environmental conditions. For example, if testing in Pittsburgh only in daytime and dry weather, testers should have a goal of being at least as safe as other Pittsburgh drivers operating in daytime and dry weather (subtracting out drunk and impaired human driver collisions). That "safer than human" should consider at least fatalities and major injury crashes. Records must be kept of all safety-related metrics, incidents, and loss events.

Some important considerations is that the policy in the preceding paragraph does not tell testers how to predict such safety nor how to measure it on a technical basis. Rather, it is up to the testers to figure this out in their own individual situation. As mentioned earlier, if they don't know how to measure their own safety, they shouldn't be out on public roads doing the testing in the first place.

Could this approach be gamed? Of course it can (as can any approach). However, if the tester goes on record committing to a particular level of safety, it will become evident whether that level of safety has been reached sooner or later based on police reports, if nothing else. Once that happens, historical metrics will show whether the tester was operating in good faith or not.

SAE J3018 for operational safety:

DOTs should ask testers to conform to the industry standard for road testing safety: SAE J3018.

When AV road testing first started, it was common for testers to claim that they were safe because they had a "safety driver." However, as was tragically demonstrated in the Tempe AZ testing fatality in 2018, not all approaches to safety driving are created equal. Much more is required. Fortunately, there is an SAE standard that addresses this topic.

SAE J3018_202012 "Safety-Relevant Guidance for On-Road Testing of Prototype Automated Driving System (ADS)-Operated Vehicles" (https://www.sae.org/standards/content/j3018_202012/ -- be sure to get the 2020 revision) provides safety relevant guidance for road testing. It concentrates on guidance for the "in-vehicle fallback test driver" (also known informally as the safety driver).

AV testers should be conform to J3018 to ensure that they are following identified best practices for safety driver training and effectiveness. Endorsing this standard will avoid a DOT having to create their own driver qualification and training requirements.

Taking a deeper look at J3018, it seems a bit light on measuring whether the safety driver is actually providing effective risk mitigation. Rather, it seems to implicitly assume that training will necessarily result in acceptable road testing safety. While training and qualification of safety drivers is essential, it is prudent to also monitor safety driver effectiveness, and testers should be asked to address this issue. Nonetheless, J3018 is an excellent starting point for testing safety. Testers should be doing at least what is in J3018, and probably more.

J3018 does cost money to read, and the free preview is not particularly informative. However, there is a free copy of a precursor document available here: https://avsc.sae-itc.org/principles-01-5471WV-42925L3.html that will give a flavor of what is involved. That having been said, any DOT guidance or requirement should follow J3018, and not the AVSC precursor document.

In addition to following J3018, the safety-critical mechanisms for testing should be designed to conform to the widely used ISO 26262 functional safety standard. (This is not to say that the entire test vehicle -- which is still a work in progress -- needs to conform to 26262 during testing. Rather, that the "Big Red Button" and any driver takeover functions need to conform to 26262 to make sure that the safety driver can really take over when necessary.)

For cargo vehicles that will deploy without drivers, J3018 can still be used by installing a temporary safety driver seat in the vehicle. Or the autonomy equipment can be mounted on a conventional vehicle in a geometry that mimics the cargo vehicle geometry. When the time comes to deploy without a driver physically in the system, you are really testing an autonomous vehicle with a chase car or remote safety supervisor, covered in a following section on testing without a driver.

Safety Management System (SMS)

DOTs should ask testers to have a Safety Management System in place before testing.

A Safety Management System is a systematic way to manage safety risk for an organization. The roots of SMS approaches come from the aviation industry. The short version is that an SMS helps make sure that you are operationally safe. An important aspect of an SMS is that traditionally it is more about how the people in the company perform tasks and the safety culture rather than the technology itself.

Perhaps the most important overarching finding of the NTSB investigation of the Tempe AV testing fatality was that the lack of an SMS increased the risk of such a bad outcome. To paraphrase the NTSB hearing opening remarks: "you don't have to wait to have a fatal crash before you decide to implement an SMS." (If you have made it this far in reading this essay, you absolutely must listen to the first 6 minutes of this NTSB hearing https://youtu.be/mSC4Fr3wf0k if you have not already done so.)

The AVSC, a closed-membership industry group, has recently released guidelines for AV testing SMS: https://avsc.sae-itc.org/principle-7-5896VG-46559OG.html

While these are not at the same level of consensus and review of an SAE issued standard (for example, public comments are not solicited), they do provide industry guidance that is applicable to road testing safety. (I personally have not reviewed these to the degree I have J3018, but expect to do so over time if it is submitted to the SAE ORAD standards committee as J3016 was. So this is not a specific endorsement, but rather an identification of industry-created content that looks likely to be useful.)

DOTs should ask that any AV testing organization to describe their SMS and accompanying safety plan. The tester should explain how such an SMS is comparable to or better than the AVSC guidelines.

Metrics:

DOTs should ask for metrics related to public safety during testing rather than autonomy performance.

It is common to want a standard set of metrics for both test and deployed vehicles. That area is still maturing. While metrics such as number of crashes of various severity classes are fairly straightforward, other predictive metrics such as "disengagements" are problematic for a number of reasons. In particular, each vehicle and each test program has different objectives and different safety architectures. So it will be a while before one-size-fits-all metrics are standardized.

We recommend that any metrics defined be tied to safety procedures and policies rather than the maturity of the technology. Most importantly, it is desirable to find metrics that AV testers cannot claim reveal proprietary information. That means that metrics that measure "how good is the AV" or "how soon to deployment" are likely to be problematic -- and not necessarily that relevant to the crucial question of whether the testing itself that's going to happen right now (and not the AV that might be deployed sometime in the future) presents elevated risk to the public.

I'd argue that the public has a legitimate right to understand whether road users are in the test area are put at increased risk due to AV testing. One way to approach this is to ask the AV tester to respond the following questions:

What basis do you have for claiming that your testing will not present increased risk to other road users, including vulnerable road users?
What metrics to you plan to collect to ensure that your system is in fact not presenting any such increased risk?
What periodic (e.g., monthly) quantitative report can you give us to show that indeed your testing has not increased the risk to other road users?

In general, the strategy should be to ask the AV tester: "Why do you think you're safe" and "How do you plan to measure safety," followed by "How will you know if you're not as safe as you promised you would be?"

If the AV testing can't promise that they will not increase risk to other road users (especially vulnerable road users), then should they be testing on your roads? If they don't plan to measure and track their actual on-road risk, then do you find their safe testing promise credible? And if they claim that their testing road risk data is proprietary, does that even make sense?

Some example metrics for testing safety (although applicability depends on the specifics of the situation):

How often does the built-in driver monitor signal a driver attention issue? (It won't be zero, but there should be a defined acceptable threshold set by the AV tester which, if exceeded, should cause a process intervention of some sort.)
How often does the safety driver make an erroneous intervention, even though there is no crash or other loss event? (In other words, how many near hits are occurring?)
How does the AV tester track skill degradation to determine when it is time for a shift change or even refresher training for a safety driver?

Keep in mind that for most companies testing safety is accomplished via test driver supervision, road safety has much more to do with the reliability of the safety drivers than the automation technology itself. So the above metrics have nothing to do with the automation technology, and everything to do with test driver safety -- which is the part that matters for most AV testing safety.

In the end, the metrics should show that the required level of safety is being achieved. They should also be predictive enough that they are likely to indicate any potential problems BEFORE there is a crash.

Testing Without A Driver:

DOTs should ask about safety during communication loss for remote safety driver testing.
DOTs should ask testers to conform to industry automotive safety standards if there is no supervising test driver.

Eventually, organizations will want to test on public roads without a driver. Indeed California has already issued permits for this. In terms of safety, a primary question to ask is how safety is being assured.

If there is a remote operator involved, then it is important to ensure that any real time data connectivity and sensor information is sufficient to ensure safety. This is a controversial area, and any company promising that, for example, a remote operator can instantly take over operation in the event of a malfunction should be prepared to offer hard data metrics on control latency (delay introduced by the remote communication system), effectiveness of the vehicle detecting its own malfunctions (very difficult to ensure if the system doesn't know it doesn't see a pedestrian for example), and communication link reliability. It is challenging (some would say implausible) to control high speed vehicle operation remotely due to the latencies involved, so a line of sight radio link with a chase car might be required. J3018 practices for the remote operator would still apply. Additionally the equipment used to perform the remote operation should conform to ISO 26262 or other comparable safety standard, which is not typically true for telecommunication equipment. (If loss of signal triggers a vehicle shutdown, then that loss of signal equipment and shutdown mechanism should conform to ISO 26262.)

If there is no remote operator involved, then either the tester should be following issued safety standards or have a safety case to explain why what they are doing is at least as rigorous as what is in those standards. Currently issued and applicable safety standards include: ISO 26262 (functional safety), ISO 21448 (safety of the intended function), and ANSI/UL 4600 (system level safety for autonomous vehicles).

It is worth noting that misinformation has been provided to at least one state DOT regarding ANSI/UL 4600 by industry advocacy groups. (Short version: there is no requirement whatsoever for external assessment in 4600, despite multiple statements to the contrary in a letter sent to a state DOT. Other negative statements tend to be similarly misleading or just plain incorrect.) Any DOT who wants the full story in response to any information they receive criticizing ANSI/UL 4600 is welcome to contact the author of this essay.

Some testers may say they have reasons for not following industry consensus safety standards. If that is the case, ask them what quantitative data they have to demonstrate they are safer than a human driver. If they can't prove to themselves that they are at least as safe as a human driver, why are they operating on public roads? If they say they have the data but it is proprietary, ask what road testing safety data has to do with the secret sauce behind their autonomy. (Short answer -- it has nothing to do with the secret sauce, but might have to do with concerns that they can't promise safe testing.)

Transparency:

It is common for testers to claim that any attempt to require data reporting, metrics, or other transparency will somehow give away incredibly valuable trade secrets and inhibit innovation. This is utter nonsense. Yet, it seems to be the industry playbook. For example, during a NYC DOT hearing "about a half-dozen autonomous car makers and their advocates said the proposed rules would turn New York City from an engine of innovation into a backwater that would set back the evolution of the potentially life-saving technology of computer-controlled cars and trucks that can move around without inferior human beings messing everything up." (https://nyc.streetsblog.org/2021/09/01/self-driving-car-industry-promising-safety-pushes-back-on-dot-plan-to-regulate-testing/)

Often this conversation boils down to testers saying "trust us, we're smart." They may be smart, but decades of experience with safety in other domains has shown that there is no safety without transparency. If they are smart enough to be able to build a car that can drive itself safely on your roads -- without even needing to follow industry standards -- they should also be smart enough to figure out a way to show you data to prove they are safe without revealing major secrets.

For situations in which a safety driver is in the vehicle, let's look at what is required for transparency, which NYC DOT did a good job with (here: https://rules.cityofnewyork.us/wp-content/uploads/2021/08/DOT-Notice-of-Adoption-AV-Rule-FINAL-with-Finding.pdf). The elements they require are:

Self-certification that the testing will be safer than a human driver. This is just asking the tester to claim (without producing any proof) that they will test safely. If they're not willing to sign up to that, probably they should not be on public roads.
Conform to SAE J3018 and AVSC 00001201911. In other words, this is asking them to follow industry standards and practices for their test driver qualification and testing protocols. This involves ONLY the human test driver and does not place constraints on the automation technology being tested. If they're not willing to sign up to have trained safety drivers and safe testing protocols, probably they should not be on public roads.
Submission of a safety plan. This has nothing to do with the automation technology -- it is all about making sure the safety driver can keep the vehicle safe. If they can't explain to the DOT what their plan is to be safe in testing, probably they should not be on public roads.

The key is: you don't need to disclose any autonomous vehicle secret sauce to explain why testing will be safe, because the safety hinges on the human safety driver, not the automation technology.

Other Resources.

Here are some resources that might be useful. While SAE J3016 is widely used for terminology, it is essential to note that it is not (and is not intended to be) a safety standard. Conformance to J3016 Levels has to do with whether you're using an appropriate name for your vehicles, and not whether those vehicles are safe.

NYC DOT Testing regulations
SAE J3016 terminology explanation (slides | video)
SAE J3016 unofficial User Guide
Metrics podcasts (especially the episodes on Disengagements and Road Testing Safety)
Dirty Dozen AV industry myths to foil regulators
How the AV industry degrades trust and its relationship to regulations
More in-depth lecture series on AV Safety

Prof. Philip Koopman is an internationally recognized expert on Autonomous Vehicle (AV) safety whose work in that area spans 25 years. He is also actively involved with AV policy and standards as well as more general embedded system design and software quality. His pioneering research work includes software robustness testing and run time monitoring of autonomous systems to identify how they break and how to fix them. He has extensive experience in software safety and software quality across numerous transportation, industrial, and defense application domains including conventional automotive software and hardware systems. He was the principal technical contributor to the UL 4600 standard for autonomous system safety issued in 2020. He is a faculty member of the Carnegie Mellon University ECE department where he teaches software skills for mission-critical systems. In 2018 he was awarded the highly selective IEEE-SSIT Carl Barus Award for outstanding service in the public interest for his work in promoting automotive computer-based system safety.

Any city or state DOT representative addressing this topic is welcome to contact him via: koopman@cmu.edu

Updated Sept. 12, 2021.

Safe Autonomy

Sunday, September 12, 2021

Autonomous Vehicle Testing Guidance for State & City DOTs