Wednesday, March 27, 2019

Missing Rare Events in Autonomous Vehicle Simulation

Missing Rare Events in Simulation:
A highly accurate simulation and system model doesn't solve the problem of what scenarios to simulate. If you don't know what edge cases to simulate, your system won't be safe.




It is common, and generally desirable, to use vehicle-level simulation rather than on-road operation is used as a proxy field testing strategy. Simulation offers a number of potential advantages over field testing of a real vehicle including lower marginal cost per mile, better scalability, and reduced risk to the public from testing. Ultimately, simulation is based upon data that generates scenarios used to exercise the system under test, commonly called the simulation workload. The validity of the simulation workload is just as relevant as the validity of the simulation models and software.

Simulation-based validation is often accomplished with a weighting of scenarios that is intentionally different than the expected operational profile. Such an approach has the virtue of being able to exercise corner cases and known rare events with less total exposure than would be required by waiting for such situations to happen by chance in real-world testing (Ding 2017). To the extent that corner cases and known rare events are intentionally induced in physical vehicle field testing or closed course testing, those amount to simulation in that the occurrence of those events is being simulated for the benefit of the test vehicle.

A more sophisticated simulation approach should use a simulation “stack” with layered levels of abstraction. High level, faster simulation can explore system-level issues while more detailed but slower simulations, bench tests, and other higher fidelity validation approaches are used for subsystems and components (Koopman & Wagner 2018).

Regardless of the mix of simulation approaches, simulation fidelity and realism of the scenarios is generally recognized as a potential threat to validity. The simulation must be validated to ensure that it produces sufficiently accurate results for aspects that matter to the safety case. This might include requiring conformance of the simulation code and model data to a safety-critical software standard.


Even with a conceptually perfect simulation, the question remains as to what events to simulate. Even if simulation were to cover enough miles to statistically assure safety, the question would remain as to whether there are gaps in the types of situations simulated. This corresponds to the representativeness issue with field testing and proven in use arguments. However, representativeness is a more pressing matter if simulation scenarios are being designed as part of a test plan rather than being based solely on statistically significant amounts of collected field data.

Another way to look at this problem is that simulation can remove the need to do field testing for rare events, but does not remove determine what rare events matter. All things being equal, simulation does not reduce the number of road miles needed for data collection to observe rare events. Rather, it permits a substantial fraction of data collection to be done with a non-autonomous vehicle. Thus, even if simulating billions of miles is feasible, there needs to be a way to ensure that the test plan and simulation workload exercise all the aspects of a vehicle that would have been exercised in field testing of the same magnitude.

As with the fly-fix-fly anti-pattern, fixing defects identified in simulation requires additional simulation input data to validate the design. Simply re-running the same simulation and fixing bugs until the simulation passes invokes the “pesticide paradox.” (Beizer 1990) This paradox holds that a system which has been debugged to the point that it passes a set of tests can’t be considered completely bug free. Rather, it is simply free of the bugs that the test suite knows how to find, leaving the system exposed to bugs that might involve only very subtle differences from the test suite.

(This is an excerpt of our SSS 2019 paper:  Koopman, P., Kane, A. & Black, J., "Credible Autonomy Safety Argumentation," Safety-Critical Systems Symposium, Bristol UK, Feb. 2019.  Read the full text here)

1 comment:

  1. The problem of finding a limited set of representatives - including rare cases - for the about unlimited set of possible test scenarios is apparent in many cases. My answer in regarding projects is: Get as experienced validators as possible together and apply a certain brainstorming technique. It is not only about logical (requirements driven) but also epistemic doubts about the correctness of system behavior that drives the resulting set of scenarios. That is where experience counts.

    Another technique I know of is to brainstorm about finding the dimensions test scenarios may be coordinated by, than analyze the interdependencies of these dimensions and find representative scenarios for each independent n-qube, said to be testing the whole n-qube.

    Either way, do not just list the scenarios but define the equivalence class it stands for.

    The fly-fix-fly anti-pattern for my opinion needs more attention than usual: If the designer follows a safety related quality process, ideally it shall not be possible to find an incorrect behavior. If we can find a problem, just fixing it is like faking there was none. Instead, we got to ask questions about the root cause of the failure and about variants of this root cause. Each failure found shall induce ideas for new representative test scenarios or scenario dimensions.

    At review time of the test scenarios with their positive test results, extending their experience by the newly gained test experiences, all the experienced validators and other stakeholders shall not be able to find one dangerous scenario that is not excluded by one or more of the performed test scenarios.

    Even with all these test techniques in place you cannot be sure to have avoided any dangerous racing condition or concurrency caused inconsistency that – very seldom – may lead to a dangerous behavior. This can only be proved to be excluded by a rigorous design quality inspection against formal rules.

    ReplyDelete

All comments are moderated by a human. While it is always nice to see "I like this" comments, only comments that contribute substantively to the discussion will be approved for posting.