Monday, March 11, 2019

The Fly-Fix-Fly Antipattern for Autonomous Vehicle Validation

Fly-Fix-Fly Antipattern:
Brute force mileage accumulation to fix all the problems you see on the road won't get you all the way to being safe. For so very many reasons...

This A400m crash was caused by corrupted software calibration parameters.
In practice, autonomous vehicle field testing has typically not been a deployment of a finished product to demonstrate a suitable integrity level. Rather, it is an iterative development approach that can amount to debugging followed by attempting to achieve improved system dependability over time based on field experience. In the aerospace domain this is called a “fly-fix-fly” strategy. The system is operated and each problem that crops up in operations is fixed in an attempt to remove all defects. While this can help improve safety, it is most effective for a rigorously engineered system that is nearly perfect to begin with, and for which the fixes do not substantially alter the vehicle’s design in ways that could produce new safety defects.

It's a moving target: A significant problem with argumentation based on fly-fix-fly occurs when designers try to take credit for previous field testing despite a major change. When arguing field testing safety, it is generally inappropriate to take credit for field experience accumulated before the last change to the system that can affect safety-relevant behaviour. (The “small change” pitfall has already been discussed in another blog post.)

Discounting Field Failures: There is no denying the intuitive appeal to an argument that the system is being tested until all bugs have been fixed. However, this is a fundamentally flawed argument. That is because this amounts to saying that no matter how many failures are seen during field testing, none of them “count” if a story can be concocted as to how they were non-reproducible, fixed via bug patches, and so on.

To the degree such an argument could be credible, it would have to find and fix the root cause of essentially all field failures. It would further have to demonstrate (somehow) that the field failure root cause had been correctly diagnosed, which is no small thing, especially if the fix involves retraining a machine-learning based system. 

Heavy Tail: Additionally, it would have to argue that a sufficiently high fraction of field failures have actually been encountered, resulting in a sufficiently low probability of encountering novel additional failures during deployment.  (Probably this would be an incorrect argument if, as expected, field failure types have a heavy tail distribution.)

Fault Reinjection: A fly-fix-fly argument must also address both the fault reinjection problem. Fault reinjection occurs when a bug fix introduces a new bug as a side effect of that fix. Ensuring that this has not happened via field testing alone requires resetting the field testing clock to zero after every bug fix. (Hybrid arguments that include rigorous engineering analysis are possible, but simply assuming no fault reinjection without any supporting evidence is not credible.)

It Takes Too Long: It is difficult to believe an argument that claims that a fly-fix-fly process alone (without any rigorous engineering analysis to back it up) will identify and fix all safety-relevant bugs. If there is a large population of bugs that activate infrequently compared to the amount of field testing exposure, such a claim would clearly be incorrect. Generally speaking, fly-fix-fly requires an infeasible amount of operation to achieve the ultra-dependable results required for life critical systems, and typically makes unrealistic assumptions such as no new faults are injected by fixing a fault identified in testing (Littlewood and Strigini 1996). A specific issue is the matter of edge cases, discussed in an upcoming blog post.

(This is an excerpt of our SSS 2019 paper:  Koopman, P., Kane, A. & Black, J., "Credible Autonomy Safety Argumentation," Safety-Critical Systems Symposium, Bristol UK, Feb. 2019.  Read the full text here)
  • Littlewood, B., Strigini, L. (1993) “Validation of Ultra-High Dependability for Soft-ware-Based Systems,” Communications of the ACM, 36(11):69-80, November 1993.

Thursday, March 7, 2019

The Insufficient Testing Pitfall for autonomous system safety

The Insufficient Testing Pitfall:
Testing less than the target failure rate doesn't prove you are safe.  In fact you probably need to test for about 10x the target failure rate to be reasonably sure you've met it. For life critical systems this means too much testing to be feasible.

In a field testing argumentation approach a fleet of systems is tested in real-world conditions to build up confidence. At a certain point the testers declare that the system has been demonstrated to be safe and proceed with production deployment.

An appropriate argumentation approach for field testing is that a sufficiently large number of exposure hours have been attained in a highly representative real-world environment. In other words, this is a variant of a proven in use argument in which the “use” was via testing rather than a production deployment. As such, the same proven in use argumentation issues apply, including especially the needs for representativeness and statistical significance. However, there are additional pitfalls encountered in field testing that must also be dealt with.

Accumulating an appropriate amount of field testing data for high dependability systems is challenging. In general, real-world testing needs to last approximately 3 to 10 times the acceptable mean time between hazardous failure to provide statistical significance (Kalra and Paddock 2016). For life-critical testing this can be an infeasible amount of testing (Butler and Finelli 1993, Littlewood and Strigini 1993).

Even if a sufficient amount of testing has been performed, it must also be argued that the testing is representative of the intended operational environment. To be representative, the field testing must at least have an operational profile (Musa et al. 1996) that matches the types of operational scenarios, actions, and other attributes of what the system will experience when deployed. For autonomous vehicles this includes a host of factors such as geography, roadway infrastructure, weather, expected obstacle types, and so on.

While the need to perform a statistically significant amount of field testing should be obvious, it is common to see plans to build public confidence in a system via comparatively small amounts of public field testing. (Whether there is additional non-public-facing argumentation in place is unclear in many of these cases.)

To illustrate the magnitude of the problem, in 2016 there were 1.18 fatalities per 100 million vehicle miles traveled in the United States, making the Mean Time Between Failures (MTBF) with respect to fatalities by a human driver 85 million miles (NHTSA 2017). Assuming an exponential failure distribution, for a given MTBF the required test time in which r failures occur can be computed with confidence α using a chi square distribution (Morris 2018):

Required Test Time =  χ2(α, 2r + 2)(MTBF)/2

(If you want to try out some example numbers, try out this interactive calculator at

Based on this, a single-occupant system needs to accumulate 255 million test miles with no fatalities to be 95% sure that the mean time is only 85 million miles. If there is a mishap during that time, more testing is required to distinguish normal statistical fluctuations from a lower MTBF: a total of 403 million miles to reach 95% confidence. If a second mishap occurs, 535 million miles of testing are needed, and so on. Additional testing might also be needed if the system is changed, as discussed in a subsequent section. Significantly more testing would also be required to ensure a comparable per-occupant fatality rate for multi-occupant vehicle configurations.

Attempting to use a proxy metric such as rate of non-fatal crashes and extrapolate to fatal mishaps requires also substantiating the assumption that the mishap profiles will be the same for autonomous systems as they are for human drivers. (See the human filter pitfall previously discussed.)

(This is an excerpt of our SSS 2019 paper:  Koopman, P., Kane, A. & Black, J., "Credible Autonomy Safety Argumentation," Safety-Critical Systems Symposium, Bristol UK, Feb. 2019.  Read the full text here)

  • Kalra, N., Paddock, S., (2016) Driving to Safety: how many miles of driving would it take to demonstrate autonomous vehicle reliability? Rand Corporation, RR-1479-RC, 2016.
  • Butler, Finelli (1993) “The infeasibility of experimental quantification of life-critical software reliability,” IEEE Trans. SW Engr. 19(1):3-12, Jan 1993.
  • Littlewood, B., Strigini, L. (1993) “Validation of Ultra-High Dependability for Software-Based Systems,” Communications of the ACM, 36(11):69-80, November 1993.
  • Musa, J., Fuoco, G., Irving, N., Kropfl, D., Juhlin, B., (1996) “The Operational Profile,” Handbook of Software Reliability Engineering, pp. 167-216, 1996.
  • NHTSA (2017) Traffic Safety Facts Research Note: 2016 Fatal Motor Vehicle Crashes: Overview. U.S. Department of Transportation National Highway Traffic Safety Administration. DOT HS-812-456.
  • Morris, S. (2018) Reliability Analytics Toolkit: MTBF Test Time Calculator. Reliability Analytics. (accessed December 12, 2018)

Tuesday, March 5, 2019

The Human Filter Pitfall for autonomous system safety

The Human Filter Pitfall:
Using data from human-driven vehicles has gaps corresponding to situations a human knows to avoid getting into in the first place, but which an autonomous system might experience.

The operational history (and thus the failure history) of many systems is filtered by human control actions, preemptive incident avoidance actions and exposure to operator-specific errors. This history might not cover the entirety of the required functionality, might primarily cover the system in a comparatively low risk environment, and might under-represent failures that manifest infrequently with human operators present.

From a proven in use perspective, trying to use data from human-operated vehicles, such as historical crash data, might be insufficient to establish safety requirements or a basis for autonomous vehicle safety metrics. The situations in which human-operated vehicles have trouble may not be the same situations that autonomous systems find difficult. It is easy to overlook the situations humans are good at navigating but which may cause problems for autonomous systems when looking at existing data of situations that humans get wrong. An autonomous system cannot be validated only against problematic human driving scenarios (e.g. NHTSA (2007) pre-crash typology). The autonomy might handle these hazardous situations perfectly yet fail often in otherwise common situations that humans regularly perform safely. Thus, an argument that a system is safe solely because it has been checked to properly handle situations that have high rates of human mishaps is incomplete in that it does not address the possibility of new types of mishaps.

This pitfall can also occur when arguing safety for existing components being used in a new system. For example, consider a safety shutdown system used as a backup to a human operator, such as an Automated Emergency Braking (AEB) system. It might be that human operators tend to systematically avoid putting an AEB system in some particular situation that it has trouble handling. As a hypothetical example, consider an AEB system that has trouble operating effectively when encountering obstacles in tight curves. If human drivers habitually slow down on such curves there might be no significant historical data indicating this is a potential problem, and autonomous vehicles that operate at the vehicle dynamics limit rather than a sight distance limit on such curves will be exposed to collisions due to this bias in historical operational data that hides an AEB weakness. A proven-in-use argument in such a situation has a systematic bias and is based on incomplete evidence. It could be unsafe, for example, to base a safety argument primarily upon that AEB system for a fully autonomous vehicle, since it would be exposed to situations that would normally be pre-emptively handled by a human driver, even though field data does not directly reveal this as a problem.

(This is an excerpt of our SSS 2019 paper:  Koopman, P., Kane, A. & Black, J., "Credible Autonomy Safety Argumentation," Safety-Critical Systems Symposium, Bristol UK, Feb. 2019.  Read the full text here)

  • NHTSA (2007) Pre-Crash Scenario Typology for Crash Avoidance Research, Na-tional Highway Traffic Safety Administration, DOT HS-810-767, April 2007

Thursday, February 28, 2019

The Discounted Failure Pitfall for autonomous system safety

The Discounted Failure Pitfall:
Arguing that something is safe because it has never failed before doesn't work if you keep ignoring its failures based on that reasoning.

A particularly tricky pitfall occurs when a proven in use argument is based upon a lack of observed field failures when field failures have been systematically under-reported or even not reported at all. In this case, the argument is based upon faulty evidence.

One way that this pitfall manifests in practice is that faults that result in low consequence failures tend to go unreported, with system redundancy tending to reduce the consequences of a typical incident. It can take time and effort to report failures, and there is little incentive to report each incident if the consequence is small, the system can readily continue service, and the subjective impression is that the system is perceived as overall safe. Perversely, reporting numerous recovered malfunctions or warnings can actually increase user confidence in a system even while events go unreported. This can be a significant issue when removing backup systems such as mechanical interlocks based on a lack of reported loss events. If the interlocks were keeping the system safe but interlock or other failsafe engagements go unreported, removing those interlocks can be deadly. Various aspects of this pitfall came into play in the Therac 25 loss events (Leveson 1993).

An alternate way that this pitfall manifests is when there is a significant economic or other incentive for suppressing or mischaracterizing the cause of field failures. For example, there can be significant pressure and even systematic approaches to failure reporting and analysis that emphasize human error over equipment failure in mishap reporting (Koopman 2018b). Similarly, if technological maturity is being measured by a trend of reduced safety mechanism engagements (e.g. autonomy disengagements during road testing (Banerjee et al. 2018)), there can be significant pressure to artificially reduce the number of events reported. Claiming proven in use integrity for a component subject to artificially reduced or suppressed error reports is again basing an argument on faulty evidence.

Finally, arguing that a component cannot be unsafe because it has never caused a mishap before can result in systems that are unsafe by induction. More specifically, if each time you argue it wasn't that component's fault, and therefore don't attribute cause to that component, then you can continue that behavior forever, all the while denying that an unsafe component is OK.

(This is an excerpt of our SSS 2019 paper:  Koopman, P., Kane, A. & Black, J., "Credible Autonomy Safety Argumentation," Safety-Critical Systems Symposium, Bristol UK, Feb. 2019.  Read the full text here)

  • Leveson, N. (1993) An investigation of the Therac-25 Accidents, IEEE Computer, July 1993, pp. 18-41.
  • Koopman, P. (2018b), "Practical Experience Report: Automotive Safety Practices vs. Accepted Principles," SAFECOMP, Sept. 2018.
  • Bannerjee, S., Jha, S., Cyriac, J., Kalbarczyk, Z, Iyer, R., (2018) “Hands Off the Wheel in Autonomous Vehicles?: A Systems Perspective on over a Million Miles of Field Data,” 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2018.

Tuesday, February 26, 2019

The “Small” Change Fallacy for autonomous system safety

The “small” change fallacy:
In software, even a single character source code can cause catastrophic failure.  With the possible exception of a very rigorous change analysis process, there is no such thing as a "small" change to software.

Headline: July 22, 1969: Mariner 1 Done In By A Typo

Another threat to validity for a proven in use strategy is permitting “small” changes without re-validation (e.g. as implicitly permitted by (NHTSA 2016), which requires an updated safety assessment for significant changes). Change analysis can be difficult in general for software, and ineffective for software that has high coupling between modules and resultant complex inter-dependencies.

Because even one line of bad code can result in a catastrophic system failure, it is highly questionable to argue that a change is “small” because it only affects a small fraction of the source code base. Rigorous analysis should be performed on the change and its possible effects, which generally requires analysis of design artifacts beyond just source code.

A variant of this pitfall is arguing that a particular type of technology is generally “well understood” and implying based upon this that no special attention is required to ensure safety. There is no reason to expect that a completely new implementation – potentially by a different development team with a different engineering process – has sufficient safety integrity simply because it is not a novel function to the application domain.

(This is an excerpt of our SSS 2019 paper:  Koopman, P., Kane, A. & Black, J., "Credible Autonomy Safety Argumentation," Safety-Critical Systems Symposium, Bristol UK, Feb. 2019.  Read the full text here)
  • NHTSA (2016) Federal Automated Vehicles Policy, National Highway Transporta-tion System Administration, U.S. Department of Transportation, September 2016. 

Thursday, February 21, 2019

Assurance Pitfalls When Using COTS Components

Assurance Pitfalls When Using COTS Components:
Using a name-brand, familiar component doesn't automatically ensure safety.

It is common to repurpose Commercial Off-The-Shelf (COTS) software or components for use in critical autonomous vehicle applications. These include components originally developed for other domains such as mine safety, low volume research components such as LIDAR units, and automotive components such as radars previously used in non-critical or less critical ADAS applications.

Generally such COTS components are being used in a somewhat different way than the original non-critical commercial purpose, and are often modified for use as well. Moreover, even field proven automotive components are typically customized for each vehicle manufacturer to conform to customer-specific design requirements. When arguing that a COTS item is proven in use, it is important to account for at least whether there is in fact sufficient field experience, whether the field experience is for a previous or modified version of the component, and other factors such as potential supply-chain changes, manufacturing quality fade, and the possibility of counterfeit goods.

In some cases we have seen proven in use arguments attempted for which the primary evidence relied upon is the reputation of a manufacturer based on historical performance on other components. While purchasing from a reputable manufacturer is often a good start, a brand name label by itself does not necessarily demonstrate that a particular component is fit for purpose, especially if a complex supply chain is involved.

COTS components can be problematic if they don't come with the information needed for safety assessment. (Hint: source code is only a starting point. But often even that isn't provided.) While third party certification certainly is not a panacea, looking for independent evaluation that relevant technical, development process, and assurance process activities have been performed is a good start to making sure COTS components are fit for purpose.

(This is an excerpt of our SSS 2019 paper:  Koopman, P., Kane, A. & Black, J., "Credible Autonomy Safety Argumentation," Safety-Critical Systems Symposium, Bristol UK, Feb. 2019.  Read the full text here)

The Fly-Fix-Fly Antipattern for Autonomous Vehicle Validation

Fly-Fix-Fly Antipattern: Brute force mileage accumulation to fix all the problems you see on the road won't get you all the way to bein...