You are listening to the Edge Case Research Self Driving Car Safety Series, and in this episode, Phil Koopman discusses metrics for unknown unknowns or the things that you have not trained on or thought about in the design. Hear Phil explain how using metrics to track the unknowns, or surprises, can help determine if the self driving car is good at recognizing something weird has happened and if it will respond safely. Now over to Phil.
This is Phil Koopman from Edge Case Research with a series on self-driving car safety. This time, I’ll be talking about approaches to metrics for unknown unknowns. Your first reaction to measuring unknowns may be how in the world can you do that? Well, it turns out the software engineering community has been doing this for decades, and they call it software reliability growth modeling. That area’s quite complex with a lot of history, but for our purposes, I’ll boil it down to the basics.
Software reliability growth modeling deals with the problem of knowing whether your software is reliable enough, or in other words, whether or not you’ve taken out enough bugs that it’s time to ship the software. All things being equal if your system test reveals 10 times more defects in the current release than in the previous release, it’s a good bet your new release is not as reliable as your old one. On the other hand, if you’re running a weekly test debug cycle with a single release, so every week you test it, you remove some bugs, then you test it some more the next week, at some point you’d hope that the number of bugs found each week will be lower, and eventually you’ll stop finding bugs. When the number of bugs per week you find is low enough, maybe zero, or maybe some small number, you decide it’s time to ship. Now that doesn’t mean your software is perfect, but what it does mean is there’s no point testing anymore if you’re not finding bugs. Alternately, if you have a limited testing budget, you can look at the curve over time of the number of bugs you’re discovering each week and get some sort of estimate about how many bugs you would find if you continued testing all the way down to zero.
At some point, you may decide that the number of bugs you’ll find and the amount of time it will take simply isn’t worth the expense. And especially for a system that is not life critical, you may decide it’s just time to ship. A dizzying array of mathematical models have been proposed over the years for the shape of the curve of how many more bugs are left in the system based on your historical rate of how often you find bugs. Each one of those models comes with significant assumptions and limits to applicability. But the point is that people have been thinking about this for more than 40 years in terms of how to project how many more bugs are left in a system even though you haven’t found them. And there’s no point trying to reinvent all those approaches yourself.
Okay, so what does this have to do with self-driving car metrics?
Well, it’s really the same problem. In software tests, the bugs are the unknowns, because if you knew where the bugs were, you’d fix them, and you’re trying to estimate how many unknowns there are or how often they’re going to arrive during a testing process. In self-driving cars, the unknown unknowns are the things you haven’t trained on or haven’t thought about in the design, and you’re doing road testing, simulation and other types of validation to try and uncover these. But it’s really the same problem. You’re trying to look for latent defects or functionality gaps and you’re trying to get idea of how many more there are left in the system that you haven’t found yet, or how many you can expect to find if you invest more resources in further testing.
For simplicity, let’s call the things in self-driving cars that you haven’t found yet surprises. And the reason I put it this way is that there are two fundamentally different types of defects in these systems. One is you built the system the wrong way. It’s an actual software bug. You knew what you were supposed to do, and you didn’t get there. Traditional software testing and traditional software quality will help with those, but a surprise isn’t that. A surprise is a requirements gap or something in the environment you didn’t know was there, or a surprise has to do with imperfect knowledge of the external world. But you can still treat it as a similar, although different, class from software defects and go at it the same way. One way to look at this is a surprise is something you didn’t realize should be in your ODD and therefore is a defect in the ODD description. Or, you didn’t realize it could kick your vehicle out of the ODD and is a defect in the model of ODD violations that you have to detect. And you’d expect that surprises that can lead to safety-critical failures are the ones that need the highest priority for remediation.
Now to create a metric for surprises, you need to track the number of surprises over time. You hope that over time, the arrival rate of surprises gets lower. In other words, they happen less often and that reflects that your product has gotten more mature, all things being equal. If the number of surprises gets higher, that could be a sign that your system has gotten worse with dealing unknowns, or could also be a sign that you’re ODD has changed and more weird things are happening than used to because of some change in the outside world. Either way, a higher rival rate of surprises means you’re less mature or less reliable and a lower rate means you’re probably doing better.
This may sound a little bit like disengagements is a metric, but there’s a profound difference, and that difference applies even if disengagements on road testing are one of the sources of data. The idea is that measuring how often you disengage, that a safety driver takes over, or the system gives up and says, “I don’t know what to do” is a source of raw data, but the disengagements could be for many different reasons. And what you really care about for surprises is only disengagements that happened because of a defect in the ODD description or some other requirements gap. Each incident that could be one of those things needs to be analyzed to see if it was a design defect, which isn’t an unknown, that’s just a mistake that needs to be fixed, or a true unknown unknown that requires re-engineering or retraining your perception system or another remediation to handle something you didn’t realize until now was a requirement or operational condition that you need to deal with. Since even with a perfect design and perfect implementation, unknowns are going to continue to and present risk, what you need to be tracking with a surprise metric is the arrival of actual surprises.
Now, it should be obvious that you need to be looking for surprises to see them, and that’s why things like monitoring near misses and investigating the occurrence of weird, but seemingly benign, behavior matters. Safety culture plays a role here. You have to be paying attention to surprises instead of dismissing them if they didn’t seem to do immediate harm. A deployment decision can use the surprise arrival rate metric to get an approximate answer of how much risk will be taken due to things missing from the system requirements and test plan. In other words, if you’re seeing surprises arrive every few minutes or every hour and you deploy, there’s every reason to believe that will continue to happen.
If you haven’t seen a surprise in thousands or tens or hundreds of thousands of hours of testing, then you can reasonably assume that surprises are not going to be happening every hour once you deploy. To deploy, you want to see the surprise arrival rate reduced to something acceptably low, and you’ll also want to know the system has a good track record that when a surprise does happen, it’s pretty good at recognizing something weird has happened and doing something safe in response. To be clear, in the real world, the arrival rate of surprises will probably never be zero, but you need to measure that it’s acceptably low so you can make a responsible deployment decision.
You’ve just heard from Phil Koopman discuss metrics approaches for the unknown unknowns. A key takeaway from this episode – even with a perfect design and perfect implementation, unknowns are going to continue to present risk. For responsible deployment, it is critical for companies to pay attention to the surprises and ensure that the self driving car has an acceptably low surprise rate and can react safely when it does encounter an unknown, unknown.
At Edge Case Research, to help companies track surprises, we developed Hologram, a continuous risk analysis tool to test perception software that identifies risks that are difficult to find with other types of testing and analysis. Hologram helps reveal the unknown unknowns and is helping companies today prepare for safe deployment. To get more information on Hologram, please visit our website at www.ecr.ai or email us at firstname.lastname@example.org.
We thank you for listening and we look forward to working with you on delivering the promise of autonomy.