Root Cause Analysis: Not Out of the Mountains, Yet
A s you drive east on I-70 coming from the Rockies, there is a point where you seem to have stopped descending, but a sign says, "Trucks: Don't be fooled. Four more miles of steep grades and sharp curves." The message is that it would be premature to relax at this point, and vigilant driving is still required to safely reach flat ground.
Figure 1: Sign on I-70. Credit: Xnatedawgx, own work, GFDL
In the same way, when you are performing root cause analysis, you can reach points where everything seems to be OK. For example:
• You have found a root cause that is almost certainly contributing to the problem.
• The root cause you have found is actually a symptom and not the underlying cause.
• Your customer is no longer concerned about the problem.
• The problem seems to have gone away by itself, and nothing is happening.
These are dangerous "flat spots" for any of us. When you reach these points in an investigation, you must be committed to going further to truly reach "safe ground."
Many, if not most, problems have more than one contributing factor, and they often interact in complex ways. Celebrating your success after finding and resolving just one contributing factor puts you at great risk of the problem recurring, along with an embarrassing and demoralizing need to reconstitute your team and your investigation. Such a situation will also result in losing your customer's trust the next time you claim to have found the root cause of a problem.
Always ask, "Why did it happen?" once a root cause has been identified. In his article, "The Art of Root Cause Analysis," Vidyasagar Appana tells us that asking why five times is used for "drilling down to the real root cause." Often, the first root cause identified is the result of another failure, and correcting only the superficial issue will result in a reoccurrence of an issue. Asking why five times helps to pinpoint the underlying, ultimate cause of the failure by identifying the causes of the more obvious proximate causes and following them to the ultimate cause .
Suppose a root cause investigation is launched to determine why an assembly broke. The first 5 Whys question would be, "Why did the assembly break?" An investigation might show that the system broke because a bolt sheared during operation. Replacing the bolt would not be sufficient to prevent a reoccurrence, so this is only a proximate cause of the failure. We must ask, "Why did the bolt shear?" Perhaps the bolt sheared because it was the wrong size for the intended use. So, "Why was the bolt the wrong size?" Further investigation may show the wrong bolt size was specified on the drawing. It may be tempting to update the drawing with the correct bolt and close the issue, but this is still at the level of a proximate cause because this issue may reoccur elsewhere.
The team must then answer the question, "Why was the wrong bolt in the drawing?" Suppose the team discovers the bolt was selected by an inexperienced engineer. It may be tempting to simply retrain the engineer, but the team should ask, "Why did the inexperienced engineer select the wrong bolt?" Perhaps the inexperienced engineer selected the wrong bolt because there was no fastener specification. Now we have the ultimate cause of the failure and can look for corrective actions that will ensure there is not a reoccurrence. A reoccurrence elsewhere could only be prevented by digging down to the underlying cause, which in this case is the lack of a fastener standard.
Your customer, whether internal or external, might not be concerned any longer about the problem, but that does not mean that you are OK and can stop. If you don't find the key root cause factor or factors, then you have no assurance that you're under control, and you have not learned anything. We have often seen customers in these cases "re-remember" these problems, and often chastise the supplier for dropping the original investigation prematurely, or for failing to make progress.
When the problem "goes away" by itself, you can be thankful that the immediate time pressure has been relieved, but you must not conclude that you can stop investigating. If you don't know why the problem stopped, then you don't know why it started. In that case you are also not under control, and you are merely traveling obliviously until the problem reoccurs.
In all of these cases, it is imperative that you get back firmly in the driver's seat and steer your investigation through to the ultimate root cause. You must find all of the significant contributing factors so that you can fully turn the problem on and off. You must also understand the underlying physics of the failure mode, and you must steer toward validated corrective action. Only at that point can you be satisfied that you have successfully navigated yourself back to "safe ground."
Enviado desde mi iPhone