By Ron L. Hughes
FAILURE SCENE INVESTIGATION
Understanding the cause(s) of failure can sometimes seem to be a daunting task to the Failure Scene Investigator. This is completely understandable given the apparent chaotic circumstances that usually surround the incident under investigation. To often failure is reacted to in a manner that will put everything back in a known “acceptable condition” as fast as possible without any real consideration given to actually solving the incident through a well thought out investigative process. Symptoms are not noted, or ignored, and the evidence is either cleaned up or destroyed. When this occurs the failure will again manifest itself and typically re-occur at an unexpected time. The good news is that when this happens often enough we become more efficient at reacting to the problem and therefore seemingly better at correcting the situation more rapidly – thus decreasing downtime or mitigating the consequences of the incident being investigated. This mindset, or restraining paradigm, is a failure unto itself.
Analysts must realize the life of any component is not infinite but pre-determined based on the stresses being subjected to the component. Therefore engineering designs not only transform a need into a description of a product, but take into account the design’s compatibility with the related expected physical stresses induced into the component based upon its functional requirements. This includes the life of the product (as measured by its performance over time), reliability, and maintainability.
As T.S. Eliot so accurately observed when he wrote “Failure is relative – it is what we can make of the mess we have made of things” – it is easy to see the key to success when analyzing failure is not to react to a problem, but to be proactive by treating the failure as an opportunity to learn.
Equally important to the analyst is the realization that failure seldom occurs for a single reason or comes from a single force or input. This becomes quickly evident when chasing all of the possibilities that the evidence leads the investigator to explore. Therefore most failures are typically a result of a multiplicity of inputs, errors and are depicted by the logic tree in multiple legs illustrating all the cause-and-effect relationships of the failure.
READING THE CLUES
Every incident analyzed will occur within a specific timeline representing the time between when the anomalous conditions of the failure first manifested themselves to when the failure was safely isolated. The failure data that is found within this timeline provides the clues or evidence needed to uncover the cause of any incident or failure – be it sporadic or chronic. Every piece of data will beg a question as to “how can” this data be in the condition found or the position found. When the investigator can answer the “how can” questions, and tie the anomalies to a specific point within the timeline, then he or she has successfully followed the path to failure for the incident under investigation. In short they have found the root cause(s) of the failure.
Equally as important as understanding that the clues exist within a specific timeline is the fact that any failure can be analyzed by also understanding the principles of how failure occurs within that timeline. The three principles of failure analysis include (1) the order and pattern, (2) determinism and (3) discoverability and can therefore be used during the investigation to follow the path that led to the incident.
Order and Pattern
There is order and pattern to everything in the universe; the sun comes up and the sun goes down, the tides go in and the tides go out, there are four seasons in a year, etc. There is also order and pattern to failure, by understanding this simple principle it is only logical to conclude an order and pattern of failure exist within the timeline of the failure under investigation. The key is to read the clues to uncover the order and pattern that led to failure.
Just as there is an order and pattern that exist within the timeline of any failure, there are also determinable effects that exist within the order and pattern. To state it in simple terms, every input will produce a set of known outputs, and every produced output came from a known input; i.e. the determinable effects. The key to determinism is to make sure that the inputs and outputs are in the correct order and pattern (the cause is below the effect in the logic tree). For example consider the following: Does misalignment cause high vibration or does high vibration cause misalignment? Both are possible but which one occurred? This means the analyst must determine which cause and effect relationship is correct. In this scenario either the equipment was initially misaligned or it was aligned correctly and became misaligned. Once this is determined then the cause and effect relationship is known; i.e. if misaligned initially misalignment caused high vibration or, if aligned correctly and became misaligned high vibration caused misalignment.
Seldom is there a single cause or a single path to failure. Discoverability, when applied in the investigative process, helps the analyst to ensure all the possible causes have been explored and accounted for by the analysis. The key is to start as broad and all inclusive as possible while working through the specifics. By asking the question “how can” over-and-over again, and systematically working through the cause and effect relationship of the failure, all the possible root cause scenarios are explored and accounted for in the analysis.
By following the concepts of order and pattern, determinism and discoverability it makes it easy for the analyst to graphically illustrate the investigation on the logic tree and document the analysis.
The Science of the Clues
The clues uncovered during the investigation can always be accounted for by a scientific explanation of the anomaly. For example you can’t have electricity outside the realm of ohm’s law or you can’t have a fire without a heat or ignition source, a fuel source and an oxygen source.
Even something as simple as color provides scientific evidence for the analyst. Color changes in materials indicate different exposures to temperature or corrosive products.
The color of smoke changes with different fuel sources. The color of lubrication products changes with the loss of additives, contamination, temperature or pressures that overcome the film barrier
Another key principle for the analyst is the mechanics of fractures. Fracture Mechanics or “Fractology” is the study of the propagation of cracks in materials. It is based on the use of analytical solid mechanics to calculate the driving force on a crack and experimental solid mechanics to characterize the material’s resistance to fracture. Fracture mechanics is therefore an important tool for determining the expected mechanical performance of materials and components. By applying the physics of stress and strain (in particular the theories of elasticity and plasticity) to the microscopic crystallographic defects (morphology of the fractured surface) found in materials, it is easy to predict and understand the macroscopic signatures of mechanical failure seen on the components fractured surface face. This technique is used to understand the theoretical causes of failures and also to verify the forces that must be present based on the pattern of the fractured surface. In essence, the fractured face tells what kind of force caused the failure. So if the type of force necessary to cause the failure can be found, it can be eliminated or mitigated thus reducing the likelihood of re-occurrence at some future date.
A lot of the time when an analyst fails to achieve success the tendency is to simply change the definition of success to a level that can be more easily obtained. Although this will allow the investigator to quickly move on it obviously limits the payback from the Root Cause Analysis effort. By changing this restraining paradigm to one that seeks the maximum return by only accepting true success, and proactively conducting a fact based analysis driven by the evidence, the paybacks then become expediential in lieu of incremental. In summary, the quality of the clues of the failure, and the correct interpretation of what the clues tell the analyst, are what determines the degree of success for the incident under investigation.
Mr. Hughes, a Mechanical Engineer, is a member of the American Society of Mechanical Engineers (ASME) & the American Society of Training and Development (ASTD). He is currently a Senior Consultant for Reliability Center, Inc. His expertise encompasses all areas of Human and Plant Reliability including the training/mentoring and facilitation of Root Cause Analysis and the performance of Reliability Assessments worldwide for client companies.