Sometimes, the evidence confronts you with a reality that is completely different from what you want to see.
We recently wrote an article about getting to systemic (root) causes in your failure investigations. In this case study, we’ll look at an investigation where the evidence pointed to a conclusion that was very different from what the team was hoping to find, and shows how identifying root causes can be very confronting for the people involved.
A few years ago, we were asked to investigate the structural failure of a conveyor. The conveyor was down for a couple of weeks, impacting production the whole time, so it was obviously a major event that the client wanted to understand.
When we arrived at site, several of the maintenance team members informed us that the conveyor failed from fatigue due to overloading. However, there had been no failure analysis done on the structure – it was simply repaired, and the conveyor put back to work. We had no evidence other than photographs taken at the time, which appeared to show a significant amount of corrosion.
Without knowing the failure mode for certain, we had to run the investigation a little differently. After collecting all the information we could, including discussions with maintenance team members, maintenance history, inspection sheets and operating data, two facts emerged:
- The conveyor was operating at half its rated load capacity, so overloading was not a factor;
- While the conveyor was being repaired, the team inspected the structure thoroughly, finding and repairing another dozen cracks in the structure.
These facts suggested that, although we didn’t know the failure mode for certain (corrosion fatigue was a good bet but not provable), it was possible for the team to have detected the cracks before failure. The next question to understand was why they hadn’t detected the cracks.
The investigation found the following indirect causes:
- Inadequate Strategy/Plan – There was a structural inspection strategy in the system, and it had been conducted, but it contained insufficient detail on what to look for and how to conduct the inspection (the previous inspector focussed on handrails and grid mesh, not the structure itself).
- Inadequate Maintenance Execution – The structural inspections weren’t conducted thoroughly – if the structure was covered by spillage, it wasn’t cleaned and inspected, simply left out. (Guess where the cracks occurred?)
- Inadequate Maintenance Execution – Defects from the inspections weren’t entered in the CMMS, so they weren’t getting fixed.
All of these causes were fixable, and would have prevented this failure happening again, but what about all the other structures? This is an example of where digging deeper to find systemic causes and turbo-charging your learning is essential.
The investigation found the following systemic causes for the failure:
- System of Work – The site lacked a consistent approach to the operation and maintenance of conveyors (the failed conveyor didn’t have scrapers, which was contributing to the spillage);
- System of Work – The site only had a generic structural inspection task sheet, and did not tailor them for specific assets to provide direction to maintainers;
- System of Work – The site wasn’t including structural inspections in their maintenance plans (in this instance, they should have been included in the shutdown plans so operators could clean and prepare the structure for the inspection, and ensure there was enough time to access the structure and complete the inspections);
- System of Work – The inspections were carried out by the engineering team rather than the maintenance team, and so the completed inspections were stored in a separate location and not in the CMMS.
Behind these causes was an over-arching organizational factor: the site’s structure for managing and maintaining structures was not aligned. The engineering team were charged with carrying out the inspections, whilst the maintenance team were charged with completing the repairs. The engineering team weren’t included in the work management process, so their inspections weren’t being planned, executed properly, or completed with subsequent defects raised. If the maintenance team did find defects themselves, they wouldn’t engage with the engineering team to get advice and input on designing and planning the repairs.
Each team’s goals were also misaligned – the engineering team didn’t have availability or reliability in their KPI’s, so they had no incentive to ensure repairs were complete. Conversely, the maintenance’s team primary metric was schedule completion and backlog, so they were happy not to have the structural work orders in the CMMS.
When we presented these findings to the site team, they were obviously taken aback, since the findings had nothing to do with overloading the conveyor. Our findings, particularly the root causes, went into areas they weren’t prepared for, and some team members were unwilling to accept them, or otherwise reacted badly. In this instance, we’d compiled a fairly extensive amount of evidence, so we were able to take them through each piece until they were willing (although not exactly happy) to accept the recommendations for change.
Had we accepted the initial information at face value, we would have completely missed the true cause of the failure. And if we’d stopped at the indirect causes, we would not have helped the site to bring about the organizational and process changes that they needed to improve all their structures, not just the one in question.
By Matthew Grant