Author: Dan DeGrendel | 19 November, 2018
When a critical asset fails in service and costs over $10 million to fix, it’s in your best interests to work out why it broke and what you can do to stop it from happening again, right? Yet, despite best intentions, many companies struggle to get to the bottom of a problem. Their investigation process lets them down.
Yes, Root Cause Analysis (RCA) is a useful strategy for investigating why something went wrong. But often, you need to take it further. Here, using an example from one of our customers, we explore how the Apollo Root Cause Analysis methodology adds immense value to RCA.
A postal processing and distribution center uses complex cross-belt sortation (CBS) systems to move packages between areas of the facility. Think of the CBS as a high-speed continuous loop comprised of trays and feed conveyor belts, with transfer stations to move products between the two. These transfer stations can be activated up to 600,000 times per month – maintenance crews regularly inspect the metal plates and rubber wheels at each transfer station for wear or cracks.
A crack in the system
The CBS system is highly efficient … until it breaks. Here’s what happened at the processing center: a metal plate at one of the transfer stations broke in half, with one half falling off and getting in the path of the CBS trays. This triggered a chain reaction, with trays piling up in the 10 seconds it took for the system to come to a complete stop. In this time, over 200 trays were damaged.
It took three days to procure spare parts and then another three to repair the CBS system. Lost revenue and repair costs exceeded $10 million.
What an internal problem-solving process discovered
An event investigation team was assigned to identify the root cause of the event and recommend solutions to prevent a recurrence.
Per standard protocol the team used the company’s event investigation process; the problem was defined as it happened. That is, a crack formed in the metal plate from millions of cycles over ten years of service. The crack grew until the metal plate broke in half, resulting in one half of the metal plate remaining in the path of the trays. The first tray struck the metal plate and started a chain reaction of tray-to-tray collisions. It took 10 seconds for the CBS system to come to a full stop.
The root cause was defined as “repeated flexing of the metal plate during use, which caused the metal to crack and eventually fail.”
From this, the team recommended three solutions:
- Increase the inspection frequency of the metal plates
- Provide refresher training on how to inspect the metal plates for cracks
- Increase the spare parts held on site to reduce repair delays
The approach taken by the investigative team didn’t explore the bigger picture, and they realized that more needed to be done. So they asked ARMS Reliability to facilitate an event investigation using the Apollo Root Cause Analysis methodology. This deeper analysis revealed some interesting findings.
What the Apollo Root Cause Analysis methodology discovered
Using the Apollo method, ARMS Reliability spent 1.5 days working with their investigation team to develop a detailed cause and effect chart to identify solutions to prevent a costly recurrence.
From the start, this approach differed from the company’s own internal investigation. Instead of defining the problem as it happened, we identified the problem as being “extended plant down-time.” From this broad definition of the problem, more specific issues emerged on the cause and effect chart. There were many “aha moments”.
The first was that the system did not, in fact, shut down because the controls detected there was a problem with the CBS system (the team’s initial assumption). Rather, it shut down because a package blocked a sensor on a feed conveyor. This meant that the system came to a controlled stop after all feeder conveyors were cleared of product, instead of an emergency stop. This extended the shut-down time and system damage.
Other findings included:
- While the metal plates are inspected regularly, the physical configuration of the transfer stations, along with limitations of a visual inspection, minimizes the probability of detecting a crack.
- The transfer stations are 15 feet above ground and behind metal guards. Each guard is four feet wide by eight feet long and fastened in place with 48 bolts and lock washers. Over 300 guards were removed/replaced during the repair using hand ratchets – a tedious and time-consuming task.
- Personnel needed to go to the stock room to get replacement bolts and washers that were damaged or lost under/behind equipment, which added to the repair time and effort.
- Metal plate failure rate information is collected by each distribution center, but the information is not shared between sites.
- Although the CBS system components are common between the distribution centers, a central supply of spare parts in the event of a catastrophic failure like this one does not exist.
- CBS system vulnerabilities had not been identified because neither a Failure Mode and Effects Analysis nor a Vulnerability Assessment and Analysis had been performed.
- Wear of other transfer station components increased metal plate flexing and thereby increased the metal plate failure rate. However, scheduled maintenance was not in place for these transfer station components.
From this comprehensive list of findings, many more solutions were recommended to help safeguard against a similar failure in the future. Here’s what ARMS Reliability proposed:
- Add logic to CBS system controls to detect abnormal events and generate an emergency stop
- Add motor brakes to the drive system to minimize emergency stop time
- Add supports to the metal plates to reduce flexing and the subsequent failure
- Provide quick access points through the guards for inspections
- Provide inspection cameras and crack inspection dye for metal plate inspections
- Schedule metal plate inspections based on metal plate use rate instead of calendar time
- Perform a Vulnerability Assessment and Analysis on the CBS system and address high vulnerabilities
- Develop optimized maintenance strategies for the critical components of the CBS system
- Provide parts carts and power tools to reduce the time required to remove and replace a guard
- Share component failure rates between sites to better predict failure rates and the required scheduled maintenance
- Re-build the transfer stations on a schedule to address component wear before it becomes an issue
- Create contingency plans for catastrophic failures of the distribution center’s mission-critical systems
- Establish a central stock of CBS system spare parts
Comprehensive analysis an “eye opener”
By using the Apollo Root Cause Analysis method, along with a skilled facilitator, the team reached a point within just 1.5 days where they clearly understood the cause of the failure and, more importantly, could identify effective solutions to prevent a recurrence.
As is typical when using the Apollo method, there were several eye-opening moments when the team discovered how various flaws and deficiencies in their operation lined up to cause the event. By digging a little deeper, some serious issues were revealed, leading to a greater set of effective solutions that will prevent the problem from happening again.