Surely there were human factors at play here that may have influenced the individual’s decision-making process
I was a young trainee engineer on a night shift of a chemical production unit 30 years ago when an experienced operator overrode a safety system and sent chemical feedstock into the wrong reactor. The consequences could have been catastrophic but only through luck they weren't.
Following the investigation, the operator was disciplined….
Are you already ahead of me? Surely there were human factors at play here that may have influenced the individual’s decision-making process. Very rarely do people come to work with the intention of violating safety procedures.
Process Safety Frameworks
Process Safety is a widely used term in high hazard industry but finding a commonly understood definition isn’t straight forward. It can mean many different things to different people. The commonly used definition is “A disciplined framework for managing the integrity of operating systems and processes handling hazardous substances by applying good design principles, engineering, and operating practices”. Most engineers know it as the rigorous methods and processes of reducing risk of major fire, explosion, or other catastrophic event.
But what does that really mean in practise and if you implement this, is it enough?
I have spent 30 years working for operating companies in high hazard environments, both in engineering and leadership roles including head of HSEQ and Production teams. I have worked for some fantastic operating companies with “disciplined frameworks” who apply “good design principles, engineering and operating practises” but events still happen.
Humans aren’t machines
In any framework, decision making, management, implementation, and verification are all done by people. The phrase "to err is human" means we all make mistakes. In the hardware design of high-hazard processes this is taken into account but often it can be too easily overlooked when designing softer systems like operating procedures.
Yes, this event was clearly a “human factors” issue so let’s stop a second and think about what was happening.
It had become accepted practice to “apply an override” when valves or instruments were not working as they should be. Valves would stick, switches that were designed to show valves were closed wouldn’t work properly or instruments would intermittently fail. The operator was following a well-used process and even had a permit for it. But he then made an error and opened the wrong valve. It was a lapse at the end of 12 hour night shift, and he was nearing the end of his 14 day shift pattern. Maybe the conditions for the lapse had been put in place long before that night shift?
Why was this common practise? Why hadn’t the supervisor simply stopped the work until it was repaired?
So, we need to look at the supervisors then?
Again, there were many human factors at play, the supervisors felt under huge pressure to keep the units going. The sticking valves was a known issue that had become common practise to “manage” as the valves were old and needed replacements. There were redundancies around the corner, and everyone was fearful of losing their jobs. The supervisor on this night shift was also having significant family issues which his colleagues didn’t know about.
So, we need to look at production management team then?
Again, they were doing a tough job in difficult circumstances. the production team had put in numerous requests to have the faulty equipment changed and had raised the issue many times with the finance team. The financial controller hadn’t approved the requests.
Then it’s the finance controller?
The finance controller wasn’t an engineer, had no knowledge of the production unit but was tasked with keeping the site finances within budget. He was doing the job to the best of his ability.
Ultimately the site leadership team were accountable for the safe and reliable operations of the units, and we now understand that the culture that they promote can ultimately determine the frequency of events like this.
There were clearly many factors in play in this example. Maybe some of them you recognise from your own experience. Maybe the question should have been “why did it happen?” and “what needs to change?” rather than who was to blame.
Well-designed equipment is only half the story.
The equipment and unit were well designed, with quality equipment and well-trained diligent people but creeping change, aging equipment, and changing practices due to financial pressures all contributed to the event. Poor management of change perhaps?
The human factors around why the operator, supervisor, unit leader and financial controller made the decisions they did is what should have been asked rather than disciplining those closest to the event?
Reduce the stress on the operations teams, give them equipment that works – treat abnormal operations as exactly that – “abnormal”, make sure that team carrying this risk also has ability to influence the budget and assess change (including creeping change) and see it as a risk would all be a start.
I learned a lot from this incident. The course of action taken at the time seemed unfair and the incident was very likely to occur again.
Leadership questions.
Maybe the leadership team at the time should have been asking; How are we ensuring that teams are not fatigued? Are we doing all we can to reduce distractions? Is there more we could be doing recognise and help our colleagues having non-work-related problems? Have we put the budget in the hands of the teams who manage the risk? Have we trained our non-operational staff in process safety and shown them the effect that the decisions they make can have? Have we set the culture where recognising and stopping unsafe operation is rewarded?
But how do you really know?
These questions are not easy to answer. Many process safety performance indicators used by high hazard environments attempt to distil difficult to answer human factor questions into easily understandable metrics. This relies heavily on trusting the data that is being analysed and presented.
Controls around human factors such as monitoring shift patterns, monitoring process safety performance have improved since the days of this event but there is always more that can be done.
Empirisys are leading advances in data collection and machine learning to help asset management teams spot these human factor patterns before finding them after an event.
Maybe if techniques like these, together with the progressive thinking that many organisations now have, had been widely available 30 years ago, this event may not have happened.
If you found this interesting, please let us know by getting in touch, giving us a clap or a follow. You can find more about us at empirisys.io or on Twitter at @empirisys or on LinkedIn. And you can drop us an e-mail at info@empirisys.io or directly to the author of this blog Andrew.gibson@empirisys.io