Boeing, FAA, Space Shuttle Challenger, Richard Feynman, and Safety Culture

Maybe it’s universal: In any organization, safety culture is the first thing to rot, because it pays no dividends.  Safety rot can cut future dividends, but paying it forward has always been a weak motivation.

The early reports of witnesses indicated a catastrophic mechanical failure, evidenced by fire, smoke, and a hole in the airplane. The pitch control problem, based in software,  which flies a Max nose-first into the ground, has no direct connection with these symptoms. The loss of pitch control could have easily resulted from severed hydraulic lines.

So the FAA concluded that two hull losses within 6 months were probably not related. In a classic wrong decision, the FAA initially chose not to ground the Max. When vertical flight profile data became available, it provided a  smoking gun: the pitch control anomaly was active in the Ethiopian crash.

Yet it may not be THE smoking gun. The first cause of the Ethiopian tragedy seems to be an engine failure, which the crew might have successfully handled, had there not  been the fatal distraction of a pitch control problem. Failure to anticipate means deficient safety culture. At the minimum it must include:

  • Exposure of pilots without specialized training  to every conceivable scenario.
  • Supervision  by a very antagonistic tiger team.
  • Prodding by a very antagonistic team of statisticians.
  • Glasnost from everybody.  We’re not Russians!

Boeing doesn’t have this culture. Riding herd on Boeing is the FAA, staffed by the revolving-door, with  deep love for companies, airplanes, and  legend. It isn’t enough. It may even be a negative. Passing FAA scrutiny may have assured the weak-minded at Boeing they are doing the right thing. With safety culture, the only efffective adversary is yourself. Question, question, never stop!

Only 6 years ago, Boeing experienced a similar lack of safety culture with the 787 batteries. The 787 uses lithium ion batteries, which cannot be made fail safe. (Refer to the previous post.) Quoting (Wikipedia) Boeing 787 Dreamliner battery problems,

The National Transportation Safety Board (NTSB) released a report on December 1, 2014, and assigned blame to several groups:[3]

    • GS Yuasa of Japan, for battery manufacturing methods that could introduce defects not caught by inspection
    • Boeing’s engineers, who failed to consider and test for worst-case battery failures
    • The Federal Aviation Administration, that failed to recognize the potential hazard and did not require proper tests as part of its certification process

When I learned Boeing was using lithium batteries, long before the fires, I got a chilly feeling. Boeing was relying on the same judgment Sony had made about the ancestral 18650 lithium cell. In the lab, these batteries are resistant to trauma. But with the right kind of manufacturing defect, they become little bombs. So here’s another safety culture rule:

  • Assume every part is potentially defective, and see what happens.

I wondered why Boeing chose not to put the batteries in fire-proof stainless steel cases, venting to the outside. (Eventually, after 5 battery fires in one week, they did.)

The perverse reason: An airplane cannot be made fail-safe. To install 787  batteries in an effectively fail-safe enclosure would be overkill, because there are too many other things that can bring a plane down. The logic: If you can make 1% of the single-point-failure gadgets in an airplane fail-safe, while the other 99% rely on redundancy, there is no measurable benefit to passenger safety.

The pattern:

  • Reliance on lab tests of parts that do not reflect manufactured defects and real-world consequences.
  • Since an airplane cannot be made fail-safe, belief that a part can be made too safe.
  • A safety culture heavily based on badly applied statistics, without effective oversight and challenge.

The buck stops at the Boeing executive suite, which means sales. Sales are the lifeblood of a plane company, contingent on geopolitics, backslapping, trade deals, technology transfer, and maybe even greased palms.  Safety does not make the list.

Senile safety culture is a disease of for-profits. But how about non-profits? NASA holds a special place in our culture, achievement for the sheer joy of it. A lot of people would take a pay cut to work for NASA. And yet, it happened at NASA, on January 21, 1986, with the Space Shuttle Challenger disaster. Nobel Prize winner Richard Feynman, already fatally ill, was appointed to the Rogers Commission. He found the cause, which he demonstrated In a simple, yet dramatic demonstration involving  a C-clamp, a glass of ice-water, and a piece of o-ring  rubber.

America needed a hero to investigate the heroic. Feynman filled that role, but his account varies from the myth. According to Feynman, individuals volunteered the necessary information, organized in a way to lead him to the conclusion. Feynman said he never would have found it on his own.

They knew the answer before Feynman.  (Wikipedia) Space Shuttle Challenger disaster, shows that at the engineering level, there was a vibrant safety culture, cognizant of what Feynman eventually discovered, that failed to influence management.

Even within enlightened NASA, there was no one to tell. Even though NASA occupies a special place in the American mind, something prevented  Feynman’s helpers to make known what they knew. Feynman filled the gap with his fame. With his ingenuity, on-camera with a glass of ice water, a C-clamp, and a piece of rubber, Feynman gave us common sense.

Feynman discovered that in calculating the safety of the Space Shuttle, NASA had misused or ignored basic statistics, the kind that would flunk you out of a 2nd year stat course. He showed that the chance of a Space Shuttle disaster was 1 in a 100, not 1 in 100,000. With similar faulty reasoning, the FAA chose not to ground the Max.

Common sense said ground the plane. The FAA said no to common sense, because they don’t understand this simple equation from Decision Theory:

Average Cost of a Decision =  

(Chance of Getting it Right) X (Cost of Getting it Right) 

+ (Chance of Getting it Wrong) X (Cost of Getting it Wrong)

The cost of getting it wrong is 300 lives.

It’s time to ground the FAA.