Normalization of Deviance
March 9, 2019 10:55 AM   Subscribe

Engineer Foone Turing looks at how pushing limits leads to failure.
The Challenger disaster wasn’t a single mistake or flaw or random chance that resulted in the death of 7 people and the loss of a 2 billion dollar spaceship. It was a whole series of mistakes and flaws and coincidences over a long time and at each step they figured they could get away with it because they figured the risks were minimal and they had plenty of engineering overhead. And they were right, most of the time…

On design:
And that’s an element everyone building anything should consider: Your system not breaking doesn’t mean it works and is a solid design. It might just mean you’ve gotten lucky, a lot, in a row.
And on a personal level:
So think about your workload (and by “work” I don’t just mean the 9-5 money-making sort of work). You have limits. And it’s not a bad thing when you have to cut back, when you have to relax, when you have to take time to heal. Because it often seems to be the nature of how we normalize what we’re successfully doing to keep pushing ourselves and not realize how close we are to being overloaded.

There’s nothing wrong with trying to avoid that point, and there’s especially nothing wrong with having to cut back on what you can do once you do hit that point. If you try to load 9 boxes in your car and only 7 will fit, you don’t get mad at the car for not “toughing it out.” You’re a machine with limits too. Those limits are different because you’re conscious and biological rather than computers and mechanical, but you’ve still got limits. Keep that in mind.
Original Twitter thread.
posted by Margalo Epps (40 comments total) 109 users marked this as a favorite
 
This is a fantastic article and I need to process it before having more specific thoughts.
posted by Navelgazer at 12:10 PM on March 9 [6 favorites]


The article is excellent, and it's definitely true that when systems go badly wrong it's often because a lot of different things failed, often over a span of time and were ignored as just the new normal or fixes delayed to focus on more visible issues as the existing redundancy and safety margins were enough to keep things rolling - until one last thing means they aren't, and it all comes falling down, sometimes literally.

This is definitely a chronic problem in IT - ok, THAT backup system has a bad disk, but it's substituted the spare so it can manage without it while we wait for the replacement to be shipped because who can afford to keep expensive spares on the shelf gathering dust, and we have this other system that'll cover us anyway. But it turns that other system failed weeks ago due a simple software error and just needed a reboot, but it was buried in the logs noting every time a file couldn't be backed up and the like because it was in use which was normal and not a cause for concern, so hadn't been spotted yet.

Then one day somebody important deletes that critical data from the finance server because 'they're not good at IT', and when you get to the backup server you find another drive failed before the spare finished rebuilding overnight (because they're all the same age and make*) so it's now useless and it turns out you go to the disaster
recovery system and get that horrible sinking feeling when you realise that's also busted, and you've just got that offline copy that hasn't been updated for months because you have two live backup systems, and you hope it's still good because although it's old, a bunch of manual updating will still you get back to where you should be after a few days otherwise it means nobody is gonna get paid for god knows how long (this didn't happen to me personally, but I know a guy...)

* and the backup server was overheating in the first place which probably caused the disks to fail because there was leak in the air conditioner, which couldn't be fixed without taking down the server room and all the gear for a few hours, and there was live stuff in there That Could Not Be Shut Down for Maintenance any time soon. I've had several similar things like that happen to me, but so far, thankfully, not all at once.

There was also a fascinating article linked in the comments about how the Titanic had a fire in the coal bunker before it even left the dock, but they basically left anyway and tried to fight it in transit, and that was possibly one reason why they were burning so much coal and thus sailing so fast - and that it may have weakened a critical watertight bulkhead.

The events that happened could have been avoided and all the people aboard needlessly died horrible deaths. In the end, the only thing the new evidence indicates is a new level of negligence and risk-taking by those in charge. A ship that was ill-equipped with lifeboats was sailed, while on fire, through an iceberg-infested area at full speed. That the British inquiry buried the testimony about the fire only adds to the tragedy.
posted by Absolutely No You-Know-What at 12:38 PM on March 9 [36 favorites]


This was a great twitter thread and I’m pleased to see that it’s been written up into an article.

Prior to reading the thread, I’d never heard of the term “normalising deviance”, but I used to work in occupational safety and health, and this is a frequent concern. In almost every major accident, you’ll see systematic failures that meant that systems which were supposed to provide redundancy - both physical systems and management systems - were undermined, because they broke, their breaking didn’t have any immediate obvious negative consequence, and therefore it was assumed that they were unnecessary.

“Oh, that alarm always goes off! It’s short-circuited or something!”

“Yeah we skipped the usual procedure, because of cost and time pressure, but we’ve been able to maintain production!”

It’s a fallacy that you see time and time again - the “got away with it once; therefore it’s safe” fallacy. And it’s really hardwired into us - I’ve lost count of the number of times in conversation that someone has dismissed the threat of global warming by comparison to other bullets that we’ve dodged as a species: from nuclear war, to H5N1, to the Millennium Bug and CFCs.

We’re just... bad at risk analysis, on a fundamental level.

Anyhow, I said at the beginning of my comment that I was interested in the term “normalisation of deviance”, and I’m slightly disappointed to report that it was coined by an American sociologist, Diane Vaughan, who isn’t mentioned in the article - despite the fact that she developed the term while, er, writing a famous book about the Challenger disaster. (Vaughan is quoted in one of the links, in fairness.)

Here’s an interview with her. A lot of truth to be found:
For example, common concepts explain misconduct by organizations, the deterioration of intimate relationships, and accidents and disasters. In each of my three books, there was a common pattern: a long incubation period filled with early warning signs that were either missed or misinterpreted or ignored. Concepts common to all are structural secrecy, the normalization of deviance, signals – missed signals, weak signals, routine signals. All of these are common in failures of all sorts.
[...]
When the US was attacked on September 11, 2001, the entire country was shocked and surprised. But the 9/11 investigating commission discovered that the terrorist attack too, was preceded by a history of early warning signs that were misinterpreted or ignored. Whether talking about failed relationships or terrorist attacks, the ignorance of what is going on is organizational and prevents any attempt to stop the unfolding harm. An important distinction between relationships failure and shuttles exploding, corporate misconduct, and terrorist attacks is politics and the decisions of elites who set problems in organizations in motion.
[...]
The basic lesson that sociologists bring is that the organization matters. If there are problems, the tendency of corporate or public agency administrators is to blame individuals. However, organization characteristics – cultures, structures, politics, economic resources, their presence or absence, their allocation, put pressure on individuals to behave in deviant ways to achieve organization goals. If you want to fix a problem, you can’t just fire the responsible person. You have to fix the organization, or else the next person to take the job will just experience the same pressures.
[...]
The people in the US as well as the courts of justice see the world as the result of individual failures. They think that if you find the individual responsible, you would solve the problems. But the problem is that if you take the responsible people out, new ones will take the job and social factors will just reproduce the behaviours and replicate the problems. America is a culture where individual achievement is everything. We consider that if you don’t achieve, you are responsible for your own failure. So looking beyond individual responsibility, to seeing what’s going on in the social context, how it works, what are the beliefs, the common culture, the political economy, etc., is something we sociologists believe explains human behavior. So it is important to target the real root causes when things go wrong, whether we are talking about relationships, shuttle accidents, or terrorist attacks. We want to know why people make the decisions that they do. When it comes to organization mistake, misconduct, or disaster, the blame usually goes to low level workers and middle managers. Blaming them works for the organization. It deflects attention from top administrators who make major decisions about goals and resources that affect organization cultures, and falls upon workers, affected their actions.
posted by chappell, ambrose at 1:00 PM on March 9 [54 favorites]


I just read Comm Check, a book about the Columbia disaster, which got me interested in learning more about the Challenger disaster, and so some of this stuff is fresh in my head.

The twitter thread/blog post reminds me of the Swiss cheese model of accidents. This metaphor treats accident-causation factors as holes in a piece of Swiss cheese. Stack up enough pieces of Swiss cheese enough times, and factors will eventually line up in such a way that you've got a hole that runs all the way through your cheese stack. To avoid accidents, remove holes or add more slices to your stack. Yes, this is an imperfect metaphor, but it's useful.

There was so much that had to go wrong for Challenger to fail. They launched at temperatures in which the vehicle was not certified to fly. The O-ring design in the solid rocket booster (SRB) itself was flawed. The O-ring material was brittle at cold temperatures.

But Alan McDonald, one of the engineers for the solid rocket booster manufacturer provides additional information about contributing causes that I don't believe was included in the final report. (This is recounted in either this video or this video.)

Before launch, temperature measurements were taken at various parts of the vehicle. (I think this was done by an ice-check team, that was looking for ice buildup on the external tank (ET); any ice buildup could fall off after launch and impact the orbiter.) A temperature reading near the right SRB aft field joint measured at something like 10°F I want to say. I believe this reading was dismissed as an anomaly. McDonald states that this was caused by cryogenic fuels (which are fed into the ET at the top of the tank) cooling passing air, causing it to become dense and sink. This cold air would flow down the side of the vehicle and pool up beneath it. After the disaster, this situation was computationally modeled and the model accounted for some extremely light winds the night before the launch. The model results corresponded with observed cooling at the right SRB aft field joint.

That joint was the one that failed.

There was so much that had to happen for things to go wrong the way they did. The joint design, the inadequate material performance at cold temperatures, the decision to launch outside of the certified temperature range. But also the extremely light winds the night before the launch, which caused that specific joint to fail.

But then they got lucky on launch. Engineers concerned about this type of failure thought that the vehicle would be destroyed more or less instantly, consumed by a giant fireball on the pad. And there was a puff of smoke from the SRB moments after ignition when the joint breached. But the products produced by burning propellant fortuitously re-sealed the joint. About a minute after the launch, one of the engineers, thinking that they'd avoided disaster, was actually praying to god, thanking him for proving his concerns wrong. And then the disaster happened.

On top of everything else, Challenger also encountered the worst wind shear ever seen on a shuttle mission. The wind shear loading, which happened at the maximum point of aerodynamic pressure on the vehicle, caused the temporary seal to fail.

A plume of flame emerged from the SRB, and the SRB/ET rear attachment point was weakened. The SRB separated from the ET at the rear attachment point, the ET failed, and the vehicle was destroyed by aerodynamic forces.

So here's what went wrong:

The decision to launch outside of certified temperatures.
The decision to overrule the concerns voiced by engineers.
The joint design itself and the properties of the O-ring material at cold temperatures.
The cold weather on the night before and morning of the launch.
The extremely light winds that caused cold airflow over the right SRB aft field joint.
The unusuallyintense wind shear.

That's a lot of holes in a lot of pieces of Swiss cheese that had to line up in order to destroy Challenger and kill seven people.

It makes me wonder how many near-disasters the shuttle program had.

On STS-112, there was a foam strike on an SRB. Comm Check states that this strike was within inches of an electronic component that, if hit, could have resulted in the destruction of the orbiter. This was months before a foam strike (on a totally different place) resulted in the loss of Columbia.

(Also, I just want to say that I'm sure I'm getting some details wrong here, and I sincerely appreciate any corrections.)

But, yeah, what really went wrong was the normalization of deviance. Challenger never should have launched outside of the conditions certified for flight. Foam-shedding should not have been treated as a normal event.

I read a lot about the Columbia disaster after Trump was elected. There was so much data that was recovered from Columbia. It was comforting to me that you could look at this utter catastrophe, and see down to the millisecond how it unfolded: when "off-nominal trends" began to appear on re-entry; when things started to fail; the first moment that the entire crew would have known that something was seriously, majorly wrong. We can know exactly what goes technically wrong in a catastrophe. That's comforting.

But looking at this moment-by-moment timeline is a kind of false comfort; the root cause was an institutional failure that allowed this disaster to happen in the first place.

And yet, with the Columbia and Challenger disasters, you can see these machines trying to save themselves after humans made all the fuck-ups necessary to seal their fates. On re-entry, Columbia is adjusting its elevon trim and firing its RCS yaw jets to maintain aerodynamic control for as long as possible. Even after the entire left wing comes off, coolant is flowing through the payload bay doors until the fuselage comes apart. In its final moments Challenger is swiveling its main engines as much as possible to compensate for the asymmetric thrust caused by the SRB attachment point failure. People have compared the fueled shuttle stack to a giant living creature; it vents these gases and it's big and ready to just leap into space; you can feel the potential energy and the complexity and the sense of purpose. But the vehicle also does whatever it can to ensure its survival for moments longer, even when its destruction is unavoidable. That, to me, is what makes these machines most like living beings.

The thing that makes me most upset about Columbia was the explicit decision not to request images from spy satellites that could have determined the extent of damage. (And of course, there was more Swiss cheese in this accident — there was another tracking camera that could have provided better ground-based imaging of the foam strike on launch, but was either out of commission or the footage was out of focus; I forget which.)

Atlantis was being prepped for launch after Columbia. With better imaging of Columbia, a decision could have been made to launch a rescue mission. I mean, I can understand why one might think, like, "What are you going to do about this, really, if the shuttle is doomed to break up on re-entry? What if a rescue can't be launched and they linger up there in orbit, knowing they will die? What if Columbia really is fatally damaged, and a rescue mission is launched, and the same problem destroys Atlantis too? You now have nine dead astronauts instead of seven. And it would likely end the shuttle program, which would, practically speaking, mean the end of construction on the ISS.

But there is something about that decision not to know — the decision not to gather data, not to learn more — it goes beyond the human factors involved in the normalization of deviance. It is the normalization of the normalization of deviance. It is a system that is not only broken, but knows and accepts its brokenness.

I understand there are other reasons for not wanting to know. I can understand that why someone wouldn't want to request time on an expensive intelligence asset, especially at that time, weeks before the invasion of Iraq.

I don't know. Foone's post is really good. It's a lot to think about. I live with depression; it uses up spoons; it throws more holes into more slices of Swiss cheese; whatever metaphor you want to use. It's easy to forget how many spoons you're using up every day; it's easy to forget how many holes you have lined up in all those pieces of Swiss cheese hidden under the top of the stack. We're human beings, and our jobs are to make things, and sometimes those things remind us of ourselves; we take care of ourselves for as long as possible, no matter what it looks like might happen.

Thanks for reading this; I know it's long. Apologies for any typos or weird wordings. Wish I could edit it, but my son is waking up from his nap now.
posted by compartment at 1:36 PM on March 9 [131 favorites]


I really related to personal coping part of this. I do pretty well on a day to day level, keeping in mind what I generally have spoons for and what I need to turn down, to the point where most of the time I think of myself as normal and capable of doing what I want. Things often go a bit wrong on vacations, to the point that I'm much more strict about what I eat (IBS and prediabetes) and try to be careful getting snacks at appropriate times and sleeping enough and everything.

And then on a recent trip to Universal in Orlando I managed to hit a cascade of problems and just couldn't cope. (A panic attack curled up on the floor of The Three Broomsticks is about as unfun as it sounds.) I eventually figured out that getting up early (night owl), on Florida time (I don't adjust to time zones well), in heat and humidity, meant that I couldn't do things I normally can do some of, like stand in line or be in crowds. By the third day I discovered that anyone can rent a wheelchair and was able to go through lines again, with emergency anxiety medication for crowd coping.

So yeah, I can cope with one difficult thing or two or, oh no, there I am curled up in a ball, get me out of here.
posted by Margalo Epps at 2:35 PM on March 9 [7 favorites]


On a personal level, a great description of what happened in my last relationship: careless in maintaining a "credit" balance, taking risks on hot button issues, ignoring warning signs, above all taking for granted. (Yes, it's actually a lot more complex than that and it was possible to kind of patch things up, but a helpful reminder in What Not To Do.)
posted by blue shadows at 3:42 PM on March 9 [1 favorite]


As soon as I read "normalisation of deviance" my mind said: 'boys will be boys'. So much social inequality and social violence happens because incremental moves in those directions are not considered outside of acceptable norms.
posted by Thella at 3:49 PM on March 9 [11 favorites]


This is all extremely fascinating and I have nothing to add except this very long and thorough article (pdf) about how fire in the coal bunker did not contribute to the sinking of the Titanic. In short, the photographic evidence provided in support of this hypothesis does not fit the facts.
posted by hat_eater at 3:51 PM on March 9 [1 favorite]


Gawande's The Checklist Manifesto is worth reading.

When he develops a checklist for surgery, he doesn't just pull together a list of what might be a good idea to check. Instead he thinks carefully about what a surgical team has the time and attention to check on.

It's plausible that those big disasters aren't necessarily a matter of people not bothering to follow the rules, though that can be part of what's happening. Sometimes that people who are supposed to follow the rules don't have the resources to do so.
posted by Nancy Lebovitz at 4:22 PM on March 9 [15 favorites]


My first encounter with the expression "normalization of deviance" was in a disturbing 2015 piece, Bedford and the Normalization of Deviance (previously) written by Ron Rapp an aviation journalist, charter and aerobatic pilot. In it he discusses the the findings of an NTSB accident report about—and the circumstances surrounding—the 2014 crash of a Gulfstream IV in Bedford, MA caused by, to start, the failure of the seasoned crew to release the gust-lock system because they had stopped using checklists and then compounded horribly by an additional series of really bad decisions made by the crew despite blatant warning signs.
posted by bz at 4:27 PM on March 9 [11 favorites]


I wanna see a systems analysis like this for the New Orleans levees that failed, or police departments that have a big problem with killer cops
posted by eustatic at 5:38 PM on March 9 [7 favorites]


It's interesting to think that the remedy/opposite of this is Kaizen - continuous small improvements and a culture of everyone having equal responsibility for it rather than waiting for big ideas to come down from the top. The top is never going to have the bandwidth to care properly about the little details that are being threaded by the engineers on the ground and yet we allow them to call those shots over and over and over.
posted by bleep at 6:27 PM on March 9 [10 favorites]


Eustatic, I don't know what's been written about the New Orleans levees, but systemic issues may be discussed in the DOJ reports that precede consent decrees with police departments. (Here's one for Portland Oregon).

This is not my area of expertise and probably not exactly what you're looking for, but it's the closest thing I know of to what you describe. I believe the consent decrees are intended, at least in part, to address systemic issues identified by the DOJ investigations. There has been some work done to analyze the effectiveness of the consent decrees and they seem to be at least somewhat effective. Consent decrees don't fix all the problems, and I don't think they're preceded by reports that identify all the problems, but I believe that they can save lives.

As for social issues outside of the US, there was a national commission on truth and reconciliation in Chile that addressed the 1973 coup and associated murders and other human rights violations. It issued a fairly exhaustive report on who was killed by whom and for what reasons, and included background information and proposals for reparations and preventing similar tragedies in the future.

Searching for that report led me to the Wikipedia page that lists truth and reconciliation commissions; there have not been many in the United States. A commission in Maine issued a report "to investigate whether or not the removal of Wabanaki children from their communities has continued to be disproportionate to non-Native children and to make recommendations ... that 'promote individual, relational, systemic and cultural reconciliation.'" It issued this report, which asserted that the conditions it described "can be held within the context of continued cultural genocide, as defined by the Convention on the Prevention and Punishment of the Crime of Genocide, adopted by the United Nations General Assembly in 1948."

I am aware of Monroe Work and his exhaustive studies on lynchings in America because of a previous post on MetaFilter.

You are right that there are big, important issues that have received too little scrutiny. More work needs to be done to study, understand, and address these issues. And more money needs to be spent on it. NASA spent $18.7 million directly on the Columbia Accident Investigation Board, and another $100 million supporting that investigation. FEMA spent over $200 million related to the accident. The Challenger disaster was investigated by a Presidential Commission.

There are long-standing problems deserving of far greater levels of attention than a space shuttle accident. But it's not like no one is working on these problems already. Find and elevate the voices of those who are.
posted by compartment at 7:10 PM on March 9 [6 favorites]


Not being a scientist, I was very pleased to see, after reading all that science stuff, that this piece actually applied to humans and their complex lives. We are, after all, more comlicated than rockets, but analogies to simple machines are helpful to our disordered minds.
posted by kozad at 8:02 PM on March 9 [2 favorites]


I honestly can't think of a better phrase than "Normalization of Deviance" to sum up the last half century of U.S. politics, with its ultimate apotheosis as Trump.
posted by longdaysjourney at 8:22 PM on March 9 [7 favorites]


One of my favorite tech blog posts by Dan Luu is titled (and is about) Normalization of Deviance. (Published in 2015 despite the retro styling.)
posted by rivenwanderer at 8:37 PM on March 9 [4 favorites]


Reminds me of all of the perennially broken, will never be fixed, always have to do more and more complicated workarounds technology issues at work. And when everyone explodes on you at work too.
posted by jenfullmoon at 8:40 PM on March 9 [3 favorites]


Your system not breaking doesn’t mean it works and is a solid design. It might just mean you’ve gotten lucky, a lot, in a row.

as a motorsports fan, this hits home. A certain weekend in May 1994 comes immediately to mind.
posted by philip-random at 9:15 PM on March 9 [1 favorite]


In motorsport group b rallying in the 80's would be the ultimate example I reckon. Basically formula 1 with the crowd on the track, it's a wonder it lasted as long as it did.
posted by deadwax at 5:14 AM on March 10


This is a really good article and timely for me, as Tuesday is my daughter's 15th birthday, or would be, had the following not occurred:

- she had the cord around her neck twice, which pinched and cut off bloodflow
- she got stuck in a position where she 'turtled' although I have never found out whether it was her shoulders or something else
- the hospital where I was delivering closed the L&D ward for being at capacity after I arrived, so I was the last one in
- the hospital had a "natural" (ugh) delivery focus, was proud of its low c-section rate; they had had several c-sections already that 24 hr period
- the nurse in charge of monitoring was not a trained RN - she was unable to interpret the tracings correctly
- the nurse also had a bad relationship with her team because of that and didn't want to bring in the charge nurse
- there was not a culture of checking in on patients proactively
- the obstetrician I saw at 2 pm said I had to be done at 4pm or he had to be informed, neither he nor anyone did that
- at the time that the distress was finally understood, all OBs were in surgery, one with a situation that had come through emergency, not L&D
- after my daughter was intubated she was extubated too early and had to be on a manual bag for almost an hr until they could get the tube back in

So...yeah.
posted by warriorqueen at 7:19 AM on March 10 [20 favorites]


Charles Perrow's Normal Accidents is an excellent, readable introduction to this concept. Published in 1984, but still all too relevant.
posted by Weftage at 8:23 AM on March 10 [1 favorite]


another thing that comes to mind here is teenagers and how they calculate risk. Can't find a direct link right now, but I recall reading the results of some research that suggested that, far from being blind to risk, teens were often quite aware of it, but rather than focus on potentially disastrous outcomes were more likely to calculate the odds of such happening.

In other words (and I definitely remember thinking this way myself with regard to driving), if the apparent odds of something bad happening are a hundred to one, then what's the big deal? Problem is, if you take that sort of chance every day, something bad's going to happen at least three times a year. And so on. My driving record was blemish free for my first almost three years, but suddenly, in a period of less than three months, I had three accidents ... and all just bad luck (or so I told myself at the time).

Worth noting, I'm now almost sixty and haven't had any kind of incident in almost forty years. Apparently I learned something.
posted by philip-random at 8:56 AM on March 10 [4 favorites]


adding to compartment's comment:
The space agency spent $18.7 million on direct costs of the Columbia Accident Investigation Board, the 13-member panel of outside experts who analyzed the Feb. 1 tragedy. That spending included transportation, the hiring of consultants and administrative costs such as the printing of the 248-page report.

I'm sure they made a lot of copies of this report. Probably in color! How does EIGHTEEN MILLION dollars bleed away? Government waste (and terrible men) that make staples $149 each? Consultants that cost like, $6500/hour? I mean, can someone help me understand where that much money goes? Maybe tests that can only be run in labs that cost tens of thousands of dollars to use?
posted by Glinn at 9:01 AM on March 10


what if you’ve been doing that for a while? You’ve been going 110% all the time. It’s worked out just fine. You’re doing great, no problems. You start to think of 110% as the new normal, and you think of it as just 100%.
(blank look)

These go to eleven.
posted by flabdablet at 9:33 AM on March 10


I would guess that each of those 13 members indirectly kept 10 - 100 people busy researching, conducting tests, consulting the law, etc., etc.
posted by maxwelton at 10:31 AM on March 10 [1 favorite]


Glinn: I mean, can someone help me understand where that much money goes?

Sure, read this. (There are two pages).

Here's a highlight:

Wells said the plan is to search by foot a corridor 240 miles long and four miles wide along the center line of the shuttle's flight path. Standing 10 feet apart, the investigators are walking and visually searching about 1,000 square miles of Texas.

I mean, how expensive could that be, like, what, a few bucks a day? Damn grifting government employees and their color copies...

Speaking as a software and test engineer that's worked on things sticking into actual humans: Doing thing right. Like, really right, like, people actually-could-and-did-fucking-die kind of right, is incredibly difficult and expensive and slow. Trying to make it fast, as in the case of the investigation, makes it even more expensive.
posted by cowcowgrasstree at 10:39 AM on March 10 [10 favorites]


How does EIGHTEEN MILLION dollars bleed away?

My point wasn't that too much effort had been spent on the shuttle accident investigations, but that too little time, effort, and money are expended to understand other issues of importance. I think this is in part because investigations with technical components offer the allure of that false hope I described above: the idea that you can understand why things happen simply because you can understand the proximate technical causes.

As someone who likely will never in my lifetime come close to being worth eighteen million dollars, I understand your feeling that that's a lot of money. However, it's a small fraction of what the cost for a typical, successful shuttle mission, and "bled away" is not the term I would use to describe that spending.

I want to ask you, in good faith, what you think should have been spent on the Columbia accident investigation, and how you would arrive at that figure? What should it cost to investigate a spaceship crash? How much should it cost to investigate an "ordinary" aviation disaster? The TWA 800 investigation cost $100 million according to some estimates. How does the Columbia investigation compare to that, in terms of difficulty, necessity, and degree of specialization?

Or, more importantly, I guess the real question is: How should we place a value on the costs of acquiring different types of knowledge?

In addition to the actual cause of the event, which needed to be proved, there were a variety of hypothesized causes that the CAIB needed to rule out. I want to say there were like six or eight hypothesized causes, all of which presumably would have required consulting from experts. This was everything from terrorism/sabotage (an especially heightened concern less than a year and a half after 9/11) to micrometeorites and space debris.

A master timeline of re-entry was established and it went through many iterations as additional information was added and errors were corrected. The timeline of events, including various debris-shedding events, was established with the help of amateur video recorded across the country, all of which needed to be precisely synced to that timeline.

Columbia, as the first shuttle to fly, was a "test mule" of sorts and was wired with all sorts of instrumentation that was recovered. It was late 70s/early 80s technology that fell to earth from space. I would presume that expensive consultants were required to recover data from old equipment that fell from space.

How much should it cost to bring in a metallurgists to analyze the droplets of re-solidified molten metal, and then consult with engineers on where that metal came from on the vehicle? What corners could have been cut on the Columbia investigation? What work could have been skipped?

In addition to investigating the cause of the accident, the board was also charged with issuing return-to-flight recommendations. How much does it cost to develop safety standards for the automotive industry, and how much should it cost to develop safety standards for human spaceflight?

Wells said the plan is to search by foot a corridor 240 miles long and four miles wide along the center line of the shuttle's flight path. Standing 10 feet apart, the investigators are walking and visually searching about 1,000 square miles of Texas.

A friendly note: Search-and-recovery costs were separate from the expenses incurred by CAIB; it is astonishing to me that workers recovered as much of the vehicle as they did. There is a recently published book called Bringing Columbia Home that details the recovery; I haven't yet read it but hope to do so soon.

Government waste (and terrible men) that make staples $149 each?

Fun fastener-related-cost-overrun fact: NASA's OIG investigated CAIB expenses and found that, with a few exceptions, everything was generally on the up-and-up. However, one of the areas of overspending identified by the OIG was, in fact, related to fasteners. It wasn't staples but some fasteners on the shuttle itself. An unnecessary $30,000 was spent on a subcontractor that examined and analyzed some failed fasteners on the shuttle. The OIG didn't find that the work itself was problematic, but rather that too much was paid, and it should have been paid for differently. NASA's contracting officer was instructed to ask the contractor for a voluntary refund, but I have no idea what came of this.

I'm sure they made a lot of copies of this report. Probably in color!

2,600 copies (probably in color) were printed at a cost of $60.87 per unit. A separate investigation by the Government Publishing Office (referenced by the later OIG report) found that the CAIB wrongly decided to publish the report through a private printer, and not the GPO. Total costs for publishing outside of the GPO were $158,253. The GPO estimated cost savings of somewhere between $14,253.00 and $112,253.00 had CAIB used the GPO. The added expense here was a small fraction of CAIB's total budget. It shouldn't have happened, but it seems to have been an honest error, and is nothing close to being on the scale of the many billions of dollars in unaccounted-for cash that the US flew into Iraq.

At any rate, NASA's OIG found that a total of 2.4% of CAIB's expenditures were questionable, but that they "they occurred for unique reasons and did not represent systemic weaknesses in controls." The OIG found that, given the timeframe and circumstances under which CAIB operated, this was actually pretty good performance, but also recommended that NASA "revise the Agency Contingency Action Plan for Space Flight Operations to identify an administrative structure and staff that will establish the necessary financial and procurement controls when a major mishap board is initiated."

Circling back to the TWA 800 comparison: Obviously the cause of a jetliner accident affects potentially millions of people; many, many people travel on any particular model of jetliner. A single spaceship accident affects relatively few people. But I would argue that human spaceflight, as the farthest limit of human exploration ever, is one of the things that defines us as a species. I view the societal cost as worth it because exploration is part of who we are, as much as we are artists and performers and storytellers and everything else that makes us human.

Frequently, this history of exploration has cast a long shadow of misery and oppression. Space travel is a form of exploration that shines light into ourselves and into that shadow. It is an opportunity to take reorient a fundamental human tendency toward goodness and light. By stepping away from the planet, I think that humanity sees itself as it is.

As Apollo astronaut Ed Mitchell said on seeing the Earth from afar: "You develop an instant global consciousness, a people orientation, an intense dissatisfaction with the state of the world, and a compulsion to do something about it. From out there on the moon, international politics look so petty. You want to grab a politician by the scruff of the neck and drag him a quarter of a million miles out and say, 'Look at that, you son of a bitch.'"

To use the spoon analogy: It takes a lot of spoons to function as a civilization. I think that the spoons that we spend on space exploration are worth it. I think that the spoons we spend on wars of choice are not. As a society, as a civilization, as a human race, we are overextended. I think it is good to spend our spoons on something where we can leave home, and return again to say "Look at that; that's us." Only by seeing ourselves as we are can the human race take better care of itself and its future.
posted by compartment at 12:37 PM on March 10 [25 favorites]


I posted about the Failure Knowledge Database once before on the blue - a lot drier than this post but might be of interest all the same.
posted by Rumple at 1:32 PM on March 10 [3 favorites]


I want to ask you, in good faith, what you think should have been spent on the Columbia accident investigation

It appears I have miscommunicated. I wasn't arguing it was too much, because I wouldn't have any idea what sorts of processes are available and what costs so much (Many of which you articulated, and thank you for answering my question.) I do understand there may be entirely valid reasons (most of which you have described), but sometimes bad players get involved which I don't know, was probably some dumb conjecture on my part. (I feel like my comment got taken personally and it wasn't meant to accuse anyone for goodness sake.)
posted by Glinn at 1:49 PM on March 10 [2 favorites]


bz has already linked to the previous post on this topic, in which I left a comment discussing the Haddon-Cave report into the loss of an RAF Nimrod patrol aircraft over Afghanistan. The author (then an experienced aviation lawyer, now a Court of Appeal judge) specifically referred to both the Challenger and Columbia investigation board reports and the 'uncanny' similarities with the failings that led to the Nimrod crash.
posted by Major Clanger at 2:32 PM on March 10 [1 favorite]


As I said here back in 2016:

The Teacher in Space Project (TISP) was one of Reagan's flashier and more cynical election promises during the 1984 Presidential campaign, and the glorious fulfillment of that promise with the inclusion of teacher Christa McAuliffe among the crew of the Challenger was meant to be the emotional high point of Reagan's 1986 State of the Union Address, originally scheduled for that night of Jan. 28.

I doubt concern over a cold-stiffened O-ring ever had much of a chance of derailing that particular launch.
posted by jamjam at 4:14 PM on January 28, 2016


I think there is very good reason to suspect that launch took place against sound engineering advice for political reasons, and that the headlong rush to blame it on a failure of safety systems is essentially a coverup.

Interesting but depressing to see how well that's succeeded.
posted by jamjam at 3:29 PM on March 10 [3 favorites]


I think the launch being pushed through for political reasons IS a failure of safety systems.
posted by bleep at 3:41 PM on March 10 [12 favorites]


Nice to see systems thinking cast in terms of individual behavior as well as multi-faceted technical issues.
posted by ZeusHumms at 8:21 AM on March 11


I work in space and aviation and this is what keeps me up at night. There’s always the pressure that if you raise issues or push back, you’re not being a team player or costing the customer money or generally being annoying. I raised many issues at my last job which turned in to CYA exercises for me because the response was always, “well, we’re already over budget and late so we’ll do a surface level risk analysis and call it a day.”

I read that Rapp article about the Bedford crash and my original point about it still stands (I think it’s in that previous thread). There was an engineering failure allowing multiple revisions to the design of the throttle quadrant to be “qualified by similarity” to a point where what the pilots had in their hands was not what the FAA originally signed off on. But it’s cheaper to argue, “you already approved this, and this new revision isn’t too much different...” and you do that for the subsequent five or ten revisions. Similarity qualifications as they work now are clear deviations from proper process so that folks can save a few bucks in testing.
posted by backseatpilot at 8:39 AM on March 11 [1 favorite]


And now that I think about it for more than a couple minutes, I guarantee we’re going to see some similar findings about deviations come out of the recent crashes of the 737Max. New envelope protection systems, poorly understood by flight crews, poorly trained, qualified by similarity to older systems that behaved just differently enough that it got the pilots on trouble.
posted by backseatpilot at 8:46 AM on March 11 [4 favorites]


I have a hard time with this because if you are doing something difficult you are not going to know the limits until you cross them - you can only test what you know about and maybe what you can imagine and afford.

Just because we can technically identify a failure after the fact doesn't mean that it was a solvable failure case. People here are scoffing at the numbers spent on these cases and it's rightfully so because they would never spend that much testing on the front-end. The road case is a perfect example - the road was designed for extreme failure in the first place.
posted by The_Vegetables at 12:40 PM on March 11


Where I work, the normalization of deviance theory, and fighting it, is taken as a way of life. We use not only the "swiss-cheese theory" described above (we call it the 'switch' theory, as in, when all the switches close, the bad thing happens) when analyzing our more serious problems.

Another useful tool we use is Heinrich's pyramid (which for some reason people call the Kinsey pyramid but I don't know why.) We level-ize all our problems to determine the scope of the investigation we need to do - bottom of the pyramid types might be a simple fix and move on, while mid-level ones might be more "how did this happen and how can we prevent it" root cause analysis. A key feature is we track and make sure every low-level one gets fixed in a timely manner, because (per the switch theory), the low-level ones cause the high-level ones later.

An often-overlooked feature of the pyramid model, is you can statistically see how your self-criticism on small problems is going by the ratios. There should be few or no top-level ones, some proportion of mid-level ones, and bunches of low-level ones. If you're having a lot of mid-level ones but not a whole lot of low-level ones, you aren't looking hard enough for the low-level ones, because they're there.

We also try very hard to make a culture of self-reporting, and not taking your stuff being reported by others personally. It needs to be 'no-fault, just fix it' for low-level ones. People who get penalized for low-level deficiencies don't have fewer deficiencies. They just report them at much lower rates.

All that said, it's very expensive and a pain in the ass, and very hard not to take the "nitpicking" as "all criticism, all the time". But it's totally worth it in a field where the top-of-the-pyramid problem could be life, health, environmental, AND political disaster.
posted by ctmf at 5:14 PM on March 11 [6 favorites]


I guarantee we’re going to see some similar findings about deviations come out of the recent crashes of the 737Max.

Backseatpilot, here's an unrolled thread on cascading issues: "Bottom line don’t blame software that’s the band aid for many other engineering and economic forces in effect."
posted by MonkeyToes at 4:57 AM on March 18 [3 favorites]


As to the economic factors... FAA handed some of its work to Boeing to cut costs and failed to notice some critical changes in the 737 MAX design. After the first crash, Boeing started to work on a fix, but its certification was delayed because of government shutdown.
posted by hat_eater at 9:41 AM on March 18 [1 favorite]


More of a legal analysis than an engineering one, of course, but on today's Democracy Now! (full episode, direct .mp4, alt link, torrents 1, 2, m) the lawyer for the family of Ralph Nader's grand-niece, Samya Stumo, who died on Ethiopian Airlines Flight 302, announcing their first of undoubtedly many lawsuits against Boeing, had a concise summary of root causes:
The history is that Boeing, 10 years ago, was facing competition, and it was facing competition from Airbus. There’s no secret. So, around 2010, Airbus was coming out with more fuel-efficient engines. Boeing saw that as a threat to their international competition for the sale of aircraft. They knew that they were behind, and they needed to get their planes out in the marketplace that could compete with Airbus quickly; ergo, the motivation.

What they decided to do is they decided that they couldn’t wait the amount of time it would take to fully redesign an aircraft, so they took a shortcut. They used the existing airframe, and what they did is they decided to put larger engines and more fuel-efficient engines on that plane—except for a couple of problems. When you put larger engines on a plane that was that old and vintage—the plane was designed where the wings are very low to the ground. So when you put those larger engines on, you need more clearance. So what happens is, you have to move those engines forward. You’ll also have to move the landing gears forward. And when you change the position of the engines, you change the landing gear, you change the aerodynamics of the aircraft.

Now, when that happens, you do those kinds of changes, you have to retrain pilots, because the plane behaves differently. And in this case, the plane, because of the larger engines, has a tendency to thrust upward faster and more powerfully than the originally designed 737 model.

Well, rather than spend the time and force air carriers to take time to train their pilots and to go through more costly training, the decision was that Boeing would come up with its own software that would help have the plane behave the same way the older 737 behaved. And they did that with the design of the MCAS system. The MCAS system is an automated system that would control the tendency of the airplane to buck upward, that when it sensed the plane and the nose of the aircraft was moving up, the automatic signals would be sent to the MCAS system and the horizontal stabilizer to push the nose down. Now, that would all be done automatically without the knowledge of the pilots. And that’s critical to avoid retraining. The pilots could operate the plane the same way they operated prior iterations. The system would all take care of adjustments to the aircraft and its behavior tendencies automatically, without any need to retrain pilots.
Quite a bit of other good coverage in the rest of the episode.

A show last month (full episode, direct .mp4, alt link, torrents 1, 2, m) immediately after the Flight 302 crash interviewed Nader and William McGee, aviation journalist for Consumer Reports and author of a book about lax aviation safety standards.
posted by XMLicious at 10:27 AM on April 5 [1 favorite]


« Older Dancing on my own? No Way!   |   The canine Reinhold Messner Newer »


This thread has been archived and is closed to new comments