ATCSCC ADVZY 020 DCC/ZAU 09/26/2014 ZAU GROUND STOP
September 29, 2014 12:30 PM   Subscribe

On Friday, ATCSCC Advisory 20 of 26-Sep-2014 went out. When operators, controllers and airport managers saw the title, a gasp of disbelief was heard. The problem was simple enough to state in three words, and complex enough to cancel thousand of flights and cost hundred of millions of dollars: ZAU ATC ZERO.

ZAU is the call sign of the Chicago Air Rout Traffic Control Center (ARTCC), which covers northern Illinois and Indiana, southern Wisconsin, western Iowa, and south eastern Michigan. There are two "sides" at an ARTCC. ZAU-LO handed traffic destined for airports in the covered area, ZAU-HI handled traffic overflying. Both were amongst the busiest in the country. ZAU-HI was busy with traffic from the east to west, as well as European traffic heading to Houston and Dallas-FW, ZAU-LO had to feed in traffic from airports like GYY, MKE, RFD, PIA, and the two busiest airports in the area; Chicago Midway International and O'Hare International, one of the busiest airports in the world.

On Friday morning, Brian Howard, a contract employee of the FAA and holding full credentials to the ZAU datacenters, set a fire in the telecom room, destroying 23 of the 29 racks and disconnecting all the controller stations from the associated radars and radio transmitters needed to watch and guide traffic through the busy sector. As the consoles dropped offline, the ZAU duty manager had no choice -- they called ZZZ, the FAA command center and reported ATC ZERO -- no controllers available, control center offline.

The result of the fire was dramatic. On Friday, almost 1500 flights were cancelled from O'Hare, and Midway cancelled all flights for the day, some 550. By Saturday, better workaround were in place, with neighboring ARTCCs taking over most of ZAU-HI, and set routes from Peoria, Milwaukee and South Bend used to funnel aircraft into Chicago, and routes to Rockford, Green Bay, and Vandalia used to funnel flights out. However, these routes, watched by towers and Terminal Radar Controllers (TRACONS), had to be flown much lower than normal, and with much more spacing, cutting the allowed traffic in half.

By Sunday, improvements in the lower routes and ZAU controllers getting to the other ARTCCS removed most overflight restrictions and allowed ORD to reach 65% capacity and Midway 75% today -- roughly 75 flights an hour from O'Hare and 25 from Midway.

The FAA expect it may be until October 13th for ZAU to be fully back online. The FAA, under scrutiny from lawmakers, is reviewing security procedures. Brian Howard, now standing charges of destruction of federal property, left a confusing (and now pulled) post on FaceBook blaming either his upcoming transfer to Hawaii or US Policy in the Middle East for why he attempted to kill himself and successfully shut down ZAU.

Total cost, so far -- approximately 5000 cancelled flights, 15000 delayed flights, and hundreds of millions of dollars, primary from the three airlines most affected, American Airlines, United Airlines, and Southwest Airlines.

The FAA has repeatedly said they do not consider this "an act of terrorism." Cancellations and delays continue to affect the Chicago area.
posted by eriko (106 comments total) 74 users marked this as a favorite
 
Yeowch. Did you get caught up in this, Erik?
posted by cstross at 12:38 PM on September 29, 2014


Flying into ORD in a couple of days.
Not a word about possible delays on either AA or O'hare websites.

Can't decide if they are very optimistic or just hoping no one notices.

You will be happy to know that the annual Airports Going Green conference is happening as scheduled at O'hare...
posted by madajb at 12:40 PM on September 29, 2014


Nice write up.
posted by benito.strauss at 12:41 PM on September 29, 2014 [4 favorites]


Given that Howard created an emergency involving hundreds of flights and thousands of people "destruction of federal property" seems pretty inadequate.
posted by evidenceofabsence at 12:44 PM on September 29, 2014


The FAA has repeatedly said they do not consider this "an act of terrorism."

Wild guess here: he's white?
posted by basicchannel at 12:46 PM on September 29, 2014 [102 favorites]


On Friday morning, Brian Howard, a contract employee of the FAA and holding full credentials to the ZAU datacenters, set a fire in the telecom room, destroying 23 of the 29 rack and disconnecting all the controller stations from the associated radars and radio transmitters needed to watch and guide traffic through the busy sector.

Christ, what an asshole.
posted by ZenMasterThis at 12:48 PM on September 29, 2014 [18 favorites]


What kind of psych evaluations do ATC staff undergo, and how available is treatment/how much care is taken with reasonable work hours etc? Maybe we don't know enough about this guy yet to say what his deal was, but ATC work seems like it must be one of the most high-pressure/high-stakes jobs around.
posted by LobsterMitten at 12:50 PM on September 29, 2014 [2 favorites]


Who gets upset at being tranferred from Chicago to Hawaii?
posted by dirigibleman at 12:51 PM on September 29, 2014 [27 favorites]


What kind of psych evaluations do ATC staff undergo, and how available is treatment/how much care is taken with reasonable work hours etc? Maybe we don't know enough about this guy yet to say what his deal was, but ATC work seems like it must be one of the most high-pressure/high-stakes jobs around.

He was a contractor. The answer is "none."

Here's a thought. Maybe stop contracting out skilled positions, and hire and promote employees in-house while offering them seniority, pensions and job stability? Might that be a way to avoid bringing in sick people who literally burn down the building by a lowest-bid-wins "privatizer"?
posted by Slap*Happy at 12:53 PM on September 29, 2014 [102 favorites]


dirigibleman: presumably anyone with family in Chicago, anyone who doesn't like hordes of incredibly creepy insects anywhere, or anyone who's not looking forward to a cost of living that compares unfavorably to the SF Bay Area.
posted by You Can't Tip a Buick at 12:54 PM on September 29, 2014 [10 favorites]


Slap*Happy: THAT SMELLS LIKE SOCIALISM, COM-RADE! OORAH AMERICA!
posted by basicchannel at 12:55 PM on September 29, 2014 [7 favorites]


Once again, and this is said as the son of a PATCO striker, thanks, Saint Ronnie.
posted by NoxAeternum at 12:55 PM on September 29, 2014 [32 favorites]


Maybe stop contracting out skilled positions

Amen to this.
posted by LobsterMitten at 12:56 PM on September 29, 2014 [8 favorites]


wow this is pretty cool. I always like when things go bad for airports because it generates such nifty interactive graphics. I look forward to the graph that shows all the planes re-routing around a "dead zone"
posted by rebent at 12:56 PM on September 29, 2014


The FAA has repeatedly said they do not consider this "an act of terrorism."

Wild guess here: he's white?


He may well be, but is there some reason you feel sort of invested in authorities calling something "terrorism" that is clearly not "terrorism" in any meaningful sense of the word?
posted by Naberius at 12:56 PM on September 29, 2014 [19 favorites]


Here's a thought. Maybe stop contracting out skilled positions, and hire and promote

But then the company would need to raise prices, the stock price would drop a couple points (a huge 0.003% off the value of the company) and senior management bonuses would be cut from a few hundred thousand to just a couple hundred thousand.
posted by sammyo at 12:57 PM on September 29, 2014 [11 favorites]


What kind of psych evaluations do ATC staff undergo, and how available is treatment/how much care is taken with reasonable work hours etc? Maybe we don't know enough about this guy yet to say what his deal was, but ATC work seems like it must be one of the most high-pressure/high-stakes jobs around.

Guess why the ATCs struck back in the 80s?
posted by NoxAeternum at 12:57 PM on September 29, 2014 [10 favorites]


Here's a thought. Maybe stop contracting out skilled positions, and hire and promote employees in-house while offering them seniority, pensions and job stability? Might that be a way to avoid bringing in sick people who literally burn down the building by a lowest-bid-wins "privatizer"?

But here's the thing- Who'll foot the bill for the rebuild of the command center? Not the agency that provided the contractor. There's still 0.0% incentives for the contracting agency to bother. Hence, regulations, but good luck with that.
posted by GilloD at 12:57 PM on September 29, 2014 [3 favorites]


I'm kind of sort of surprised that there's no significant DR plan - or if there was one, that it took days to hand over control to the sister ATC zones. I mean, an ETTR of 10/13 sounds like they're ordering 23 racks of gear to replace the dead ones and pulling back from the last backup.

Oof.
posted by Kyol at 1:00 PM on September 29, 2014 [6 favorites]


Yeowch. Did you get caught up in this, Erik?

Yes. I was in STL when my brother said "are you going home tonight?" I said something like "that's the plan" and he said check the news.

In fact, I think I burned up three years of travel karma, because exactly 0 of the 9 AA ORD-STL flights flew, and exactly 1 of the 9 AA STL-ORD flights flew, and that flight was AA1422, and I was on it.
posted by eriko at 1:01 PM on September 29, 2014 [22 favorites]


It seems there is frankly terrifying lack of redundancy in the ATC setup for the Chicago area. That should be embarrassing to some FAA muckity mucks (assuming they are capable of that emotion).

Haven't these people seen Die Hard 2?
posted by srboisvert at 1:02 PM on September 29, 2014 [3 favorites]


He may well be, but is there some reason you feel sort of invested in authorities calling something "terrorism" that is clearly not "terrorism" in any meaningful sense of the word?

Man severely disrupts national infrastructure while alluding to foreign policy as a reason for it, not terrorism in any meaningful sense of the word? So he was clearly mentally ill too - you know, unlike the underpants bomber, for example. Oh wait...
posted by Dysk at 1:03 PM on September 29, 2014 [28 favorites]


From the first article:

“Brian served his country honorably on a nuclear submarine”

I'd say we got out of this one pretty well :-)
posted by scolbath at 1:03 PM on September 29, 2014 [41 favorites]


Rebent you mean like this map, where the Chicago area looks more like North Dakota's airspace? (I couldn't find a .gif timelapse, sorry.)
posted by Wretch729 at 1:04 PM on September 29, 2014 [2 favorites]


Article says there is only one person on duty at a time in the position he was in? (Sounds like he wasn't a controller himself, but a guy who keeps the systems running?) What happens if that guy has a heart attack or something?
posted by LobsterMitten at 1:05 PM on September 29, 2014 [1 favorite]


Can't decide if they are very optimistic or just hoping no one notices.

As they get more controllers into the outlying TRACONs and over to the neighboring ARTCCs (ZID,ZOB,ZKC and ZMP) they're gradually able to handle more flights into ORD/MDW. MDW's just a few flights an hour short of normal capacity, ORD is running about 65%, which is close to foggy weather conditions. In good weather on West Flow, ORD can handle 112 an hour, East flow is limited to 98. This becomes 125+ both ways after the new runway opens next year, making ORD 5 parallel runways, four of them with large separations.

They're brining in equipment now. If they can even get part of ZAU-LO back online it'll make a huge difference, if they can get all of ZAU-LO, they could probably run at 100%, with the neighbors handling overflight.
posted by eriko at 1:05 PM on September 29, 2014 [1 favorite]


Who gets upset at being tranferred from Chicago to Hawaii?

Someone who likes real cities, good food, a lively drinking scene, a culture of theatre and architecture and gigs, daytime baseball, friendly people, blue skies, seasons, cycling from summer heat through fall colours to a proper winter of dirty snow and a welcome spring.

Someone who wants to be in the city that produced people like Studs Terkel, Mike Royko, Harry Carey, Bill Murray, Saul Bellow, Muddy Waters, and Roger Ebert.

I guess none of that equals the worst traffic in the US (Honolulu), expensive costs of living, and loads of tourists, but for me....
posted by C.A.S. at 1:06 PM on September 29, 2014 [15 favorites]


He clearly flipped - torching a data centre and then trying to stab yourself to death are not the actions of someone who's in control of themselves. It'll take a while to find out what pushed him over the edge, but in some ways this individual and his acts are the least part of the story.

That the system was - one assumes, still is - so amazingly vulnerable, that's the story. There are technical things that need fixing, because single points of failure like this are problems that can be fixed, and there are a very large number of questions about systemic oversight that lets such design decisions through. There are questions about how much else of the ATC system is technically underspecified. There are questions about how many other important systems are similarly at risk, given that if you can't get it right for ATC then can we trust that others are doing it better.

But the big question is - what sort of organisation is unable to spot someone going wrong this badly in time to prevent disaster? What people said above - once you contract stuff out, your oversight is outsourced - is probably at the heart of this: you cannot abdicate your responsibilities to dealing with your people as people, and not have stuff like this happen.
posted by Devonian at 1:07 PM on September 29, 2014 [24 favorites]


It seems there is frankly terrifying lack of redundancy in the ATC setup for the Chicago area. That should be embarrassing to some FAA muckity mucks (assuming they are capable of that emotion).

Well, I would bet that there was redundancy, but it was never planned that someone would burn the place to the ground.
posted by NoxAeternum at 1:08 PM on September 29, 2014


Well, I would bet that there was redundancy, but it was never planned that someone would burn the place to the ground.

If your redundancy involves everything in one location so that one fire can wipe it all out, you need better redundancy.
posted by Dysk at 1:11 PM on September 29, 2014 [64 favorites]


To all of the people who are asking "how could the FAA be caught flat-footed?", I really recommend reading up on the PATCO strike. All of these problems were there 30+ years ago.
posted by NoxAeternum at 1:11 PM on September 29, 2014 [3 favorites]


ATC work seems like it must be one of the most high-pressure/high-stakes jobs around.

He wasn't a controller, he appears to have been a contract telecom engineer or system admin.

I'm kind of sort of surprised that there's no significant DR plan - or if there was one, that it took days to hand over control to the sister ATC zones.

One problem: He knew *exactly* where to hit the the building to do the most capacity damage, and that part was in the basement in the center. He hit the datacenter and comm vault. The control consoles are just fine. It wasn't even fire than did the most damage, it was water.* He had credentials to enter the building and be in that very room.

All the lines were redundant, the computers are redundant but the place where those lines feed into the computer is very hard to be redundant in terms of space. At a certain point, everything has to come together.

* Those of you wondering about sprinklers in a DC? You often don't get a choice on that. If your local fire codes say you have sprinklers, you have sprinklers. Usually, they have a "pre-action" valve which keeps the pipes dry until an alarm sounds, which opens the valve and loads the pipes.

There's also supposed to be a dry system to suppress fires, but I would not be surprised to find out that he deactivated it before he set the fire. It's not hard to do.
posted by eriko at 1:14 PM on September 29, 2014 [15 favorites]


What happens if that guy has a heart attack or something?

Presumably ZAU ATC ZERO but for like, 20 minutes instead of hours to weeks.
posted by maryr at 1:14 PM on September 29, 2014


Even if they had a completely redundant system in a building staffed with clones I would assume a recovery plan would involve a lot of taking your time and double-checking what with them being air traffic controllers and everything.
posted by fullerine at 1:17 PM on September 29, 2014


Rebent you mean like this map, where the Chicago area looks more like North Dakota's airspace?

That was Friday, when ZAU was basically closed. Now it looks better. They're still routing most of the overflights to the edges into the neighboring ARTCCs, but you can see how they're routing flights into/out of ORD. This is a different site's view of ORD, note that thanks to ADS-B you can even see planes taxiing at ORD.

(Hmm. Am I thread sitting? I'll wait on commenting further to see if the mods are waving me off.)
posted by eriko at 1:18 PM on September 29, 2014


Well, I would bet that there was redundancy, but it was never planned that someone would burn the place to the ground.

Datacenters have some interesting safety measures because electrical fires are things that happen by themselves. And you can believe me that we test both the alternate DR datacenter location and bus some techies somewhere else to test that we can work remotely.
posted by sukeban at 1:19 PM on September 29, 2014 [4 favorites]


Yeah, and while other friends of mine in IT are all aghast that they allowed a contractor to be alone in the datacenter, I don't know of any physical controls that would've prohibited that in any of the DC's I've worked in, just policy controls. So if someone snaps and goes in alone and starts playin' telephone operator with the wiring, the first thing you're going to notice that is when the incident command does all red and angry.

(Which is, of course, all kinds of condescending to contract employees versus non-contract. What, full time employees can't snap? Shyeah, right.)
posted by Kyol at 1:21 PM on September 29, 2014 [6 favorites]


Ignoring all the other questions of ethical working conditions--- does it make sense that one person can cause this amount of damage to our critical infrastructure? Shouldn't there be redundancies in place so that even if one person goes Full Crazy- even if ten people go Full Crazy- there's backup? This isn't like the coffee machine at your office that can wait a week before someone repairs it.
posted by BuddhaInABucket at 1:21 PM on September 29, 2014


It's not just about this individual though. Contracting out these skilled positions is a symptom, not a cause. The games Congress has played with the FAA's budget (coupled with FAA mismanagement) has left the agency strained and unable to keep up. In such an environment, redundancy and contingency planning are one of the first things to go.

The idea that a ARTCC could become unusable is not remotely inconceivable. That is, after all, why the FAA has ATC ZERO plans, and thanks to these plans, nothing terrible happened outside of the fire. But if it wasn't Brian Howard, it would have been a burst water pipe, a sparking electrical connection, an earthquake, a ceiling collapse, whatever. The immediate disaster plans appear to have worked, at least in so far as everyone landed safely (though there needs to be a lot more analysis on this point), but it's clear there were no acceptable recovery plans.

If we're going to expect an air traffic system that doesn't fail when a single building becomes damaged, and I think we expect such a thing, then we need to pay for the redundancy needed to make that happen. The millions of dollars it would cost to prepare for a disaster would pale in comparison to the losses from this event.
posted by zachlipton at 1:22 PM on September 29, 2014 [19 favorites]


What's with the focus on a contractor, what, employee's never go crazy? His employment contract form is probably not the core problem.
posted by Bovine Love at 1:22 PM on September 29, 2014 [2 favorites]


Disclaimer: I'm a contractor.
posted by Bovine Love at 1:23 PM on September 29, 2014


Shouldn't there be redundancies in place so that even if one person goes Full Crazy- even if ten people go Full Crazy- there's backup? This isn't like the coffee machine at your office that can wait a week before someone repairs it.

You'd be really, really surprised at how fragile many huge company systems are to these kinds of events.
posted by odinsdream at 1:26 PM on September 29, 2014 [9 favorites]


Here's a thought. Maybe stop contracting out skilled positions, and hire and promote employees in-house while offering them seniority, pensions and job stability? Might that be a way to avoid bringing in sick people who literally burn down the building by a lowest-bid-wins "privatizer"?

This is the libertarian privatization dream, a post-PATCO work force of short-term contract temps.
posted by charlie don't surf at 1:30 PM on September 29, 2014 [3 favorites]


then we need to pay for the redundancy needed to make that happen.

This Will Not Happen.

What may, only may, happen instead is that a panel will be convened to consider whether anything should be done. That panel will issue recommendations saying that something should be done, but it will not say what exactly should be done. A second panel will then review those recommendations, and issue a list of things that should probably be done at some point. Several different legislative subcommittees will then separately look at the issue and come up with conflicting plans as to how to proceed. At least one, and probably more, panel(s) will then convene in an attempt to figure out which of the subcommittee plans is vaguely reasonable and the super-set of recommendations will then be codified into a single plan, and a request for funding will be made.

This is where everything goes completely off the rails and becomes so complex it beggars belief.

Practical upshot: redundancy will not be paid for, but quite a lot will be spent anyway making at minimum several million dollars for well-connected contracting companies, and eventual redundancy will be promised at some point in the future (note: this is not a promise!).

Of the funding that is eventually arranged, the majority will be spent to support tomacco farmers in Random State due to the unforeseen consequences of a major drought and several decades of pumping everything they could willy-nilly out of the ground such that the aquifers have collapsed, probably due to godlessness and commies. A portion will eventually be discovered to have been sent to prop up a nickle-and-dime religious movement somewhere else in the world, and a congressional staffer will be reprimanded accordingly (given a sweet gig at a think tank).
posted by aramaic at 1:39 PM on September 29, 2014 [39 favorites]


I'm kind of sort of surprised that there's no significant DR plan - or if there was one, that it took days to hand over control to the sister ATC zones.

Cutbacks. Fucking cutbacks. Creating plans cost time and money. Taxpayer money. And, we don't want to spend taxpayer money on something "just in case" do we? Just tighten that belt another notch...feel that? That's the tight squeeze of freedomz!
posted by Thorzdad at 1:43 PM on September 29, 2014 [2 favorites]


Redundancy is another way of saying over-built and inefficient, particularly when the budget pencils come out.

The most efficient system in the world is one that can barely cope with normal operations.
posted by bonehead at 1:45 PM on September 29, 2014 [36 favorites]


Shouldn't there be redundancies in place

In certain circles, 'redundancy' is spelled 'inefficiency'.

Edit: what bonehead said
posted by echo target at 1:48 PM on September 29, 2014 [3 favorites]


The FAA has repeatedly said they do not consider this "an act of terrorism."

Wild guess here: he's white?


And had mental issues. From NPR coverage:
The affidavit says Howard entered the control center shortly after 5 a.m. Friday, pulling a hard-sided rolling suitcase behind him. Thirty minutes later, he sent a note via Facebook that caused a relative to report its contents to the police.

The affidavit, posted online by The Chicago Tribune, quotes the message:
"Take a hard look in the mirror, I have. And this is why I am about to take out ZAU [the center's radio call sign] and my life. April, Pop, love you guys and I am sorry. Leaving you with this big mess. Do your best to move on quickly from me please. Feel like I give a [expletive] for the first time in a long time again ... but not for too long (haha!) So I'm gonna smoke this blunt and move on, take care everyone."
Roughly six minutes after that note was sent, an employee at the control center called 911 to report a fire. Paramedics who arrived shortly afterwards say they saw smoke and found a panel that had been removed to expose cables and wires. They also saw a gasoline can, towels and a suitcase.

Seeing a trail of blood on the ground, the emergency crew followed it and found Howard, shirtless, under a table. He "was in the process of actively slicing his throat," the FBI filing says, citing a paramedic. The team began to treat Howard, who told them to "leave me alone."
So, there's that.
posted by filthy light thief at 1:50 PM on September 29, 2014


Clearly another victim of reefer madness.
posted by jferg at 1:52 PM on September 29, 2014 [8 favorites]


what we should do is fully privatize air traffic control. With multiple competing air traffic control groups, the market will build redundancy right into the system. plus, with the internet and smart phones, there's no reason to have a centralized control room at all. just like Uber or Lyft, air traffic controllers could work freelance from home offices or even from their phone "on the go" thus optimizing the price to safely direct each route.
posted by ennui.bz at 1:52 PM on September 29, 2014 [26 favorites]


It seems there is frankly terrifying lack of redundancy in the ATC setup for the Chicago area.

Our whole National Airspace System is pretty fragile. I remember doing a school report in the mid-90s about the antiquated systems ATC was forced to work with, and I'm sure it hasn't gotten significantly better since then. A large, complex, busy airspace system costs a lot of money to set up and maintain, and I don't think anyone really has the stomach to pay for it unfortunately.

The plan for the future is called Nextgen and it incorporates a lot of really good improvements to things like instrument approaches and sense-and-avoid, but it's been slow going because the FAA is passing the cost of these upgrades on to aircraft owners through mandates of additional equippage. Nextgen is supposed to decentralize a lot of the information gathering aspect of traffic control, relying on sensors on aircraft to give ATC a picture of the skies instead of large radar installations (although they'll never completely phase those out). Not sure how a fully Nextgen-operational airspace would handle an interruption like this, though, since it does still require aircraft funneling data to a central location for processing and rebroadcast out.
posted by backseatpilot at 1:53 PM on September 29, 2014 [4 favorites]


Devonian: He clearly flipped - torching a data centre and then trying to stab yourself to death are not the actions of someone who's in control of themselves. It'll take a while to find out what pushed him over the edge, but in some ways this individual and his acts are the least part of the story.
The obvious solution is to ban cigarette lighters, matches, and flammable paper from airports.

And instigate mandatory pocket searches, preferably with new high-tech pocket screening devices from companies key senators have invested in.
posted by IAmBroom at 1:58 PM on September 29, 2014 [1 favorite]


What's with the focus on a contractor, what, employee's never go crazy? His employment contract form is probably not the core problem.

It's not that contractors are evil or unreliable, it's that the contract form suggests a lesser degree of oversight, and a lack of centralised psych evals, etc. that you would normally expect for this sort of critical infrastructure. It's easier to put systems in place to catch this sort of shit going wrong for a person, mentally, before they actually go postal if they're an employee rather than a contractor who is part of an entirely separate bureaucracy.
posted by Dysk at 1:59 PM on September 29, 2014 [4 favorites]


To everyone demanding "redundancy": How much redundancy do you have for your car? If your house burns to the ground, how long will it get your spare house up and running? If your workplace catches on fire, how much further will it be to your alternate office?

IF you have a positive answer to all of those, are you prepared to double your income taxes to pay for the government to catch up to your level of "preparedness"?

It is simply not possible, nor realistic, to design redundancy into our lives for every conceivable problem. Life isn't the NASA Apollo Program. Sure, in retrospect it would have been great if there had been a 24/7 guard on that room, or some other preventative measure, but frankly there's probably ten other ways he could have done as much or more damage.

If some asshole boats out to a bridge in any major US city and straps a serious charge onto just one support, blowing it outwards, there's really not "redundancy": that bridge is shut down for A Long Time, even if it only shows cracks afterwords. No city is prepared to stop any boater going near any bridge.

How about a semi intentionally jackknifing in a tunnel? How about a truck dumping unexploded ordinance on any major highway exit (where the traditional countermeasure of exploding won't work)? How about ...

There's an endless list of things that can go wrong or be done maliciously, and we have to prepare for the most serious that are also most likely. Everything else is security theater, and now we're Monday-quarterbacking.

Of course, now that he's done this, this idea is ripe for copycat assholes, and must be added to the list of things we prepare for.
posted by IAmBroom at 2:13 PM on September 29, 2014 [8 favorites]


Many of those things you suggested are unlikely at best. The possibility of a data center meltdown is within the realm of probability just by accident alone, never mind a malicious employee. Throw in that we're talking about one of the primary air control centers, versus a data system that would be considered an inconvenience if it went down but wouldn't need to be restored asap. It's not unreasonable for people to say that the FAA dropped the ball here.
posted by kokaku at 2:19 PM on September 29, 2014 [7 favorites]


I can't believe this is really a thing. I've come up with disaster recovery and backup plans for small offices with redundancy. On tiny budgets, too. Literally, i've worked at places that have less than 10 workstations where someone could burn down the whole room with the NASes, servers, and workstations and we'd be getting going within a day or two. And most of the delay there would be calling companies to transfer licenses of quirky software that ties itself to hardware as part of the license.

Either someone needs to get fired for not doing their job and being a clueless noob, or whoever told them what to do or ignored their pleas needs to be fired and brought up on charges.

I'm just fucking flabbergasted. Seriously, friends/colleagues tiny companies that have at most 2-3 servers have more redundancy than this. What's their goddamn excuse?

Another point is, does every major public safety system like this suck this much? People would get fired from a fucking video game studio if this lack of redundancy shut them down for a couple weeks.

eriko: There's also supposed to be a dry system to suppress fires, but I would not be surprised to find out that he deactivated it before he set the fire. It's not hard to do.

I've helped install gear and config stuff at one of those super redundant, bomb-drop telco data centers. It was one built to run an actual telcos equipment and they sold off the facility to a private company after the dot com boom that now does various levels of super reliable colo. They had i think, two different dry systems.

Although there was a whole "tag out" system to disarm that system(which could be overridden, and there were light up evac paths if someone in the control room decided to just fire the system and people were in there)... if you were someone who regularly worked in there, getting in and getting that system disarmed would basically consist of "oh hay steve" "oh whats up bob" *BZZZT* door opens and now you're in there with the system off.

As for why they didn't have the sort of system where someone can make the call to fire the dry system and just kill whoever is in there if they don't leave, i have no idea, but forcing it to fallback to the wet system by shutting it off and going in, yea, as you said, not that hard.

dirigibleman: Who gets upset at being tranferred from Chicago to Hawaii?

Hawaii is one of those places like disneyland. People want to live there, and don't realize the reality of it actually kind of does or would suck ass. It's on a similar place on my list to alaska, actually. In that i know a bunch of people who grew up there and got the fuck out as soon as they turned 18 if they could, or if not then not long after.

...And they all constantly get people asking them "why the hell did you move away from there to here?"

Traffic, bugs, and cost of living have already been brought up... but there's also just a lot of isolated, boring, smalltown bullshit. And every company who brings you in from out of town tries to get away with paying you less because "omg you'll be in hawaii!".

Maybe i'm just a bemused cynical asshole knowing the realities of it, but i can completely understand why someone wouldn't be stoked on moving there for work. "Yay, now i can spend all my money on a shitty place so that... people can be jealous of me for "living the dream"!"
posted by emptythought at 2:21 PM on September 29, 2014 [9 favorites]


I can't speak for everyone, of course, but the redundancy I'm after would involve having a better plan for when a control center gets taken out.

...thereby avoiding, for example, the traffic snarl that occurred when this exact same center had an electrical malfunction that created smoke, which caused it to be evacuated.
posted by aramaic at 2:21 PM on September 29, 2014


Having a bad day, emptythought?
posted by Melismata at 2:28 PM on September 29, 2014


what we should do is fully privatize air traffic control. With multiple competing air traffic control groups, the market will build redundancy right into the system. plus, with the internet and smart phones, there's no reason to have a centralized control room at all. just like Uber or Lyft, air traffic controllers could work freelance from home offices or even from their phone "on the go" thus optimizing the price to safely direct each route.


I beg the gods of MetaFilter that no "disruptive entrepreneur" reads that comment.
posted by GenjiandProust at 2:31 PM on September 29, 2014 [5 favorites]


IAmBroom: I don't have a primary car, let alone a spare one, but I have public transit, taxis, Uber, Lyft, Zipcar, and ordinary rental cars available. I don't have a spare house, but I do have insurance to ensure I have someplace to live and to replace my things. So thanks to the magic of hotels, I do have spare shelter available if my home burnt down. If my workplace catches on fire, I could work from home or an office in New York and the company could rent temporary space pretty quickly.

Bridges are quite expensive, so we don't build backup bridges in the name of redundancy, but we do consider the contingencies if a bridge collapses or is otherwise unusable. There can certainly be a logical cost/benefit analysis to determine reasonable levels of redundancy for these facilities.

That's all I am asking for here. It's not remotely unreasonable to expect someone to have looked at the likelihood of various events and weighed them against the cost of mitigating measures. There are 22 ARTCC facilities (and a far greater number of terminal and tower facilities). None of them are a secret. It's really quite conceivable that one of them will become unusable. Even if you somehow failed to think about the possibility, the numerous instances where these sites have become temporarily unusable should have been a clue to you.

Nobody is saying they should have built 22 backup facilities that stand idle 364 days a year. The costs of that most probably outweigh the benefits. But you're taking unreasonable risks with critical national infrastructure if you don't have an actual disaster recovery plan with appropriate redundancy to get back up and running more quickly.

Or, to put it another way, if I knew that it would cost me hundreds of millions of dollars if my car broke down, I'd spend a small faction of that on a backup transportation plan.
posted by zachlipton at 2:35 PM on September 29, 2014 [10 favorites]


Redundant physically dispersed telecom demarcs within a datacenter aren't all that unusual, and would have helped. But surviving this would have required fully-redundant equipment all the way to the user workstations (in this case, the controller's consoles) Exterior communications, internal LAN, and servers (or other specialist equipment) would have duplicated and physically dispersed, and that would still have left single points of failure in the control room (the warm bodies, if nothing else). That level of on-site redundancy and high-availability is... rarer.

The far more common solution is to provide alternate processing facilities and an alternate work location. In many cases it's decided to accept one with less than full steady-state capacity. And that's essentially what they had here: the existing Chicago controllers have been dispersed to other control centers with function hardware, where load shed from the failed location is being picked up as well as possible.

Given that even this well-informed and targeted attack doesn't appear to have resulted in a life-safety emergency (though certainly there were massive inconveniences and economic impacts) it's hard to say that there were obvious flaws in their risk and impact analysis. An event that does just this much damage, without resulting in complete loss of the site, was probably considered a fairly low probability.

On the other hand given that dropping a single site is now a demonstrated vulnerability, that the difficulties with load transfer and the economic impact are now publicly-known, and the air traffic system is a known target for aggressive bad actors, it might well be a good time to increase the level of protection. Remediation of the specific single points of failure that were targeted here could help, but the more obvious solutions would be to provide a full, in-region warm spare for the entire site, or a LOT more excess capacity at the existing alternate locations (other centers), along with (it looks like from the outside) faster and more repeatable processes from transferring workload from a failed location.

Both of those would provide protection against this scenario and a number of worse ones, including a "hand of god" loss of the full site including staff, but would cost a LOT Very few organizations would even consider ensuring that level of continuity-of-operations, and even the FAA would have to weigh the costs and benefits carefully. With budget constraints and the previously-demonstrated lack of respect for the criticality of the ATC system, I'll be surprised if this is fully fixed, but I'd sure love to be wrong.
posted by CHoldredge at 2:35 PM on September 29, 2014 [12 favorites]


if you were someone who regularly worked in there, getting in and getting that system disarmed would basically consist of "oh hay steve" "oh whats up bob" *BZZZT* door opens and now you're in there with the system off.

The simplest way is to lean something moderately heavy against the big yellow "Don't Dump The Firebottle" button that's required in every room with that sort of fire protection. The advanced method involves the complex tools known as "a screwdriver" and "a jumper with with alligator clips on each end."

To me, the one big mistake is that this wasn't a No Lone Zone, or there wasn't anybody making sure that you weren't alone. Defending against a hostile insider is always going to be harder than defending against a hostile outsider. There were plenty of fences and locked doors between the world and that room. The problem was that Howard had the keys to all the fences and doors.

If I were the FAA, I'd be going on the offense. "Well, Mr. Congresshole, the reason we didn't have two separate data centers at each ARTCC is you refused to pay for them. The reason we had contractors doing this work was you refused to let us hire our own people to do them. The reason we couldn't have guards on the room is your budget cuts. We could have stopped this, but you decided that causing a budget crisis was more important. So, above all, I hope *your* ass was stuck in Timbuktu for the weekend. Lord knows enough innocent people were."

But, I'm not the FAA.
posted by eriko at 2:39 PM on September 29, 2014 [24 favorites]


It really isn't that we have to start worrying about people burning down data centres. Putting gasoline sniffers on the doors, now that would be security theatres.

It's recognising that data centres can go wrong for a lot of reasons, and that it's good to be able to fail over to something else when that happens. Fortunately, there is an entire speciality devoted to this sort of problem - disaster recovery, or DR - and that includes not only the engineering to make it happen but the methodology to do proper risk analysis to find out how much DR you need (Disclaimer: I am not a DR practitioner, but I have hung out with them in bars). In practice, if you take this sort of thing seriously you end up with different levels of DR covering different aspects of your business, from the stuff that will kill you if you can't do it in the next five minutes to the archival stuff for long-gone projects. Stuff like your dev programme and your marcomms tends to sit somewhere in the middle, but YMWV.

Indeed, the FAA has DR - the US airspace didn't collapse in a dangerous manner, everyone who was up at the time got down OK because procedures were followed and the remaining working infrastructure reconfigured itself appropriately. A metric shitload of money was lost, certainly more than it would have cost to engineer in stronger DR to make Chicago ATC failover in minutes rather than days, but... guess what, the FAA didn't lose that money, and it's actually a good question whether building enough redundancy across the system wouldn't in fact cost more compared to the number of times we can reasonably expect this level of failure to hit. (The FAA should really be comfortable with assessing probabilities of failure and associated consequences - it's a large part of its actual job.)

Personally, and with the proviso above that I don't actually do this for a living, I'd expect that a properly engineered system would be substantially more robust and not unaffordably so. What I don't know (but fear I do) is that since ATC isn't a properly engineered system to start with it may well not be retrofitable sensibly until the next major architectural revision.
posted by Devonian at 2:43 PM on September 29, 2014 [7 favorites]


Redundancy makes complete sense when the lost of a single data center causes the loss of time, income, and effort is more than equal to the amount of money needed to pay for a duplicate data center.

I've talked to people who have worked on systems -- just individual systems, mind you, not complete network or server infrastructures -- used by private companies that have been amortized by the amount of money per hour that's lost from the system being down. Tens of thousands, hundreds of thousands of dollars. Having a large segment of the US air traffic grid down for a full day? We're easily into tens if not hundreds of millions.

Comparing that to an individual's car or any single point of infrastructure that has either has built-in redundancies or is relied upon by few others is not even swinging in the same ballpark.
posted by mikeh at 2:45 PM on September 29, 2014 [2 favorites]


I'm just fucking flabbergasted. Seriously, friends/colleagues tiny companies that have at most 2-3 servers have more redundancy than this. What's their goddamn excuse?

Likely that the worst case, planes being grounded for a day or two, isn't the worst thing in the world. There is reserve capacity in the other neighbouring airspaces, and planes have limited ways of dealing with loss of ATC. There were enough redundancies to prevent loss of life (first) and high-value assets (second).

Redundancy doesn't mean that the system has to have completely duplicate capacity. For big, expensive systems like ATC, it means they should degrade gracefully, without major incidents. A bad system would have risked accidents. This may have not been flexible enough to prevent economic loss and customer inconvenience, I'm not one to judge, but it was flexible enough to be safe in this case.
posted by bonehead at 2:48 PM on September 29, 2014 [2 favorites]


The ATC infrastructure is pretty robust, overall. There's protocols in place for all sorts of communication failures because 1. they happen and 2. failure is often spectacularly bad. The big military overlap in development and the cold war paranoia of the second half of the 20th century has pushed for good failure handling, too. I suspect that their checklists have checklists.

My guess is there are a bunch of folks inside the ATC organizations who've been pushing for better DR with respect to data center spaces and who have been getting shut down for a variety of administrative reasons that mostly boil down to a lack of budget and a great organizational inertia that says that the known devil is better than the unknown one.
posted by rmd1023 at 2:58 PM on September 29, 2014


I don't think it's fair to assume that they didn't have a functional DR or continuity-of-ops plan. All that we know so far is that invoking or improvising that plan took longer, and was less effective, than those of us on the outside would have expected. Whether that means there's a mismatch between our expectations of their preparations and what they actually decided to implement, or they had a failure either of planning or execution, isn't yet clear. And frankly, the people who could tell us are much better utilized fixing the existing problem at the moment. The root cause analysis should wait until AFTER the incident is closed.

This isn't just another datacenter. The FAA can't go to IBM BCRS or Sungard and take out a hotsite contract for a high-altitude tracking radar feed to go with their three racks of HP blades. The system failing has costs, but providing redundancy isn't cheap, either. When any SMB can hire out for a decent risk analysis and BIA, I really doubt that the FAA completely neglected these things. They may have made risk acceptance trade-offs that now look bad, or they may fallen short of their own expectations, but we really won't know that until long after the problem is fixed.

And bluntly, something this bad happened, and nobody died? They succeeded at job one. Dollars and hours are secondary.
posted by CHoldredge at 2:58 PM on September 29, 2014 [11 favorites]


something this bad happened, and nobody died? They succeeded at job one.

Hell yeah.
posted by rmd1023 at 3:03 PM on September 29, 2014 [3 favorites]


It's really quite conceivable that one of them will become unusable.

They, in fact, did have plans for just this contingency, and they have them for *all* the ARTCCs. They're running them now. It was *known* that a multi day outage at an ARTCC would be a bad thing, but they did the best that they could to protect against a multi day outage, and they made plans in case it happened. You don't think they just came up with the ZAU-HI reroutes and ZAU-LO tower routes off the cuff, do you?

All of them are part of the play books. There are many cases where a given route or terminal is closed because of weather. To redirect traffic around ZAU, they just pulled the routes out of the playbook and announced them. They had low level routes in place to get planes to/from MDW/ORD, but they were expected to be needed for a couple of hours. There were better plans in place, but they involved moving controllers to different facilities. That's why Saturday was better than Friday, and Sunday better still -- getting the offline ZAU controllers to distant facilities and getting them into those flows. (Note: Those controllers couldn't fly out. It's an overnight drive to Kansas City and Minneapolis, shorter to Indy and Cleveland.) It also helps that the airlines now understand that they must file and fly the *exact* routes published to get to MDW/ORD, on Friday evening and Saturday morning, they were still trying to file standard approaches and expecting a mid-flight change.

One of the big problems was that, until about 4PM, the FAA didn't know how bad this was, because they couldn't get into the room -- the police had them locked out. So, while they setup the first level plans, they were still hoping to get some of ZAU online that day. It wasn't until late that the realized that ZAU was going to be down for days and they went to the second and third level drills -- and they specifically did not send controllers on the road until they knew they couldn't use them there. It would suck if you started them rolling, it was a simple fix, and then you're down for hours while they all drive back.

Now, you'd like to think that we could just drop ZAU or ZNY or ZLA and pick everything up, but to do that requires money, and the government refuses to spend the money. In fact, if you told me that you could knock out ZAU completely and be over 50% traffic capacity in 24 hours, I'd have laughed at you. The FAA has done spectacularly well with this, and people demanding MTRT 100% in 4 hours or less after a complete ARTCC loss simply have no idea how much that would cost.

This bastard basically picked the second or third worst center to hit, and hit the exact weakest spot in that center. From the sense of "how would I do it if I was going to be a bastard", he's done very well. As a very frequent flier who's stuck dealing with this mess, I want to kick him in the nuts so hard that he has to contact ZKC for overflight clearance.
posted by eriko at 3:06 PM on September 29, 2014 [36 favorites]


Redundancy makes complete sense when the lost of a single data center causes the loss of time, income, and effort is more than equal to the amount of money needed to pay for a duplicate data center.

dingdingding

while everyone going "well no one died!" has a good point, the fact that this caused something like 200 million in losses proves they still fucked that up. there's no way 23 racks of equipment cost 200 million dollars, this isn't the LHC or over-inflated bits of gear on an aircraft carrier or anything.

the crappy thing here is those costs were incurred by the airlines, not by the FAA, so the FAA and congress/etc have no real reason to actually care.

I actually think that's a pretty myopic view for a government agency to take too. If you're in charge of a system that can cause massive economic harm in domino ways if it fails(it's not just the airlines who lost money, what about people who needed to be at X location to do Y job and were delayed?) and your link in the chain can break the entire system, even the parts run by private companies(IE, the planes, airports, etc), then i think you have a greater responsibility than "stop people from dying".

I mean yea, that's really important, but i feel like there should be some push for something to the effect of "no event including someone bombing a datacenter or office can take this system down for more than 12 hours".

It just blows my mind that it's still running at a degraded level now. Hell, 9/11 didn't fuck up any systems for that long as i remember. internet and phone was totally functional within a day or two. And this isn't really infrastructure, it's infrastructure on top of infrastructure. And yet it's arguably, only an order of magnitude less important than say the power grid or the internet.
posted by emptythought at 3:14 PM on September 29, 2014 [4 favorites]


Do you remember where we didn't have any planes in the air for a couple of days after 9/11? And that was with no physical damage to the air traffic control system. Yeah, there are a lot of delays right now and it's a pain in the ass, but business is continuing.

* note: Maybe there were planes elsewhere? I was in Boston where there was no physical damage and September 12th was a fucking eerie bright sunny silent day in the skies.
posted by maryr at 3:17 PM on September 29, 2014 [2 favorites]


something this bad happened, and nobody died? They succeeded at job one.

There's a good chance that someone did. A number of people who couldn't fly rented cars and drove. There's a significant chance that a number of these people were involved in accidents, and a not insignificant chance that someone died from a car wreck in a car they never wanted to be in. Illinois has a felony murder rule, and arguably, this could apply if someone did die in a car wreck because they were driving rather than flying -- the question becomes does the charges currently made or going to be made rise to "forceful felony." Given that arson is in play, yes.*

The very impressive thing was the way they were able to handle the traffic that was already in the sector -- I've heard they were on cellphones to towers to call instructions up, since they lost their link to their transmitters, and they were either able to divert or route everyone in the sector while DCC got the word out and rerouted the rest of the world around.

They also knew they had about 50 planes incoming from Europe that were expecting landing slots, and they had plans in place for them by the time the first ones showed up. They also knew they couldn't hold those planes very long, and that they wouldn't have fueled up for extended low level flying, so they built a corridor down Lake Michigan and had an ORD tower position handling them via miles-in-train.


* Let's not go into the issues with the felony murder rule here. Accept that it is in play in this case if someone died because of actions caused by this fire. As an aside, Illinois formally abolished the death penalty in 2011, so that's not in play for a state charge. Federal charges are a different matter.
posted by eriko at 3:18 PM on September 29, 2014 [1 favorite]


Hell, 9/11 didn't fuck up any systems for that long as i remember.

Oh, it certainly did. Ask PATH commuters. As NYC cell phone customers, TV and Radio listeners, etc. There was a huge knock on effect from losing 1/2/7 WTC.

The airspace was shut down intentionally, after 11-Sep-2001, but the air traffic control systems were unharmed. Once US airspace opened, they were ready to go.
posted by eriko at 3:21 PM on September 29, 2014 [4 favorites]


And that was WITH emergency plans built into the system, thanks to Y2K planning.
posted by maryr at 3:24 PM on September 29, 2014 [1 favorite]


There are 22 ARTCC facilities (and a far greater number of terminal and tower facilities). None of them are a secret.
My God... a single person with a grudge and a month to spare could take out the entire system.
posted by MikeWarot at 3:28 PM on September 29, 2014


Naberius, pretending that people wouldn't be screaming "terrorism" if this guy was a brown Muslim is either garden-variety cluelessness or willful denial.
posted by truex at 3:30 PM on September 29, 2014 [4 favorites]


Redundancy makes complete sense when the lost of a single data center causes the loss of time, income, and effort is more than equal to the amount of money needed to pay for a duplicate data center.

That's not obviously correct in all circumstances. That calculation completely leaves out the estimated risk that the data center will fail.

Fundamentally, this is an underwriting exercise. The usual model looks like this: Threat A will cost $X/year to completely mitigate. If A takes place, expected losses (economic and intangible both) are expected to be $Y. But A is expected to take place only every N years. If X is less than Y divided by N, you spend the money and mitigate the risk.

The problem is, for most of these threats, N is hard to determine except looking backwards. And for unusual events, if A never takes place, it will always be hard to convince your management (in this case the American people) that you didn't waste the money you spent to be safe. And if it does, you'll never convince them that you spent enough
posted by CHoldredge at 3:31 PM on September 29, 2014 [3 favorites]


In unrelated news, Boston (BOS) had to ground stop. Why?

AT BOS, TWO TALL SHIPS ARE IN THE HARBOR IN ADDITION TO LOW CEILINGS. A SHORT TERM GROUND STOP IS IN PLACE WHILE THE SHIPSCLEAR THE HARBOR.

And ORD is backing up, because were at Happy Hour.

AT ORD, THE 23-00Z ARRIVAL DEMAND HAS SPIKED AND DUE TO THE HEAVY DEPARTURE DEMAND AT 23Z, A GROUND STOP IS POSSIBLE. A REVISION TO THE AFPs IS EXPECTED. EAST WINDS ARE POSSIBLE AFTER 2230Z WITH AN ARRIVAL RATE OF 68.

The east winds possible would be the lake breeze making it out to ORD. In more news, DEN got hit by a thunderstorm, so it's backing up, and Florida has apparently lit up with storms as well, so flight routing is changing by the minute dodging them.

ATC is *hard*. But that tall ships delay is funny. WHAT DO YOU MEAN A GODDAMN BOAT?
posted by eriko at 3:34 PM on September 29, 2014 [8 favorites]


To be fair, the landing vectors at BOS were laid out over the original cow paths, which themselves were laid out back when ships had more lanes through Boston Harbor, which was larger because there was less waterfront landfill.
posted by benito.strauss at 3:46 PM on September 29, 2014 [7 favorites]


Why is everyone assuming that these equipment rooms were data centers that primary contain Telco and IP equipment?

There are certain kinds of communication equipment for which it's prohibitively difficult to design full redundancy (or doing so leads to a vastly more complex system, which has its own issues). I don't have personal experience with ATC, but I do have experience working in broadcasting, and I'd imagine that the principles (and equipment) are largely similar: Maximize redundancy, and have a "catastrophic fallback" that can quickly be implemented to operate at reduced capacity until a full recovery can be orchestrated. This sounds exactly like what the FAA are doing.

Similarly, while concontracting culture has completely gone off the rails in the government, there are certainly specialists who don't need to be full time employees. Most government offices don't need a full-time electrician....

Let's also not forget that modern ATC systems are still designed with the assumption that towers can go dark (this actually happens somewhat frequently in tornado-prone areas). Even though ZAU went offline, planes were still able to land safely.

As for mental health screenings, I don't think there are any good options. The FAA is the most paranoid organization that I know of when it comes to the mental health of pilots and controllers. Unless you want to prevent anybody who's ever seen a psychologist from going anywhere near an airport, there aren't any good options that don't massively trample the rights of the workers.

There are certainly lessons to be learned here (which is coincidentally a thing that the FAA is rather good at). This is an unacceptable level of downtime, and changes need to be made. However, I see a lot of people jumping to very hasty conclusions in this thread. Let's hope that saner minds prevail at the FAA.
posted by schmod at 4:03 PM on September 29, 2014 [5 favorites]


rebent: wow this is pretty cool. I always like when things go bad for airports because it generates such nifty interactive graphics. I look forward to the graph that shows all the planes re-routing around a "dead zone"
Flight-aware maps of large #MSP & #DTW reroutes due to #ZAU ATC ZERO #Chicago #traffic #delay #FAA @Delta—Mike Robinson, ‏@WX_ATM, 26 September 2014
posted by ob1quixote at 4:05 PM on September 29, 2014


ms scruss was supposed to head back to YYZ from MCI via ORD yesterday, but United silently cancelled her flight and moved it to 0550 this morning. I can't imagine how much fun Kansas City International is in the wee hours.

> what we should do is fully privatize air traffic control.

I know you jest, but we have this in Canada. Most of the PSRs are over 25 years old, and can barely handle new clutter built around cities as their whole kilobytes of object masking RAM is maxed out. We also, apparently, have some of the most expensive ATC charges in the world - but clearly none of that makes it to buying new radar.
posted by scruss at 4:05 PM on September 29, 2014 [1 favorite]


Getting stuck at MCI overnight is a nightmare. There's nothing there, damn few hotels nearby, and you are far closer to Kansas than to are to either Kansas City.

MCI was a brilliant airport until we needed security. The doors of the gates are 75 feet from the outside sidewalk. Then they had to add security, which means pens by the gates. Hate MCI. Hate it. Grrrrrrr.
posted by eriko at 4:30 PM on September 29, 2014 [2 favorites]


Oh, I was wrong. That's not a lake breeze, that's a backdoor cold front. It's going to fog out MDW and ORD. ORD has cat III ILS, so it will be fine.
posted by eriko at 4:32 PM on September 29, 2014


while everyone going "well no one died!" has a good point, the fact that this caused something like 200 million in losses proves they still fucked that up. there's no way 23 racks of equipment cost 200 million dollars, this isn't the LHC or over-inflated bits of gear on an aircraft carrier or anything.

Ehhhhh... I've got join the people who are saying the system worked. If you're going to have a backup for one center, you're going to backup all 22, and I'm guessing by the time you do all that you'd be lucky to set it up, let alone maintain it, for $200M.
posted by booooooze at 4:41 PM on September 29, 2014 [1 favorite]


To everyone demanding "redundancy": How much redundancy do you have for your car? If your house burns to the ground, how long will it get your spare house up and running? If your workplace catches on fire, how much further will it be to your alternate office?

Let's say I walked out the door of my work right now and it exploded behind me, a hunk of debris landed on my car and set it aflame, and by random chance a meteor landed on my apartment. What would happen to me? After I reconsidered my way of life, I'd call my insurance company. I'd get my renter's insurance and car insurance paid out. I'd have a rental car within a couple hours. I'd have a hotel to stay at tonight. My work would find a new office within a couple days. My work's systems would be switched over to their backup systems located elsewhere. Our data systems are backed up daily, and we have redundant off-site servers for all of the critical business applications. Of course, all of my personal data, pictures, financial records, etc. would be perfectly fine, as I have them backed up to multiple hard drives, in multiple locations, and also in the cloud on auto-backup. It's almost as if the really important things, the things I can't replace easily, I keep redundant. The total cost for all of these measures are fairly low, and don't affect my way of life.

More importantly, here's what wouldn't happen: 1500 flights wouldn't get cancelled. Hundreds of thousands of travelers wouldn't be stranded. Airlines wouldn't lose hundreds of millions of dollars. If my car, home and work burning down were going to cost that kind of money and affect so many people, I'd probably go ahead and get spares for all three.

IF you have a positive answer to all of those, are you prepared to double your income taxes to pay for the government to catch up to your level of "preparedness"?

This is hyperbole of the highest level. It's the kind of thing I'd expect to see on a Fox News teleprompter. I don't find it unreasonable for the FAA to have redundant systems. To outright declare that the entire US Government's budget would have to be doubled would imply that they also need double the facilities, staff, and administration. Simply put, that's not true.

Just about every major business has all of their data backed up multiple times. My company, with thousands of employees spread across the globe and full manufacturing and shipping systems in place, manages to have multiple backups of everything. The FAA has a $15.4B budget. How is there not room in that budget to have full redundancy of the computer systems for their biggest 22 centers?
posted by Mister Fabulous at 5:07 PM on September 29, 2014 [1 favorite]


Your company is nowhere near the size or complexity of the USATC, and you are making lovly assumptions about satay reliability.

I know this because you're talking about backups. Backups mean nothing. Restores mean everything, and if you haven't slapped a new HDD into your machines and restored your data, I flat out do not believe you will ssurvive a hardware failure or a idiot wth delete privs incident.

Sorry. Ive heard this too many times. If you haven't restored data in the last 30 days, your backups are probably useless. If you haven't run full production on your DR site in the last 30 days, it probably wont run at all.

And when your interconnect isn't a pair of big pipes to a couple of tier 1 providers, but literally dozens of connections to radars -- many 20 years old, none digital -- and dozens of transceivera, again digital, you quickly start to realize that this costs real time and real money for an organization which half the government is actively trying to destroy.

So, yeah, get on with your bad self and your "of course," and I deeply and truly wish you luck when you need those restores.

You have restored, right? Recently?
posted by eriko at 5:25 PM on September 29, 2014 [16 favorites]


ATC zero is second only to SCATANA.

(Although my favorite ATC acronym is SODPROPS.)
posted by kiltedtaco at 6:52 PM on September 29, 2014 [1 favorite]


eriko, Boston changes flight patterns when LNG tankers enter the main channel on their way to the Everett LNG terminal. And while they are probably faster than tall ships, they are certainly not that fast!
posted by scolbath at 7:37 PM on September 29, 2014


You have restored, right? Recently?

Yes, I have. Had my HDD crash (spindle) and had to restore a month ago. Went well. My work also had the central ordering system crash, too, due to a server that caught fire. Downtime was a few hours, full restores within a day, but production never ceased. That wasn't my point.

My point was that IAmBroom was making asinine comparisons using an individual person and USATC. Individuals can have downtime without major repercussions, USATC can't. When you can't have downtime, then the systems need to be designed for it. In the case of USATC, it seems that there wasn't redundancy in place. One server room goes down and it knocked out ATC over one of the busiest airspaces in the country. That's insane.

I realize that having redundancy in place for so many systems costs enormous amounts time and money. My question is how much? Is it in the Billions? Tens of Billions? We already spend that on the FAA every year. The answer is that whatever the number, it's not going to double the income tax rate (IAmBroom's other statement) to have critical systems built with redundancy.
posted by Mister Fabulous at 7:44 PM on September 29, 2014 [2 favorites]


DR is pretty well known, I think. The dreaded Return Home plan is often harder than faint over to the DR site -- and isn't that what's taking so long here?
posted by wenestvedt at 7:52 PM on September 29, 2014


It was a telco room where the wires from the radars come in. Wiring each radar to a main site and to a backup site is not feasible. The FAA is not the same as Uncle Jim's online store.
posted by monotreme at 9:04 PM on September 29, 2014 [3 favorites]


Do you remember where we didn't have any planes in the air for a couple of days after 9/11?

I do. I remember deep blue skies -- a sky that I'd seen only once in my (then) 35 years. It was amazing. It was also, when you realized why there was nothing but blue, horrifying.

Worse: The experiment we will never recreate, so this data point is unprovable and probably invalid, but: The average temperature over the continental United States rose by 3°C between 11-Sep-2001 and 15-Sep-2001.

The only way to repeat this experiment -- the only way to possibly confirm it -- is to shut down all flights for four days and see if it happens again. But we have this issue. Turbojets and turbofans pump hot air into the sky. This creates contrails. They block the sun.

We have a theoretical reason why no flying would raise the average temperature. But the cost of the repeat experiment is literally millions, possibly billions of dollars.

And yet, if flying does cool the sky, and thus the planet? Can we afford to stop?

We don't know.
posted by eriko at 9:54 PM on September 29, 2014 [3 favorites]


Oh, but in good news…..

AT BOS, THE SHIPS ARE GONE AND METERING IS IN PROGRESS. THE AIRCRAFT INCIDENT AT EWR CLEARED QUICKLY AND THE AIRPORT HAS RETURNED TO NORMAL.

Well, good.
posted by eriko at 10:48 PM on September 29, 2014 [1 favorite]


> Hate MCI. Hate it. Grrrrrrr.

But, but it's so beige ...
posted by scruss at 1:23 AM on September 30, 2014


ATC is *hard*. But that tall ships delay is funny. WHAT DO YOU MEAN A GODDAMN BOAT?

I've had to give my ship's air draught to Sound VTS because I was going past Copenhagen airport. I think anything over 45m tall is significant to them - which, funnily enough, probably means a lot more cruise ships than actual Tall Ships count as ships-that-are-tall. The Drogden Channel by Copehagen airport is pretty shallow (and not going to be made deeper because there is a road tunnel underneath it) so most of the really big cargo ships go the other way around Denmark (and under a bridge instead), but cruise ships particularly face a lot of commercial pressure to try and squeeze through the Drogden because it's a lot shorter than going all the way round (particularly if you've just called at Copenhagen), and also it's pretty so the passengers like it.

London City airport need to know about tall vessels as well, but I'm pretty sure their significant height is mostly a lot taller than 45m.
posted by Lebannen at 2:16 AM on September 30, 2014 [1 favorite]


23 racks is not a big installation, these days, even figuring in the data links. Cost for a decent geo-diverse DR install of this size is tiddly-winky bupkiss compared to the costs associated with this downtime. It was probably a metric crapton more money coming up with the protocols for a ZAU ATC ZERO situation than it would be to replicate the site.

The problem is that the ATC system is so old, notions of a geo-diverse disaster-recovery datacenter didn't figure into its original plans for redundancy, and money that should have gone to rebuilding it went up in flames in Iraq instead. (As recently as the late '90s, a major investment and trading firm had it's DR facilities - in the same building. Both were destroyed by sprinklers going off during the first WTC attacks, wiping out its backups as well.)

8 years of misrule in the eexecutive, and another 6 of foolish austerity in the face of an economic and infrastructure catastrophe, and we're 12 years behind where our transport infrastructure needs to be.
posted by Slap*Happy at 4:27 AM on September 30, 2014 [1 favorite]


The problem is that the ATC system is so old, notions of a geo-diverse disaster-recovery datacenter didn't figure into its original plans for redundancy, and money that should have gone to rebuilding it went up in flames in Iraq instead.
[citation-needed]*

Based on some back-of-the-napkin calculations and my experience in the IT and broadcasting worlds, the cost of upgrading ZAU alone to be "fully redundant" could easily exceed $200 million, and might not yield a system that's more reliable overall.

Even though $200 million is a staggeringly large sum of money, it's probably still cheaper for everybody to eat the losses in this situation, compared to making massive (and questionable) investments to build 2+ copies of everything.

Let's not forget that "redundancy" for ATC extends far beyond keeping the data links online. An earthquake or other natural disaster could very easily wipe out ATC for an entire region in an instant, no matter how much "redundancy" is put in place -- pilots are trained to deal with this scenario, and did so admirably in this case. The Cold War mentality that influenced the design of our the present-day system still exists, and for good reason.

The manager was absolutely correct to broadcast ATC ZERO as soon as it became evident that something was wrong, and the FAA is absolutely correct to slowly reintroduce service as the equipment is gradually restored. The system is working as designed. It's much better to (safely) go offline than it is to take risks with a damaged system, given that the system is explicitly designed to be able to safely go dark in an emergency.

*This applies to much more of what's being said in this thread too.
posted by schmod at 9:13 AM on September 30, 2014 [4 favorites]


tl;dr; The modern ATC system was designed at the height of the Cold War. It was absolutely and unequivocally designed with geo-diverse disaster recovery in mind.

That being said, this design goal does not translate to "100% availability of every airport nationwide to maximize airline revenue." This was never a requirement, nor should it be, as explicit "0% downtime" requirements can very easily undermine safety.

Accidents and natural disasters happen that can affect local and regional availability -- this is an unavoidable reality, and our airways are designed to handle this.
posted by schmod at 9:21 AM on September 30, 2014 [4 favorites]


He may well be, but is there some reason you feel sort of invested in authorities calling something "terrorism" that is clearly not "terrorism" in any meaningful sense of the word?
posted by Naberius at 12:56 PM on September 29


Had this man been anything other than lily white, this would've been deemed terrorism. Can't believe I had to spell it out for you.
posted by basicchannel at 10:06 AM on September 30, 2014


That being said, this design goal does not translate to "100% availability of every airport nationwide to maximize airline revenue."

It was also originally designed when O'Hare International was by every measure the busiest airport in the world, handling almost 200 flights a day. Nowadays? ORD hit that in two hours. Well, three right now.
posted by eriko at 10:19 AM on September 30, 2014


It was absolutely and unequivocally designed with geo-diverse disaster recovery in mind.

Data center wasn't.

You're referring to geo-diverse disaster recovery in a bizarre way - it's like insisting morse code is digital communication - technically true, but irrelevant to anything. Also, this isn't even technically true, as there was a giant gaping hole in ATC coverage that caused massive disruptions in air-travel. Graceful degradation is not the same thing as redundant.

Meanwhile, in the way modern information systems are actually geo-diverse, housing 23 racks worth of gear and redundant data links for same don't cost nobody anywhere near a quarter bil, even if they have to buy it from Lockeed-Martin. Maybe a few mil/year, and the lights stay on in the control tower, everyone's flight makes it on time, this is a "wow that was weird" situation rather than a "wow that was massively expensive and potentially life threatening" situation.
posted by Slap*Happy at 10:43 AM on September 30, 2014


Who says that these 23 racks are the only equipment that isn't multiply-redundant?

I agree that the racks of equipment are comparatively cheap. However, it's not quite as cheap to design truly redundant signal paths to the tower, coupled with the costs of designing, implementing, and testing/maintaining the failover mechanism.
posted by schmod at 10:53 AM on September 30, 2014


However, it's not quite as cheap to design truly redundant signal paths to the tower, coupled with the costs of designing, implementing, and testing/maintaining the failover mechanism.

Yes, it is. This is all COTS* stuff, order it from Black Box or your local politically-connected systems integrator, get a dozen or so bids from local contractors to lay the fiber. Hell, you can probably just house it off-site and use redundant ISP's and metro-E to talk to the tower (and in fact you should). It's not like standing up an exchange server, granted, but it's all well traveled ground at this point, to where vendors are tossing in freebies like monitoring and infosec services.

(* Common, Off-The-Shelf: mass market equipment you can buy boxed and ready to go.)
posted by Slap*Happy at 11:05 AM on September 30, 2014


Slap*Happy,

I spent most of my morning overhearing the armchair aviation consultants next to me in a cafe, so let me tr not to take out my frustration with them on you. That being said:

Most of this equipment, for better for worse, isn't COTS. First off, the sort of reliability desired when the system was built wasn't available with COTS stuff, and it's probably more expensive to redesign the system to use widely-available COTS equipment and still maintain the same level of safety. (When Nextgen gets build, I do hope more COTS equipment is used.)

That being said, eliminating a ARTCC as a SPOF entirely (which is probably what would be necessary to avoid downtime altogether--a fire in the ARTCC could easily incapacitate the crew working there, at least) would probably require what amounts to a hot (online) spare, including keeping it staffed 24/7, which is probably more expensive than the amount of disruption caused here.
posted by thegears at 8:30 AM on October 1, 2014


Currently sitting at the Portland airport waiting for my flight to get the ATC signoff from O'Hare so we can board and take off. Already an hour behind, and I don't know if the delay will only increase as this circle gets more red.
posted by mrzarquon at 10:41 AM on October 5, 2014


« Older Thinking about disease   |   Poor man's bitcoin mining Newer »


This thread has been archived and is closed to new comments