Fate: 1 Internets: 0
July 24, 2007 5:03 PM   Subscribe

Newsfilter: 30,000 customers in the San Francisco area lost power today at about 1:50pm PDT, in a series of power failures which knocked out a major datacenter hub: 365 Main. The hub controls servers for many social media sites, including Technorati, Netflix, Yelp, Craigslist and all Six Apart properties, including TypePad, LiveJournal and Vox. (6A's twitter stream has updates.) More here and here. Amusingly enough, 365 Main tempted fate and released a press release today patting themselves on the back for "two years of 100-percent uptime".
posted by zarq (80 comments total)
 
Both 6A and LiveJournal have status pages on separate servers.

Those were down for most of the afternoon.
posted by zarq at 5:05 PM on July 24, 2007


Oh, and 365 Main's website is here. ("The World's Finest Data Centers")
posted by zarq at 5:07 PM on July 24, 2007


I miss my ONTD!
posted by ThePinkSuperhero at 5:07 PM on July 24, 2007 [1 favorite]


I was wondering what was up with craigslist. I figured my boss had finally gotten around to blocking it.
posted by lekvar at 5:08 PM on July 24, 2007


Underground explosions which neatly sever the electricity without harming anyone in the blast?

My money is on some convoluted heist where the power needed to be off to bypass the alarms and cameras.

Or someone has a Dead Pool going on different companies up-time and needed to step in for the win.

I mean, if you look at the facts, these are really the only two scenarios that make any sense.
posted by quin at 5:13 PM on July 24, 2007 [6 favorites]


So that's why ONTD has been down all day. I guess i'll stop hitting refresh then.
posted by puke & cry at 5:13 PM on July 24, 2007


At least one site is reporting that the outage was caused by a lone drunk on a rampage.
posted by zarq at 5:15 PM on July 24, 2007


I mean, if you look at the facts, these are really the only two scenarios that make any sense.

Umm, hello-- Gojira!
posted by dersins at 5:18 PM on July 24, 2007


365 Main is where our servers are. We didn't have problems.

I don't THINK we had problems. At least I didn't get an email about any problems.

Interesting.
posted by strontiumdog at 5:19 PM on July 24, 2007


It's a big gossip day- between Lindsay's arrest and Britney's meltdown with OK!, I need my ONTD.
posted by ThePinkSuperhero at 5:19 PM on July 24, 2007 [2 favorites]


I got caught in it. Had to take the stairs! It was AWFUL!
posted by wemayfreeze at 5:20 PM on July 24, 2007


That man should receive a sound thrashing, that would recompense the users that missed out on their Livejournaling.
posted by chlorus at 5:22 PM on July 24, 2007


No UPS?
posted by four panels at 5:23 PM on July 24, 2007 [1 favorite]


To ensure uptime for key tenants such as RedEnvelope, 365 Main provides modern power and cooling infrastructure. The company’s San Francisco facility includes two complete back-up systems for electrical power to protect against a power loss. In the unlikely event of a cut to a primary power feed, the state-of-the-art electrical system instantly switches to live back-up generators, avoiding costly downtime for tenants and keeping the data center continuously running.

Unless of course it's Tuesday :(

I watched 365's generators appear to kick on and off during this from our office down the block, I was wondering if they'd be able to deal with the flips up and down.

It's entirely possible that 365 main stayed on line and the ISP's out of the local facility don't have their shop in order for the transport piece.
posted by iamabot at 5:26 PM on July 24, 2007


One might question the logic of building a datacenter in San Francisco 250 feet from the bay on what is entirely land fill but then again, level 3 has had their china basin facility in the same general situation going for 10+ years.
posted by iamabot at 5:29 PM on July 24, 2007


An angry mob lining up outside a datacenter. (Well, the guy on the cellphone looks a little miffed.)


What will they burn effigies of? Servers? Power cords?

I really hope the drunken employee theory is right. That is too hilarious.
posted by Salmonberry at 5:33 PM on July 24, 2007


level 3 has had their china basin facility in the same general situation going for 10+ years.

Level 3 isn't exactly the gold standard in internet service, either.
posted by spiderwire at 5:35 PM on July 24, 2007


Level 3 isn't exactly the gold standard in internet service, either.
posted by spiderwire at 5:35 PM


These datacenters go to eleven.
posted by iamabot at 5:40 PM on July 24, 2007 [6 favorites]




This may be the most currently dead links deliberately linked on Metafilter (and not deleted) ever!
posted by Joey Michaels at 5:41 PM on July 24, 2007


These datacenters go to eleven.

percent uptime?
posted by spiderwire at 5:44 PM on July 24, 2007


Breaking News: All Online Data Lost After Internet Crash! tubular, man!
posted by ZachsMind at 5:46 PM on July 24, 2007


I can't believe craigslist has been down. You just get this notice:


Error

Craigslist and many other sites are having issues at the colo facility.

Please sit tight, and try again later.

We are aware of the situation, and the happy craigslist elves are scurrying to make it better, even now.


I'm guessing the elves aren't too happy...
posted by vacapinta at 5:47 PM on July 24, 2007


four panels writes "No UPS?"

Imagine the fun if they discover their idea of backup isn't the one you have !
posted by elpapacito at 5:53 PM on July 24, 2007


"...the happy craigslist elves are scurrying to make it better..."

I've yet to ascertain a significant difference between elves and gremlins. Is there any authoritative work on this subject?
posted by ZachsMind at 5:54 PM on July 24, 2007


LiveJournal has been down all day, and I've had no idea how to talk about it.


Current Mood: ☺ confused
Current Music: David Lee Roth - Just Like Paradise

posted by Uther Bentrazor at 5:58 PM on July 24, 2007 [3 favorites]


ThePinkSuperhero: "It's a big gossip day- between Lindsay's arrest and Britney's meltdown with OK!, I need my ONTD."

puke & cry: "So that's why ONTD has been down all day. I guess i'll stop hitting refresh then."

I thought it was the convergence of these two awesome gossip events that killed it like when ANS died or when Britney shaved her head.
posted by macadamiaranch at 5:58 PM on July 24, 2007


Ah... that explains it. I was sure it was something that I was doing wrong.
posted by blaneyphoto at 5:59 PM on July 24, 2007


Yeah, I'm quite surprised they don't have power backups. How could they not?

Morons.
posted by delmoi at 6:00 PM on July 24, 2007


ZachsMind, elves create. Gremlins, as seen in this seminal documentary warning, video, hilarity, live only to destroy.
posted by lekvar at 6:03 PM on July 24, 2007 [1 favorite]


I'm curious as to what the explanation is going to be from 365 Main, particularly given their claims of multiple redundant power, backup generators, etc. etc.

The outage occurring on the same day as their self-congratulatory press release, though ... that's just too beautiful for words.
posted by Kadin2048 at 6:03 PM on July 24, 2007


What most people who didn't experience aren't getting is this wasn't a simple power outage. The power went off and on some five times in an hour I think it was. I'm not too familiar with UPS technology (though ours didn't have an issue) I can imagine this might be taxing on the equipment. Possible the UPS wasn't properly surge protected? I dunno.
posted by matt_od at 6:07 PM on July 24, 2007


I thought it was the convergence of these two awesome gossip events that killed it like when ANS died or when Britney shaved her head.

Heh, yeah I though about that earlier.
posted by puke & cry at 6:08 PM on July 24, 2007


Mysteriously, the only data lost was from a bunch of fan fiction sites.
posted by Joey Michaels at 6:17 PM on July 24, 2007


These datacenters go to eleven.

percent uptime?

Somewhere in a corner, Christopher Guest is crying.
posted by katillathehun at 6:19 PM on July 24, 2007


IT'S COMING FROM INSIDE THE HOUSEDATACENTER
posted by null terminated at 6:25 PM on July 24, 2007


Generally speaking for most systems once the generators come on there are procedures for taking them back offline, facilities such a 365 main condition power as it is taken from the facility, in this case PGE. This conditioned power is what feeds the customer and facility power in the datacenter. As a benefit of conditioning power most systems provide some amount of local battery backup, providing ample time for the back up diesel generators (I believe located on the roof @ 365 based on the blue smoke coming off of it during the outages) to kick in and take the place of the mains power delivered from the utility.

It is possible, although unlikely, that 365's design was unable to handle the numerous transitions between utility provided and facility generated power. As far as UPS systems go it's not uncommon for utility power to drop off repeatedly when the utility is having issues.
posted by iamabot at 6:26 PM on July 24, 2007 [2 favorites]


Somewhere in a corner, Christopher Guest is crying.

if bad "this [x] goes to eleven" jokes had any effect on Christopher Guest, he wouldn't even have made the next reunion tour

in my defense i didn't start it
posted by spiderwire at 6:29 PM on July 24, 2007


It is possible, although unlikely, that 365's design was unable to handle the numerous transitions between utility provided and facility generated power.

Also could have been karmic overload due to that well-timed press release and their freaking corporate name. Older UPS systems reportedly have problems with hubris spikes.
posted by spiderwire at 6:32 PM on July 24, 2007 [1 favorite]


Well, that would explain why I couldn't get onto LiveJournal just a moment ago. How on Earth am I supposed to express my opinions about Deathly Hallows now?
posted by Faint of Butt at 6:32 PM on July 24, 2007


My power for the last two weeks has gone off and on at least once a day, and yesterday it went off and on 4 times. When you live in hurricane country, you just yell (insert profanity of your choice here) and get started all over again. Plus, we have a back up generator for emergencies.

Wonder who backs up the backups?
posted by misha at 6:34 PM on July 24, 2007 [1 favorite]


I'm not too familiar with UPS technology (though ours didn't have an issue) I can imagine this might be taxing on the equipment.

On real datacenter power systems, it doesn't matter. The load isn't carried by the electrical systems, because most generators don't put out very clean power. The load is carried by an inverter, which is driven by the combination of a rectifier or the battery packs. A static switch connects the two, there's also a bypass that hooks the input power to the outputs.

I've seen several power hit on my small unit (80KVA), and known them only by seeing the input voltage drop. Output voltage remained dead on. They can handle this -- indeed, the input voltage varies constantly.

However, it's clear that 365 screwed up. The datacenter's data sheet claimed *ten* 2.1MW CPS units, rigged N+1. so that one failing wouldn't result in a section of racks dying. 365 doesn't have batteries, they claimed that the CPS systems would hold the load, and would switch quickly enough that they didn't need the float time that a small battery bank would give them. The CPS had 60,000 gallons of fuel, in multiple tanks, so it is unlikely that fuel exhaustion was the problem.

They bet on CPS, and they lost. I'm guessing enough units failed that the output voltage dropped enough to down every machine in the data center. The lovely thing about N+1 is that you can have any one unit fail with no issues. The *bitch* of N+1 is that two failures can kill everything, when your load is high enough that you can't afford to be at N-2.

However, 1+1 is expensive. Assume they needed 18MW of power. N+1 means you need 10 2MW generators. 1+1 means you need twenty of them, rigged into ten sections, each section backed by two units. 1+1 is far more reliable, though -- to down a section, you need to kill two *specific* generators, and even if that happens, the other nine sections are still running under redundant power.

N+1 is RAID-5 -- and 365 lost two disks.
posted by eriko at 6:34 PM on July 24, 2007 [3 favorites]


As far as UPS systems go it's not uncommon for utility power to drop off repeatedly when the utility is having issues.

Yes, which is why real systems have hysteresis built in. You don't just fall back and forth, you fall over for a given period of time and then, if power is good, switch back, esp. when you're dealing with generator units, which take time to spin up and stop (and often really hate a restart while stopping.)
posted by eriko at 6:37 PM on July 24, 2007 [1 favorite]


... power failures were reported throughout wide swaths of the east side of San Francisco, including downtown and at PG&E's own office on Beale Street near the Ferry Building.

Oh man. Were there any Bar Exam testing sites on the east side of the city?? Nevermind; any mefites in a position to know would still be recovering from heart attacks at this point...

(I will stop mentioning the bar exam really soon now! No later than Thursday.)
posted by rkent at 6:39 PM on July 24, 2007


Apparently their system uses a flywheel to keep things going until the diesel can kick in. No batteries required, but not much safety either if your diesel fails.
posted by smackfu at 6:40 PM on July 24, 2007


Netflix had (unrelated) problems of it own today:

Netflix Reeling from Customer Losses, Site Outage
"Netflix Inc. frustrated investors and customers alike Tuesday as its stock price plunged to its lowest point in more than two years while its Web site was inaccessible most of the day because of unexplained technical problems.

...But the Web site outage was a surprising — and embarrassing — setback.

The online hub of Netflix’s rental system went down Monday evening and remained unavailable until Tuesday afternoon, locking out subscribers for more than 18 hours. Spokesman Steve Swasey attributed the outage to an unanticipated problem that he declined to describe.

The breakdown didn’t appear to be related to San Francisco power outages that were blamed for temporarily knocking out several popular Web sites, including Craigslist, Technorati, Typepad and Livejournal.

Service to Netflix’s site was finally restored around 3 p.m. PDT after Netflix’s engineers had missed several earlier estimated times for fixing the trouble.

Netflix had been in the process of updating its computers to reflect price reductions that took effect Tuesday."
posted by ericb at 6:48 PM on July 24, 2007 [2 favorites]


Also could have been karmic overload due to that well-timed press release and their freaking corporate name.
Well, with a name like "365", you'd expect them to be down for about one day every four years. Except for the bgeinning of a century unless the year is divisible by 400
posted by hattifattener at 6:51 PM on July 24, 2007 [5 favorites]


Uhm... Yeah.. Like I said Lekvar, I haven't been able to ascertain a significant difference in these two species. Are gremlins really elves that went bad? In fact I'd be willing to wager that these same self-styled "craigslist" elves who are allegedly fixing the problem may have been the very ones to create it in the first place.

Are elves just gremlins that have turned over a new leaf? They don't seem to be different species of fae. I think they're either the same species with differing cultures, or they're closely related on a genetic level.

You'd think after all these generations since before the Dark Ages, we'd have more information by now about the fae. For example, there seems to be some disagreement among various resources as to whether or not pookahs are in fact six feet tall, or six inches tall. That's quite a wide window for error, don't you think? Is the Easter Bunny a pookah or not? Some reports say they're invisible yet anecdotal evidence indicates under certain conditions they can in fact be seen. And how can they look like rabbits if no one's ever seen one empirically?

Honestly, there's an entire avenue of research that's been left open for cranks and crackpots which is not being addressed by the legitimate investigative community of humanity. If we continue to ignore gremlins and pookahs and trolls and the occasional drunk leprechan, events such as this will continue to happen unabated. I really wish someone with the proper resources would investigate this and get to the bottom of it.
posted by ZachsMind at 6:53 PM on July 24, 2007 [1 favorite]


Are elves just gremlins that have turned over a new leaf?

WTF are you yammering on about? Elves bake our cookies.
posted by ericb at 6:59 PM on July 24, 2007


And of course the press release has been yanked....
posted by phliar at 7:07 PM on July 24, 2007


ericb, thanks for that article. I hadn't realized the Netflix-Blockbuster wars had heated up so much. The article seems to paint a bit too negative a picture of Netflix's fortunes, Blockbuster is apparently gaining enough to force Netflix to lower prices, sure, but Netflix is at least operating in the black, which is more than you can say for Blockbuster.
posted by mediareport at 7:23 PM on July 24, 2007


like when ANS died or when Britney shaved her

... Thanks to the magic of monitor resolution and line breaks, for a quick second my mind went to a completely different place there.
posted by Cyrano at 7:24 PM on July 24, 2007


my mind went to a completely different place there

Five o'clock shadow and razor burn is a bad look anywhere.
posted by maxwelton at 7:35 PM on July 24, 2007


Apparently their system uses a flywheel hamster wheel

Fixed!
posted by spiderwire at 7:49 PM on July 24, 2007


Apparently their system uses a flywheel hamster wheel training wheels.

FIXED!
posted by ZachsMind at 8:08 PM on July 24, 2007


training wheels

that doesn't make any sense: you can't power a server with training wheels. those are just there to keep it from falling over.

365 main spent a lot of money on their green server farm and they'll thank you not to insult their poor hamster like that, especially after the hard day he just had.
posted by spiderwire at 8:25 PM on July 24, 2007


Apparently their system uses a flywheel hamster wheel Big Wheel.


There ya go.

posted by pupdog at 8:27 PM on July 24, 2007


Apparently their system uses a flywheel hamster wheel training wheels a wheel of fortune.

FIXED the FIX!
posted by ZachsMind at 8:28 PM on July 24, 2007


BIG WHEEL??? That doesn't make any sense either! How is the hamster supposed to reach the peddles?
posted by ZachsMind at 8:30 PM on July 24, 2007


Apparently their system uses a flywheel hamster wheel Big Wheel in the Sky which, apparently, does not keep on turnin'...
posted by QuestionableSwami at 9:00 PM on July 24, 2007


ZachsMind : Apparently their system uses a flywheel hamster wheel training wheels a wheel of fortune.

Right or wrong, we had a deal... and the law says, bust a deal, face the wheel.

FIXED OBFUSCATED the FIX!
posted by quin at 9:06 PM on July 24, 2007


Having people on Metafilter refer to ONTD breaks my brain. I mean sure, the law of averages says there have to be plenty of mefites on Livejournal. But still.
posted by pinky at 9:15 PM on July 24, 2007


I'm responsible for a television broadcast facility that requires constant power and cooling.

The outage at 365 Main is due to a failure of 365 Main's systems, not any power outage.

To provide power in a shore power outage, a system must be capable of 3 things, 1) transfering the load to an alternate power source, typically a UPS, 2) sustaining the load for a short time on an energy reserve, typically a UPS, and 3) generating on-going electricity to power the energy reserve, typically a diesel generator.

After losing shore power, all three systems need to operate, otherwise you have a power failure to the critical systems, which has happened to 365 Main.

You can take precautions like back-up generators and A & B power systems to be able to endure failure in parts of these systems, but I doubt that's what was in place.

From what I read on the data sheet, 365 Main was likely using flywheels as their energy reserve, not traditional battery powered UPS. In my professional opinion, that is a serious risk. With flywheels, you get seconds of power in which everything has to go right. Gen-set start, stable power, and then transfer switch. A traditional UPS will likely give you several minutes.

My guess is that either the flywheels ran down before the generators came on line, or some of their transfer switches/breakers blew out.
posted by Argyle at 9:35 PM on July 24, 2007 [1 favorite]


ONTD transcends LJ, nowadays. Hell, it got a mention in Oprah Winfrey's magazine last month. It's about a nanosecond away from jumping the shark Perez Hilton.
posted by Dreama at 9:35 PM on July 24, 2007


This just reminded me that I had a livejournal.
posted by Esoquo at 9:43 PM on July 24, 2007


This just reminded me that I had a livejournal.

You left it all alone and now it's taken to writing its own emo poetry and generating artificial ennui. Bastard.
posted by spiderwire at 9:51 PM on July 24, 2007


I work in a restored 100 year old, four story building. In it's previous life, it was a power station, and it is quite beautiful. Unfortunately, I am also on the team that is responsible for it's evacuation in the event of a disaster.

Unlike most of the others on the team, who got pushed into it because every department needed someone to volunteer, I actually have a bit of background in risk assessment. So I wandered around and looked for the things that would fail and cause people to die.

For the most part the evac plan is pretty solid. I only found a couple of minor faults. And one big one.

The two stairwells are lit by fluorescent lights. More than enough to provide good visibility in most disaster situations. However as I have an unpleasantly active imagination, I came up with a couple of situations that it wouldn't be enough, and I asked about them.

"So I noticed that we don't have any battery powered light backups between floors four and ground. I saw a couple of light strobes, but I don't really think that would be conductive to helping people see where they were going..."

It was explained to me that we have a generator system (We do. It's a big one.) and that would kick in if the lights went out.

To which my natural response as a cynical, pragmatic, pessimist was "And what if the disaster is someone blowing up the diesel filled generators which cause the building to be set on fire?"

It should be noted that I work in a field that seems to generate a lot of hostility, and bomb threats are not completely uncommon. My scenario is not totally unrealistic.

They completely ignored me. And as a third floor dweller, I now make it my business to have a flashlight around.

Just in case.

Big companies with lots and lots of money sometimes do really stupid things.
posted by quin at 10:04 PM on July 24, 2007


quin, I may be completely off base (and please say so if I am) but I'm under the impression that diesel doesn't combust without the application of high pressure, likely more than your storage tanks are capable of generating. So, points to you for paranoid imaginings (high praise in my book) but I think you'd probably be better off finding something else to worry about. Like clowns.
posted by lekvar at 10:28 PM on July 24, 2007 [1 favorite]


Weird, those battery-backed emergency lights are so common in commercial buildings that I'm surprised that someone that bothers having an on-site generator wouldn't have emergency lights in all the stairwells and corridors.
posted by hattifattener at 11:13 PM on July 24, 2007


It was definitely not a global failure at the data center. Our company's servers were not affected, and we have two separate cages on two different floors at 365. Just FYI.
posted by dopeypanda at 12:19 AM on July 25, 2007


Zachsmind, you are ignoring the possibility that these Craigslist elves may not be of the Faerie sort at all, but Tolkienian elves which are of a completely different sort and not at all related to gremlins. My guess is that Craigslist elves are Noldor, since they are "accounted the greatest of the Elves and all the peoples in Middle-earth in lore, warfare and crafts." [Also whom Tolkien referred to as Gnomes in some of his writings, to confuse matters even further.]

Given your LOMUN (LOw Metafilter User Number, thus indicating vast Internet experience) and the high TFN (Tolkien Fanboys to Normal) ratio among the Internet user population, I would have thought you might have considered this possibility.

For shame.
posted by moonbiter at 12:45 AM on July 25, 2007


when the power went out this afternoon, i got stuck in the elevator at my SOMA temp job for a few minutes. i was almost at ground floor so it was not scary. but i did enjoy ringing the emergency bell.
posted by lapolla at 3:07 AM on July 25, 2007


At least I'm not the only MeFite that hangs out on ONTD. I'm not sure this makes me feel better about myself, but it's certainly damn funny.
posted by saturnine at 4:08 AM on July 25, 2007


quin, I may be completely off base (and please say so if I am) but I'm under the impression that diesel doesn't combust without the application of high pressure

Bombs generate pressure. That's what makes them bombs.
posted by delmoi at 6:35 AM on July 25, 2007


365 Main has issued a press release confirming that some of their generators didn't start automatically, that they were able to start them all manually, and that they switched to generator power when the outage first hit and still remain on it until they figure out what went wrong.

They also mention that their generator config is N+2 and imply that they do load shedding when they lose generators (which is why not all the racks went down).
posted by azazello at 9:03 AM on July 25, 2007


Still sounds dubious; why were all those major sites (6A, Typepad, etc) down all day yesterday if they were only w/o power for 45 minutes? Can't imagine they don't have remote reboot switches. I could see some of them being down b/c their IT teams weren't prepared for a full hard reset, but not all of them. Most of those companies have some pretty smart cookies working for them, and it doesn't take a rocket scientist to do a remote reboot.

If it was a power surge and not an outage (as the press release implies), why 45 minutes in the first place? Why not shunt in power from the grid to let the affected customers at least attempt to restart, then switch them over to the generators as those were manually started? Doesn't scan. Any of the experts want to chime in?
posted by spiderwire at 11:21 AM on July 25, 2007


Valleywag responds here. Skip to the end for a good comment from Soma.fm about what happened.

The Valleywag article itself is kinda ridiculous. They make a half-assed attempt to defend the "drunk-assed" employee stuff, and float another story about the power outage taking out the authentication system -- causing the line outside. Of course, Data Center Knowledge cites CNet (which has their servers in 365) and says rather that the problem was that the outage screwed up the automated parking gate (strikes me as much more plausible) so everyone had to come in through the front door and get authenticated there, which caused a line.

365 Main screwed up on a monumental level, but VW are being genuine dicks.
posted by spiderwire at 11:45 AM on July 25, 2007


why were all those major sites (6A, Typepad, etc) down all day yesterday if they were only w/o power for 45 minutes?

There are a lot of boxes that must come up for a big website to work after a whole datacenter room goes down, and not all of them are under the customer's control. There may have been connectivity problems for a while afterwards. The PDUs ("remote reboot switches") don't do much if they're behind a router that hasn't come up. The big sites that stayed down all belonged to 6A, which means that perhaps no, as a company they weren't prepared for a full hard reset.

As for your last question, it's answered in the release. The SOP is to stay off the grid after the first outage or surge until PG&E says the power is stable. This prevents all kinds of problems with multiple outages. The problem yesterday was that some generators - which must come up - didn't come up, and their power wasn't available until 45 minutes later. There were at least 3 of those, too, since the redundancy is N+2.

I'm not sure how plausible the soma.fm's version is - they say the flywheels may have been exhausted from a brownout by the time they had to start the generators - but this really doesn't sound believable, since those systems are engineered thoroughly to prevent that happening.
posted by azazello at 12:16 PM on July 25, 2007


There are a lot of boxes that must come up for a big website to work after a whole datacenter room goes down, and not all of them are under the customer's control.

Yeah, but 365 claims that the support infrastructure was up and running quickly -- DNS and so forth (although, I agree, that's the dubious part of the claim -- and if untrue, makes the "45 minutes" thing highly misleading).

And while you might need to get multiple boxes running to get the entire website up, it ain't hard to toss up your DNS and apache daemons to serve a page saying "sorry, we're screwed at the moment." Furthermore, even a remote reboot shouldn't be necessary. For sites that big, esp. ones based in SF like Craigslist, they had to have had techs down there within 45 minutes. Hell, one of those guys has to have a computer that could serve one page that he could have just hauled down there himself.

Also yet another reason for those companies to have redundancy -- difficult, perhaps, for data-driven sites like Technorati, Typepad, and Craigslist, but not impossible. (And in fact, kind of silly for Craigslist since they're city-by-city.)

But Gamespot, e.g., should at least have a plan whereby they can get their main content up and running somewhere.

(On that note, are all these companies keeping all their nameservers in the same place as their central infrastructure? If so, why??)

The SOP is to stay off the grid after the first outage or surge until PG&E says the power is stable. This prevents all kinds of problems with multiple outages. The problem yesterday was that some generators - which must come up - didn't come up, and their power wasn't available until 45 minutes later. There were at least 3 of those, too, since the redundancy is N+2.

Right, but I meant that they should have been able to bring the affected sites up in stages at the very least as they restarted the generators. If multiple generators failed, the only logical inference is that those generators failed to kick on as the flywheels were spinning down and didn't spin them back up to full speed before the next brownout -- which makes sense, as you wouldn't want to start your generators every time there's a slight dip in the incoming utility flow. What soma.fm is suggesting is that the threshold was set too low, and by the time the flywheels had spun down too far the generators didnt kick on to spin them back up in time.

However, I agree that this explanation isn't totally consistent. if multiple generators didn't kick in (which had to be the case, as you point out), and all the customers came up at the same time, that means all the failed generators were restarted manually -- at the same time. That, in turn, means that it apparently took 45 minutes for 365 to get up on the roof and restart those generators. And that is a problem.
posted by spiderwire at 12:49 PM on July 25, 2007


Interesting story that illustrates one vulnerability of partially-automated power failover plans: last September Verio's big Sterling VA data center lost outside power after a backhoe cut a line nearby. Their UPS cut in as planned, and they brought up their generators in time - but someone forgot to flip the big switch that isolates the datacenter's power from the grid, so when they cut in the generators, the power flowed right back up the main line and into the ground where the cable was cut. With (almost) no resistance, the generators immediately tripped their breakers, and before they could be restarted the UPS died - and our servers along with thousands of others lost power. Bringing the whole system up from scratch took hours.
posted by nicwolff at 4:56 PM on July 25, 2007 [2 favorites]


« Older LAN Party in the Senate Chamber? Call Terry.   |   Kinder Surprise Newer »


This thread has been archived and is closed to new comments