The wreck of the web.
April 22, 2011 4:01 PM Subscribe

It wasn't supposed to be like this. Amazon.com's Elastic Compute Cloud (EC2) crashed yesterday, taking with it popular sites like Reddit, Quora, Foursquare, Hootsuite, Act.ly, and about 70 other sites. Amazon.com was affected, as was some functionality of the New York Times. Amazon Web Service's Health Dashboard indicates that there are still major operating disturbances.

The crash has raised serious concerns about the viability of cloud based computing, and comes on the heels of Amazon's previous announcement of a cloud-based music service, Amazon Cloud Player (previously). Still, companies have been taking the outage in stride. Quora's error message is emblematic of the relationship between start-ups and the cloud: “We’d point fingers, but we wouldn’t be where we are today without EC2.” If your wondering what they mean by that, Booz Allen's The Economics of Cloud Computing is a good place to start.

posted by 2bucksplus (135 comments total) 20 users marked this as a favorite

There is also some speculation that the continued outage of Sony's PlayStation Network is possibly a result of the EC2 crash. While unconfirmed, it would suggest that even large companies are drawn in by the siren song of the cloud. and this also means I can't play anything on my $300 paperweight or even watch Netflix at the moment
posted by 2bucksplus at 4:06 PM on April 22, 2011 [2 favorites]

This was a triumph. I'm making a note here: HUGE SUCCESS.
posted by crunchland at 4:06 PM on April 22, 2011 [61 favorites]

The forecast is cloudy with a 100% chance of crashing thunder.
posted by It's Raining Florence Henderson at 4:07 PM on April 22, 2011 [4 favorites]

To quote someone from twitter, if you have a hostname, it's not cloud computing. EC2 is just a very well-run virtualized hosting platform.
posted by GuyZero at 4:07 PM on April 22, 2011 [3 favorites]

GuyZero, is this a view that you agree with?
posted by kuatto at 4:09 PM on April 22, 2011

Amazon crash reveals 'cloud' computing actually based on data centers (not the onion)
posted by Skorgu at 4:10 PM on April 22, 2011 [14 favorites]

Sucks but seems Amazon's position that deployment to a single zone is 99.95% is it not? Netflix is deployed to all 3 zones so they were not effected.

Guess deployment to 1 zone does not eaqual failover capability. Just like if your company has runs a single data center you can get wiped out if there is a natural disaster.
posted by Ad hominem at 4:11 PM on April 22, 2011 [3 favorites]

Hey. You. Get off of my cloud. And by "you," I mean everybody.
posted by It's Raining Florence Henderson at 4:12 PM on April 22, 2011 [7 favorites]

It's unfortunate that this outage was so severe, so far-reaching, and has taken so long to resolve. A day is an eternity for plenty of businesses to be knocked offline. But too many companies will simply shake it off and keep on using EC2. Because the economics make too much sense. No one offers quite such competitive pricing for instantly scaled commodity computing, let alone with such a powerful software stack. The other providers are more expensive and generally less robust.

Outages will happen. AWS is supposed to be engineered to prevent single points of failure. This hasn't always worked as we'd like; a corrupt bit propagated through the "gossip protocol" they use to keep servers up to date took down S3 a few years ago. But we're still miles and miles ahead of having to manage your own infrastructure, getting crushed by lots of traffic and being unable to scale, and naturally STILL being prone to serious hardware failures. At least with commodity hardware, your hardware is completely covered. You pay nothing for it beyond the time you utilize it. The economics are real, far-reaching, and can take into account very occasional downtime and still win every single time for nearly all the businesses utilizing a system like EC2.

The postmortem will be interesting, and the biggest question will be "what are you doing to ensure this never happens again". But cloud computing at this (or any) scale is uncharted territory with new programming and technology and failures will still occur. We'll just get better and better at it and reduce their potential impact as best as possible. And thousands of companies will continue to gladly use an amazing system that provides them with bandwidth and capacity in an instant, and charges them only down to the hour.
posted by disillusioned at 4:12 PM on April 22, 2011 [17 favorites]

yes...cloud....um...its called hosting...

tech novelty speak and marketing is odd groupthink for supposedly independent thinkers.
posted by lslelel at 4:15 PM on April 22, 2011 [6 favorites]

The crash has raised serious concerns about the viability of cloud based computing

With whom? I haven't seen anyone do better with their own setup. The very fact that it's only had one major issue in four or five years is pretty good evidence that it works just fine, and they have a much better record than Rackspace.

Shipping multiple machines to multiple data centers on multiple backbones and then syncing them with your own managed solution is a hell of a lot more expensive and a lot less reliable than getting a slice of virtualized data center.
posted by notion at 4:15 PM on April 22, 2011 [2 favorites]

I've been thinking of moving a project to S3/EC2, but this does make me pause. I don't think our current server has been down for anything other than a reboot in the past 10 years. The big thing you gain from Amazon is bandwidth and scalability. I guess you lose out on reliability.
posted by bhnyc at 4:16 PM on April 22, 2011

As someone who has taken a certain amount of shit for not wholeheartedly gulping the Cloud Kool-Aid (metaphor soup, anybody?) in the last couple of years, I acknowledge that there's a measure of schadenfreude to be had here.

That said, anyone who is terribly surprised by this has not been paying much attention, or at least has probably not actually used the systems in question. As a friend of mine observed this morning, sooner or later you're still talking about physical servers in racks in buildings. Things will break because things always, always, always break.

I'm not convinced this really says much about the viability of "cloud based computing" that thoughtful nerds didn't already understand, but I guess it does serve to highlight the risks of relying on a single organization or a too-tightly-interwoven infrastructure. It might give some people good, concrete reasons to restructure their reliance on service providers like Amazon, but I doubt it's going to slow the rush to commodified, easily scaled hosting all that much. It just makes too much economic sense.
posted by brennen at 4:17 PM on April 22, 2011 [8 favorites]

Someone at my office was poking fun at Amazon's outage. The only response I could offer was, dude, at least Amazon's uptime is good enough that when they go down, it's news.
posted by mullingitover at 4:18 PM on April 22, 2011 [27 favorites]

Oh goody. As a semi-techy-but-nowhere-near-the-techiest-folks-here member, I'm really looking forward to what I'm gonna learn here.
posted by nevercalm at 4:20 PM on April 22, 2011

I'm really looking forward to what I'm gonna learn here

The secret to the cloud pun is to try to think of phrases with the word "cloud" in them.
posted by It's Raining Florence Henderson at 4:22 PM on April 22, 2011 [7 favorites]

To quote someone from twitter, if you have a hostname, it's not cloud computing. EC2 is just a very well-run virtualized hosting platform.

Uh, that's totally backwards. When cloud computing was first coined it meant virtualized hosting. Then later it got 'degraded' to mean, basically "any software on the web". Essentially google docs, salesforce.com and lots of other services that already existed before EC2.

The thing that makes EC2 different from previous virtualization services was that you could request computer time automatically via an API, rather then going in manually and setting it up. That's a huge difference in the way we use computers. And on top of that Amazon.com sells by the hour. Previously you needed to pay for a whole month, or so, to get a virtual machine. With Cloud Computing (per the original definition) you can burst to as many computers as you need on demand, automatically.

Not only has this 'new' (and totally meaningless) definition become popular people are now saying that people using the correct definition are wrong? Really? Ugh.
posted by delmoi at 4:23 PM on April 22, 2011 [18 favorites]

MetaFilter: the cloud.
posted by localhuman at 4:24 PM on April 22, 2011

Reminds me a little of when Katrina took out New Orleans and knocked out Something Awful's connection to its servers.
posted by ZeusHumms at 4:24 PM on April 22, 2011

> The big thing you gain from Amazon is bandwidth and scalability. I guess you lose out on reliability.

Well, it's luck that your current server hasn't had any major problems happen to it. Just as it was luck for Amazon that their services didn't have any major disruptions until just now.

I guess the problem goes to: do you want to be responsible for your mistakes or the mistakes other people may make?
posted by mrzarquon at 4:28 PM on April 22, 2011

at least Amazon's uptime is good enough that when they go down, it's news

Gotta hand it to Amazon's developers. IBM tried to make computing a utility, but Amazon actually did it.
posted by Blazecock Pileon at 4:29 PM on April 22, 2011 [14 favorites]

It's interesting how different companies even handled moving to AWS and seeming instability. Netflix adopted a "fail constantly" strategy, and has developed technoloy to recover from instability in EBS. Whereas Reddit has been Plagued with problems for a long time.
posted by Ad hominem at 4:30 PM on April 22, 2011 [3 favorites]

I don't think our current server has been down for anything other than a reboot in the past 10 years. The big thing you gain from Amazon is bandwidth and scalability. I guess you lose out on reliability.

I have an EC2 'micro' instance that I just leave running and it wasn't affected. I'm just using it as a development machine, though, so there's no load and I don't think I used it during the outage. It was running in the affected service area.

The problem Amazon was having was with it's disk storage system EBS. Performance was seriously degraded, as far as I know. It wasn't even totally offline.

The other thing is that this only affected one data center. Amazon lets you run instances in multiple data centers around the world (Called 'availability zones') but you have to pay for bandwidth between them. If you're really worried about uptime, you can run instances in multiple locations. The problem is if you have a virtual hard disk, you have to access that disk from a machine in the same zone.
posted by delmoi at 4:30 PM on April 22, 2011

disturbances

A communications disruption can mean only one thing.
posted by obiwanwasabi at 4:31 PM on April 22, 2011 [2 favorites]

Uh, that's totally backwards. When cloud computing was first coined it meant virtualized hosting. Then later it got 'degraded' to mean, basically "any software on the web". Essentially google docs, salesforce.com and lots of other services that already existed before EC2.

So when you store a doc in google docs it goes to more than one data center and is stored on several actual drives within each DC. All you have for a google doc is a URL. That's cloud computing.

And no, cloud computing was not coined for automated virtualized hosting. The term existed before both SaaS and what EC2/S3 does as a generic network computing term.

But it's a matter of hair-splitting, true. Everyone think their cloud is the only true cloud.
posted by GuyZero at 4:31 PM on April 22, 2011 [2 favorites]

Netflix has the right idea.

The difference between 'cloud' and 'virtual hosting' is twofold: capacity and api. If you can scale up from one instance to a hundred with one API call in minutes it's (IMNSHO) 'cloud'. If a human has to interact with a billing system and enter credit card details or call a CSR it's just hosting.

I know of several very high profile companies that scaled their way past very large DDoS attacks simply by being based in EC2, something that's very difficult to do with standard hosting, datacenter ops, etc. You can do white-label or private cloud services of course but then you're taking on the risk of all that spare capacity.

Joyent has some commentary (note they are a direct competitor to AWS).
posted by Skorgu at 4:32 PM on April 22, 2011

There really isn't that much experience in running systems at this scale. Engineer all you want, test the crap out of it, but until we have a decade or two of operational experience, there will be periods of downtime.
posted by humanfont at 4:33 PM on April 22, 2011 [6 favorites]

Amazon lets you run instances in multiple data centers around the world (Called 'availability zones') but you have to pay for bandwidth between them.

That's not my understanding. Availability zones are still within a physical DC. It's meant to keep you insulated from operators taking things like transformers offline, etc. where you have to take down a huge block of machines.
posted by GuyZero at 4:33 PM on April 22, 2011 [1 favorite]

Putting aside all the schadenfreude for a moment, I have some real concerns from a business perspective.

According to TheNextWeb (and others), the outage took down Foursquare, Quora, Reddit, Hootsuite, Cotweet, Paper.li, Mobypicture, Flavors.me, About.me and more.

While I understand that the free versions of those services would be out, I'm quite curious why many of the companies involved didn't plan ahead with redundancies and backups for the Pro or Enterprise versions. For example:

Professionally, my company is evaluating CoTweet Enterprise and Hootsuite Pro, and this outage has me asking serious questions about reliability.
In my personal life, I use Flavors.me and actually pay for the service.

Please don't misunderstand me. I'm not asking for 100% reliability. After all, such a thing doesn't exist. When you're dealing with a free service, you get what you pay for. Outages happen.

But when you're paying for an Enterprise version of a software tool -- some of which cost upward of $1500/seat -- is it unreasonable to to expect there to be a Plan B to keep service flowing?
posted by zooropa at 4:34 PM on April 22, 2011

Gotta hand it to Amazon's developers. IBM tried to make computing a utility, but Amazon actually did it.

In fairness, Sun also failed at doing this. Credit where credit's due.
posted by GuyZero at 4:34 PM on April 22, 2011 [1 favorite]

Everyone think their cloud is the only true cloud

The cloud is nebulas?
posted by It's Raining Florence Henderson at 4:35 PM on April 22, 2011 [1 favorite]

Yeah people are confusing Amazon's specific availability zones with a more generic concept of being able to lose a given chunk of your operation in its entirety. Being in us-east-1, us-east-2, and us-east-3 didn't seem to have helped in this instance, at least according to what I've read.

Netflix (and others) likely split ops between us east and us west at the least. That costs lots (and lots) of money but you get what you pay for. The bigger a single point of failure you engineer into your system the harder it will be to recover when it fails. If Amazon's VA datacenter is your SPOF you're discovering that now.
posted by Skorgu at 4:37 PM on April 22, 2011 [2 favorites]

As a number of other writers have noted, the shaky issue here is that Amazon's infrastructure has always been divided in a number of ways: regions like the northeast and the southwest and so on, and availability zones. The plan recommended by Amazon has always ben to spread sites that need redundancy across multiple 'Availability zones', as those should never go down simultaneously.

Data transfer between different physical regions is really expensive, so very few people set up their systems to do that: instead they spread their instances across availability zones, where in-region traffic is cheap. Amazon's implied promise (not a legal one, but definitely what they tell you when they explain how to "do it right") is that the Availability Zones are sufficient to protect you from Amazon's own downtime. They're isolated physically from each other, they're on separate networks, and so on.

One of the problematic things about this outage is that it has spanned many availability zones, and has completely knocked out sites that did what Amazon said they should to protect themselves. Obviously, this happens with all hosting companies sometimes. The problem is that use of cloud infrastructure as primary hosting is a relatively new practice, and the people most vulnerable to it (startups relying on EC2 for scaling, say) are the ones that have the least ability to invest in a "trust no one" system with an additional layer of physical redundancy.

The actual downtime is not really much worse than when other major infrastructure providers have been hit by natural disasters. But it does mean that "Put it in the cloud!" cannot be treated as a sure solution to any kinds of availability concerns. Just as traditional infrastructure needs redundancy, cloud infrastructure needs complex (and expensive) cross-region, and sometimes cross-provider redundancy. That's no shock for the greybeards, but it's definitely a disappointment for those who'd hoped that shoestring bootstraps could cloud their way to perfect uptime.
posted by verb at 4:37 PM on April 22, 2011 [11 favorites]

the outage took down Foursquare, Quora, Reddit, Hootsuite, Cotweet, Paper.li, Mobypicture, Flavors.me, About.me...

the companies involved didn't plan ahead with redundancies and backups for the Pro or Enterprise versions

No offense, but the notion of "enterprise" versions of any of these products is pretty funny.

An "enterprise" product is one where a lot of money is lost when it goes down. Like a billing database.

The outage of all these site probably made most companies a lot of money in terms of improved worker productivity. The net effect of this outage was economically positive on the balance.
posted by GuyZero at 4:38 PM on April 22, 2011 [9 favorites]

Can we all stop fucking saying "cloud" now?

I work in IT (security) and I swear to god I will punch the next project manager that says "IN THE CLOUD."
posted by Threeway Handshake at 4:39 PM on April 22, 2011 [23 favorites]

Cue IBM: "The cloud is secure AND dependable. Get 5 9s with IBM cloud services."

Yeah. Right.
posted by clvrmnky at 4:40 PM on April 22, 2011

But it does mean that "Put it in the cloud!" cannot be treated as a sure solution to any kinds of availability concerns.

If you outsource systems administration and operations to the lowest bidder, guess what happens.
posted by GuyZero at 4:40 PM on April 22, 2011 [3 favorites]

cloud their way to perfect uptime

posted by verb

Heh. eponysterical.
posted by GuyZero at 4:41 PM on April 22, 2011 [7 favorites]

An "enterprise" product is one where a lot of money is lost when it goes down. Like a billing database.

Stop trolling.
An enterprise system is one that is scalable and reliable. One that failure redundancy so that you can put important applications or data that are mission critical on it.
posted by Threeway Handshake at 4:42 PM on April 22, 2011

This is a big deal and a large failure of the EC2 infrastructure to be sure, but what happened here is very different from EC2 going down entirely. There has been a problem in one of four availability zones (independent portions of the infrastructure within a single region) within one of five regions. For a much shorter time yesterday, there were some less major problems in the other availability zones of that region (us-east-1, hosted in data centers in Northern Virginia). All the other regions continued to work fine and the rest of us-east-1 has been fine except for that several hour period yesterday).

The point here is that cloud computing is not some kind of magic bullet that isolates you from the real world realities of system failures. Stuff breaks whether it's in the cloud or your own leased cage in a colo somewhere. One way or another, if your business can't handle downtime, you have to design your system to be fault-tolerant and have sufficient redundancy to handle these types of failures. That's often a fairly difficult engineering task because you have to deal with keeping everything in sync and working together. In other words, your application can't just know to connect to mysql on 10.0.2.56; it has to know to connect to a mysql cluster that is distributed across multiple data centers (or some other kind of failover system). This is all doable, but it generally takes a lot more effort to design systems this way because you have to consider your architecture as a set of independent services and then build redundancy into the design of each system. Ad hominem's comment about the Netflix blog post is exactly right: you have to design these systems with the assumption that anything and everything can fail at any moment and then aggressively test your infrastructure to make sure it is as redundant as you think it is.

The catch is that this is often a difficult task; it's not all that obvious how to spread a database across multiple machines, especially when some of those machines are in another country and data transfer is comparably slow. It's even more difficult when you can't sacrifice performance for reliability. Companies can invest more resources in this kind of hardening, but that comes at the expense of adding new features and simply keeping up with demand, which is what most startups are struggling to do as fast as possible.

The irony here of sorts is that one of the main advantages of cloud computing is that it actually makes it easier to build fault tolerant systems. Want to set up five (virtual) web servers behind a load balancer? Need a file server in the US with a backup machine in Singapore? You can do these things in a couple of commands with Amazon Web Services and pay very little for the privilege. That kind of thing would have cost serious money before AWS (Amazon Web Services). If something is overloaded, you can add more capacity by starting more instances or moving your services to a more powerful instance in less than five minutes. If an entire data center is destroyed in an earthquake, you can start up machines on the other side of the world at the drop of a hat. Amazon also has tools like Simple Queue Service and Relational Database Service that developers can use to more easily develop reliable systems.
posted by zachlipton at 4:43 PM on April 22, 2011 [9 favorites]

An enterprise system is one that is scalable and reliable

[citation needed]

Sound engineering can make all but the most pathological technologies scalable and reliable if enough time, effort and money is pumped into it.

Blowing your engineering budget on systems sold as 'enterprise' without understanding how to use them is a great way to, well, blow your engineering budget I suppose.
posted by Skorgu at 4:47 PM on April 22, 2011 [2 favorites]

Thank goodness for MetaFilter running its own system and not relying on a third party for its hosting and online presence.
posted by hippybear at 4:50 PM on April 22, 2011

The point here is that cloud computing is not some kind of magic bullet that isolates you from the real world realities of system failures. Stuff breaks whether it's in the cloud or your own leased cage in a colo somewhere.

Bingo.

One way or another, if your business can't handle downtime, you have to design your system to be fault-tolerant and have sufficient redundancy to handle these types of failures.

Well, it is very important to note that many developers who have been following Amazon's recommendations on how to make their systems reliable on EC2 were still crippled. This isn't to say that they couldn't have gone farther -- spreading their systems over multiple regions, say -- but the multi-availability-zone problems yesterday were a real wakeup call.
posted by verb at 4:50 PM on April 22, 2011 [3 favorites]

As other have pointed out, everything fails at some point. It's just a question of how much failure you can accept, and to design around that.

I work with a team that looks after server and application infrastructure for the management of an essential utility service. The systems are massively redundant and spread over multiple sites, and we target no more than 2.6 hours of outages a year. And that requires a huge amount of effort in design and administration.

Hosting / cloud solutions, to reach extreme levels of availability, will have to change their charging models in order to recoup the considerable effort it takes to maintain availability. And that's not going to be attractive to the service providers or their target market.
posted by KirkpatrickMac at 4:53 PM on April 22, 2011 [1 favorite]

Stop trolling. An enterprise system is one that is scalable and reliable. One that failure redundancy so that you can put important applications or data that are mission critical on it.

Threeway Handshake > Thanks for that. There is a valid reasons some of these services can have an Enterprise or Pro version. Hootsuite and CoTweet, for example, are charging a lot of money promising that their services enable nervous IT departments to rest easy on the idea they don't have to let loose a ton of login credentials into the wild. Instead of giving out usernames and passwords to Twitter, Facebook, et al., they give out usernames and passwords to CoTweet or Hootsuite and can keep the others secure.

That's a valid business model, and many companies are paying for it. But I'm still left wondering how these companies get away promising 99.99% uptime and realiability when they haven't planned for redundancy and backups for their Pro versions.
posted by zooropa at 4:53 PM on April 22, 2011

So when you store a doc in google docs it goes to more than one data center and is stored on several actual drives within each DC. All you have for a google doc is a URL. That's cloud computing.

No, that's just web based software. It's no different then it was 10 years ago. The fact that google docs runs on a massive cluster isn't really relevant. The same setup (for far fewer users) could be done on a single server. It's also not really computing in the sense of running your own software. When you use google docs or another web app, you're just using software other people wrote.

If you run your own software, obviously you are going to want to have your own hostname to put it on, even if it's just an alias.

That's not my understanding. Availability zones are still within a physical DC. It's meant to keep you insulated from operators taking things like transformers offline, etc. where you have to take down a huge block of machines.

Here's what Amazon says:

_{Availability Zones are distinct locations that are engineered to be insulated from failures in other Availability Zones and provide inexpensive, low latency network connectivity to other Availability Zones in the same Region. By launching instances in separate Availability Zones, you can protect your applications from failure of a single location. Regions consist of one or more Availability Zones, are geographically dispersed, and will be in separate geographic areas or countries. The Amazon EC2 Service Level Agreement commitment is 99.95% availability for each Amazon EC2 Region. Amazon EC2 is currently available in five regions: US East (Northern Virginia), US West (Northern California), EU (Ireland), Asia Pacific (Singapore), and Asia Pacific (Tokyo).}

So I guess the highest level in the hierarchy is the region but the Availability Zones should also be isolated from each other as well. But I guess they weren't in this case, which is weird. I wonder how that happened.
posted by delmoi at 4:57 PM on April 22, 2011 [2 favorites]

That kind of thing would have cost serious money before AWS (Amazon Web Services)

Yeah I meant regions, I think netflix is in Virginia, Ireland and Singapore?

All in all even with the downtime I would trust Amazon over myself.
posted by Ad hominem at 4:58 PM on April 22, 2011

Can we all stop fucking saying "cloud" now?

Remember: Clouds are made of vapour.
posted by eriko at 5:00 PM on April 22, 2011 [12 favorites]

[citation needed]

http://tinyurl.com/438e2bw

Sound engineering can make all but the most pathological technologies scalable and reliable if enough time, effort and money is pumped into it.

Right. And if I were to go out, buy a Ferrari F1 racecar, grind out its gearbox, then immediately drive it on the street and into a telephone pole, it doesn't mean that the F1 racecar isn't a professional racecar. It means that I was doing it wrong.

Enterprise-grade products, while not exactly a thing that one can precisely define in a dictionary, represent what I said.
posted by Threeway Handshake at 5:00 PM on April 22, 2011

Based on the ratio of how fascinated I am by this thread, to how much I understand of it, this is the best thread in the history of Metafilter.
posted by Horace Rumpole at 5:01 PM on April 22, 2011 [7 favorites]

What you gain with these services is convenience, functionality, cost savings, and (usually) scalability and reliability. What you lose is transparency and control.

It's not always bad. It's kind of like how you get stuck in highway traffic and decide to take the backroads, and it's just as slow, but you'll never know if you should have just stuck with the highway. AWS is the highway.

I don't see a lot changing except PHBs demanding more justification for using this cloud-thingy and engineers doing a bit more homework on their failover strategy.
posted by RobotVoodooPower at 5:03 PM on April 22, 2011

For those who are interested in this sort of thing, Amazon has a whitepaper (PDF) on Building Fault Tolerant Applications for AWS that might be a good read.

The real question here is going to be what happened to cause this failure across availability zones in the us-east-1 region. Amazon makes it far easier and cheaper to build systems that are redundant across availability zones than systems that are redundant across multiple regions. Their promise was basically that a single availability zone might fail, but multiple availability zones in the same region shouldn't fail absent a major natural disaster or similar calamity. A lot of businesses could live with that level of redundancy. Clearly, the four availability zones weren't fully isolated to survive whatever networking problem occurred here, which suggests that multi-region redundancy is more important than many previously thought.

Finally, it's important to note that Amazon has been highly aggressive at moving their own systems over to Amazon Web Services. (disclaimer: I was an intern there a couple of years ago.) Amazon has always been really big on building fault-tolerant systems internally so their development model has been more suited to this, but the fact is that Amazon.com is the biggest online retailer in the world and they are hosting major portions of their site on the exact same infrastructure that they are selling as as AWS. It's in their interest to get this stuff right, not just customers people may lose trust in their web service business if things go down, but because they lose enormous sums of money in lost sales every single minute that their website isn't fully available.
posted by zachlipton at 5:08 PM on April 22, 2011 [3 favorites]

The important question: will this event have any impact on the frequency of those annoying Microsoft "To the cloud" commercials?
posted by fredludd at 5:08 PM on April 22, 2011 [2 favorites]

I have two RDS database server instances in Amazon's us-east-1a zone that have been "rebooting" for more than 28 hours. Fortunately, I didn't have many problems because I don't trust anything with an on/off switch (virtual or physical) and my needs are small enough that I can afford to be paranoid. That's the key. Reddit, which is currently being starved to death by Conde Nast overlords, can't afford multi-region redundancy and they're still struggling to get back online.
posted by foggy out there now at 5:09 PM on April 22, 2011

Yeah, Jedberg is still struggling with EBS.
posted by Ad hominem at 5:25 PM on April 22, 2011

Amazon denies Skynet involvement in cloud outage: Claims genocidal AI not to blame for website downtime
posted by homunculus at 5:25 PM on April 22, 2011 [1 favorite]

http://tinyurl.com/438e2bw

Oh snap, an obfuscated link to LMGTFY! You showed him!
posted by inigo2 at 5:25 PM on April 22, 2011 [1 favorite]

ColdFusion PTSD!
posted by stratastar at 5:31 PM on April 22, 2011 [1 favorite]

From my perspective (I work for a company that has deployed the two largest applications of their kind to AWS) this is less a failure of technology and more a failure of marketing. The fundamental problem here is that in order to fully take advantage of a computing platform such as EC2, you cannot just take your existing application, install it, and call it a day. Some above mentioned that Netflix suffered degredation rather than outage because if their "fail constantly" approach and this is illustrative of my point. Applications which are designed along traditional lines, such as reddit (which experienced ~36 hours of downtime) or even those which are designed along traditional architectural patterns such as n-tier, are simply not suited to take advantage of what the platform has to offer.

(In fairness, one of our applications on AWS experienced downtime, however, this was due to a misconfiguration in the disaster recovery system. The other application degraded but did not go offline. You've heard of both of them, but I can't name names.)
posted by feloniousmonk at 5:35 PM on April 22, 2011 [2 favorites]

Um, your lmgtfy (really?) kind of proves my point. Number one paid hit for "enterprise hosting" is Rackspace who had some rather public downtime a while back. Splurging on "Enterprise" hosting didn't help Yelp. Danger was brought down by the failure of their "Enterprise" SAN. Oracle (as Enterprise as it gets) took down Chase.

I also don't think the F1 analogy does your argument any credit either. There are some circumstances where absolute speed is what counts but the vast majority of real world IT is more concerned with throughput. Latency is important but once it's below a certain key value making it faster is not worth much if there are cheaper and easier ways of increasing throughput.

An F1 car may very well be the fastest thing on four wheels. But it requires three trailers of specialized equipment, hours of care and feeding and ludicrously expensive parts and consumables. And a ten year old station wagon full of backup tapes is going to comprehensively kick its ass in throughput.

The word 'Enterprise' on a piece of software, ISV, SaaS, PaaS, IaaS, or any other aaS provider means precisely nothing technical.
posted by Skorgu at 5:38 PM on April 22, 2011 [1 favorite]

i guess it's a good thing that this crash has raised concerns about the viability of cloud based computing, since the constant warnings of those who see the cloud as something other than just the newest misunderstood tech buzzword have been roundly unheeded.
posted by radiosilents at 5:38 PM on April 22, 2011

You've heard of both of them, but I can't name names.)

"Which web site do you work for?"
"A major one."

Sorry.

Yes -- the really interesting thing here is what the hell happened to span availability zones. The little press pieces keep talking about an RDBS issue -- I'd hate to think at some point all of those zones share the same synced hard drive, because, err, well, I know they have to share data, but wouldn't that enable bad data to just leapfrog around the zones?

Can't wait for the autopsy on this...
posted by cavalier at 5:40 PM on April 22, 2011 [2 favorites]

Yeah, kind of dickish, but I don't want to get in trouble. I personally don't give a shit but some people really do, much to my consternation.
posted by feloniousmonk at 5:42 PM on April 22, 2011 [1 favorite]

An enterprise system is one that is scalable and reliable

No, "enterprise" means it costs more. Its a marketing term not a technical term.
posted by Zetetics at 5:45 PM on April 22, 2011 [6 favorites]

At any rate, I think what actually happened boils down to EBS being an over sold product in general. Of course, we won't really understand the whole scenario until/if AWS decides to tell us, but I think that Joyent blog way upthread is probably pretty close with their suggestion that it is a congestive failure.

In case that Joyent blog was too dense, basically the problem is analogous to what happens in grocery stores in a big city that's unused to, say, snow, that gets a major snowfall forecast. Normally these grocery stores can serve the population without a problem, but when everyone shows up at once wanting to buy as much as possible it gets crazy fast. In this case, instead of bizzards and gallon jugs of water, what happened is EBS disks started tanking and that caused automated systems to try and request more of them, and that caused further failures.

You might also think about it in the same way as a "flash crash" like the one that took out the stock market awhile back.
posted by feloniousmonk at 5:46 PM on April 22, 2011

No, no, no. "Enterprise" does mean something when applied to a piece of software. It means consultants who are paid $450 an hour can champion the installation of a product that is promised to be up and running in nine months and is entering alpha in twenty four at six times the original estimated cost all while shouting down local IT who has said all along that the open source solution would be just fine, we'll just need to sponsor a few coders and we'll get what we need but OF COURSE no one listens to local IT who don't have the broad-market experience that the consultants do, and besides those consultants absolutely need more hookers and blow and trips to Cancun for "training retreats" hosted by the enterprise software vendor. GET IT RIGHT.
posted by seanmpuckett at 5:52 PM on April 22, 2011 [15 favorites]

Well I'm not going to beat this horse much more, but I'm going to stick with my own definitions of "cloud" and "enterprise".

Many companies simply use "enterprise" to signify that you need to pay more for that particular version of whatever it is they sell, but I'll stick to using it for systems that actually matter to their users. I'm sure HootSuite is a fine product but $1500 a month is not a "lot of money" and when it goes down no one else's business shuts down. If salesforce goes down, your sales reps are pretty much sitting on their hands.

As for "cloud", the difference between me running apache and "the cloud" is one of scale and reliability. You can't fill up google docs. Uploads as many docs as you want, you won't fill it. You may hit your personal upload limit, but it won't be full. No one needs to even ask youtube to spin up an additional instance - when Warner uploads videos to Vevo they just do it. When 12 million people watch Rebecca Black it just works. Additional capacity happens transparently. You still need to ask for more EC2 capacity. SQS is closer to what I'd consider a cloud service vs EBS or EC2.

At any rate I'll reiterate that this is hairsplitting. It will be interesting to see what the final postmortem is.
posted by GuyZero at 5:52 PM on April 22, 2011

Yes, "enterprise" is a term totally devoid of meaning. As far as I can tell it's used for two purposes: to belittle people who don't work "at the enterprise scale" and to charge obscene amounts of money for things that are typically available in OSS projects but are difficult to configure (c.f. database clusters). Not that the "enterprise" solution is easy to configure, but hey, you can pay someone, which is awesome, right?
posted by feloniousmonk at 5:53 PM on April 22, 2011

I think you're conflating "enterprise" with "mission critical" there.
posted by feloniousmonk at 5:54 PM on April 22, 2011 [2 favorites]

Enterprise means developers used the factory pattern.
posted by Ad hominem at 5:58 PM on April 22, 2011

To be fair, I think you have to have at least one AbstractFactoryFactory and a corresponding ConcreteFactoryFactoryImplementation before you're truly "enterprise ready."
posted by feloniousmonk at 5:59 PM on April 22, 2011 [3 favorites]

5000 BC: Wheat crop fails, raising serious questions about the viability of agriculture.
posted by A dead Quaker at 6:01 PM on April 22, 2011 [22 favorites]

As a friend of mine observed this morning, sooner or later you're still talking about physical servers in racks in buildings.

Mebbe someday (ie "later") there will be a return to those massive 5' hard disks of the past, old-school Gigantism in the server infrastructure that is hosting all the virtual instancing stuff.

Not that COTS JBOD will not always be a valid approach.
posted by mokuba at 6:01 PM on April 22, 2011

Not that COTS JBOD will not always be a valid approach.

Pretty much every server architecture boils down to this these days. Is there really a non-COTS JBOD system out there somewhere?
posted by GuyZero at 6:06 PM on April 22, 2011

I know this is out of their hands for the most part, but this is kind of a shame for Reddit. The site's been pretty flakey for a while due to growth, a lack of money for upgrades, and probably not enough scalability in code, but over the last few weeks, they hired more programmers and were working on optimizing the site (oddly, Amazon failed Reddit in March slightly after the new employees came in, which caused users to joke about the new people breaking Reddit). It looked like it was working, as I got much less of the "Reddit is under a heavy load" type errors (I think they're gussified 500 errors), which used to show up 25-50% of the time I'd hit a link. Then this happens.

Of course, since Netflix stayed up, through both redundancy and a contingency plan (AFAICT), it sounds like Reddit could have prepared better. But between all the other problems with Reddit that cause it to go down and have a bunch of errors, I can understand them not preparing for a problem with a service that is usually rock-solid.
posted by mccarty.tim at 6:08 PM on April 22, 2011

The cloud is composed primarily of hype and misinformation, intended to dupe small and medium businesses into wasting money on an impossible dream.

About a year ago, I had the misfortune to take a job with a graphics company that had some really amateurish IT systems spread across 5 sites, I was hired to clean them up. The boss was sold on this cloud idea so much, that he'd bought $75,000 (annual maintenance fee) worth of SalesForce CRM, and then spent a year failing to get it operational, and then spent two more years paying consultants $100k/year and still it wasn't running. Hell, for the $425k they wasted, they could have hired a whole staff of human accountants, or even some people who knew how to operate SalesForce.

The owner was obviously sold on the cloud idea, he wanted to do a total backup of all their storage systems to the cloud. I checked out their data requirements, they passed around 1Tb USB drives like they were floppies. They only had 1500/864 ADSL, so it would impossible to get data up to the cloud fast enough. IIRC I estimated the first total backup would take about 14 weeks, and each week we generated enough data to take at least 10 days to upload.

I spent a lot of my time on this job trying to inform the owners that their schemes were impractical. This didn't make me very popular, I didn't last long very in the job.
posted by charlie don't surf at 6:22 PM on April 22, 2011 [9 favorites]

The boss was sold on this cloud idea so much, that he'd bought $75,000 (annual maintenance fee) worth of SalesForce CRM, and then spent a year failing to get it operational, and then spent two more years paying consultants $100k/year and still it wasn't running.

So obviously I wasn't there, but I have seen projects like this. I don't think it had much to do with the cloud - there are companies that can't get their data clean in ACT!.

The owner was obviously sold on the cloud idea, he wanted to do a total backup of all their storage systems to the cloud.

yeah, this isn't practical except for low to medium usage home users.
posted by GuyZero at 6:26 PM on April 22, 2011

I definitely feel for reddit. They've been getting it from all angles. First, a lot of the first hand knowledge about the whys and wherefores of their system is basically gone now that most of the original crew has left, second, they were the red headed step children of Conde Nast until recently, and now they're suffering because the combination of the two has left them totally exposed to AWS outages.

To top it off, I don't think their application was really designed to take advantage of all AWS has to offer anyway. If it had been, they wouldn't have gone down.
posted by feloniousmonk at 6:31 PM on April 22, 2011

One interesting social repercussion of this is playing out right now on reddit. Allegedly the site's premium users (called Reddit Gold) were allowed full access to the site at some point today while the free users were kept in read-only mode for a while. Once full access was restored to all, free users started posting screeds calling for a mass downvoting of the Gold members. Now, of course, reddit karma score is worthless, but it is symbolic on the site itself, and just struck me as another case of reddit's through-the-looking-glass attitude of bad is good, like we saw in cortex's sockpuppet symposium, where the non-paying masses want to exact revenge on the people who actually help pay for the site.

But then it might just be a big joke I'm not getting. The days go by and I find myself growing further and further afield from the 'ol reddit base.
posted by BeerFilter at 6:31 PM on April 22, 2011 [4 favorites]

I definitely feel for reddit. They've been getting it from all angles.

There is no such thing as reddit. What you see on the reddit website is a search engine that sorts through stored texts and links and presents them in an HTML page. The storage and servers are outsourced to Amazon, the search engine is outsourced to IndexTank. That's all reddit is, outsourced crap stuck together with baling wire and chewing gum. You could fire the entire staff of reddit's so-called "programmers" and nothing would change, except you wouldn't get a new reddit alien logo every day.

Let me reiterate that: reddit is a division of Conde Nast with only one actual product: a daily 120x40 .gif file. And those don't even ship every day.
posted by charlie don't surf at 6:57 PM on April 22, 2011 [1 favorite]

Have you ever used reddit? Whether or not its features justify the complexity of the software is a reasonable question, but it is not a trivial application (it's OSS, go ahead and take a look). Highly dynamic threaded conversations are a difficult problem to solve well, particularly at their scale.
posted by feloniousmonk at 7:00 PM on April 22, 2011 [10 favorites]

You still need to ask for more EC2 capacity. – GuyZero

What? No, you don't: if you have configured auto-scaling, then when your instances get loaded over whatever limit you set, Amazon will start more of them; when load is low it will shut down the instances no longer needed.

With this additional information, can you agree that EC2 is the cloud?
posted by nicwolff at 7:16 PM on April 22, 2011

One that failure redundancy so that you can haz put important applications or data that are mission critical on it.

I are not kin fizx that for you.
posted by pressF1 at 7:18 PM on April 22, 2011

The advantage of EC2 is you can shift all the work and pain of managing a reliable datacenter off to Amazon. The disadvantage is if Amazon fails to provide a reliable datacenter, you're screwed.

I think it's fascinating (and prescient) that Netflix had redundant hosting within Amazon. Frankly I suspect most users of EC2 were hoping it wasn't necessary to do that much work.
posted by Nelson at 7:23 PM on April 22, 2011 [2 favorites]

There is no such thing as reddit. What you see on the reddit website is a search engine that sorts through stored texts and links and presents them in an HTML page. The storage and servers are outsourced to Amazon, the search engine is outsourced to IndexTank. That's all reddit is, outsourced crap stuck together with baling wire and chewing gum. You could fire the entire staff of reddit's so-called "programmers" and nothing would change, except you wouldn't get a new reddit alien logo every day.

Nonsense. That's how essentially every website works nowadays: you write some software that combines various components, wrap it all in your own templates, and there's your site. If you'd like, you can make some of those components be separate services ("web services" to throw out a previous trendy buzzword) or you can make some of the components be client-side JavaScript and have the server send and receive various bits of data to the JS ("AJAX" to throw out another trendy buzzword of yesteryear), but the point is still basically the same.

Using third-party components and outsourcing hosting is a good thing when it comes to making a website because other people are a lot better at doing most things than you are. If every website wrote everything from scratch, nothing would ever get released and the web would be a buggy mess. Reddit uses a third-party search engine because writing search engines is a really hard thing to do well and it makes more sense to have someone do that once for lots of sites than for each site to write their own. Reddit outsources hosting to Amazon because, even with this outage, Amazon is a heck of a lot better at running a data center than a couple of programmers at reddit. I use jQuery in a lot of my projects because John Resig and friends have spent a ton of time working to build a reliable cross-platform library, and I'd be nuts not to take advantage of all that effort because I had some aversion to using someone else's code. Web developers should focus on what makes their product special, not wasting their time producing crappy implementations of solved problems.

Basically everything you said above could be applied equally well to Metafilter, and we don't even get a new logo every day (not to take away from the supurb work the mods do, nor pb's great efforts to maintain and improve the site). What exactly do you expect a website to be? Rooms of custom built hardware?
posted by zachlipton at 7:30 PM on April 22, 2011 [3 favorites]

Personally, I would hope that there is at least one aspect of metafilter that involves vacuum tubes.
posted by feloniousmonk at 7:34 PM on April 22, 2011 [2 favorites]

Personally, I would hope that there is at least one aspect of metafilter that involves vacuum tubes.

Cortex's amps?
posted by kenko at 7:45 PM on April 22, 2011 [3 favorites]

There is no such thing as reddit. What you see on the reddit website is a search engine that sorts through stored texts and links and presents them in an HTML page. [...] You could fire the entire staff of reddit's so-called "programmers" and nothing would change, except you wouldn't get a new reddit alien logo every day.

At least 40,000 lines of code would beg to differ.
posted by teraflop at 7:56 PM on April 22, 2011

Twilio has posted a few thoughts on developing for AWS. Reading what Twilio has to say, seems like jedberg is on the right track moving storage out of EBS.
posted by Ad hominem at 7:59 PM on April 22, 2011

At least 40,000 lines of code would beg to differ.

Reddit was written twice. First in Lisp, then re-written in Python.
posted by Ad hominem at 8:04 PM on April 22, 2011

The cloud is composed primarily of hype and misinformation, intended to dupe small and medium businesses into wasting money on an impossible dream.

What the fuck? The cloud is way fucking cheaper then buying physical servers. I've used EC2 and S3 for lots of different things and it basically costs almost nothing. I've been running a 'micro' instance 24/7 and it costs maybe $14 a month. If I wanted too, I could I turn it off when I'm not using it and drop my costs to like less then a dollar a month. I could also pre-purchase the instance for $50 or so for a whole year.

The idea that "Cloud Computing" is expensive is just insane. It's so cheap it's barely noticeable.

checked out their data requirements, they passed around 1Tb USB drives like they were floppies. They only had 1500/864 ADSL, so it would impossible to get data up to the cloud fast enough.

Amazon lets you mail in hard drives for them to load in or unload data on too. It's called AWS Import/Export and it costs $80 per drive plus $2.50 per hour of load time.

The reason people like cloud computing is because it's cheap. At least the PaaS style cloud computing rather then this nebulous web service B.S. Anyway if you run your own hardware you still need to hire consultants.

Once full access was restored to all, free users started posting screeds calling for a mass downvoting of the Gold members. Now, of course, reddit karma score is worthless

This was on the top of politics. I lol'd. I don't think anyone is serious...
posted by delmoi at 8:09 PM on April 22, 2011 [3 favorites]

Reddit was written twice. First in Lisp, then re-written in Python.

Heh, that's what the influence of Paul Graham gets you...
posted by delmoi at 8:11 PM on April 22, 2011 [2 favorites]

EC2 is really only cheap when you are talking about the need to have excess capacity for load spikes. If you have a typical HA environment, it's going to cost about the same, if not more, depending on how you account for the expenses, as if you hosted it on your own machines in a datacenter. I think selling services like EC2 as a cost-saving measure is problematic, particularly when things like this happen and we find out that essentially the only way you stay up during this kind of outage is if your application crosses EC2 region boundaries.

Of course, for one off stuff, it's definitely cheaper. If you're hosting something mission critical, it isn't.
posted by feloniousmonk at 8:16 PM on April 22, 2011 [1 favorite]

Wonder if the OP could give a citation on the "about 70 other services" line. Most every web service that I use or interact with on a daily basis has been impacted.

Now, I might just be highly tuned to AWS services, but I have a feeling we're talking about hundreds or even thousands of startups of startups who have been affected.
posted by bpm140 at 8:22 PM on April 22, 2011

I am typing this on an EC2 instance with 185 days uptime. It's in US East too.

EC2 is already better than almost anything you could do yourself (short of the kind of bulletproof equipment banks and telcos use, which starts at ten times the price). They have economies of scale that no one else in the industry does. I'm confident that this will only spur them to improve.
posted by miyabo at 8:24 PM on April 22, 2011

Once full access was restored to all, free users started posting screeds calling for a mass downvoting of the Gold members. Now, of course, reddit karma score is worthless
This was on the top of politics. I lol'd. I don't think anyone is serious...

Most are joking, but there seem to be a few who are pissed that the admins lied when they said Gold members wouldn't get preferential treatment, and then lied again (or at least dissembled) when they said the users who were re-activated were random (they weren't -- they were Gold + 13% random non-Gold).
posted by dirigibleman at 8:28 PM on April 22, 2011

What will it take for everyone to admit that nothing is more foolish than relying so heavily on technology and, essentially, 'putting all your eggs in one basket?' Every individual and business who is looking to put everything online and make their entire life/business all digital is just asking for trouble. My theory is these crises will only become more commonplace, and they will become less of an inconvenience and much more of a real serious issue because the world won't be able to function in the downtime.. and everything that we've attempted to archive and save in our lives will be wiped out and we'll become nothing. Technology will be our downfall. Sleep tight.
posted by Mael Oui at 8:30 PM on April 22, 2011

'putting all your eggs in one basket?' -- Actually, if implemented correctly, this sort of technology is supposed to be exactly the opposite of that. It's supposed to use distributed networks so that if any one part of it fails, the other parts are supposed to pick up the pace and let you get by.

And if they do become more commonplace, don't you imagine that people will adapt? If someone got burned yesterday for relying on something that wasn't supposed to fail, then they've probably learned a most valuable lesson.

And I don't know about you, but until they archive my brain, no computer failure will turn me into nothing.
posted by crunchland at 8:36 PM on April 22, 2011 [1 favorite]

What will it take for everyone to admit that nothing is more foolish than relying so heavily on technology and, essentially, 'putting all your eggs in one basket?'

What I would really like to see is open source, standardized ways to request services. EC2 has a 'spot' market but that's obviously totally controlled by Amazon anyway. There are other cloud services out there. Why not set things up so that you can transfer disk images between them, or even stage things on virtual servers on your own machine or whatever.

That way you could make the reliability/price trade off for yourself.

Anyway, I personally wouldn't be to concerned about this if I was a startup. Everyone has downtime, it's not that big of a deal. Reddit just isn't "Web scale" They are down or have problems all the time not just when this stuff was going on. Comments are always failing to post, or more annoying, getting posted while falsely claiming to fail (resulting in multiple postings)
posted by delmoi at 8:51 PM on April 22, 2011 [1 favorite]

Remember: Clouds are made of vapour.

Indeed, but are clouds electric?

Cloud! I disconnect from you.
posted by Sys Rq at 9:26 PM on April 22, 2011 [1 favorite]

What will it take for everyone to admit that nothing is more foolish than relying so heavily on technology and, essentially, 'putting all your eggs in one basket?'

Yeah, those idiots at Reddit never should have stopped publishing their dead-tree version.
posted by grouse at 9:35 PM on April 22, 2011

I miss Reddit's "carve on a rock and whip it at you" service.
posted by dirigibleman at 10:12 PM on April 22, 2011 [2 favorites]

EC2 is really only cheap when you are talking about the need to have excess capacity for load spikes.

Well, yeah, but let's not downplay that as an advantage. The ability to easily scale up your app/site/service to meet demand is a very neat trick. Since most people buying hardware buy a certain amount of excess capacity in order to deal with demand swings, not to mention just anticipated growth before they want to have to add more, that's where you can legitimately get a cost savings.

But it's basically timesharing writ large, and there are certainly tradeoffs. I'd probably raise my eyebrows a bit if I found out my bank had moved its core processing out to a virtual cluster somewhere on somebody else's hardware, because that just seems like something where rapid scalability ought not be the dominant concern. But if they put their external-facing website there, it might be more reasonable (cf. the MasterCard DDOS).
posted by Kadin2048 at 10:21 PM on April 22, 2011

EC2 is really only cheap when you are talking about the need to have excess capacity for load spikes.

EC2 is also cheap when your annual hosting is in the $30-80k range, and therefore it's too big to run half-heartedly by programmers doubling as sysadmins, but not big enough to allow you to hire a full time sysadmin and ALSO pay for the hardware to live somewhere.
posted by chimaera at 10:27 PM on April 22, 2011

EC2 is really only cheap when you are talking about the need to have excess capacity for load spikes.

Which is the point of the entire exercise. EC2 and related services let you scale your entire system on demand, something nearly impossible to do if you're running your own infrastructure. Also, it's one thing to compare hardware costs, but EC2 saves you from having to deal with things like "darnnit, now I need to spend a gazillion dollars for a network engineering consultant to reconfigure my Cisco router" and "I'll buy this line from ISP A with this much bandwidth but burstable to that much bandwidth." All that cost for sysadmins and network engineers and datacenter techs adds up.

True story time. At a certain large technology company I once worked for, the company would lease entire racks of servers for their data centers. I think they were three year lease terms. This worked great, except that when the leases came due, everyone had to migrate their systems over to the new servers (which would have modern hardware and an up to date Linux image to facilitate security patching). If a single server wasn't migrated in time, the company had to pay huge costs to keep the entire rack in place until it was no longer in use. My team was struggling with a fairly fragile NFS setup at the time and we couldn't endure the endless series of cascading failures that would surely result from migrating to a new server, so with the support of senior management, the company had to eat the penalties to keep the rack around for another six months or so. Certainly this whole system was poorly designed from the beginning, but these kinds of things happen all the time with larger datacenters. With something like EC2, this would have been a non-issue.
posted by zachlipton at 10:40 PM on April 22, 2011

I'm not saying that it's a bad thing at all, just pointing out that selling it as a cost saving measure is maybe not the best way to approach it.
posted by feloniousmonk at 11:20 PM on April 22, 2011

Thank goodness for MetaFilter running its own system and not relying on a third party for its hosting and online presence.

I can't tell if you're joking, but if AWS had been around a few years ago we would've been begging mathowie to use it.

Sure, the constant JRun errors and the fact that the MeFi server was a wonky beige box that lived in a cupboard were endearing. But I didn't think so when desperately hitting refresh like a crackhead combing the carpet for imaginary rocks, and would've killed for a MetaFilter that only died as often as EC2.
posted by jack_mo at 2:10 AM on April 23, 2011 [4 favorites]

Reddit was written twice. First in Lisp, then re-written in Python.

Once as tragedy, a second time as farce.
posted by atrazine at 5:15 AM on April 23, 2011 [10 favorites]

Remember: Clouds are made of vapour.

Indeed, but are clouds electric?

Cloud! I disconnect from you.

http://gigaom.com/cloud/aws-launched-read-replicas-for-its-database-service/
posted by Devonian at 5:20 AM on April 23, 2011

08:07:22 up 774 days, 23:14, 2 users, load average: 0.17, 0.10, 0.04

Just had to share my backup DNS' statistics, on a 10-year-old Compaq in a rack a few miles from where I sit. Eat that, Cloud!

I haven't heard an explanation for the Amazon failure (sorry if it was in the links), but the first thing I thought wasn't to blame hardware, or Amazon missing a critical software flaw, or a DDOS of any kind, but rather that some guys in hard hats are looking down into a deep hole at a partially-severed cable just across the street from Amazon's data center. The mightiest networks are brought down this way.

Like when my colo place's power got cut. Sure, I trusted the diesel generator and the UPS' batteries to keep things going so I didn't provide my own. "We don't know why the power didn't switch over to UPS or start the generator" was all the answer I got :( As you can tell, that was over two years ago.
posted by AzraelBrown at 6:14 AM on April 23, 2011

AzraelBrown, still no answer on what happened. Gah. I read all the comments here and it's a debate about what cloud services are and whether they make sense to buy or not .. Not a 'what happened + analysis' discussion.

I'm very interested in knowing what happened to their Northern VA facility. Did the guys building the metro rail extension to Dulles cut another important fiber cable or something ? Fire in the data center ? A power event ? Why such a massive failure ?
posted by k5.user at 6:19 AM on April 23, 2011

As a non-techy ignoramus, I find this whole thread reads so much better when I substitute SkyNet for "cloud."

It all makes sense now.
posted by misha at 6:48 AM on April 23, 2011 [1 favorite]

I just moved our site to Heroku, which is EC2 backed, a couple of days ago. We've been down for nearly 48 hours at this point. Moving back to the old infrastructure would've taken around 8 hours or so (plus more to move back), so I figured - let them sort it out, it won't be that long.

Bad call.
posted by bashos_frog at 6:55 AM on April 23, 2011 [1 favorite]

I can't play anything on my $300 paperweight or even watch Netflix at the moment.

I too had the message that the PS Network was down when signing into Netflix but I was still able to use Netflix on the PS3.
posted by juiceCake at 7:01 AM on April 23, 2011

The secret to the cloud pun is to try to think of phrases with the word "cloud" in them.
posted by It's Raining Florence Henderson

What kind of clouds rain forgotten TV actresses?
posted by fourcheesemac at 8:07 AM on April 23, 2011

We have a roll-your-own mail and web server in our closet, and this kind of thing is part of why I doubt my husband will ever give it up. We get outages when our power and internet links go down (or when we move the server) but at least we have some idea of why.

I agree with whoever said upthread that planning for graceful failure is the most important thing, though. If my PBEM game goes down, that's one thing. Mission-critical business applications are entirely different.
posted by immlass at 8:24 AM on April 23, 2011

I am currently asking for my job title to be changed to The Chaos Monkey.
posted by fullerine at 8:28 AM on April 23, 2011 [2 favorites]

cardiac patient monitoring on ec2

aaand it's gun-in-mouth time for these clowns
posted by crayz at 1:00 PM on April 23, 2011 [6 favorites]

yeah, this [large-scale data migration] isn't practical except for low to medium usage home users.

This isn't true, although it's certainly easier to migrate small amounts of data. Delmoi mentioned migrating data to Amazon via hard drives. I've worked with mail migrations for large organizations, and similar options exist there - for example, you can ship a big box of backup tapes to Postini for import into their mail archiving solution.
posted by me & my monkey at 1:17 PM on April 23, 2011

aaand it's gun-in-mouth time for these clowns

Like that carnival game with the water and the balloons?
posted by Sys Rq at 1:17 PM on April 23, 2011 [3 favorites]

cardiac patient monitoring on ec2

Yeah, probably not a smart idea.
posted by Blazecock Pileon at 7:25 PM on April 23, 2011

Why is this thread quiet? I'm with k5.dulles. I come to MetaFilter looking for answers and I know somebody out there has them. Forget this cloud debate, unless it's really the villain here. But Sony Playstation Net being down (but there are workarounds), has folks asking, "Is this the work of anonymous?" ("Anonymous?)" Seriously, is this just some overloaded cloud thingie that can be remedied by more whatever, or does this event reveal some critical bug in the system that will doom whatever tech these various affected outfits share, or is it all a sinister plot?
posted by CCBC at 2:05 AM on April 25, 2011

CCBC, regarding PSN, the reason is almost certainly: sony developed at ostensibly redundant but never actually battle-tested system and when there was a fault, it ate itself spectacularly in the form of desynchronized fine-grain data. In this scenario, even a warm restart won't put Humpty together again. And if they've never tested a global checkpoint restore, they may not even be able to recover the latest data even by applying journals.

I guess in non technical terms, that would be "they built a shitty building and it fell down in the earthquake."
posted by seanmpuckett at 5:16 AM on April 25, 2011 [1 favorite]

Okay, seanmpuckett, so I am not to believe PSN when they say that this is all the result of sabotage? (They say "external intrusion" and claim to have shut down the system themselves.) Further, is there no connection between the Sony and Amazon breakdowns? I get that Amazon's N. Virginia server system went down somehow but that it has other operating servers. What I don't know is whether Sony was part of that cloud or some other one or this is just a coincidence.
posted by CCBC at 12:09 PM on April 25, 2011

Why is this thread quiet?

Because everyone who wanted to say something already said it?
posted by grouse at 12:37 PM on April 25, 2011

This guy claims to know what happened with PSN. New software was released that allowed folks to pirate some PSN content, so PSN shut down to eliminate the problem. Meanwhile, Amazon seems to be fixed and promises a full report. This guy says he knows what happened: a "networking event" that "triggered a large amount of re-mirroring" of storage volume, creating a capacity shortage.

I am still baffled.
posted by CCBC at 1:38 PM on April 25, 2011 [1 favorite]

I'm kind of surprised Sony would take down the PSN servers entirely if the trick behind these hacked consoles was that they found a way on the developers' network. You'd think they could just shut down the developers' network until they resolve the issue, instead of upsetting all the PS3 owners. Yes, it would be annoying to game-dev teams, but that's still better than kicking everyone out.
posted by mccarty.tim at 3:20 PM on April 25, 2011

Okay, by now everyone knows a little more about PSN and there is a thread about it. So far nobody knows about Amazon. I guess this thread is dead.
posted by CCBC at 2:33 AM on April 27, 2011

Okay, by now everyone knows a little more about PSN and there is a thread about it. So far nobody knows about Amazon. I guess this thread is dead.

We could make stuff up, if you'd feel better?
posted by verb at 6:55 AM on April 27, 2011 [1 favorite]

A few days ago we sent you an email letting you know that we were working on recovering an inconsistent data snapshot of one or more of your Amazon EBS volumes. We are very sorry, but ultimately our efforts to manually recover your volume were unsuccessful. The hardware failed in such a way that we could not forensically restore the data.

Amazon Cloud Crash Destroyed Many Customers' Data.
posted by Horace Rumpole at 1:19 PM on April 28, 2011

What we were able to recover has been made available via a snapshot, although the data is in such a state that it may have little to no utility...

If you have no need for this snapshot, please delete it to avoid incurring storage charges.

Insult to injury!
posted by grouse at 1:22 PM on April 28, 2011

Well the hard drive in my desktop at home pooped out today and has bad sectors in some Windows boot files... ugh. So I guess Amazon and I are about even at this point.
posted by GuyZero at 1:26 PM on April 28, 2011

here's the official post mortem from Amazon - http://aws.amazon.com/message/65648/
posted by askmehow at 11:38 AM on April 29, 2011

Oh holy crap, askmehow, that document reveals that this outage was caused by human error.

The trigger for this event was a network configuration change. We will audit our change process and increase the automation to prevent this mistake from happening in the future.

Well I suppose I shouldn't be too surprised. Even today, I often think of the sign I saw on the wall of the tech support office of our university's IBM/360 shop, probably back around 1972:

Computers don't make mistakes. All "computer errors" are human errors.
posted by charlie don't surf at 3:42 AM on April 30, 2011

Heh. You'd be shocked to see the root causes of errors I've seen at similar scale.

What's really amazing is that this config change only got pushed in a single DC. Imagine the shitstorm if they made the same change uniformly across DCs. (And this is the reason that most companies stagger such changes across DCs)
posted by GuyZero at 10:54 AM on May 2, 2011

« Older How These Two White Guys Wound Up In This Kendrick... | Did you get my good side? Newer »

This thread has been archived and is closed to new comments

MetaFilter

The wreck of the web.
April 22, 2011 4:01 PM Subscribe

Tags

Share

The wreck of the web. April 22, 2011 4:01 PM Subscribe

Tags

Share

The wreck of the web.
April 22, 2011 4:01 PM Subscribe