Petabytes on a budget
September 2, 2009 9:39 PM Subscribe

Petabytes on a budget: How to build cheap cloud storage disclaimer: I use Backblaze myself, but I thought this article was cool enough to share without being fanboyish
posted by device55 (70 comments total) 14 users marked this as a favorite

I started off thinking this was a bit on the pepsi-blue side of things, but it was an interesting and well-put-together article on how to do this kind of monster-storage thing.

What I couldn't help but thinking as I read it, though, was how quaint this will look -- like those torso-sized 1Mb drives of a few of decades ago -- in a decade or so.

And from there, to wondering what kind of data we'll be storing when we've got petabyte-storage earrings and tie clips. And hovercars.

Probably holographic direct-brain-connection pr0n, if history is any guide.
posted by stavrosthewonderchicken at 9:53 PM on September 2, 2009 [2 favorites]

Interesting, but those drives are packed so tightly that they'll all cook and die off pretty quickly. Once they start to die from the heat, you'll tend to lose many drives close together. With no hot-swap and no front panel access to them, you'll have to take the whole thing out of the rack to replace a drive, which means plenty of downtime.

I've seen drives packed a lot less tightly than that have high drive failure rates from heat.

The picture of the racks full of those pods makes me glad I don't work there.
posted by DrumsIntheDeep at 9:55 PM on September 2, 2009 [5 favorites]

I love it when some asshole bitches that they saw the link somewhere else. Really adds to the thread.

Looking at those, I had the same thought DrumsIntheDeep did. That's a lot of running drives for the space, but 67TB at less than $8k there has to be a catch. Even so, I can't help but salivate over the thought of all that space in that little package.
posted by eyeballkid at 10:01 PM on September 2, 2009 [2 favorites]

I've seen drives packed a lot less tightly than that have high drive failure rates from heat.

In those situations did they have as many fans? What do they have, 6 fans per pod?

I am pretty naive in such matters - but it looks like they packed a ton of airflow in there. I have no idea if that's adequate or not
posted by device55 at 10:04 PM on September 2, 2009

In those situations did they have as many fans? What do they have, 6 fans per pod?

6 fans are good (and not uncommon), but what worries me is that the disks are 3 layers deep. Those fans are most likely pushing the air in one direction, so the front most row gets cool air and the two layers behind get increasingly warm air.

I haven't touched a dedicated storage system in the last 5 years, so perhaps things have changed. The stuff I worked with had one layer of drives in an enclosure with fans and the supporting hardware behind them. They were all 10k RPM drives, so perhaps they put off more heat than whatever drives these guys are using. We had to be careful about the air conditioning in rows with a lot of storage to keep everything cool enough. Besides drives dying, nearby servers would get too hot and automatically shut themselves down.

Another factor in how long they'll live is how hard they're working them. Constant activity will drastically reduce their lives.

Lastly, hard drives die more than any other computer component. Get a lot of them in one place and some (hopefully) small percentage will die on any given week, so being able to hot-swap them is important. These pods look like they require some poor schmuck to go into the datacenter (possibly miles away from their home and office), shut the box down, unplug it, take it out of the rack (it's probably heavy with that many drives in it), replace the failed drives, then put it all back together. Usually this sort of thing is done in the middle of the night to avoid impacting users. Makes it hard to have a life.

With hot swap drives you just pop a bad drive out and slide a new one in at any time.
posted by DrumsIntheDeep at 10:31 PM on September 2, 2009

I was just reading I, Robot and thinking about how storage and computing has shifted. I imagine some smug neo-me in 50 years being like, OMG canz u believ racks?
posted by klangklangston at 10:53 PM on September 2, 2009

How much porn can this hold?
posted by Saxon Kane at 11:00 PM on September 2, 2009

How much porn can this hold?

All of it.
posted by mrnutty at 11:06 PM on September 2, 2009 [14 favorites]

I think the more interesting angle than the hardware itself is the company's decision to open up the design and parts list for the hardware. A lot of companies would regard a piece of custom hardware like that as valuable IP, and protect it viciously. Backblaze apparently thinks that the potential improvements as a result of opening up the design outweigh the risk of someone taking it and replicating their service on the cheap.

Really, the only thing that's stopping you from running right out and building one of these yourself is the custom enclosure. Assuming you could find a fabricator willing to bend and punch some sheet metal for about the same price, you'd be all set.

Evidently they must think that their value and uniqueness as a company lie elsewhere, and is not encapsulated in the hardware.
posted by Kadin2048 at 11:40 PM on September 2, 2009

Another factor in how long they'll live is how hard they're working them.

Given that it's a backup service, my guess is, not very hard. This is the kind of storage that would have been on a magtape jukebox not that long ago. They may even be able to spin drives down.
posted by hattifattener at 11:45 PM on September 2, 2009

They've built a couple of layers of redundancy, so I think three drives or so have to die in order to lose any data. (I'm basing that on the Hacker News discussion.) Also, they probably have copies of data on multiple machines in multiple geographic locations.

I think this is a technically sweet piece of engineering on the cheap. Want! Oh, & if very many people follow their lead on specific parts, the larger scale on those parts will probably make it even cheaper.
posted by Pronoiac at 11:45 PM on September 2, 2009

> I haven't touched a dedicated storage system in the last 5 years, so perhaps things have changed.

Consider the scale of these systems for a moment. Each one of these pods is equivalent to a single drive in a standard raid system. That is the backblaze goo that holds their systems together, which is why they don't care about people knowing their storage, since its their software they run ontop of it that matters.

I doubt they bother replacing a single dead drive in a pod, in fact, they probably wait until they have numerous pods that have 2-3 dead drives in them, and then pull them out of the active storage pool for whatever clustering file abstraction layer that lives ontop of those individual pods, then maybe when they have a good 4 or 5 pods offline like that, they take a road trip and pull the dead drives, put the good ones back in, and just reformat and add the pods back into their system as a fresh node.

The systems themselves appear to be on rails, and the backplane for the drives to connect to is on the bottom of the case, so they can slide the case out of the rack, and just lift the drive up vertically from the chassis and drop down a new one. And again, why bother implementing hot swap on the drives and getting different controller boards (and really, depending on the boards, they may be able to power off the backplanes already), when the pod itself is redundant.

Remember, these guys aren't building a general purpose for resale storage box, they have very specific needs and have defined what they can sacrifice for acceptable performance. They aren't meant to act as render nodes in a production high definition video editing market, they are just meant be a big collection of platters that has to be able to accept data fast enough from users uploading over the internet.

Google does the same crazy hacks, they have even gone so far as to engineer specific 10GB ethernet switches, since they have a specific subset of features they want, and they can totally drop all the other ones because it makes their systems 0.5% faster or what have you.
posted by mrzarquon at 11:48 PM on September 2, 2009 [4 favorites]

Lastly, hard drives die more than any other computer component. Get a lot of them in one place and some (hopefully) small percentage will die on any given week, so being able to hot-swap them is important. These pods look like they require some poor schmuck to go into the datacenter (possibly miles away from their home and office), shut the box down, unplug it, take it out of the rack (it's probably heavy with that many drives in it), replace the failed drives, then put it all back together. Usually this sort of thing is done in the middle of the night to avoid impacting users. Makes it hard to have a life.

Given they're SATA, if they've built a decent backplane, they should be possible to hot swap; roll the unit out of its rails, pull the drive, drop it back in.

The thing that looked really questionable to me was the choice of JFS as a file system. On one level sure, JFS is cool, worked with it on AIX, but... it's barely even maintained in the mainline kernel any more. IBM have pretty much dropped any real maintenance on it; the kernel code has been untouched for years, and the utils are updated every year or so. I haven't heard any actual bad things about it, but still - how heavily is it used.

(Although I guess if these guys prove it just works the answer is "more than you'd think")
posted by rodgerd at 11:52 PM on September 2, 2009 [1 favorite]

(I should clarify: it's barely touched in the Linux kernel. I assume IBM do more work on it in AIX)
posted by rodgerd at 11:57 PM on September 2, 2009

Cut the whinging "I saw this on x" thread-shitting. Congratulations, you read other websites. Keep it to yourself unless it's deeply substantial to the thread.
posted by cortex (staff) at 12:13 AM on September 3, 2009 [8 favorites]

On a box to box comparison it is much cheaper than some of the packaged commercial alternatives, but that is because it is designed for massive pools of pods and not business storage. There are single points of failure throughout the box, if it loses a power supply, a controller, anything on the CPU, a fan, the boot drive or a port multiplier the box is tanked. Add to that the cooling design problems, and they are pretty fast and loose individually. Sure they can give away the plans, because if you have one and something happens, you are dead in the water.
posted by arruns at 12:32 AM on September 3, 2009

Keeping that much gear cool isn't really that great of a challenge from an environmental concern, what matters is how many watts a rack throws off. As long as you can exit the heat out the back and aren't allowing it to rise up through stacked chassis you're probably just fine.

With a solid cold aisle and their fans (those ones push 125CFM each) my guess is they don't really have that big of a heat problem. Ever stack 2 x IBM P570's fully loaded on top of each other - 60 amps @ 208? Or 4 bladecenters in a rack, try 200 amps @ 208? Now that's some serious heat.

They are dealing with 4RU per pod, 42 RU's in a standard rack, so figure 10 pod's per rack.

Their average draw is about 5a per pod, so you're looking at an average run load of 50 amp, with a peak probably double that if you're bringing up a whole rack in stages.

That's...well that's nothing really extra ordinary...your average 4RU HP DL580 pulls down more than that. A well engineered datacenter can cool upwards of 10kw per footprint. You can make arrangement to do 20kw per footprint, but no one really does this unless you have some stupid ridiculous needs because it's prohibitive in terms of cost.

So will the drives die? Sure, will they likely have problems with heat? Maybe, probably the top of the stack will get warmer but I'm guessing that foam they put in actually helps quite a bit in keeping the heat isolated to a specific chassis.

Nevertheless, nice job of engineering, wouldn't want to see it hooked up to anything that needed to push out stupendous iops, but it's not designed for that.
posted by iamabot at 12:34 AM on September 3, 2009 [3 favorites]

I'm kind of surprised there doesn't appear to be any thought given to N+1 redundancy on the power supplies & fans. I'd think even with the software allowing the pods to fail, a little extra redundancy at the pod level would make maintenance less costly over the life of the pod.
posted by BrotherCaine at 1:26 AM on September 3, 2009

As to why there is no redundancy in the pods, I think it is because at heart these are software guys. They give away the design because they see their business value in the software layer. If you do all the redundancy through software, n+1 is just a hassle. It means you have paid for all the +1 hardware that is only used in a failure mode.
Why not just manage it all through software, throw hundreds of identical drive pods at the problem and let it run?
posted by bystander at 1:57 AM on September 3, 2009

OK, but I want to see them add it to a weatherized box connected to a rooftop kit greenhouse warmed by the excess heat, thus reducing the carbon footprint of the scheme or at least make it easy to grow tasty tomatoes on the roof at work.
posted by pracowity at 2:05 AM on September 3, 2009 [5 favorites]

As someone who has approx. 3TB of data & a mirrored 3TB of backups this looks like something that could interest me, esp. as for $50 a year I'd gain an extra 3TB of data storage. I guess the main problem is the upload speed of my internet connection (470k this morning = 290+ days of uploading.)
posted by i_cola at 2:20 AM on September 3, 2009

Heat isn't as much of an issue for drives as you'd expect. I learned this in early testing of drives wrapped in neoprene and steel at TiVo (the "full metal jacket" noise reduction option, which we never shipped) which ran pretty happily at boiling temps, and later reinforced that observation at Google.

What IS an issue is that with large enough drives, even in a RAID 6, when the first drive goes bad, it causes a flurry of activity as it rebuilds the replacement, and the flurry of activity lasts longer on larger drives, and can kill additional drives that were on the edge before. So needing only 3 drives out of 15 to die to lose your data seems a bit risky to me, although if they stash your data in multiple physical locations, it's fine.

I'd go with ZFS over JFS, I think, not that that would help the failed-drive reconstruction problem. These guys really seem to have gotten a bunch of stuff right*, though: Commodity drives, commodity power supplies, cheapest possible everything else and use redundancy to get reliability.
----
*at least if you think Google's approach is right
posted by Hello Dad, I'm in Jail at 2:56 AM on September 3, 2009 [4 favorites]

> > How much porn can this hold?
>
> All of it.

Are you asking for a challeeeeeeenge?
posted by Mr. Bad Example at 3:07 AM on September 3, 2009

It's an interesting idea. I'm not how well it will be adopted (or even examined) by current EMC et al. customers. Their solutions are ridiculously expensive (god, are they ever...) but there's something to be said about both the support and name recognition.

I can't count the number of times my VP has been interested in a product advertised in airports or on the AM news stations.
posted by purephase at 4:09 AM on September 3, 2009

As someone who has approx. 3TB of data

Honest, non-snarky question: What kind of stuff are people storing that they have this much data? I shoot tons of videos and photos and rip lots of CDs and my hard drive isn't even half full.
posted by jbickers at 4:16 AM on September 3, 2009

What kind of stuff are people storing that they have this much data?

See the "All of it." comment above. Porn may be pretty much all the same, but there is not, as far as I know, a compression algorithm that takes sufficient advantage of this fact.
posted by pracowity at 5:09 AM on September 3, 2009 [2 favorites]

Until last year I worked for a company that made similar although much more complicated and sophisticated systems and sold them for exponentially more money than these things cost. I really can't see how their business plan can hold up against much much cheaper commodity solutions like this.
posted by octothorpe at 5:18 AM on September 3, 2009

What kind of stuff are people storing that they have this much data?

Well, I store all my DVDs on a network drive (rapidly approaching capacity) that feeds a media center. I'm at about 2Tb now. I store full DVD images, which is inefficient, but easy. A DVD image runs between 4 and 8Gb, and my collection is not large by the standards of some aficionados. At some point I'm going to build myself a NAS, and I'm going to have to work out all this stuff with case, power supplies, cooling, and software management. The Backblaze pod seems like overkill for my needs, but it's interesting anyway.
posted by Ritchie at 5:25 AM on September 3, 2009

Doesn't count unless it develops some of consciousness.
posted by Brandon Blatcher at 5:48 AM on September 3, 2009 [1 favorite]

Ha. I just realized that my previous comment makes it sound like I have terabytes of porn. I wish. As a matter of fact, I just have lots of music files that I filch from torrents. What I meant to say was that a lot of people probably do have big collections of porn -- didn't people used to say that porn drove internet innovation? Was that ever true? Is it true now? Or was it just blather?
posted by pracowity at 5:56 AM on September 3, 2009

So it's a PC with a bunch of hard drives?

Um, that's interesting? I would assume the 1Pb figure is before mirroring. I wouldn't want to store data on single hard drives in a data center like that. In terms of their actual storage, you would need to double the costs at least.
posted by delmoi at 6:02 AM on September 3, 2009

jbickers: Honest, non-snarky question: What kind of stuff are people storing that they have this much data? I shoot tons of videos and photos and rip lots of CDs and my hard drive isn't even half full.

I think it depends on how you define "tons" and "lots". Of course, it also depends how define "hard drive" ;-)

Media production requires a great deal more space than simply storing the compressed end products - for example, I shoot still photos on a 10 MP camera in RAW format - the resulting images are ~ 10MB each, and I usually shoot about 400 pics in a session. A couple of sessions in a day, and you're up to 8-10GB. Start editing those in Photoshop, build up multiple layers, and the PSD files can easily be 50-100MB each. Inevitably you will keep multiple versions of some of these, and you'll want to archive the PSDs so you can go back and make modifications later ...

Similar situation with video - I shoot plain old DV, not even HD format, and that chews up (IIRC) 47GB/hr. Keep a few of your raw video files archived on disk, along with all the intermediate files, clips, etc, and pretty soon you're looking at a LOT of space.

Even if you're just working with the compressed end products, if you're building a media center, you might want to rip some DVDs, and if you keep them in the original MPEG-2, you're looking at 5-8GB per disk. You have 100 DVDs? Let's call that half a terabyte.

And then let's say you put your 500-CD collection online with lossless compression - FLAC gives you about 2x compression, so ... let's say 300MB per disk, so you can easily get into the hundreds of GB, just with compressed audio.
posted by kcds at 6:12 AM on September 3, 2009

Another thing that can take up a ton of HD space: Virtual machines. Now, obviously if these VMs are using the same OS there's a lot of redundancy, but it can be easier to just to make them separate then try to take advantage of that redundancy.
posted by delmoi at 6:16 AM on September 3, 2009

Surprisingly, we found that temperature and activity levels were much less correlated with drive failures than previously reported.
posted by Rat Spatula at 6:20 AM on September 3, 2009 [2 favorites]

OMG pracowity, stop downloading, storing and looking at all that porn
posted by Rat Spatula at 6:21 AM on September 3, 2009

So it's a PC with a bunch of hard drives? Um, that's interesting?

What's interesting is that it's part of a custom engineered solution for tremendous amounts of expandable, modular storage and that the company decided to share it with the internets instead of locking it behind 15 layers of trademarks, copyright, and patents.
posted by device55 at 6:53 AM on September 3, 2009

Sun's Thumper and it's bigger brother can be had for as cheap as 30k and offer 48 TB of storage in a 4U tray.

And those are full fledged servers - the 4540 has 8 cores and (upto) 128G RAM. The drives are packed 4x12, and heat isn't an issue - but you do only have a minute to replace a drive before the alarms go off.

Sun does offer 48TB 4U drive arrays - But there are other similar products from other vendors

If I might go a little pepsi blue, I've got several Lefthand Networks iSCSI disk arrays and have been very happy with them. They aren't quite as dense - 48TB in 8U I think, but the management software is much better than the competitors and very cost effective.

I'm a big fan of iSCSI sans. That said, it's worth mentioning that OSX does not offer native iSCSI support and the third party options are hit or miss on reliability and usefulness.
posted by Pogo_Fuzzybutt at 6:59 AM on September 3, 2009

What kind of stuff are people storing that they have this much data?

Increasingly, regular old crappy consumer video cameras (like the Flip Ultra HD, which is so cheap I got one as a give away recently) shoot High Def video. This eats hard drive space like there's no tomorrow. We shot about 100 gigs of it in a one-week vacation with the kids. If you're a pack rat and never get rid of anything that adds up incredibly fast, and storage is cheap so why not? There's a reason the new iMacs come with a TB of hard drive.
posted by The Bellman at 7:04 AM on September 3, 2009 [1 favorite]

This is a very cool article. I always enjoy reading about technology taken to a massive scale.

It is also highly effective advertising. After reading the article, I immediately went to their homepage and started evaluating their service for my own needs and those of my consulting clients. I really should do this kind of bloging for my own company. Writing clear and coherent technical articles (with pretty pictures) about the cool technology you work with is probably the best possible advertising for a technical customer base.

Unfortunately, it didn't take me very long to realize that backblaze won't work for me or the majority of my consulting clients. Windows only (yeah, ok, Mac client in beta). Sigh. I don't need a gui app for my Linux desktop. Provide an open interface, and I'll do it myself. I just want unlimited online storage at a flat rate. I do realize that I could store about 30GB on Amazon S3 for the same 5$/month. My music folder is bigger than that, however.
posted by Thalience at 7:05 AM on September 3, 2009

Thalience, I've been using the Mac client for over a month now with no issues - they even updated for Snow Leopard the day of the release. (now, if you're supporting clients on linux or unix yeah. no.)
posted by device55 at 7:09 AM on September 3, 2009

This article is great. I love how specific it is, to the point we can all sit here and argue details about their filesystem choice, drive swapping options, and airflow problems. I've seen a lot of systems folks build a lot of interesting components in the past few years, but it's all trade secret do not discuss. Nice to see something open so we can all learn.

I remember seeing a talk, oh, 10 years ago about Blue Gene, IBM's successor to Deep Blue to build experimental supercomputer capabilities. The part that blew me away in that talk was how they built these processors knowing that parts of the chip or modules in the system would fail. But they were all packed so tight there was no hope of field replacement. Instead you just lived with the failures and designed the rest of the system to work around it. "Autonomic computing" was the buzzword, if I remember right. At the time that idea seemed so radical, although now I imagine it's pretty common. A computer designed to get slower as it ages.
posted by Nelson at 7:16 AM on September 3, 2009

They've built a couple of layers of redundancy, so I think three drives or so have to die in order to lose any data

Or one SATA controller, which would eat an entire row.

See, EMC SANs are redundant end to end. I have two HBAs, each has two ports. Each connects to a pair of FC switches. Each switch connects to both storage processors. Each storage processor is connected to every disk. The drives themselves are protected by RAID and hot spares. All are hooked to redundant power supplies.

I can lose any part and not lose access to the disks. That's redundancy, and EMC gets it. Hell, on the DMX-4, the Big Blue Glowing Bar on the cabinet has redundant power.

This box protects the drives, but not the controllers. Lose one of those, and a bunch of customer data just melted. 15 drive RAID-6 means a very painful rebuild cycle if one fails, until you rebuild, performance will suffer. If you lose two of them, then performance is really going to suck, and, of course, you are one drive away from losing the entire stripe set.

Since there apparently isn't any hot spares, you lose that performance *until* you get to the system and swap the drive, you then lose more performance during the RAID rebuild. So, as these boxes fail drives -- and they will, all storage systems do -- performance will suck until someone walks in an swaps the drives.

Those 120cfm fans aren't going to pull anywhere near 120cfm through that box -- not with that drive density. That entire box is going to cook, and while that won't really affect the drives, the electrolytic caps in the power supply, however, are going to hate it.

I've done storage for a long, long, time. I wouldn't trust my mp3s on this thing for any length of time, even if I had two of these boxes mirrored.
posted by eriko at 7:31 AM on September 3, 2009

Aside from the questions about the file system chosen (and yeah, ZFS or maybe Gluster or something would make more sense, but that's almost a seperate post entirely) - the one big weakness I see with these are the power supplies.

They aren't redundant, or hot swappable. Now, this isn't a problem really - if you think in terms of pods and not in terms of disks, making the pods redundant solves that dilemma. But a dead power supply is a much bigger issue then than a dead drive, and it doesn't look like it's an easy replacement.

That said, power supplies don't go out that often, so it's possible to overstate how much of a weakness in the design this actually is.
posted by Pogo_Fuzzybutt at 7:34 AM on September 3, 2009

What's interesting is that it's part of a custom engineered solution for tremendous amounts of expandable, modular storage and that the company decided to share it with the internets instead of locking it behind 15 layers of trademarks, copyright, and patents.

What part of it is is custom? The case? It appears to be completely commodity hardware. As other people mentioned, the specialness is in the software.
posted by delmoi at 7:41 AM on September 3, 2009

> Remember though that they don't use one of these, they use hundreds or thousands - I imagine if one of these pods has a failure they just yank it out, fix it, and add it back to the pool. I think the idea is that these pods are treated like you might treat a single drive.

All of these pods sit behind an application layer which segments and encrypts user data across the system - so one of these pods failing doesn't mean that Customer #23456 lost his stuff
posted by device55 at 7:41 AM on September 3, 2009

What part of it is is custom?

Sorry, I was referring more to the system as a whole - the pods are components of a custom engineered solution - but, yes, beyond the case they are built from commodity bits.
posted by device55 at 7:46 AM on September 3, 2009

> This box protects the drives, but not the controllers. Lose one of those, and a bunch of customer data just melted. 15 drive RAID-6 means a very painful rebuild cycle if one fails, until you rebuild, performance will suffer. If you lose two of them, then performance is really going to suck, and, of course, you are one drive away from losing the entire stripe set.

I doubt they isolate their customer data to a single pod. I am sure their system simply isn't "Eriko's data is on pod 37." These guys specialize in compression, deduplication and backup. As they've said, they have their whole software layer built on top of this, they have decided that it is cheaper to implement redundancy (since they are the ones writing the implementation anyway) and compensate for a 5% failure rate in software than to put more money into individual boxes and build a system around a .001% failure rate.

For other operations where you just want storage you don't want to think about, or obsess about, a thumper or similar is exactly what you buy. Usually those systems are the single storage entity for a business, a division or a department, so you want everything redundant in box, as *that box can't fail* since you usually have very specific business apps and systems that don't deal well with clustered filesystems, multipathing, or what have you, so you pay for lots of money for SUNs or Isolons or whoevers magic sauce that makes that NFS/iSCSI/Fiber mount never go down and never lose any data.

These guys are engineering to their own specification, they have end to end control at the source level of how their applications deal with data and storage, so they don't need to use someone elses magic sauce when they are in the business of writing their own, right?
posted by mrzarquon at 7:49 AM on September 3, 2009

I've been wondering for a while how much storage one would need to hold all the important data in one's life.

Obviously, data accumulation differs from person to person, but has anyone attempted a rough estimate.

When I was younger, I collected lots of new music pretty quickly. As with most older people, it's rarer for me to find new music that I like. I probably find one "sticky" album (a recording I'm likely to want to keep for years) every two or three months. If I add that to the hoard of music I collected in my youth, I can extrapolate that 250GB should be find to hold all the music I'm ever likely to collect in my life.

I would be curious as to a ballpark estimate if I added digitized movies, tv shows and books. I am not talking about saving everything I ever watch, read or listen to. I am talking about the amount of storage I'd need to save, say, 60 years of collected media.
posted by grumblebee at 7:53 AM on September 3, 2009

More why I don't like it. IOPS. -- That's Input-Output Operations Per Second, and it is often the biggest problem we face in production computing. These are 7.2K SATA, you'd be lucky to get 100 R/W IOPS out of a single spindle, so you're looking at 75*15, or 1125 IOPS per stripe set, or just under 3500 per box.

Ideally. The problem with a 15 drive stripe is that it becomes very difficult to get full stripe writes and reads. Anything less than a full stripe write means that the spindles that aren't involved basically "throw away" the IO opportunity while the others are busy.

Now, look at the workload they're building. Cloud computing, which means multiple connections with wildly varying I/O workloads. They're going to spend a lot of time just moving heads around. If they get 500 IOPS sustained per stripe, they're going to be lucky.

Now, if you had one box talking to this? That's not bad. When you have a few dozen VMs fighting over it, well, if they all want data, you're looking at, in ideal cases, 23 IOPS per box. More likely, it'll be closer to 12 IOPS. Then you realized that since you're not reading or writing nice neat stripes, and that each box wants different parts of the disks, and, and, and.

And that, kids, is why EMC, IBM and SUN storage systems costs so much. First, they use drives with twice the IOPS -- 150 for 15K FC drives. Secondly, they rewrite the firmware on these drives so that the drives never lie about writing to disk. Thirdly, they back them with large caches and very smart storage processors to coalesce I/O as much as possible to limit head movement and maximize full stripe access. Fourth, they'll sell you EFD disks that give you 20,000 IOPS, and boy, are they not cheap*, and I want some badly, because I have this one 75GB datastore that I've plaided** across 18 spindles -- 2700 peak IOPS. Why? Because everything in the site touches this store constantly, and despite having 18 spindles serving up that data, it still spends roughly 60% of the time simply moving heads to get to the right blocks. Cutting my access time from milliseconds to microseconds is a huge win, and it's the number one item on next years capital budget.

What this box would work well for is archival storage -- data you lay once and read rarely. Point VMs at this, and ugh. The thing that annoys me is that they could seriously improve both the performance and reliability of the setup by using RAID-10 with a hot spare, rather than RAID-5. It would cut the capacity in half, but 36TB for $8K is still pretty damn good.

*Tray of 15 400GB EMC EFD drives is about a half a million.

** Plaids are stripes of stripes -- it's how you get lots of spindles on a workload without paying punative RAID penalties.
posted by eriko at 8:01 AM on September 3, 2009

On a somewhat different note, does this company mind if you store information that perhaps, ahrem, is of somewhat questionable nature?

I mean, personally, all I have stored on my hard drives are copies of the good Book and hymns I've sung myself, but I have this friend, nay, an acquaintance, nay, a fellow I read about in the papers, who has at times stored information who's copyright has been a matter of contention.
posted by dearsina at 8:10 AM on September 3, 2009 [2 favorites]

Also, I've actually had a chance to talk to similar developers about their own high speed data sending technology, and how they designed their system to use cheap storage.

In that instance, they weren't doing backup, but they had some similar challenges of storing and resending various files, but they made sure the storage backend was entirely uniform, as they weren't storing data as the customer had it on their desktop: all customers data was compressed, encrypted, and split up into a uniform data segments they found to be optimal for their algorithms (like 128K or something) at the client side before it would be pushed to their servers. The servers would tag and log each segment in their super fast database, and then just write each segment a few times to wherever it could find space on these very specific pools of storage at it's disposal. Since all the data size that was being written to the drives were uniform, they knew exactly how to tweak their raids for optimal performance, their network backend for best throughput, etc.

So from a storage perspective, these drives may have been perfect for them: hundreds of places to write 128K files thousands of times of a second, awesome (although in their case they were using a cheap SAN solution, so with the added overhead / latency of something like HTTP, the file size may be much bigger).

The IO controllers wouldn't even know if the data was coming from bob or suzy or whoever, it's job was just to find places to put redundant copies of these data chunks flying at onto whichever disks were green and still had space on them, and then another system would check against the database and pull the content off when someone requested it.

Remeber, these services may have been laggy from a disk IO nerd perspective compared to a SAN or something, but they were servicing customers over the internet. So they had 5-60 seconds to collect the packets and start sending them to the client, during which their little UI on the client would just show a dialogue "preparing to send" or whatever.
posted by mrzarquon at 8:11 AM on September 3, 2009

To me the oddest choice is that one of the four SATA controllers is PCI rather than PCIe, and that one carries 50% more drives than each of the PCIe controllers. The PCI bus can only move 133 MB/s total for the entire system - rather less in practice, which is adequate for home use of a single drive, but a bottleneck for a striped multidrive system which has far more disk bandwidth than that. Perhaps not such a big deal when it comes to serving data over a GigE connection but that'd be a real performance chokepoint when rebuilding arrays.
posted by George_Spiggott at 8:11 AM on September 3, 2009

> More why I don't like it.

Dude, you are missing the point entirely about the functionality of these systems. They are not meant for you. They are not general purpose storage, or meant for fast as fuck storage. I mean, you are free to continue not liking it, but realize that they aren't trying to compete with EMC or whatever, those features you've listed mean nothing to them, as their requirements are:

Can it store our uniform blocks of data
Can we pull from them fast enough so that our customers (who are on the internet) don't get upset with the process taking too long
Is it cheap
Is it redundant enough for our needs (since it is cheap, can we get away with just two or three boxes holding the data)

And this is just the final back end storage place for the data, for all we know, they could actually have a production EMC or two, for their fast as fuck caching storage, metadata and database repository. They are designing their system probably around the assumption that they will never have every customer try to do a full restore all at the same time. So yeah, they have low IOPS for the amount of spindles and controllers and all that stuff, but they are probably only doing one restore for every 1,000 backups, and each one of those jobs is distributed across multiple pods. So yeah, POD 32 is totally swamped right now, and they can't restore customer 32153's file chunk 89 from it, so they just pull it from the next POD that has it and can respond, etc.
posted by mrzarquon at 8:22 AM on September 3, 2009 [3 favorites]

and actually, if they design their system for loadbalancing, then it doesn't even matter if a POD is not responding due to being already saturated, crashed, on fire, or what have you, the IO system that is actually in charge of allocating those encrypted blocks of customer data just finds other systems to read from and write to.

> On a somewhat different note, does this company mind if you store information that perhaps, ahrem, is of somewhat questionable nature?

These guys are just bit shovelers, you send them bits, they find a place to keep them safe, that is about it. They probably go out of their way to ensure they don't have a clue what the content of those bits are, except that they belong to your user account.

The devs I know who have done similar work actually designed their system specifically so they couldn't tell anyone what data belonged to who (when you pushed files to them, you got a time limited random code you used to retrieve it, which they made sure they couldn't store).
posted by mrzarquon at 8:30 AM on September 3, 2009

eriko: What this box would work well for is archival storage. It's almost like they were designing a disk storage system for archival storage!

And that, kids, is why EMC, IBM and SUN storage systems costs so much. And that price, kid, is why EMC and IBM and SUN storage systems weren't appropriate for Backblaze.

They're not claiming they've developed some amazing storage system that solves all needs. They're not selling the hardware. They're sharing a system design that they say works for them. Yeah, the blog post is a little light on details about exactly what this system is good for compared to other options, but it's not like they're writing an academic paper.

I realize you are Mr. Storage Expert and know all about everything disk related. But maybe you could learn a little bit by looking at what Backblaze published with an open mind and seeing how their special solution serves their need? Or not, it's up to you.
posted by Nelson at 8:47 AM on September 3, 2009

And this is just the final back end storage place for the data, for all we know, they could actually have a production EMC or two, for their fast as fuck caching storage, metadata and database repository.

Actually I think the correct approach would be to store this in the memory. I have a feeling this is what Google does and why they require the 10 gbps. They route their file system through their routers, this is possible through (glusterfs) and is actually feasible if you have enough bandwidth.

The problem with this scenario is that you have redundancy built into the system and that same redundancy means writes are slower (as they have to propagate through the system) and deletes are the same way. Financial reports and ERP type software need to have a degree of auditing and exactness. This sort of thing only works when you write not so often and read a lot. You'd be surprised at how hot data centers can run. I have it on good authority that Google runs rather warm. When you're running commodity hardware like this they're spec'd to high temperature tolerances.

Of course for most companies, from a financial perspective, it is far cheaper to EMC and commercial hardware at higher prices than rebuild everything from scratch. What these guys are doing is a whole lot simpler than 10+ years of business and application logic most places face.

I could be talking out of my ass but this is what people who sign off on data centers have been jabbering about lately.
posted by geoff. at 10:35 AM on September 3, 2009

> Of course for most companies, from a financial perspective, it is far cheaper to EMC and commercial hardware at higher prices than rebuild everything from scratch. What these guys are doing is a whole lot simpler than 10+ years of business and application logic most places face.

Exactly. These guys are building a new system from scratch, and get to engineer everything end to end, they don't have any legacy support to fold into. So they needed a bunching of spinning disks they could access over HTTP, so they built a system that did exactly that, nothing more, nothing less.

I run into this mental breakdown with companies that are really excited about going to the cloud. I know of a few companies (mefi being one of them) where it appears to work really well, because they were always small, they started in the 'cloud' and they stayed there. But when I talk to 30-40 person businesses with 20 years of documents, business systems, and they are excited about putting everything on EC2 and S3, they somehow think they processes wont change at all, just that their data will magically be elsewhere, and they will be saving money (but they don't know how much they are paying for their current systems anyway, besides "too much").
posted by mrzarquon at 10:45 AM on September 3, 2009 [1 favorite]

I can understand why some may compare this Backblaze hardware to others like EMC and Sun and whatnot seeing as the article had a graphic comparing their prices to EMC and Sun and whatnot.
posted by Tacodog at 11:08 AM on September 3, 2009

I agree with Nelson that the haters are missing the point. This isn't going to fill every single role. It's a new tool that could be useful, & it is pretty nifty for its niche.

mrzarquon, I thought "cloud" implied high availability clusters. Metafilter doesn't work like that.

Even the Amazon offerings can be great for avoiding, say, having to make a business case for every new server or hard disk & the ensuing support contracts & so forth. And Dropbox, for example, can just overhaul a workflow handily.
posted by Pronoiac at 1:10 PM on September 3, 2009

I've learnd a new acronym: JBOD (just a bunch of disks).
posted by exogenous at 1:24 PM on September 3, 2009

Oh and last I heard Google uses Oracle for their business side of things. RDBMS and traditional database systems are going to predominate where auditing is needed not because they are cheaper or better, but because people have learned to trust the results they get. I forgot how delete requests were handled through Gluster or Hadoop or a "cloud" platform but I'm assuming if you wanted to audit data you'd see funny things like "8 out of 10 clusters have a delete request on that data set" and imagine bringing a payroll to a CFO and seeing what you have to pay this week and what you're pretty sure you're not going to have to pay this week.
posted by geoff. at 1:58 PM on September 3, 2009

> I thought "cloud" implied high availability clusters. Metafilter doesn't work like that.

Metafilter.com does not work that way.

Metafilter Network LLC probably does. From what I recall, everyone uses google, Matt and pb may have a dev server in their office, or their office may just be desks and workstations, and their entire business workflow lives 'in the cloud.' Granted, its not a fancy cloud (google apps/docs/whatever), but its not a local storage system either.

Maybe they use github for versioning of their code as well, etc. etc.
posted by mrzarquon at 3:40 PM on September 3, 2009

I just mentioned metafilter as it was an example of a small company, that really grew up around the web, so moving their business management side of things to the cloud was easy, since everyone was already using email and webservices anyway. It was really just an anecdote on how not all technology ideals or standards (IOPS, virtualization, cloud computing, clustering) would work for all business cases, so cloud works for metafilter's business back end, but it wouldn't (currently) work for a 10 person law firm with 20 years of legal documents and only access to a 8mbit internet connection.
posted by mrzarquon at 3:45 PM on September 3, 2009

I'm pretty sure that pb has mentioned using MS SQL 2005/8 which I wouldn't consider in the cloud. In fact I'm nearly positive that Metafilter is rather conventional and just a server or two (very beefy servers).
posted by geoff. at 5:24 PM on September 3, 2009

Cite.
posted by geoff. at 5:25 PM on September 3, 2009

geoff- yes, i wasn't talking about the servers and systems hosting this website, I meant the systems that Matt and co use for business stuff, ie mail (gmail) or maybe document collaboration, etc. which would qualify as running 'in the cloud.'
posted by mrzarquon at 9:07 PM on September 3, 2009

Ah yes, I'm sorry misunderstood you.
posted by geoff. at 5:54 AM on September 4, 2009

"Sure they can give away the plans, because ~~if you have one and something happens, you are dead in the water~~ they aren't in the business of selling $8,000 backup units."

FTFY, arruns. Seriously, though - this in no way threatens their business model, which as others have pointed out, is based on the software layer. This is just a cheap commodity sourcing for them; one that they don't consider a critical advantage.

And, they just got a load of blogworld free marketing from it.
posted by IAmBroom at 1:25 PM on September 4, 2009

Cloud computing is a funny thing. A lot of large players are scrambling to sell it to the enterprise market as a REALLY BIG DEAL, and you know what...I think they are right, but the business model for a lot of this is just funny, and figuring out how someone competes with Amazon or Google (should they put their minds to it) is a fun intellectual exercise.

Here is why I think it works and makes sense for some companies to dive in to the cloud business and others not so much.

When I look at an Amazon or a Google or say a Yahoo or eBay and their model for offering a cloud computing solution I see that they are offering a service that is directly in line and complimentary with their core business and the expenses associated with maintaining that business from an infrastructure and technology perspective.

Taking the biggest one (and most convenient for this example), Amazon, their EC2 platform makes terribly great business sense to me. Amazon is essentially an e-commerce retailer as it's core business. They participate in a cyclical business with defined periods of peak strain on their technology footprint. They are obligated to make regular technology and capacity investments to keep their core business running and to keep capacity on hand and available to support peak periods. This, generally speaking, leads to excess capacity sitting unused outside of those peak periods. Amazon offering EC2 strikes me as a great way to somewhat offset the cost of keeping that excess non peak capacity going, mitigate the hardware refresh costs to some extent, and well...eat their own tech dog food. Turning that excess capacity in to a revenue engine is brilliant and you get to take advantage of ever larger economies of scale on an asset base that generally matures to EOL after 5 years.

The cloud computing space is a race to the bottom, the competition to offer processing power for the lowest cost is a place where only a few folks will survive and it will be a mean, bitter, ugly and small margin place when it is said and done. The companies that make it in that market will be the ones who use their cloud solution everywhere they can, service offerings that are cutting edge with features and enhancements to the service offering...but at the end of the day features and enhancements will just move the customer base around a bit between the survivors. It's going to come down to who can afford to offer the most processing power for the lowest absolute cost. I think the costs will be driven more by the energy efficiency of the computing than anything else long term. It used to be one of the biggest absolute technical roadblocks was the networking end of things, and I don't want to overly trivialize the networking aspect, it's important but by no means a barrier to entry that it once was.

I think that if you want to be truly successful in the cloud space from either a storage or a processing perspective you have to have a business or services offering that is aligned directly with your cloud offering *beyond the cloud offering itself*, and have to aggressively move all business units and service offerings in to the cloud platform in order to reduce overall overhead, and finally I think that legacy computing infrastructure and facilities are a liability rather than an asset. I'd rather have 40 modern facilities with power efficient cloud capable systems and applications than 400 disparate systems running on a mix of platforms with different power footprints and widely varying facilities.
And that ends my rant of the week.
posted by iamabot at 2:18 PM on September 4, 2009

Ugh, 'the cloud', seriously? (disclaimer: I write lots of software that works 'in the cloud').

Dear everyone: if I take my household PC and make a REST interface for you to put files on it, is that 'the cloud'? Seriously, hardware abstraction != the cloud.

Also, 'the cloud' as a rejuvenated term for 'hosted services' that carries none of the security/privacy/ownernship concerns is pure BS.

Note: this blog post is frickin' awesome, and makes me think that maybe scalability is a game that more than Google can play.

Everyone knows that 'the cloud' sells big for some reason, and it is a breath of fresh air for allocation, but what the f it is and what it isn't (like... this) needs to be nailed down or it'll be yet-another-crappy-term-that-techies-hate.

It seems like it's already there, actually.
posted by tmcw at 12:49 PM on September 6, 2009

« Older Float like a leprechaun, sting like a bee | Deus Ex Barba Newer »

This thread has been archived and is closed to new comments

MetaFilter

Petabytes on a budget
September 2, 2009 9:39 PM Subscribe

Tags

Share

Petabytes on a budget September 2, 2009 9:39 PM Subscribe

Tags

Share

Petabytes on a budget
September 2, 2009 9:39 PM Subscribe