Join 3,440 readers in helping fund MetaFilter (Hide)

The Data Deluge
February 28, 2010 12:28 PM   Subscribe

According to one estimate, mankind created 150 exabytes (billion gigabytes) of data in 2005. This year, it will create 1,200 exabytes. Data data everywhere and possibly too much to drink?
posted by Glibpaxman (21 comments total) 7 users marked this as a favorite

So your solution is to post to Metafilter? You, sir, are part of the problem.
posted by Horace Rumpole at 12:36 PM on February 28, 2010 [9 favorites]

After exabyte comes zettabyte followed by yottabyte. After that is anyone's guess but some people are trying to lobby for hellabyte.
posted by Rhomboid at 12:45 PM on February 28, 2010 [6 favorites]

What gets me is not the amount of data produced, but the amount ignored and forgotten. As Washington Irving observed in "The Mutability of Literature"
How much, thought I, has each of these volumes, now thrust aside with such indifference, cost some aching head! how many weary days! how many sleepless nights! How have their authors buried themselves in the solitude of cells and cloisters; shut themselves up from the face of man, and the still more blessed face of nature; and devoted themselves to painful research and intense reflection! And all for what? to occupy an inch of dusty shelf—to have the title of their works read now and then in a future age, by some drowsy churchman or casual straggler like myself; and in another age to be lost, even to remembrance. Such is the amount of this boasted immortality. A mere temporary rumor, a local sound; like the tone of that bell which has just tolled among these towers, filling the ear for a moment—lingering transiently in echo—and then passing away like a thing that was not.
posted by stbalbach at 12:46 PM on February 28, 2010 [8 favorites]

"An assurance of unfading laurels, and immortal reputation, is the settled reciprocation of civility between amicable writers. To raise monuments more durable than brass, and more conspicuous than pyramids, has been long the common boast of literature; but among the innumerable architects that erect columns to themselves, far the greater part, either for want of durable materials, or of art to dispose them, see their edifices perish as they are towering to completion; and those few that for a while attract the eye of mankind are generally weak in the foundation, and soon sink by the saps of time."

Samuel Johnson, Rambler 106.
posted by Horace Rumpole at 12:53 PM on February 28, 2010 [1 favorite]

A Fistful Of Datas?

I tried to find the recut one where it's just this clip over and over with LaForge looking on in dismay, but I forgot which one that was
posted by DecemberBoy at 1:06 PM on February 28, 2010

Yeah, but take out all the copy-pasted Treaties of Westphalia in Metafilter threads, and that number's easily 5% lower...
posted by l33tpolicywonk at 1:07 PM on February 28, 2010 [1 favorite]

Yes, definitely too much to drink.

To give you some perspective here from Molecular Biology Land, we can now take a tissue sample, shoot it into the mass spec, and end up with a long list of proteins that are up- or down-regulated for a particular disease state. Awesome. Now what? How do we take that data and make sense of it?

Right now we have two options: one, painstakingly go through the data and try to identify one or two protein changes (out of thousands) that might shed some light on the disease in question, or two, use a very expensive piece of database software that basically does the same thing, only with less skill and no common sense. We do both, and neither method is very good.

What the scientific community needs is A.I. capable of finding connections, noticing trends, and picking out the bits of signal in the noise. Unfortunately, the tech just isn't there yet.
posted by dephlogisticated at 1:47 PM on February 28, 2010 [3 favorites]

I'm not sure that the AI you describe Dephlogisticated is going to make your dreams come true. I've seen good scientists not look farther than for what they expect to find and dismiss looking at other relationships in the data set as a waste of time.

I think you hypothetical AI is doomed to either give you the same answers you'd have gotten to begin with, or to cry wolf so much that no one trusts it.
posted by Kid Charlemagne at 2:16 PM on February 28, 2010

[...] mankind created 150 exabytes (billion gigabytes) of data in 2005. This year, it will create 1,200 exabytes.

"It"? We, surely. Unless... !
posted by Sys Rq at 2:31 PM on February 28, 2010 [2 favorites]

It would be interesting to see a comparison of more identifiable benchmarks like the number of words or pictures. A book's worth of text today might be 10s or even 100s of times larger than the same number of words a decade ago, or a simple diagram might be 10's of thousands of times larger, thanks to changes in formatting, resolution, etc.

It would also be interesting to know if there is a way of quantifying the quality of certain types of data, like books. I understand that's an almost impossible and highly subjective proposition, but as an example, what percentage of books published today make it onto the NY Times bestsellers list (or some other measure that could be used as a constant from decade to decade). Dropping barriers to publication has the upside of allowing some books to percolate to the top that might otherwise never have been picked up by a major publisher 30 years ago... but it also probably implies, conversely, the percentage of available crap has increased dramatically too [with full understanding the one man's trash is another man's treasure].

Or what percentage of the data is comprised of say, scientific, business, or private data. The articles gave examples of (IIRC) Walmart pushing many times more data in a day on business transactions than the amount of books contained in the Library of Congress, and data collection in the field of Astronomy outstripping itself at, well, astronomical rates. But it would be interesting to know how this explosion of data is being driven by, say, the bulk of the world going digital with things like music and photos and backups.

For example, by far and away the single biggest category of data I have is digital photos I've taken, about 1TB duplicated on two additional backup disks (3TB). By comparison, I'd be surprised if I had even 10GB of text documents.

Just wondering out loud :)
posted by Davenhill at 2:46 PM on February 28, 2010

It is important to realize that most of this data is both trivial and ephemeral. I mean, how many of you store more than a fraction of your emails? text messages? data used to get your text messages from your sender's device to your own? Someone may have interest in that data you created, but it's probably not you.
posted by GenjiandProust at 2:55 PM on February 28, 2010

Kid Charlemagne:I think you hypothetical AI is doomed to either give you the same answers you'd have gotten to begin with, or to cry wolf so much that no one trusts it.
I'm much more optimistic about AI. Human brains can be fairly good at pattern recognition, even complicated and nuanced pattern recognition, but much/most of it is done subconsciously. When something does or does not fit a pattern, you're initially having to rely on a gut feeling without necessarily consciously understanding what is causing it.

Computers should be better at looking for known patterns, and AI should improve a computer's ability to look for unanticipated patterns... and more importantly, for any (potential) pattern, a computer program would be able to spit out the data for closer scrutiny.
posted by Davenhill at 3:15 PM on February 28, 2010 [1 favorite]

In the interest of being the one dissenting voice, I say that AI is not the solution. What we really need to manage the data flood is a worldwide effort to destroy the economy so that all humans revert to subsistence farming. The computers that remain will have time to catch up while humanity gets itself out of the dark ages.
posted by mccarty.tim at 3:24 PM on February 28, 2010 [1 favorite]

You need philosophy to work with science in order to interpret data.
posted by ovvl at 3:33 PM on February 28, 2010

Another benchmark I'd like to see is some sort of automagical indexing of what counts as data if you pay enough attention or something is important enough to you. While I strongly suspect that the modern homo sapians ingests more raw data per capita than the whole of the species for millennia, there seems no good way of knowing if, for example, people 15,0000 years ago paid as much attention to types of grass as we do to types of youtube videos.
posted by digitalprimate at 4:18 PM on February 28, 2010

I think you hypothetical AI is doomed to either give you the same answers you'd have gotten to begin with, or to cry wolf so much that no one trusts it.

It'd be fantastic to have A.I. that could pick out interesting data points/patterns with the same skill, or even half as much, as a human (though I'm not holding my breath). The point is that an automated system could handle 1010 more data in 1/1010 as much time. That's really the issue here—the pure volume of data, rather than our ability to make sense of it.

When I talk about better A.I., I'm envisioning something much closer to Google than Hal 9000. An algorithm doesn't have to understand what it's doing in order to be good at doing it.
posted by dephlogisticated at 4:47 PM on February 28, 2010 [2 favorites]

Yeah, sorry about that. I've needed to turn off my random USENET post generator for about 15 years now. I'll get to it soon.
posted by chairface at 4:55 PM on February 28, 2010 [1 favorite]

it'd be fantastic to have A.I. that could pick out interesting data points/patterns with the same skill, or even half as much, as a human

The flip side of that is that people tend to overfit. (paper links from here; some behind a pay wall but the abstracts are free)

As an example, I saw a really frustrating talk recently where someone went through the trouble of doing a broad-scale study, did a bunch of statistics on their data and then proceeded to disregard those results in favor of following up on a statistically insignificant lead they had a prior hunch about. At that point, why even bother to generate the data? Why not just pick your candidates ahead of time and just study those?

Not to say that AI-driven approaches to data interpretation don't have biases of their own, sometimes covert ones - but that's on a whole different scale from the intuitive capacity of the human brain. If we want to know what's driving the prediction in a formal model, we can usually find out (as Davenhill mentioned). We can also try to assess how predictive the model's rules are in general and get a sense for how much to trust them. You can't really do any of that with a mental model.
posted by en forme de poire at 11:01 PM on February 28, 2010

Prediction: The study, research, development and application of hard and soft AI as well as general informatics will generate even more and more data per capita than ever before.

AI would only be another step or stop-gap measure. It's no end point or holy grail. Real AI won't be the *waves hands magically* singularity. We probably won't even recognize real AI when it first shows up. Heck, I'll put some money on it appearing accidentally in the form of a complicated botnet that doesn't want to take orders or be switched off. As far as I can tell, it's already here. AI programs pass the Turing Test daily in the delivery of spam and malware. Other very sophisticated AI programs try to weed it out so the much smaller amount of legitimate mail makes it to John and Jane Doe's inbox.

Informatics is complicated. Computational anything is always going to be an abstraction and mediation of real things - and it requires data, lots of it. The more computational anything we do the more data there will be, the more complexity, the more information and knowledge that will be required, the more blind trust we have to place in these complex systems and the experts who actually know their specialized fields to deal with these complex systems. Teaching and educating these specialists takes even more data. Equipping them to do research even more. Implementing their findings and bringing them to market -- even more.

In short, there's really nothing much we can do. That snowball rolled right out of Pandora's magic box a long time ago. Barring catastrophe or a technological reset its going to keep on rolling and self-generating more and more data. That is its function. It needs input.

Think about it. Every keystroke I make here adds to it. Every time I save a favorite, or store a local bookmark. Every time I visit a page that throws a cookie, beacon or tracker. Every photo. Every graphics project. Every log file. Every altered or remixed mp3. Every random voice recording, phone call, dialed number, grocery store purchase, lent book...

It all adds up, and I'm nobody.

I'm not a biologist, or - heh - a "business intelligence" agent. I'm not a credit card agency. A short ten years ago I never thought I would have a use for a mere terabyte of data, but now I do. I remember trying to figure out what to do with a couple of gigs of HDD space not long ago. Now I could theoretically lose a few dozen gigs up my nose if I sneezed the wrong way near a microSD card.

And behind every bit of organized, indexed data is a whole lot of entropy in the form of spent energy. The finer-grained the media, the more data that is required to engineer and apply it, the more complicated the tools, etc.

Data data everywhere, and not a spot to think. And we're just getting started.

It's worse than turtles all the way down. It's like grey goo is here already, except in the form of user manuals, NAND gates and optical pits on discs.

If you watch carefully you can see the universe crystallizing with the inherent entropy of informational order, the heat death of the universe in action.
posted by loquacious at 1:26 AM on March 1, 2010 [2 favorites]

Data (at least personal data) needs a Best By date to prevent your personal information from becoming unmanageable in the flood of data.

If you submit your name to a database with an agreement that it will be kept for no longer than X months, your name is stored in a record that includes a corresponding expiration date. On that date, unless you actively intervene, your name is purged from that database or, depending on the type of information, it is rendered inaccessible until you reactivate it or (perhaps) you die.

And those expiration dates should go with the data. If a third party buys year-old data that is supposed to expire when it is two years old, the purchased data should expire after a year.

Audit = submit dummy identities to data gatherers and then check back to make sure those records are purged on schedule.
posted by pracowity at 6:04 AM on March 1, 2010 [1 favorite]

I mean, how many of you store more than a fraction of your emails?

Why in God's name would you ever throw away any of your (non-Spam) emails?! I haven't thrown away an email since 2004 - and I probably have 95% of my emails before that...

Storing email costs me nothing (even if it didn't live on Google's servers, it's a few gigs, which is pocket change) and because at least some of my email has value to me, I have to think a little before throwing out any given piece of it. I have neither time nor interest in assuming this chore, nor do I see its value - I scan my emails and then just archive them all, trash and treasure alike, it's not like I ever have to see these archived emails unless I search for them, and it means that I have a completely empty inbox a lot of the time (like now...), concealing some 204598 emails I have in this account.
posted by lupus_yonderboy at 9:36 AM on March 1, 2010

« Older Of the bookshelves I’ve inspected in my life, two ...  |  "We only smash stuff outside."... Newer »

This thread has been archived and is closed to new comments