LOC amassing tweets at breakneck pace, needs help to make it accessible
January 29, 2013 10:53 PM   Subscribe

The Library of Congress posted a Jan 2013 update on its mission to archive public tweets, announced back in April 2010 (previously). 170 billion tweets so far, adding more than .5 billion per day. Search for a term? Prepare to wait ~24 hours.

The pdf report includes details of the agreement with Twitter such as, "The Library cannot provide a substantial portion of the collection on its web site in a form that can be easily downloaded." They have met 2 out of 3 mission goals for the Twitter archive: "to acquire, preserve and provide access to a universal collection of knowledge and the record of America’s creativity for Congress and the American people."

In the past 2+ years, the LOC and Gnip still have not been able to implement a solution to allow researchers to search Twitter's gift in a usable manner. Storing 133TB of compressed data on redundant tapes helps with archiving, but certainly contributes to search response times measured in hours (if not days).

Entrepreneurs, big data scientists: Contact LOC Director of Communications @GayleOsterberg via (what else?) Twitter.
posted by sundog (20 comments total) 10 users marked this as a favorite
 
A couple years ago I wrote an exam question for an undergraduate algorithms class premised on being responsible for indexing the LOC tweet archive. In the exam question I lead students to learning how to use the Burrows-Wheeler transform to perform the query in time proportional only to the length of the query, which equates to about 2 disk accesses per character in the search term.

Of course, it wasn't until this year that an algorithm came out that can construct the Burrows-Wheeler transform outside of memory, and still, suffix-sorting 133TB of data is going to be a bit, well, slow. But break it up into 100GB chunks (perhaps by date), farm it out to Amazon Web Services, and it's doable for anybody.

Somebody from Google will have a far better solution, I'm sure.
posted by Llama-Lime at 11:19 PM on January 29, 2013 [4 favorites]


As much as I appreciate the notion of "big data," even as an occasional wannabe linguist, I think we're focusing more and more on the "big" at the expense of the "data." Twitter is an astronomical collection of casual written English (mostly). A sandy beach is an astronomical collection of individual grains of sand. What is the question it answers?
posted by Nomyte at 11:23 PM on January 29, 2013 [4 favorites]


Some broad topics of interest expressed by researchers run from patterns in the rise of citizen journalism and elected officials’ communications to tracking vaccination rates and predicting stock market activity.
posted by empath at 11:28 PM on January 29, 2013 [1 favorite]


"The Library cannot provide a substantial portion of the collection on its web site in a form that can be easily downloaded."

I'm really annoyed with twitter and Facebook locking up a huge portion of publicly created data behind walls in order to 'monetize' it. The web worked perfectly well before they showed up - true, they made it easier for people to use. Especially twitter - with facebook the original idea was to create a version of the web you could use to communicate with your actual, real life friends - not everyone in the world.

But with twitter, it was all really about posting things in public to begin with. It's just this weird coincidence that this arbitrary restriction they put in, 140 characters, happened to actually make the platform more inviting because people could tweet what was in their head at whatever moment, rather then feeling intimidated by the idea of writing and editing whole blog post.

But the thing with twitter - and what's worse is that when they started you could do things like get access to a decent chunk of the feed, or subscribe to people's twitter stream's with RSS.

What's really annoying about the whole thing is that we never even needed something like diaspora or another open social network. The open protocols on the web were more then enough to enable something like FB or twitter on their own - it's just that they were never made accessible enough for the ordinary user.
A sandy beach is an astronomical collection of individual grains of sand. What is the question it answers?
If there's a specific term that's used in 1/1,000,000 tweets, having a billion tweets means you see a thousand sample's of it's use.

And the thing is, if 9999 out of 10,000 tweets were totally useless, how would you be able to figure out which ones are the good ones without having access to all of them?
posted by delmoi at 11:30 PM on January 29, 2013


The open protocols on the web were more then enough to enable something like FB or twitter on their own - it's just that they were never made accessible enough for the ordinary user.

I'm not sure what this means.
posted by empath at 11:34 PM on January 29, 2013


I keep misreading LOC as LOL. Guess this is why I don't get invited to speak at library conferences much nowadays.
posted by Wordshore at 11:38 PM on January 29, 2013


Yes, this really is about the data. And before the LoC got their hands on it for research purposes, others have been exploiting this data for money. Everything on Twitter is being extensively mined, right now, by for-profit companies to decide how to sell you more Lemon-Fresh Tide, or to see how that product is working out in the test market, or frankly I don't even know precisely because it bores me to tears. But there's a cottage industry of Twitter-related startups working on changing your short tweets into money for somebody else, a team of sharecroppers in the fields of Twitter's prodigious slough, and you can be certain that it is not difficult for them to index a mere 133TB. As far as other pedestrian uses of the data, government agencies are buying the services of these Twitter-orbiting companies, and using the same tools that marketers are using, to find out how social movements start and spread, and who the leaders are. The Iranian Green Revolution was a bit of a wake up call and even if Twitter's role has been overstated none of these guys want to be called out for not attempting to use a tool that they could have. So all you activists out there, be aware that the same social media tools that you find so effective for organizing can and will be turned against you.

But back to the data, this is a linguist's gold mine too, more so than a marketer's goldmine. It's an entirely new written form, evolving quickly over time, and every utterance has been stored! This is an absolutely huge corpus of a very unique type. Can't you imagine all the possibilities for learning how language changes and adapts to new forms? Imagine having a record of all the walls of Rome and mining the graffiti! What a treasure! I'm not in linguistics anymore, but this has got to be one of the most exciting datasets that's out there, even more so than Google's book corpus, because it's an entirely new form of expression and should enable all sorts of new linguistic discoveries that we could not have found out before. (And mock the tweet as a form of expression all you want, it's still human, and it's still language, and it's still very amenable to study.)
posted by Llama-Lime at 12:10 AM on January 30, 2013 [1 favorite]


I'm not sure what this means.
It's easy to setup a Facebook or twitter account and start posting stuff and having your friends see it. It's not so easy to setup your own server and blog engine, and if you do what are the chances that you can get all your friends and family setup with an RSS reader and subscribed to your feed?

Facebook and twitter handle all of that for you - and make it pretty easy. As a result they've become defacto standards. Google+ doesn't offer RSS/Atom feeds either.

Tumblr is getting more popular, though. And so far they still allow RSS feeds, which is nice.

By accessible, I simply mean the ability of an "ordinary" user to set up and use something without a major time investment or skill requirement.
posted by delmoi at 12:15 AM on January 30, 2013


Nomyte: A sandy beach is an astronomical collection of individual grains of sand. What is the question it answers?

Well, there's plenty you can learn from a beach, but typically, what people are interested in is what the sand contains.

It may be that only when you collect 170 billion tweets can you really start to tell what's the sand, and what's the seashells.
posted by Malor at 12:29 AM on January 30, 2013


It's easy to setup a Facebook or twitter account and start posting stuff and having your friends see it. It's not so easy to setup your own server and blog engine, and if you do what are the chances that you can get all your friends and family setup with an RSS reader and subscribed to your feed?

Blogs were easy enough to set up before the social networks. FOAF wasn't sufficient for the social graph, though.
posted by empath at 12:36 AM on January 30, 2013


But back to the data, this is a linguist's gold mine too, more so than a marketer's goldmine. It's an entirely new written form, evolving quickly over time, and every utterance has been stored! This is an absolutely huge corpus of a very unique type.

While I generally find Twitter to be a lot more noise than signal, this is a good point to make, and one that LOC recognizes:
Twitter is a new kind of collection for the Library of Congress but an important one to its mission. As society turns to social media as a primary method of communication and creative expression, social media is supplementing, and in some cases supplanting, letters, journals, serial publications and other sources routinely collected by research libraries.
One question: what are they doing about all the shortened URLs? With some companies already out of business, and others already unreliable, is LOC looking to "lengthen" shortened URLs, in the hope of maintaining some context that is hinted at in these mini-messages? I didn't see anything on a quick scan of the LOC pages and white paper linked in the OP.

Also, you can search Twitter via Google, and that is instant access, not with a 6 month delay. But I don't think it provides the same options as having direct access to the whole archive. Still, strange that LOC only gets access after 6 months. Maybe it's a liability thing, where any damning/damaging tweets is likely to be noticed and deleted in that time period, and thus kept out of the archive.
posted by filthy light thief at 1:26 AM on January 30, 2013 [1 favorite]


I don't have any thrilling solutions for the LoC, I'd probably give elasticsearch a try and if not do the normal solr/hadoop thang. It would be neat to put up old data as an aws volume so people could take a crack themselves.

Or do a metadata-only presentation. So you'd search for some arbitrary metadata/fulltext query but only get a frequency chart, like google's n-gram search.

I'm totally going to use this as an interview question though.
posted by Skorgu at 3:43 AM on January 30, 2013


Blogs were easy enough to set up before the social networks. FOAF wasn't sufficient for the social graph, though.
Blogs were pretty trendy for hipsters back in the day, but there was nowhere near as many people who had them as have facebook or twitter. What I'm thinking of is basically - if has a friend who gets a blog, would it have been immediately obvious to that person what they had to do to get one on their own? Even if there was something that would be easy for them to use, they'd have to google around or ask people for suggestions to find out about it.

(And then you still don't have the social graph stuff like in FB. With FB, you were writing stuff for your actual friends, not the world at large - there was never anything like that with blogs in general. It would be possible to setup with stuff like OpenID and Atompub but no one's ever put it in a simple package.

Twitter is also interesting, you don't have to social graph stuff but the limited form makes composing posts easier. Not in the sense of interacting with the computer, but in the sense of how intimidating it is to actually sit down and write a post.

Marketing is an issue as well. These days you hear famous people talk about their Instagram or Google+. Obviously someone convinced them it was a good idea (or paid them, probably, when it comes to G+) that famous people using the product probably get a lot of people to check it out. Twitter did a lot to court celebrities and promote themselves (creating PR as opposed to paid advertizing)

Part of the problem is that even if the pure hosting and bandwidth costs are low, it costs money to make things user friendly and easy to use, and it costs money to market something. So free software, or services that aren't predicated on keeping your content locked up so they can profit off of it (along with any other personal data they get ahold of) are at a serious disadvantage compared to those that do.
even more so than Google's book corpus, because it's an entirely new form of expression and should enable all sorts of new linguistic discoveries that we could not have found out before. (And mock the tweet as a form of expression all you want, it's still human, and it's still language, and it's still very amenable to study.)
That said, think about what Google has access too: everyone's emails and non-off-the-record IM's - and text messages of Google voice users.
I don't have any thrilling solutions for the LoC, I'd probably give elasticsearch a try and if not do the normal solr/hadoop thang.
Heh, 133TB sounds like a lot but if you think about it for a sec at today's prices it wouldn't really cost much to store on regular hard drives. At the current price of about 30 GB/$ on magnetic hard drives ($99 on a 3TB drive), it would only cost about $4,500 to store (just 45 of those $99 drives). Doesn't seem like a totally unmanageable problem. (and at 10 cents/gb it would cost just $13k to store on EC2).

If you put it on EC2, you could just charge people for the compute time to do a map reduce or something against it. I think if they made it a public image, you'd still have to pay to store the volume, right? I'm not sure.
posted by delmoi at 4:20 AM on January 30, 2013


I've been following this for a while, it's an interesting story. To my mind, there's actually two contrasting (not necessarily conflicting) views of data storage and access here. On the one hand, the LOC comes from a world of more tightly controlled metadata schemes and indexing via descriptive metadata, very granular in terms of what is retrieved. On the other hand, the data mining/big data/hadoop approaches work with abstracted patterns from and across large data sets, which can depend on the way you structure your queries in the first place.

They're both useful in context, but also very different approaches. It would be useful to see if these can be combined in a productive way, but at the moment, looking at this from the existing LOC point of view, you can see why it looks very intimidating.
posted by carter at 4:40 AM on January 30, 2013


As long as I eventually have access to all ~17,000 of Horse_ebooks' tweets I can die happy.
posted by six-or-six-thirty at 5:52 AM on January 30, 2013


a mere 133TB.

Okay, right there, you're already talking 67 2TB drives. Which is, of course, certainly the worst way to store it -- as a matter of fact, you would almost assuredly lose it when one fails. Assuming you want to grow the archive, you then have *just* a few inserts you need to do per second. How many? I'd get at least 5K per second.

Let's build this out. No, you cannot use $150 2TB USB drives. Not if you actually want to get reasonable access to this data. So, 134 2TB drives in a set of RAID-10 stripes, 15K FC...wait. There are no 2TB 15K FC drives. So, you're back to 7.2K SATA. 75 IOPS per, if you're lucky, or just over 10000 IOPS. Assuming a 9-1 write load (lots of incoming tweets, less readers) and you're looking at 4500 effective write IOPS (2 writes per block, one to each drive in the mirror) with 1000 read IOPS.

Let's assume there's 1000 tweets a second. That's... 0.9 IOPS per second. Oops. Wait, we assumed 5K per second. OOPS. You've already lost. I won't even bother with RAID-5 (write penalty? 4) or RAID-6 (penalty? 6) because I can't afford the write penalty for RAID-10.

OK. Fuck SATA. Let's get SAS/FC, on 15K spindles. 170 IOPS per. Of course, we can't get 2TB 15K spindles. We can get 900GB, so we now need ~160 spindles for just the raw data. While enterprise class FC drives are more reliable, you also have twice as many disks to fail, so it's time to mirror those. 320 900GB 15K FC.

Total array theoretical IOPS -- 54.4K. Nice! Which means we now have, at that 9-1 ratio, 24.4K effective write IOPS and 4480 read IOPS, or at 5K tweets per second, 4.8 IOPS. Given a block size of 512b bytes, you can write a tweet in 1 block, so we can handle 24K tweets a second.

If everything runs perfectly. BTW, you also need the systems to control these disks, the racks to mount them, the power to run them all. Oh, the power went out? Sorry, should have bought two of them. And the network to connect them. Oh, and then figure out how to keep them in sync.

Oh, yeah, the indices!!! Wow, those are fun. The good news is that they're smaller. The bad news is that each one, while dramatically reducing your cost to read, increases your cost to write -- you have to write the data, read the index, then write the index into the right place. There are tricks to help with that, but you're still, always, doing a read+write. Per index, mind you. Now, since the index should be much smaller, you can throw SSDs at it.

What do you index by? Word? That means an index insert for every word! Hash tags? Better -- most tweets don't have them, so you've already cut your index write load by a bunch. But you're still looking at a lot of them. So, several SSDs. Oh, those fail too.

So -- let's give up on flying updates. We'll write them to a heap on some *very* fast disk, probably SSD, and use a process later that will spin the next day into a database slice and attach that slice to our main datastore. Of course, this means you can't search *today*, but you know, I've already blown away that budget you said when we could just "slice it into 100GB sections and let Amazon handle it..."

Because you know what cloud services optimize for? Density. They're expecting you to access files on the order of hours, at minimum, not milliseconds. In multiple data centers, without write consistency. Yes, you're data will get to the other datacenter eventually, but when you're system gets a "write commit" from storage, it will *not* be at the other storage centers. They don't have the IO or the bandwidth to do that. It'll get there, but it won't be everywhere when you write it.

Now, we could give up on inserts. Basically, just write the 133TB we have. If a few people use it, then maybe it becomes a cloud service compatible app -- if they're willing to rent you 133TB of disk. That's a lot of disk, and it would be very smart to ask exactly how they're going to store that 133TB of data -- because if they store in on RAID-5 stripe sets, you are very likely going to lose part of it.

133TB split up into 100GB slices is a *much* different set of problems. You have a few inserts and a few reads per second, and you slice it into small groups, so that while you may only have 600 IOPS (8x2TB SATA) in your RAID-10 or RAID-6, you only have 80-160 clients connecting to it, and most of them aren't connect at all.

133TB+ trying to keep up with Twitter? That is a completely different ballgame. There's a reason that Twitter has a couple of hundred developers and engineers basically working on this problem every day.
posted by eriko at 6:17 AM on January 30, 2013 [8 favorites]


What is the question it answers?

What the hell are we doing?
posted by mrgrimm at 7:34 AM on January 30, 2013


delmoi: "Marketing is an issue as well. These days you hear famous people talk about their Instagram or Google+. Obviously someone convinced them it was a good idea (or paid them, probably, when it comes to G+) that famous people using the product probably get a lot of people to check it out. Twitter did a lot to court celebrities and promote themselves (creating PR as opposed to paid advertizing)"

Not only that, but it makes those famous people JUST people. You know, like the rest of the lumpen proles.
posted by Samizdata at 11:11 AM on January 30, 2013


Oh, yeah, the indices!!! Wow, those are fun. The good news is that they're smaller. The bad news is that each one, while dramatically reducing your cost to read, increases your cost to write -- you have to write the data, read the index, then write the index into the right place. There are tricks to help with that, but you're still, always, doing a read+write. Per index, mind you. Now, since the index should be much smaller, you can throw SSDs at it.

With a data structure like a suffix array you can compute the index in linear time, and compressed index modifications like the FM Index let you store the data with the index, additionally reducing write time and space requirements. I think these are the algorithms and data structures llama-lime was talking about. And there are several parallel modifications to these algorithms.

Also, I think your assumptions assume a real-time indexing operation. The January 2013 report above indicates they have a fairly labor intensive import process right now, receiving archives of tweets in hour-long segments from Gnip throughout the day. And for most of the research purposes, real-time access isn't needed. They're looking for historical trends.

In that case, building an index every day, or even less often isn't a problem. Hell, once a year could be fine, as having a stable corpus version you can reference in journals will help with reproducing studies and comparing different results between studies.

Honestly, I find some of the comments in their report a little disingenuous, like the quote from Dick Costolo near the bottom of page 4: "It’s two different search problems. It’s a different way of architecting search, going through all tweets of all time. You can’t just put three engineers on it."

Well, yes, it is a different problem. But it's the only one they need to solve, unlike Twitter and Google, who need to provide many different browsing and search methods, in addition to handling a very large stream of writes.

The LOC understands archiving and preservation. Their discussion of multiple offsite copies of tape archives indicates this. And that's great, because it's a critical issue often lost in our digital ecosystem.

But they should be able to handle this with the resources they have, or a relatively minor increase. Is it currently under Digital Initiatives? Which has about $50 million, of which $19 million is for staff according to th e2013 LOC Fiscal Budget Justification? That's a lot of money, and with 134 full-time staff in that department, I would assume they have some sharp Information Retrieval experts. Their talk of public-private partnerships and leveraging private sector investment is a red flag, and I hope there are no strings attached if they make those deals.

Somebody from Google will have a far better solution, I'm sure.

Google's Developer Relations Lead made a related snarky comment on that post:

"I think we can get the query times down quite a bit further than 24 hours :)"
posted by formless at 7:59 PM on January 30, 2013 [1 favorite]


133TB+ trying to keep up with Twitter? That is a completely different ballgame. There's a reason that Twitter has a couple of hundred developers and engineers basically working on this problem every day.


They don't really need to keep up with twitter. They can just do an update every 6 months or every year.
posted by empath at 8:11 PM on January 30, 2013


« Older “I think I like Prince so much I would actually...   |   Bob and Weave. Newer »


This thread has been archived and is closed to new comments