A sandy beach is an astronomical collection of individual grains of sand. What is the question it answers?If there's a specific term that's used in 1/1,000,000 tweets, having a billion tweets means you see a thousand sample's of it's use.
I'm not sure what this means.It's easy to setup a Facebook or twitter account and start posting stuff and having your friends see it. It's not so easy to setup your own server and blog engine, and if you do what are the chances that you can get all your friends and family setup with an RSS reader and subscribed to your feed?
Twitter is a new kind of collection for the Library of Congress but an important one to its mission. As society turns to social media as a primary method of communication and creative expression, social media is supplementing, and in some cases supplanting, letters, journals, serial publications and other sources routinely collected by research libraries.One question: what are they doing about all the shortened URLs? With some companies already out of business, and others already unreliable, is LOC looking to "lengthen" shortened URLs, in the hope of maintaining some context that is hinted at in these mini-messages? I didn't see anything on a quick scan of the LOC pages and white paper linked in the OP.
Blogs were easy enough to set up before the social networks. FOAF wasn't sufficient for the social graph, though.Blogs were pretty trendy for hipsters back in the day, but there was nowhere near as many people who had them as have facebook or twitter. What I'm thinking of is basically - if has a friend who gets a blog, would it have been immediately obvious to that person what they had to do to get one on their own? Even if there was something that would be easy for them to use, they'd have to google around or ask people for suggestions to find out about it.
even more so than Google's book corpus, because it's an entirely new form of expression and should enable all sorts of new linguistic discoveries that we could not have found out before. (And mock the tweet as a form of expression all you want, it's still human, and it's still language, and it's still very amenable to study.)That said, think about what Google has access too: everyone's emails and non-off-the-record IM's - and text messages of Google voice users.
I don't have any thrilling solutions for the LoC, I'd probably give elasticsearch a try and if not do the normal solr/hadoop thang.Heh, 133TB sounds like a lot but if you think about it for a sec at today's prices it wouldn't really cost much to store on regular hard drives. At the current price of about 30 GB/$ on magnetic hard drives ($99 on a 3TB drive), it would only cost about $4,500 to store (just 45 of those $99 drives). Doesn't seem like a totally unmanageable problem. (and at 10 cents/gb it would cost just $13k to store on EC2).
« Older Siskel and Ebert on Prince. I know right?... | Five-year-old boxing practice.... Newer »
This thread has been archived and is closed to new comments
Of course, it wasn't until this year that an algorithm came out that can construct the Burrows-Wheeler transform outside of memory, and still, suffix-sorting 133TB of data is going to be a bit, well, slow. But break it up into 100GB chunks (perhaps by date), farm it out to Amazon Web Services, and it's doable for anybody.
Somebody from Google will have a far better solution, I'm sure.
posted by Llama-Lime at 11:19 PM on January 29 [4 favorites]