Join 3,498 readers in helping fund MetaFilter (Hide)


Practical information for wannabe Glenn Greenwalds
March 17, 2014 10:45 AM   Subscribe

"The first journalist to attempt reporting on the Wikileaks cables was David Leigh of The Guardian. The material arrived as a single 1.7GB CSV file containing 251,287 U.S. diplomatic cables from 1966 to 2010. If you’ve ever tried to open a 1.7GB file, you know you probably can’t. Microsoft Word and Excel will plain refuse. Windows Notepad and Mac TextEdit will try, but slow to a crawl." At Opennews Source, Jonathan Stray has written a helpful beginners' guide to dealing with large amounts of documents for journalists and interested lay people.
posted by MartinWisse (18 comments total) 52 users marked this as a favorite

 
Journalists should really look into the the tools the legal industry uses for electronic discovery. There is a lot of great information in the article, but it seems a bit amateurish compared to the kind of data processing and analytics you can perform with modern ediscovery processing and review platforms. I've had to deal with data sets in the billions of emails, software repositories, unstructured data, common office files...you name it. You can search conceptually, use human, temporal, or geographic analytics, train AI to identify relevant documents, create concept graphs...

Anyway, it's definitely expensive, but the software and services have dropped in price tenfold over the past 5 years and will continue to do so. If you ever wonder what kind analysis intelligence agencies can do, all you have to do is look at commercially available software and imagine what will be possible in 5-10 years. That's what they got.
posted by pleem at 11:14 AM on March 17 [6 favorites]


It's not for the faint of heart but what used to be called Google Refine (now OpenRefine) is allegedly pretty good at hashing through a ton of semi-structured data. Free, too.
posted by Skorgu at 11:18 AM on March 17 [5 favorites]


The future is a weird place.
posted by PMdixon at 11:24 AM on March 17 [2 favorites]


And I can't even find a solution where I can scan a couple of hundred receipts and invoices per year, meta data 'em and find 'em. Did I mention it should be super fast and cheap?
posted by Foci for Analysis at 11:26 AM on March 17


Why spend lots of money on specialized tools to open up very large and unwieldy files? Why not just split the file with open-sourced UNIX tools and work with it in pieces? And, if anything, that facilitates distributed, "parallelized" journalism for which the Internet is ready-made.
posted by Blazecock Pileon at 11:48 AM on March 17 [2 favorites]


(May as well get used to the command-line, anyway, given the need to deal with advanced encryption to protect journalists and sources...)
posted by Blazecock Pileon at 11:50 AM on March 17 [1 favorite]


Journalists do use command-line tools - e.g. IIRC csvkit was written when Chris worked for the Chicago Tribune – but there are a lot of skills to get there. Some people are working on better tools (or making things like OpenRefine less user-hostile) and others are running courses at conferences or colleges helping journalists get up to speed but there's still a ton of work to demystify the shell to a group which largely hasn't needed to use it before.

The OpenNews project is really interesting to follow in this regard – here's another neat recent story talking about some nice assessment tools and an important result:

https://source.opennews.org/learning/human-assisted-reporting/
posted by adamsc at 12:07 PM on March 17 [1 favorite]


If you’ve ever tried to open a 1.7GB file, you know you probably can’t

This is one of a handful of reasons I keep Visual Studio on my machine even though I don't do much .NET work anymore. There's a free version as well (but no x64 version as yet so this work-around would only work up to about 2GB of file). That said, better to do what Blazecock Pileon suggests and learn how to take slices of the file to get a feel for what you're working on. But I do appreciate the need to get a good look at just what you have some times. VS is also the one tool I've found that will make a serious effort to handle large files on a single line.
posted by yerfatma at 12:20 PM on March 17


> Why not just split the file with open-sourced UNIX tools and work with it in pieces?

Why not read the article?
Two commands you must absolutely know: head shows the first few lines of a file no matter how big it is, while grep searches through files. Both come installed on Mac and Linux. For Windows you will want to install a Unix-compatible command line such as Cygwin or Git Bash.
posted by benito.strauss at 12:22 PM on March 17 [1 favorite]


Why not read the article?

Yeah. Very seriously, if you're a journalist handling this stuff and you're not learning Linux and using Tails, then for the love of God start doing that.
posted by mhoye at 12:30 PM on March 17


Just recently it was discovered that a supposed MtGox database and document dump was actually malware that stole your Bitcoin wallet from your local computer.

This falls under the category of "don't open random executables downloaded off of the internet" but one wonders how many journalists were too excited to take proper precautions before opening it.
posted by RobotVoodooPower at 2:29 PM on March 17 [2 favorites]


Why not read the article?

There's nothing in there about what I mentioned. Assume good faith.
posted by Blazecock Pileon at 3:16 PM on March 17


one wonders how many journalists were too excited to take proper precautions before opening it.

One wonders how many journalists actually have bitcoin wallets.

Actually, one doesn't.
posted by dersins at 4:11 PM on March 17 [2 favorites]


12 years ago I'd happily sling around 4× 3 GB (CMYK separations gone awry, usually) buffers in emacs on ~300 MHz Sun Ultra 10s with 256 MB of RAM: that's far below the spec of any modern netbook. It would be a bit faster than a Raspberry Pi, but not wildly.

Similarly, we were slinging all of the news (every news website [pretty much] in Europe) into a corpus that was searchable by dates, publications, regexes and collocates in the late 1990s. Billions of words, and all on a SparcStation 10, serving to up to 64 concurrent users. A Sparc10 has far less processing power and memory than a first-gen iPhone.

Why have these tools taken so long for people to understand? I mean, some of it's good advice given in the article, but it wanders off into numpty pseudo-closed software within seconds.

(But that Army appropriations report linked from the article? My hat's off to the team of dedicated obfuscators who came up with that template. Sheer bloody-minded jobsworth genius: meets the requirement to publish, which doesn't say anywhere it has to be readable.)
posted by scruss at 4:16 PM on March 17


Why not just split the file with open-sourced UNIX tools and work with it in pieces?

Fragment 1 "We are confident that during House hearings our leadership totally killed…"

Fragment 2 "… Sen. Feinstein's Senior Intelligence Staffer. Also, our Adopt-A-Kitten program is going swimmingly."
posted by ChurchHatesTucker at 6:48 PM on March 17 [2 favorites]


The real problem, of course, is that most journalists, like most other humans, are not techies. They get into journalism to report news, not muck around with computers (I am, btw, a j-school graduate who is also a techie...but I wasn't a techie until long after I was a practicing journalist).

A craftsman must know his or her tools, of course, and that's the real answer. I've never been patient with people who need to do something and complain that they don't know how or that it's too hard. Learn. Find someone who can help you get the job done and teach you at the same time. This is a great article with links to the basics of doing this kind of work. If they're not teaching this stuff in j-school, they should be. I know they taught us state-of-the art stuff for investigation when I was in school, but the state of the art has changed a little since 1978.
posted by lhauser at 6:01 AM on March 18


I would use MS Access for the CSV file but that's probably not on every journalist's PC.
posted by tommasz at 11:20 AM on March 18 [1 favorite]


Opening a multi-gigabyte file in Access feels like a good way for the NSA to destroy a journalist's computer.
posted by yerfatma at 12:36 PM on March 18 [3 favorites]


« Older Daily affirmations from Skeletor (SLTumblr) Skele...  |  doge2048... Newer »


This thread has been archived and is closed to new comments