History as data science
History Lab has "focused on digitizing, structuring and visualizing large sets of declassified US government documents. This is a starting point for showcasing how computational techniques can aid historical research." Can big-data analysis show what kinds of information the government is keeping classified?

Declassification is inevitably a political process: when the government decides what evidence to release, that decision is necessarily determined in large part by political considerations. Quite often, material is withheld because it is considered politically sensitive. But this is generally the sort of material the historian would most like to see. There is thus a built-in conflict between the consumer and the supplier of historical evidence: we historians want to see the "dirt," but those responsible for the release of documents want to make sure that the material released does not damage the political interests they are responsible for protecting.
Declassification Engine provides solution to processing declassified documents
“That has a pretty obvious explanation — it’s during the Yom Kippur war. But what about these spikes, here in 1975, or these in 1976? They could be worth looking into.”

75-76? gosh, Vietnam, Cambodia, Chile, Africa, Argentina, East Timor...
posted by clavdivs at 9:42 PM on April 27, 2015

Very cool! If we had FOIA stats then similar methods could be applied to local governments. And companies are expose interesting dirty laundry through their activities, like reports, tax filings, etc. too., especially banks.
posted by jeffburdges at 10:18 PM on April 27, 2015

That's a fascinating and potentially very important project. I'd imagine it must be very frustrating for the researchers to uncover so many tantalising hints of patterns and priorities (like the change in declassification rates of Operation Boulder documents), but know that they'll probably never get the evidence necessary to confirm their hypotheses. It'd drive me crazy.
a set of 117,000 records produced by various US departments and agencies from the 1940s to the present... is on loan to the Columbia researchers from the publishing company Gale.
This and a previous mention of getting access to commercial archives surprised me, as I thought works produced by the US government are not entitled to copyright protection within the US. How have these ended up in paywalled archives?
posted by metaBugs at 5:29 AM on April 28, 2015

This (broad-scale computer analysis of declassified documents) is the subject of my (long running and much interrupted) PhD thesis. I've been working with ideas similar to these since about early 2001, and I presented an early version in 2004.

Since then, however, I've had real difficulty persuading academic colleagues that this is a fruitful way to go. Intelligence professionals immediately 'get it', but intelligence historians tend to see this kind of work as methodologically unproven for several reasons, some of which have been raised in this thread already.

I really hope these guys take off in their project. I mean, obviously for selfish reasons this legitimates my research to a certain extent. But more deeply, this is a powerful tool for looking at questions of the utmost importance.

I started working with DDRS data (that's the Gale dataset), but I found it was just too patchy. My main problem was that I didn't have automatic indexing systems, so I had to hand-search and hand-code all my documents. This gave me a really small set-space of documents to work with (my thesis dataset is only 14 K pages), so I had to find the highest quality database I could. I primarily used CIA-FOIA and looked at a restricted set of top-level intelligence estimates.

Unfortunately, I had to vet my results with CIA and... well they gave me permission to publish my thesis, but then they 'closed' access to most of my documents. So this archive is no longer available for this kind of research. Um... sorry about that, guys!

75-76? gosh, Vietnam, Cambodia, Chile, Africa, Argentina, East Timor...

This is, indeed, a serious problem with big-scale declassified data. When you find something really big, there are always many possible explanations... and there's always the possibility that there's no single explanation, but a group of things coming together.

If you think about it a bit, you'll quickly see that any 'released' data ('declassified' in US terms) is the top layer of four different layers of structure:

1. Released Documents (the stuff we get to see)
2. Underlying data and sources (the data/ideas that went in to the documents)
3. Institutional structures (the organisations/people who wrote the documents)
4. Events in the world

It's really tempting, when starting out in this work, to spot a pattern in the released documents and go "oh, a big thing in the document data (layer 1)! something must have changed in the world (layer 4)!". But, when you investigate further, you often find that it was a change in the organisations that produced the documents (layer 3), or the way they handled data (layer 2), or whatever. This means that most of your time using the document data to work out what happened in the underlying data and the institutional history. Eventually, of course, the goal is to kind of filter out those effects and learn about the Important Secret World Events in layer 4, but this is really tough.

Once you've spent a while looking at layers 2 and 3, though, you start to realise that those layers are really interesting in their own right. I've become fascinated by the way CIA analysts use evidence in their writing. They don't know it, but many intelligence debates of recent years have been based on the intelligence environment way back in the founding years of the CIA, c. 1947-53.

And companies are expose interesting dirty laundry through their activities, like reports, tax filings, etc. too., especially banks.

There was a proposal, a few years ago, to use similar methods to go after war criminals in Guatemala. Unfortunately, that proposal went nowhere for prosaic reasons (eg. bureaucracy, I was really ill at the time, etc).

they'll probably never get the evidence necessary to confirm their hypotheses

One the one hand, this is not so bad: lots of historians (especially those working in the medieval and ancient world) never get the evidence to confirm their hypotheses. In history, you take your findings to a place where you can definitely say "this possibly happened". On the other hand, it's a huge problem. While people accept that kind of ambiguity in medieval history, in modern history they'll often say "wait, you can't say for sure? Then you have no findings!!" This is an unproven methodology in the eyes of the historical community, and they've proved slow to accept it as legitimate.

Why do I keep on then? Ultimately, this kind of research will direct the work of other researchers. If we find something Big and Inexplicable, then that invites investigation through conventional historical research (targeted FOIA requests, oral history efforts, and so on). Furthermore, as this work is inherently longitudinal, it gives us a chance to create a framework upon which to hang detailed historical narratives. You need to know where your work fits into a big picture? Look at the big data story!

This and a previous mention of getting access to commercial archives surprised me

You're quite correct that the documents themselves are not subject to copyright. However, the commercial archives are perfectly within their rights to collect and republish the documents. They provide images of the text, rough textual transcriptions, and search functions that allow you to find the documents you want. DDRS also provides rough abstracts of the documents, but these are pretty unreliable, and I don't use them. If you want to, I suppose, you could put those same document images up on your own website, as they're not copyright-protected. It seems like rather a lot of work, though.
posted by Dreadnought at 8:12 AM on April 28, 2015

