Mining the Mother of all Data Dumps
We now have a relatively massive haul of digital data from the OBL strike. There are several forensic toolkits in use by the private (commercially available)
sector as well as open-source
. Best practices
include inventorying all the sources, cloning the sources so as to not damage pristine data, recovering any partial or damaged content, making the cloned sources read-only, adhering to legally-admissible tools standards, and documenting everything. There is an excellent source titled Digital Forensics and Born-Digital Content from the Council on Library and Information Resources [pdf
, Resource Shelf
]. But what to do next*?
I’d immediately parse the data, looking for anything that resembled encrypted text-strings, urls, logins, or passwords, and immediately access, subpoena, compromise, and archive any mentioned sites or services, adding that information to the digital warehouse. The Anonymous hack of HBGary
is a well documented narrative of the process.
I would then index the data contextually and semantically, looking for date and time stamps, languages used, file types, bank accounts, email addresses, IP addresses, place names, person names, indices from the 9/11 Commission both published and unpublished, known keywords (targets, weapons systems, known methods, etc.), and certainly others
. It would also be useful to examine machine-created data on machines such as access and activity logs as well as the registry for machine and user-specific data.
My suggestion would be to centrally locate the source data, and to then index it and slap a front-end on (see AOL data dump, previously
). I’d also apply analytics to the front end to see what the crowd was looking for, and optionally aggregate and share that data (with some careful thought as to designing a system to avoid a Private Manning-type scenario
), creating an internal-honeypot for capturing analysts interests and ideas. The dataset is likely not large enough for true data mining (previously
), but Social Network Analysis (previously
) could still be employed beyond searching for keywords. I’d also look for patterns of activity (and gaps), and compare that with known plots to identify patterns. Most importantly, I’d work backwards, as old date is likely stale as far as actionable intelligence. I’d further suspect that any data from the 9/11 period would be beyond priceless.
Most technology enthusiasts are protective of their privacy and skeptical of data mining. This appears to be a situation where this technology can be used for good.
*Disclaimer – everything after this point assumes access is limited to secured machines, accessed by authorized users of the United States military, law enforcement, and employees (and contractors).