Archive Team
September 15, 2009 10:45 PM Subscribe

Archive Team: We are going to rescue your shit. (previously)
posted by stbalbach (43 comments total) 21 users marked this as a favorite

It's perhaps worth noting that one of the minds behind Archiveteam is Jason Scott, who has a pretty awesome blog and an arguably even awesomer collection of BBS-era textfiles. (If you spent a any significant amount of time on BBSes and haven't been to textfiles.com before, be prepared to lose the next few hours. It's a real trip down memory lane.)

I suspect, although I don't know for sure, that his interest in textfiles and the preservation of pre-Internet netculture led to Archiveteam's current activities.

Their pages on the basics of backups—both why and a little bit of how—are pretty good, and I've referred people to them occasionally. (Their suggestion for a first step: get a USB keychain drive, copy your documents, financial data, and photos onto it immediately, then store it away from the computer. If every computer user took 5 minutes to do this a few times a year, I'd imagine a whole lot of stuff wouldn't get lost in all-too-frequent "crashes".)
posted by Kadin2048 at 11:17 PM on September 15, 2009 [3 favorites]

I poked around and there's not much on the website right now, but I like this website's practical perspective, particularly with formats. I have a few password-protected WordPerfect 6 documents from the mid 1990s that I had a hell of a time trying to open last year even though I knew the password, and recently I had to write from scratch software to rescue all my Commodore 64 and 128 word processing stuff into TXT and even convert my numerous C64 basic programs into TXT source code (surprisingly I couldn't find any software to do this, even the latter; maybe I didn't Google it right).

Lesson learned: I now write strictly in RTF for all my letters, writings, and so forth that aren't dependent on a page layout (and if so, I do PDF). I am not going to allow myself to find in 2034 that I'm locked out of a Microsoft Word 2003 document. Sounds like a piece of cake now, but we're all going to churn through so much software and technology over the next 20 years that I'm not going to bank on any support for cryptic or closed-source format. The newer stuff will be supported, sure, but the older stuff will gradually get forgotten.

I guess I'm kind of compulsive about saving stuff, but I like to see where I've been and where I'm going, and I think my kid might like having all that stuff someday.

Speaking about lost websites, I wish someone had ripped JumpTheShark.com before TV Guide wrecked it. There was some great stuff in those TV show user comments. Archive.org was sensible enough to archive it though much of it has been flagged as blocked by the site owner, but last spring I found enough holes to reconstitute much of the old site. It seems like such a travesty for the old stuff to be deleted and I've got a mind to put it online myself, but I bet TV Guide would be all over it like white on rice.
posted by crapmatic at 11:24 PM on September 15, 2009 [4 favorites]

crapmatic: basic to ascii for c64. The functions / statements (GOSUB, PRINT, IF) are stored as tokens, as you've probably found out. I did this a while ago, and I completely forgot I had to do this step. Crazy. I think I used Star Commander, but that migration was ages ago. It's in safe, safe, ascii land now. BTW, I had to use an LPT1 printer port and a paper clip to transfer everything from an ancient C64 disk drive! AWESOME.
posted by sleslie at 11:39 PM on September 15, 2009

Also if anyone out there has an archived version(s) of word.com, CALL ME pr EMAIL ME. That site needs to be out there again.
posted by sleslie at 11:42 PM on September 15, 2009

Kadin2048: I suspect, although I don't know for sure, that [Jason Scott's] interest in textfiles and the preservation of pre-Internet netculture led to Archiveteam's current activities.

There's a couple blog posts where Jason Scott talks directly about the need for something like the Archive Team.

Eviction, or the Coming Datapocalypse. From December 2008. Describes how AOL Hometown was shut down. He quotes some desperate commenters, e.g.:

My question is like those above. Is there anyway still to retrieve my journals and homepages? I tried before the deadline but nothing happened. These are my memories. Things I wanted to remember about my kids. And when I tried to access them before the deadline I was unable to. Otherwise I would have printed it all out. Please help.

And goes on to say:

We’re talking about terabytes, terabytes of data, of hundreds of thousands of man-hours of work, crafted by people, an anthropological bonanza and a critical part of online history, wiped out because someone had to show that they were cutting costs this quarter.

It’s an eviction; a mass eviction that happened under our noses and we let it happen.

I’ve been evicted before – I was kicked out of a boarding house I used to live in between 1992 and 1997. Eviction laws were in place and I was sent notification after notification, shoved into my mailbox, left under my door, explaining my rights and how to appeal and how much time I had. It was done during the summer months, because winter would be a hardship. It was handled coldly, nastily, but it was done according to law, and luckily, I had a place to move onto. (They were closing up the building to turn it into professional space, which it is to this day.)

When we evict people from their webpages, fuck all is required.

... I’m saying that, like a real eviction, there should be practices in place. When you open your doors to hosting user content, you should have rules in action that, unless it’s a complete and total fire sale and you have no hope of even staying open that long, then you should be required, yes by law, assholes, to make the data available to customers for an extended period of time.

Followup: Datacalypso!

So what do we do?

Well, let me give a personal example.

Through one of the weblogs I browse, I found out a website called podango.com (a podcast hosting site) was going down. The word had gone out to subscribers of the service that the company was going to be going through some rough times, much as a hedgehog being thrown into a blender was in for some tough times, and maybe you should get your shit off our servers immediately. In line with what I’ve been talking about, they gave everyone five days at the end of December 2008 to do it. Five days. Five days versus four weeks; what’s the gooddamn difference? Technically savvy people given less than a week, over Christmas, to figure out how their data was going to be transferred, to figure out how to get RSS feeds transferred. Some people came back from holidays and found all their shit gone. Didn’t check e-mail during Christmas? Sorry, podcaster!

So what did I do?

I fucking downloaded it.

... What you’re looking at is about 70 gigabytes of data from podango.com, lock stock and barrel. Over 4000 distinct episodes of podcasts. It took my machine five solid days to do it, but I downloaded all of that lame site. Do I have a favorite podcast on there? No. Did I know someone with a podcast on there? No.

I did it because I had the means (disk space), the motive (the sense of history and the recognition that this was historically relevant work representing thousands of hours) and the opportunity (a fast connection and five days before they were to die). A back-of-envelope calculation tells me I just rescued 41 days of podcast, along with all relevantly hosted images, show descriptions and XML data.

This one will pay back immediately; people are already contacting me, profusely thanking me.

So what am I saying here?

We need the A-Team.

And finally:

Stand back, we're archivists.
posted by russilwvong at 12:06 AM on September 16, 2009 [11 favorites]

This is where open source is a savior. As long as someone somewhere still has a copy of the openoffice.org source, and it can open your format (chances are fairly good that it can), openoffice can be ported to your current hardware and OS and used to access your documents. No matter how many years have passed, if the source code is readable and your document is readable, it will be possible to access your document.

I am amazed anyone with any option ever trusted a closed source program and its proprietary format for their personal documents (I am probably only surprised because "use open source and open formats or you don't own your data" was one of the first computer usage lessons I ever got from the Linux nerds that taught me to use a computer).
posted by idiopath at 12:09 AM on September 16, 2009 [1 favorite]

From their "deathwatch:" Microsoft Encarta is going away? I mean, I guess that day was inevitable, but I was with about 15 to 20 other people in the cafe of building 10 (I think) when they announced to our team that they were calling it "Encarta" and not "Micropedia" or whatever internal name it was at the time, to underwhelming response from us. Man, I must be getting old.
posted by maxwelton at 12:24 AM on September 16, 2009 [1 favorite]

Vaguely reminds me of The Memory Hole, except not so specialized.
posted by knave at 12:46 AM on September 16, 2009

Metafilter: "I fucking downloaded it."
posted by Mitheral at 12:54 AM on September 16, 2009 [1 favorite]

EVERYONE IS ALL HOT AND BOTHERED BY GEOCITIES CLOSING.

Friggin' necrophiliacs.
posted by twoleftfeet at 1:13 AM on September 16, 2009

I know someone who was one of the anonymous "nominators" for the MacArthur Foundation Prize for a number of years. Every year since 2005 (the year I got his excellent BBS documentary) I have urged him to nominate Jason Scott. The whole process is deliberately non-transparent so I have no idea if he even submitted the nomination, but hey, maybe this year.
posted by atrazine at 2:08 AM on September 16, 2009 [1 favorite]

I wonder how much paper it would take to document Microsoft word formats in great enough detail to be able to read it?
You know, print out the specs on acid-free paper, even in a normal library (i.e. without a dry, cold vault with a neutral atmosphere) that could last hundreds of years at least.

To answer my own question, apparently office 2007 formats are zipped xml. If you have office 2007, you can see this by renaming an office document to .zip and opening it. You can then browse the file structure. Neat!
posted by atrazine at 2:29 AM on September 16, 2009

The Deathwatch page makes for nostalgic, melancholy reading.
posted by JHarris at 2:29 AM on September 16, 2009

Plus, he's Metafilter's Own™ jscott.
posted by armage at 2:44 AM on September 16, 2009

The responsibility of handling people's data scares me away from setting up all sorts of projects. I wish it scared large companies like Yahoo! a bit more.
posted by malevolent at 2:45 AM on September 16, 2009

To answer my own question, apparently office 2007 formats are zipped xml

Earlier formats are essentially dumps of Word's in-memory data structures, which means the only comprehensive documentation for them is the Microsoft Word source code.
posted by cillit bang at 3:01 AM on September 16, 2009

A couple of hours after reading this thread, I revisited the interactive business planner at Business Canada. I've used this to write three or four plans over the last decade, but visit very rarely. Today I see they are closing the application at the end of the year.
They have my current email address but have yet to notify me that they will be deleting those plans. It was a not so nice illustration of this post.
posted by bystander at 4:00 AM on September 16, 2009

Followup: Datacalypso!

The response to Slashdot critics at the beginning of that entry makes for delicious reading. Bravo, jscott!
posted by rory at 4:18 AM on September 16, 2009

Kinda surprised that they have no opinion on Mozy.

Also, this: Archive.org seems stable at the moment but it's 2 petabytes of data aren't mirrored anywhere else, the code for their system isn't open source and generally they're a single point of failure for a large amount of the web's history.
posted by jbickers at 4:28 AM on September 16, 2009

I hadn't heard Encarta was going under, either. I played a shit-ton of MindMaze during my formative years, because it was one of the only "games" (along with Oregon Trail) we were allowed to play at the school library. If you solve the riddle, I'll strum you a tune!
posted by danb at 5:25 AM on September 16, 2009

I am surprised nobody brought this up yet: what about this site? Is metafilter mirrored? Is it backed up anywhere? If mathowie gives up on the rat race and wanders the world like Jules from Pulp Fiction did, will all the content we have posted here just be gone for good?
posted by idiopath at 5:37 AM on September 16, 2009 [1 favorite]

I always wondered what happened to the Electric Sheep comics (Apocamon was particularly funny) but that probably went bust before the A-team came on the scene... anybody who knows please memail me... same as with word.com, that was awesome...
posted by yoHighness at 5:50 AM on September 16, 2009 [1 favorite]

The "use a $20 USB drive for backups, 'cause that will be enough unless you're a pro" is a bit naive given the abundance of 10+ megapixel cameras these days. I like the idea in theory but I'd need a 50 gb drive just to archive the pictures I have right now. And that's just the photos, it doesn't include scanned copies of negatives or older photos or generated graphics. Throw in the need to archive written docs and the files that go with them and I need upwards of 100 gb storage. Today. Tomorrow I'll need more because I'll likely take another photo of the kid tonight and perhaps build one more graph for the publication I'm working on... but since I generate written work for a living I suppose I'm a pro in that respect.

HOWEVER it's a damn good idea to have some workable method of backup, and having an external drive (available in multi-terabyte sizes these days) is a damn good idea. Sadly most people don't learn until they've lost something important, when it's too late. My neighbor is a freelance photographer, and she lost nearly all of the first few months of baby pictures of her 2-year old son thanks to a drive failure. Somehow she managed to save photos for her clients but never backed up her own. I helped her set up an older generic Intel box as a network file server just to give her one more chance at saving things in the future.

As for proprietary formats... ha. Working in science it's basically an expectation. The number of data collection systems I've used that collect data in an open format can be counted easily: 1. In 10+ years of research I have used one system that saved raw data as a plain text file by default. Everything else is designed to save data in a compressed, undocumented format that can (often painfully) be exported to a different format if you have the right software on hand. My old lab kept alive a creaking 486 simply because it was the one system old enough to run the export commands for one data format conversion (and even after that, it took manual conversion in Excel to rebuild the resulting file into a format we could export to something the new system could handle).

I do have an active interest in maintaining my things in a format that can be read in the future. But I am not in a branch of sciences that regularly uses anything except Word to generate written documents. (Yes, you engineers and mathematicians are so cool with your T_EX, you can smile smugly now, secure in the future-proofness of your data formats.) So what do the minds of MeFi recommend? How does one dump years worth of old Word docs into a more future-proof format? Should I just import them into OpenOffice and then save them as ooxml instead, given that I know fuck-all about T_EX and have no real interest in learning it unless I have to?

(And to go completely off-topic - I know it goes against the general principle of T_EX but if it's such a good format, why the hell doesn't OOo act as a WYSIWYG editor that builds the T_EX code behind the scenes without me having to understand what all the flags mean? I know in theory I am not supposed to care about anything except the contents when I build the document, but christ it has to be typeset eventually if I'm going to print it, so shouldn't I be able to tell the damn thing what I mean when I flag something as a heading or whatnot? Some of us don't mind being able to specify the formatting as we work. It isn't as evil a concept as most T_EX proponents seem to think it is. And at least 95% of my colleagues would look at a document containing lots of backslashes indicating markup, say "WTF" and delete it all, or just ask me to re-send the thing in Word format without all the "weird stuff" in there to distract them. I understand the concept - I've struggled to fix some pretty fucked-up Word doc formatting from colleagues, so I know how bad WYSIWYG can be, and I prefer using a text editor for web pages so I'm sure I could learn the T_EX code if I needed to - but when it comes to writing I just don't think that way and don't know many other people who do.)
posted by caution live frogs at 6:04 AM on September 16, 2009

yoHighness: http://www.electricsheepcomix.com/apocamon/. It reopened last month.
posted by mediareport at 6:10 AM on September 16, 2009 [3 favorites]

caution live frogs: "As for proprietary formats... ha. Working in science it's basically an expectation."

To paraphrase Woodie Guthrie, if one scientist asks for an easy way to export portable longterm usable data, they will figure you are a whiny user, and ignore it; if two of you do it, they'll figure you're some kind of fringe kooks, but if 5,000 people do it, and mention they may just switch to another vendor if they make it easier to archive and recover data, then you become a business plan for the product's future.

You don't own data in a proprietary format, you are leasing it from the vendor who makes software that reads that format. Your boss, and your boss's boss need to understand this, and the full implications of this. A vendor who will not make a convenient way to export data to a common and easy to use format is trying to blackmail you, to make it as hard as possible for you to switch to another vendor using manipulative and unethical means.

Modern versions of word save files in an XML format that includes plain text. You can go through and manually rename copies of files and make sure the data is there and somewhat usable in notepad.
posted by idiopath at 6:25 AM on September 16, 2009 [1 favorite]

idiopath: "You can go through and manually rename copies of files and make sure the data is there and somewhat usable in notepad."

Self-correction: you can rename the file to a .zip and extract and read the contents in notepad.
posted by idiopath at 6:31 AM on September 16, 2009

I'm saddened to learn from the "Deathwatch" page that totse.com is gone. It was not only one of the oldest sites on the web, it in fact predated the web as a dialup BBS named "Temple of the Screaming Electron" (hence totse). The reason given, though, is after 20 years the guy behind it is sick of the project. Totally understandable.
posted by DecemberBoy at 6:40 AM on September 16, 2009

cillit bang: "Earlier formats are essentially dumps of Word's in-memory data structures, which means the only comprehensive documentation for them is the Microsoft Word source code."

The XML format is arguably somewhat better than the old binary format, but reading it is still not a lot of fun. I guess there's an advantage in that you can use standard XML processing tools, but if all you have is a Word .docx file and you want to extract the content from it, you're going to end up doing a significant amount of reverse-engineering of their (very complex) XML structure in order to get data out in a way that preserves even relatively basic formatting.

Ultimately, what will make Word XML files usable into the future isn't their use of XML but just how widely used and common Word is, and thus how many libraries (especially OSS ones) exist to read it. Using XML may make writing those libraries somewhat easier, but I'm not really sure that Word XML is going to be any more or less readable than the old binary .doc format, because it had already been reverse-engineered to death by the time the XML format was developed.

I'm not totally crapping on it—it does have some advantages, particularly if you want to be able to incorporate Word documents in a workflow or process them with batch tools, strip out hidden data, etc.—but just trying to underline that not all XML formats are created equal. Even though Office Open XML and TEI (just as an example of a simple XML-based standard) are both "XML formats," they are miles apart in terms of ease of processing and appropriateness for long-term archiving. (And frankly TEI is fairly complex in its own right, although there I don't get the impression that it's much more complex than it needs to be to get the job done.)

A lot of people just see "XML" and immediately assume that everything is OK; it's not. You can build a really horrible, obscure format on top of XML, in much the same way that you can build a really horrible, obscure format on top of ASCII, or lots of other open standards. Once you get beyond simple formats that are obviously structured and can honestly be called 'self-documenting', a lot boils down to documentation. An XML format that requires a thousand pages to definitively document isn't much better than a binary format that requires the same number of pages, IMO.
posted by Kadin2048 at 6:47 AM on September 16, 2009 [2 favorites]

Archive.org seems stable at the moment but it's 2 petabytes of data aren't mirrored anywhere else, the code for their system isn't open source and generally they're a single point of failure for a large amount of the web's history.

Is this true? I thought they have a number of mirrors around the world including at the Library of Alexandria in Egypt. Also the software is built with open source software (like Apache Lucene) and the data is extremely open - one could download the entire Internet Archive, and they even provide instructions how to do it: bulk download (note: this script would need mods to download the entire archive).
posted by stbalbach at 6:56 AM on September 16, 2009 [1 favorite]

Kinda surprised that they have no opinion on Mozy.

Seeing as Mozy is a personal backup service, I'm not sure they need to have one. It's a bit outside the scope of Archive Team, since the files on Mozy are not publicly accessible, in contrast with AOL Hometown and Geocities.

At best, they could give advice: "Backups can go bad. So can backup providers. Always keep two sets of backups in case the worst happens."
posted by ymgve at 7:00 AM on September 16, 2009

mediareport, thanks for making my day!
posted by yoHighness at 7:29 AM on September 16, 2009

They're also scraping URL shortening sites in some effort of staving off link rot, though I'm not sure what will happen to these lists of shortened URLs and destination maps.
posted by filthy light thief at 7:33 AM on September 16, 2009

"I am surprised nobody brought this up yet: what about this site? Is metafilter mirrored? Is it backed up anywhere? If mathowie gives up on the rat race and wanders the world like Jules from Pulp Fiction did, will all the content we have posted here just be gone for good?"

It's a serious problem all right. Matt has perviously said he's got a transition plan if he gets hit by a bus (which is good, think of the epic obit thread) but just as worrisome is MetaFilter being shut down because he can't afford it or decides one day it isn't worth the bother. Like the closing of the home of the underdogs that would be a pretty sad day.

"The 'use a $20 USB drive for backups, "cause that will be enough unless you're a pro' is a bit naive given the abundance of 10+ megapixel cameras these days. I like the idea in theory but I'd need a 50 gb drive just to archive the pictures I have right now."

You might be surprised at the number of people, even with a computer with built in card reader, whose only copy of their pictures is on the memory card for their camera. They only take a few dozen pictures a year and can get them printed off the card so why bother backing them up. Gives me the willies every time I encounter it. Years of pictures all crammed onto a single card.

That being said, while $20 will get you 8GB of flash $65 will get you 320GB of disk in an USB enclosure. And 8GB is probably more than 95% of people need to back up all their generated text documents.
posted by Mitheral at 8:00 AM on September 16, 2009

It's a serious problem all right. Matt has perviously said he's got a transition plan if he gets hit by a bus (which is good, think of the epic obit thread) but just as worrisome is MetaFilter being shut down because he can't afford it or decides one day it isn't worth the bother. Like the closing of the home of the underdogs that would be a pretty sad day.

At least the robots.txt file of MetaFilter allows archive.org to crawl all the posts. Apparently user pages, favorited info and some other things are barred from crawling, which, while sad from an archival standpoint, is understandable.
posted by ymgve at 8:51 AM on September 16, 2009

jscott, get the hell on here and comment so we can shower you with + favorites and praise.
posted by cavalier at 8:59 AM on September 16, 2009

The archiveteam.org site seems to be down. Anyone have a backup?
posted by cillit bang at 9:16 AM on September 16, 2009 [1 favorite]

including at the Library of Alexandria in Egypt

So we're going to protect the knowledge of civilization from destruction by collecting it at the Library of Alexandria? Where are the other backups? Encased in Buddhist statues in Afghanistan?
posted by roystgnr at 9:31 AM on September 16, 2009 [2 favorites]

ideopath - good point, but unless I'm totally off-base here you're paraphrasing Arlo, not his dad. My archived vinyl copy of Alice's Restaurant can be used to verify this claim if you happen to have a "phonograph player" (I think that's the term, I don't know this old technology is all just bulky and confusing to me I guess, why didn't they just record it on mp3?)
posted by caution live frogs at 10:03 AM on September 16, 2009

It's perhaps worth noting that one of the minds behind Archiveteam is Jason Scott, who has a pretty awesome blog

It's also perhaps worth noting that one of the minds behind Archiveteam is Jason Scott, who has three pretty goddamn cute cats, one of whom has the cutest of fake cat twitter feeds. Which is to say I just put it together that this is that Jason Scott. HUZZAH SOCKS ARMY, cat haters plz scroll moar.
posted by clavicle at 10:18 AM on September 16, 2009

We are going to rescue those who are going to rescue your shit!

Yo dawg, I heard you like rescuing shit...
posted by Ogre Lawless at 10:32 AM on September 16, 2009

Uhm ... some days this is a broken record for me, but PLEASE:

THE BACK UP IS NOT AN ARCHIVE!

Rinse. Repeat.

The backup is just that: a backup. And you should have multiple copies of them in physically disparate locations, but they are NOT AN ARCHIVE!

BTW, I applaud the attention to the problem in this thread, but I do hope that some others of us who work with this (digital archives) on a regular basis will chime in here ...
posted by aldus_manutius at 11:23 AM on September 16, 2009

Salutations, maniacs.

I'm up here in Toronto doing a presentation on Sockington (just finished a little while ago), and then saw my name pop up on one of my google rss feeds. Fun discussion!

Most of the usual arguments, questions and then answers have shown up in here, not really needing me.

Yes, the point of the "what do I do" page immediately responding with "get your own house in order, buddy" is that a lot of people don't even look at their own data with the gravity it deserves. And yes, there's folks for whom a USB stick wouldn't begin to approach their amount of precious shit, but for the majority, they have so few. So thanks for having the discussion.

Aldus, sit down, please, next to the guy who points out it's pronounced "trekkers". What matters is crap is saved - we'll work on properly curating it later. The single hardest part of history is to be there when it happens. I'd rather someone save stuff away so others can come later and do the niceties, but at this point we're losing incredible amounts of stuff every week from shutdowns and poor engineering. Let's focus on that, and "uhm" later.

Thanks again, everyone.
posted by jscott at 1:23 PM on September 16, 2009 [4 favorites]

« Older Catch you later. | A tray of lard Newer »

This thread has been archived and is closed to new comments

MetaFilter

Archive Team
September 15, 2009 10:45 PM Subscribe

Tags

Share

Archive Team September 15, 2009 10:45 PM Subscribe

Tags

Share

Archive Team
September 15, 2009 10:45 PM Subscribe