Skip

Why Your Digital Data Could One Day Disappear
February 17, 2002 5:54 PM   Subscribe

Why Your Digital Data Could One Day Disappear HBS Working Knowledge has a Story (actually it's an Excerpt of Dark Ages II: When the Digital Data Die, by Bryan Bergeron) that says data stored on discs and other forms of computer storage are anything but permanent. Not only are the disks themselves the trouble (they last 5-20 years), the computers that read/write them are an added problem, tried opening a Commodore 64 file lately, or a 5 ¼ inch disc?
posted by Blake (39 comments total)

 
You mean one day I'll be burning my CDs to a single DVD, and one day burn my DVDs to a single [whatever comes next]? Huh, that might take a whole afternoon.
posted by fleener at 6:34 PM on February 17, 2002


That's how people make money. Vinyl - Tapes - CDs - MDs - CDRs - DVDs - DVDRs.

I don't expect my PC to be able to load old ZX Spectrum programs from audio tape, although I am sure someone out there has tried and probably suceeded.
posted by riffola at 6:37 PM on February 17, 2002


I don't see where this guy is coming from. The ave lifespan of a computer before it gets replace is much, much lower than how long drives last. If you have important data you should be making backups, if you aren't you only have yourself to blame.

As far as reading from vintage computer disks go, are we going to start blaming our parents for not storing our baby pictures at 0 degrees kelvin in some vault? If you care about the data you would have migrated it by now to another media.
posted by skallas at 6:37 PM on February 17, 2002


Emulation is the answer.
posted by inpHilltr8r at 6:48 PM on February 17, 2002


0 degrees kelvin is also impossible, nes pas?
posted by Dark Messiah at 6:55 PM on February 17, 2002


Um. If you want to get data off a 5.25" disk, just go dig up a 5.24" disk. Not that hard.

Similarly, just because the programs are gone dosn't mean the data is. You can still read it, it just takes extra work.

The whole "digital data dissapears" is BS from people who don't understand it.
posted by delmoi at 7:05 PM on February 17, 2002


I'm beginning to wish I hadn't built my house out of digital bricks.
posted by Neale at 7:28 PM on February 17, 2002


Um. If you want to get data off a 5.25" disk, just go dig up a 5.24" disk. Not that hard.

Au contraire. Maybe a 5.25' disk drive is around now, but what about five years from now, or ten? What about those media formats that nobody supports anymore? What happens when the outmoded technology breaks and no one can repair it?

This is an institutional problem more than it is a personal one. Yeah, you can probably convert all your data to new media in a day or two. But how about your local library? Or a university? It's expensive and time consuming, which tends to chase off institutions, which need stable and managable archiving the most.

Don't give this short shrift. It's a major problem for every library and university in the country, not to mention various government agencies.
posted by briank at 7:35 PM on February 17, 2002


Seconding briank.

My librarian friends cant stop talking about this problem. Dont think 10 or 20 years, Think 50 or 100 or 300 years. How do you store data reliably? Technology breaks and the rate at which most digital storage mediums degrade is astonishingly bad. Microfiches are still considered reliable, definitely more so than storing potentially irreplaceable data on disk drives (would you?) Paper itself degrades at a high enough rate to be an alarming problem and no clear (and cheap) solution exists as to how to store data, books, photos in a way that their existence is guaranteed hundreds of years from now and people in the future will know how to read them.
Clay tablets are starting to look pretty good.
posted by vacapinta at 7:52 PM on February 17, 2002


At least we aren't using magnetic tape anymore. Not only does it erase itself, by its very nature, but it stretches and curls, changing the timing and rendering it unusable. Untold quantities of NASA data have died a silent death this way.

Arguably, though, caching of internet data is a much bigger problem. Pre-97 net content is, historically, probably the most significant stuff the internet will ever produce. And yet very little remains even now, only five years in the future. The most complete, most open, most increadible datacloud ever created by humanity, and we're erasing it as fast as we can.
posted by Ptrin at 8:35 PM on February 17, 2002


vacapinta said: Microfiches are still considered reliable, definitely more so than storing potentially irreplaceable data on disk drives (would you?)

I would. One advantage of digital tech is that it can monitor and repair itself. I have had a hard drive fail over time and regularly scheduled scandisks let me know that there was a problem and moved data from the sectors that were questionable. I replaced the drive when the errors grew to frequent. In order to know that paper or microfilm is degrading you have to examine it
posted by srboisvert at 8:42 PM on February 17, 2002


>Maybe a 5.25' disk drive is around now, but what about five years from now, or ten?

I have four GCR 5 1/4" drives and three MFM 5 1/4" drives that haven't been thrown out for a decade. Probably won't be thrown out till I need the drawer space.

I can read any 5 1/4" disk for a long time to come. All these drives work on a PC, and with three of each I'm not too worried about one or two breaking.

That and if a lowly hardware hoarder such as myself has the drives kicking, imagine what must be still sitting in forgotten warehouses...

BTW: My C64 disks from 1984 still read fine last year when I copied what I wanted from them and put it on CDR.

>Think 50 or 100 or 300 years. How do you store data reliably?

CD-R, of course. With over 200 years of lifetime, if the future generations can't holocube the information, or at least make another backup of it (digital data won't degrade through copies) then they don't deserve it anyhow.

I think the author of the article is way off base anyways. I quote:

"this scenario, dedicating a 30-cent floppy backup for the data, labeled and stored in a shoebox in an office desk drawer with other archives, seems reasonable. Devoting a $10 Zip disk or taking the time to burn a CD-ROM of the data seems a little unreasonable."

It takes longer to write out a full floppy on some computers than to burn a CD at 24x. The floppy costs more. The Zip Disk was dead from the start, and costs more than $10 anyways.

"Given these facts, the data deserve at least to be archived to several floppies, one stored at her home, one at her office, and another sent to a friend or relative. Assuming that the data could not be replaced, the cost and time involved in making several CD-ROMs seems very reasonable."

Uhhhh, several CDROMs for several floppies? Do you mean 5,000 floppies by several? I suppose it fits the definition, but it is just a little bit unreasonable.

Also, a decade old laptop doesn't have a drive bigger than a CD. Period. Copy the laptop hard drive to a CD. Cost? $0.50 and about 10 minutes of your time (not including the time it takes to transfer the information while you're in bed).

And trust me, Jane could simply re-write the thesis if she was this screwed. If she did it once, she'll be able to do it in a quarter of the time next time. And she'll buy a new laptop and learn from her mistake of using an unreliable POS. A lesson more valuable to the world than her thesis, I would say.

BTW: I archive everything. My entire life of non-downloaded (ie: "irreplaceable") items fits on a CD. Maybe two. I even have email from 1996-1997, and I found (and saved one or two of) old usenet posts from me from the same time. I love scaring people asking me for a copy of something they emailed me, searching it, and giving it to them.

So, how often do I have to backup? Well, I simply keep a directory on my fileserver under 650 MB and back it up when I feel like it (once a month or so). Once a year I delete it and start fresh.

Works for me, and if you think that's too hard to do, well, you probably haven't lost any data yet.

" For example, properly archiving 1,000 digital images to CD-ROM, including creating a standalone database indexed to image number with two keywords per image, might take $4 per image for digitizing, CD-ROM blanks, database software and about ten minutes per image for indexing. Thus, the project would entail $4,000 and four 40-hour weeks from someone who is familiar enough with the content to index it properly."

Speaking as someone who recently scanned in an entire 700 page book to take with me on vacation, if you skip all but basic indexing (by page, for example) the job will take 6 hours, cost $0.01 per image, and 1E-06 seconds per image for indexing. WTF do you want a database for? Are you crazy? If you are so desparate to get at the photos, you will put the effort in. Right now what you need is the data digitally, and a date and sequence number stamped on it. In the future you can worry about databases and such, since in the future (as in starting 2 years ago) you should be doing it all digital anyways.

Oh, and I never even read the book before I scanned it.

Automating the scanning progress (since photographic prints are often of similar sizes) would probably cut the costs (in volume) and I don't see why you couldn't scan the photos in at a speed of at least 30 per minute. If a digital photocopier can do it, the it follows scanners can.

Heck, I was scanning by hand pages many times bigger than a photo on a 5 year old $80 scanner at a rate of 2 per minute.

That article is proof that just because you teach something doesn't mean you can outwit common sense.

[Sorry for the ultra long post :) ]
posted by shepd at 8:54 PM on February 17, 2002


I store my most valuable data in cuneiform on clay tablets buried in sand. Lasts forever.
posted by quercus at 9:03 PM on February 17, 2002


Microfilm and paper may last as long as 500 years, while digital formats are rated at less than 100 years.

Let's make a 25 year fruitcake (Joy of Cooking recipe), set our Palm Pilots for 25 years from today, and meet right back here in this thread! Every 25 years we'll award the fruitcake to the person whose data has held up the best.
posted by sheauga at 9:24 PM on February 17, 2002


You have to wonder about the credibility of a guy that says stuff like this:
For example, I recently purchased a Commodore 64 system, at nearly the cost of a current PC, simply to be able to read some commercial program I wrote in the mid-1980s. The discs were still readable, but, save for a few Commodore 64 systems in attics or basements, the discs were practically unusable.
Nearly the cost of a current PC? I just looked on eBay. Commodore 64s are going for between $5 and $40, depending on how much extra junk comes with it (drives, modems, printer etc). And there are lots of them. There are hundreds of thousands, if not millions, of C-64 systems lying around in those attics and basements. If they were so rare, we'd see them showing up on Antiques Roadshow selling for $10,000 a pop.

Untold quantities of NASA data have died a silent death this way.

That's not the only way NASA's lost data. See my post to Risks Digest from last July for a completely different way they've lost valuable data (data that wasn't particularly useful at the time, but became very valuable 25 years later ... yet another important thing to consider about why we should hold on to as much old data as possible, even if it seems pointless now, but I suppose that's for another thread). The problem with old data often isn't bit rot, or the disappearance of outmoded disk drives, it's the use of weird, obscure programming languages or database formats that are only around for a short time before getting replaced by something else.

In other words, the question's often not "Will we run out of Commodore 64 floppy drives?", it's "Will anyone still remember how to program in Perl in 20 years?" "Will anyone still be using Cold Fusion in 2010?" Something along these lines happened with the Y2K mess, when 75-year-old COBOL programmers had to be brought out of retirement at insanely high per-hour rates to fix code that nobody had touched in 35 years ... even though much of the original physical hardware running the programs had been kept in tip-top shape and was still being used. And still is today, of course.

Just as a data point: You can still buy pianos that play piano rolls. That's a 92-year-old digital data format. In fact, judging from Google it remains a pretty decent-sized industry.
posted by aaron at 9:47 PM on February 17, 2002


I have four GCR 5 1/4" drives and three MFM 5 1/4" drives that haven't been thrown out for a decade.

Yes, but what are you going to hook them up to? Do modern computers even come with a controller that can run a 5.25? More and more are "legacy-free" these days. Better keep a whole machine around, it won't cost much more...
posted by kindall at 11:08 PM on February 17, 2002


You could try printing on good paper with good ink. Microprint -- something too small to read without magnification, but good for backup storage. There must be inks and papers that wouldn't decay for many years. By the time you need to read and replace the paper, you would have little reading robots (bookworms with little reading glasses and cardigans) to do the work.

Maybe the best digital permanent storage is the Internet. Keep your data in at least two or three different places, and let server backups and replacements keep refreshing and moving your data. Anything of lasting interest (and everything is interesting to someone, especially if you interlard it with porn*) will likely be copied to other places if you give people free access to it. I suspect that this thread will be somewhere in a thousand years, though there may be no one to read it.



* Hide important texts in pictures of fellatio and let a million one-handed typists distribute and store them for you. Or build a DNA drive -- something that will encode text in the junk sections of DNA and insert it into plants and animals. Let sex be your distribution system. That's not just a strep throat, that's the Encylopaedia Britannica.
posted by pracowity at 11:58 PM on February 17, 2002


in a thousand years, though there may be no one to read it

Well.. DNA will continue to be replicated and copied. It's sexy topic. Its all about replication. Natures very own process of passing DNA information from one generation to the next shows how data can be preserved for millions of years using a living system. We are all born with one purpose: to keep the data alive. Not sure what the computer analogy is. Perhaps the Internet is a living system. I think the authors point is data is "dead" once its archived because no one is takeing care of it. We cant assume a copy is good enough and forget about it. Data has to be living in order to survive. Clay tablets are only alive because we can still read the language.
posted by stbalbach at 12:40 AM on February 18, 2002


What I find more interesting is why we're compelled to pointlessly save every ridiculous byte of our digital existence. We archive our mail stores, our oh-so clever scripts, our laboriously formatted documents, etc.

Little do many of us realize that no one will bother with our data once we're dead. No one will sift through our digital flotsam, especially if it requires any arcane technological effort. All those meticulously labelled and carefully organized CD-Rs are destined for the landfill the moment we're not there to look after them.

I agree with those who think the Internet may the only reliable digital storage device, but even then our data merely becomes another non-descript molecule in an ever-deepening ocean of information, of interest to no one eventually. Not even the techno-antiquarians of the future.
posted by johnnyace at 2:28 AM on February 18, 2002


of interest to no one eventually

Unless we make ourselves interesting. Better get cracking, the clock is ticking...
posted by rory at 2:44 AM on February 18, 2002


> Little do many of us realize that no one will bother with
> our data once we're dead.

Unless you are one of the few future-historical figures of our time, probably no one (except perhaps your descendants, but isn’t that enough?) will be interested in you alone, but historians will be interested in you statistically.

Today’s historians would love to have the equivalents of our recipes and pay stubs and shopping lists and love letters and memos and diaries and entertainment schedules and answering-machine messages and SMS messages and pornography and plumbing diagrams and medical charts and school records and telephone books from each resident of Rome 2000 years ago.

Future historians will worship at the shrine of Blogger. They will want to know about every cat picture and every stupid fucking online test. They will care whether you are hot or not.

We all want, individually and collectively, to be understood. If we can make ourselves understood to the future (through future historians) for the cost of a few measly giga-tera-whatever-bytes of permanent storage hidden under a mountain or out on an asteroid, it’s worthwhile.
posted by pracowity at 3:37 AM on February 18, 2002


pracowity, I agree that archives of the current 'net will be of great historical value, and is one of the primary reasons I participate in online discussion forums such as MeFi, but I also suspect that eventually even ASCII text will go the way of the dinosaur.

Even so, there will always be those who enjoy digging through the past, so I'm confident that the 'net will be archived in some form or another from here on out. However, the collective amount of data is already more than anyone could ever hope to consume, and it's not even twenty years old.

I think collections like Blogger and MeFi are probably right at the fringe of what will one day be considered worthwhile archive material, especially after a few format paradigm shifts. Given that most of the 'net is totally free-form at the moment, it will be of little use if it can't be easily categorized and searched by machine in the future. Usenet is a good example; megs of useful information lost among gigs of blather and noise.
posted by johnnyace at 4:22 AM on February 18, 2002


Another way of keeping your data safe (not for the squeamish).
posted by jonathanbell at 4:23 AM on February 18, 2002


Data will be passed down in the way it always has been. By being copied again and again, yet digital data should make it a lossless copy. I study the Roman Empire at university and some of the losses and gaps are absolutley crucial to our understanding. Just imagining what Roman history would be like with a real Bloggus Caesari makes me have to lie down.

I only hope that some evil regeneration technique allows me back in 500 years to point to MeFi and say that I was part of that.
posted by nedrichards at 4:41 AM on February 18, 2002


Of course, don't forget about the CD eating fungus. Bookworms be damned!
posted by piskycritters at 4:45 AM on February 18, 2002


Aaron: Interesting point, but I think the COBOL/Y2K and NASA Mars examples are significantly less likely to happen in the future because so much work has been done on common data formats such as XML and ways to exchange information between separate systems (XML-RPC, SOAP, etc.).

There are still examples where we have been idiotic about the need for archival storage -- pre-1995 Usenet would've been lost entirely without a few hobbyist packrats -- but on the whole, the Web-driven impetus to share information has made things appreciably better.
posted by rcade at 7:13 AM on February 18, 2002


Actually, from what I've read from professional historians, too much data is as bad as too little data. Perhaps we should be considering that the basic fragmentary nature of the historical record is in many ways a good thing because in the process of the data getting lost and destroyed, quite a bit of distillation occurs leaving behind the most useful records.

For example, do we really need to have everybody's tax returns from 10 years ago archived anywhere? The descriptive statistics have already been calculated for this data, you can get the aggregate frequency distributions for every single line on the full 1040. As a result, keeping these records for eternity becomes a matter of preserving trivia for trivia sake.
posted by KirkJobSluder at 8:04 AM on February 18, 2002


KirkJobSluder. That's possibly so. But what about local historians looking to see what industries/commuting patters were important in their area 150 years ago when the very idea of commuting is bizarre? One of the many reasons the American Civil War and Crimean War are interesting is that you get large numbers of soldiers diaries telling you what it was like for a man at the front. You can not know what will be interesting for future historians or what statistical techniques they'll use.

Also I don't know which professional historians you heard that from, or what periods they particularly studied but I would say that a historian who doesn't want to preserve or expose the past is not a historian, but a mere controversialist. It should also be noted that you can’t prove something from its absence (just infer) so we don’t know what we’re missing. The very stuff of history, the primary source material is letters and journals, I for one will be glad that your webpage will be saved at the web archive.

Surely any experience in complex systems of information would show that generally speaking the best stuff tends to rise to the top but that doesn’t make the rest any less valid or historically interesting. To take up a point from earlier the mere fact of the 'blog phenomenon' is historically interesting. I'll just make one note: lots of Roman historical authors were 'distilled' down into precis by the byzantians, now having that precis is much better than nothing at all, but it's so much less important and interesting than those fragments when the full text survives.

This is rambling too much so I'll stop now but I'm glad that, so long as we have the space (and the way digital storage is going we'll probably have it for a long time yet) we will save things and that people are thinking about ways to save them better.
posted by nedrichards at 8:32 AM on February 18, 2002


For those that enjoy things archaeological and historical try Motel of the Mysteries.

A very good parody of archeology and American society that has some obscure relationship to the saving of digital data.
posted by bjgeiger at 8:45 AM on February 18, 2002


Also I don't know which professional historians you heard that from, or what periods they particularly studied but I would say that a historian who doesn't want to preserve or expose the past is not a historian, but a mere controversialist. It should also be noted that you can’t prove something from its absence (just infer) so we don’t know what we’re missing. The very stuff of history, the primary source material is letters and journals, I for one will be glad that your webpage will be saved at the web archive.


It's not a matter of what these historians (Hexter and Mink for example) want. It's a matter of what they can actually get their hands on. Even letters, journals, and diaries at best only capture fragments of the past, those little bits of personalized fiction that the author deems to be important. For example, it appears that while we have several versions of Sennett boards from the ancient world we don't have any rules for playing on those boards. One hypothesis for this is that the game was considered to be so mind-boggling obvious and trivial that's no one ever bothered to waste precious resources in preserving the rules.

Certainly the absence of information or loss of information is a serious problem. But too much information is just as bad of a problem. In order for that information to be useful it must be cataloged and sorted. Maintaining massive stores of information requires considerable time and money. The breakdown of the patent system can be largely attributed to the fact that technological innovation and publishing has outstripped the ability of patent offices to discover the previous research related to the patent.

In another problem with archiving Internet records is how frequently do you choose to archive? For example, my current paper up for review went through over 10 revision cycles. Is it really worthwhile to save 10 different major drafts (eight of them were not very good to begin with). What constitutes a major change that is worth documenting? Granted, Christopher Tolkien made tons of money by publishing drafts that his father thought no one would be interested in. But in most cases the obsessive documentation of drafts doesn't serve a historical purpose. With files to change on a daily basis, you get even more problems. For meta filter, is everyday a different document?

Ultimately, you have to be selective in what you save, and what you discard. The Web starts off as a fragmentary record and becomes even more fragmentary.
posted by KirkJobSluder at 9:49 AM on February 18, 2002


This is all true. I'm not entirely sure how it can be solved. We will have to do something because all thius information isn't likely to just go away again. An adaptation of the 'with many eyes all bugs are shallow prinicple' may work in the long term.

There does seem to be something endearingly human about the need to record your presence in this scary world of ours (or theirs) thus the wonderful ancient graffiti.
posted by nedrichards at 10:41 AM on February 18, 2002


Kindall, all PC computers with a floppy port (and I've never sold one without) support 5.25" floppy drives. Laptops excepted -- but only because you can't rip into them like PCs.

The only difficult part is finding floppy cables with the larger 5.25" connector. But making one is as simple as squeezing a vice, so I'm not worried. :-)

GCR drives (C= 1541 dives, for example) don't work on those ports, though. But a simple program like Star Commander lets me hook up those drives through the printer port (standard on all but the cheapest PCs).
posted by shepd at 3:42 PM on February 18, 2002


Unless you are one of the few future-historical figures of our time, probably no one (except perhaps your descendants, but isn’t that enough?) will be interested in you alone, but historians will be interested in you statistically.

I disagree. Not about your statement that historians will be interested in out statistical flotsam, but that they wouldn't be interested in us as individuals. They will be absolutely orgasmic over any individual's giant email spools, documents, blog posts, etc. Those items are vitally important in helping historians learn more about how people really interacted, how they thought, even the way their use of the English language differed at that moment in time from the time the historian will be doing the studying.

I dated a girl for a long time off and on in high school, college and beyond, who kept copious, extremely detailed journals about her daily life as she was growing up. She later died, and her family didn't know what to do with all those giant binders (and eventually floppies) filled with millions of words of writing. I found out by happenstance that the Smithsonian is always interested in such writings by Americans of all walks of life, because they always tell a tale of a given place and time in American history. So the journals were turned over to them. I'm sure they've just been filed away, as they're not of much interest to anyone at the moment. But eventually - 50, 100, 200 years from now - someone is going to have a reason to research what life was like for upper-middle-class girls living in Manhattan in the 1980s and early 1990s. And when they do, those things will be a treasure trove of vital information. (We did print out the contents of the floppies, btw, for the very reasons given in this thread.)
posted by aaron at 5:14 PM on February 18, 2002


Sorry, not the Smithsonian, it was the Library of Congress.
posted by aaron at 5:26 PM on February 18, 2002


> They will be absolutely orgasmic over any individual's
> giant email spools, documents, blog posts, etc. Those
> items are vitally important in helping historians learn
> more about how people really interacted...

Well, yes, but they likely will be interested in her as a representative of the masses, as just one of many upper-middle-class Manhattan girls of the time thinking the same thoughts and buying the same shoes, and not as a particularly interesting person, not in the way people are interested in, say, John Keats. Probably you could substitute any other girl's ephemera for hers and the historians would be as happy; they would be interested in her because she was average and taught them something about the times. But you couldn't substitute just any 1820s Londoner's correspondence for the letters Keats wrote to friends and family. That's the sort of difference I meant.
posted by pracowity at 10:28 PM on February 18, 2002


Oh, okay, I understand (though I thought she was an interesting person). But then, plenty of people aren't considered interesting by the masses until after they're dead. And who knows how much such writings will still manage to stick around for 100 years or more. Most personal writing of a century ago has long since just been thrown away. Somehow nowhere near as much of it manages to stick around as you'd think.
posted by aaron at 10:53 PM on February 18, 2002


Kindall, all PC computers with a floppy port (and I've never sold one without) support 5.25" floppy drives.

Huh. You would have thought price pressure would have eliminated that capability from the machines around 1990. If you can save 5 cents by not supporting 5.25" drives, it'd seem a no-brainer.
posted by kindall at 10:55 PM on February 18, 2002


aaron, now multiply your deceased friend's writing by even a small % of each human generation. Even if everyone's musings and records were digital and permanently stored reliably, how will historians ever hope to find it? What chance do they ever have of wading through the exponentially increasing unfathomable amounts of data to discover the occasional gem?

There's already so much historical information that's quietly disintegrating on paper, much less on modern media, that we can't hope to save it all. Eventually every byte we've massaged into place will be lost or forgotten. While I want to be remembered as much as anyone, I don't hold the illusion that "the future" will even know that I existed, or very much care.
posted by johnnyace at 12:24 AM on February 19, 2002


> While I want to be remembered as much as anyone, I
> don't hold the illusion that "the future" will even know
> that I existed, or very much care.

Probably nothing you write will matter to many, but some of your descendants would love to know what great great great great great great grandpa johnnyace thought about hamburgers and squirrels and music. They would take the time to read your old blog and check out your record collection and so on.
posted by pracowity at 7:53 AM on February 19, 2002


« Older Boy quits school at 7, becomes MIT professor at 20...   |   I Nominate Richie Havens As... Newer »


This thread has been archived and is closed to new comments



Post