A Better Wayback Machine
November 4, 2013 9:06 AM   Subscribe

A Much Better Wayback Machine. Mefi's own rajbot programs for the awesome Internet Archive, and recently helped add some sweet new features. You can now instantly save a page and get a permanent URL; insert a 404 handler on your site to lead users past broken links; use new APIs; and plenty of other good stuff. [via mefi projects]

I somehow didn't know until just now that archive.org has been running a blog since 2004. Recent projects featured there are the curated library of key NSA statements by lawmakers and officials, and the JSMESS project, which does for vintage software what the Archive has already done for vintage text, audio, and video.
posted by Jacob Knitig (37 comments total) 102 users marked this as a favorite
This is great.
posted by Scientist at 9:11 AM on November 4, 2013

Wouldn't you want the browser's preload function to recognize dead links and substitute a Wayback Machine link -- or is that a web server's function?

In any case, these are all awesome features. I will have to dig up some of the old busted links that frustrated me in the past and see whether they work better now.

And jessamyn was right, the Wayback Machine's web pages looks much better now!
posted by wenestvedt at 9:35 AM on November 4, 2013

"Sometimes dead is better."
posted by The Card Cheat at 9:40 AM on November 4, 2013

Yes kids, even your snapchat usage will comeback to bite you.
posted by The 10th Regiment of Foot at 9:44 AM on November 4, 2013

archive.org is an essential Internet service, it makes me nervous there's only one. Google has the data to construct something similar, presumably so does Bing and Baidu and maybe Yandex. But only the Internet Archive is serving it live. About time to make another donation...
posted by Nelson at 10:01 AM on November 4, 2013 [5 favorites]

The Internet Archive are some of the best folks online.
posted by Pope Guilty at 10:04 AM on November 4, 2013 [1 favorite]

Awesome! I think I ran across the new page saving functionality recently, but didn't pay enough attention at the time to realize how useful it might be. The Wayback Machine via this Web Cache extension for Chrome has been one of my go-to secret weapons.
posted by Wemmick at 10:57 AM on November 4, 2013

I routinely use the wayback machine to call out various federal agencies for failing to update guidance documents, standard requirements, and the like. This is going to make that process so, so much simpler, stronger, and meaningful. Yes, it is indeed time to donate.
posted by late afternoon dreaming hotel at 11:02 AM on November 4, 2013 [2 favorites]

So. When are they going to decide that the most important part of an old book is in fact not the high-resolution scan of the texture of yellowing paper? Are they going to add the bits that will make their pdfs wieldy and not the horrible sluggish multi-layered yellow monstrosities that they are right now?

That said...

The Internet Archive are some of the best folks online.

posted by Pyrogenesis at 11:07 AM on November 4, 2013

The Best!
posted by jessamyn at 11:21 AM on November 4, 2013 [2 favorites]

Sometimes we can have nice things, and the Internet Archive is the evidence.
posted by Jimbob at 12:02 PM on November 4, 2013 [3 favorites]

webcitation.org was one way to do it, but with limits and they are shutting down. Wikipedia is looking at starting a new project to do it, to take over webcitation.org. But I always thought Wayback was the better choice, and here they are. They did it with little fanfare and will make a big impact. I suspect this will turn into a bigger thing as it evolves.

The next step will be the ability to search - hope that arrives sooner than later.
posted by stbalbach at 12:25 PM on November 4, 2013

Are they going to add the bits that will make their pdfs wieldy and not the horrible sluggish multi-layered yellow monstrosities that they are right now?

Do you mean since the PDF's are multi-layered they don't work well in e-book readers like the Kindle? I found a solution is to convert the PDF to DjVu format, than convert back to PDF - it removes the layers and makes it lighter on kindle resources. Just need the right tools, been a while but ping me if you need some names.
posted by stbalbach at 12:29 PM on November 4, 2013

Excellent, most excellent.
posted by fluffy battle kitten at 12:40 PM on November 4, 2013

Fantastic. The fact that you can now get the Archive to save particular pages on demand is really cool.
posted by shivohum at 12:42 PM on November 4, 2013

Well, I know that when I talk to people about academic papers, I recommend they use an archive.org link. Why? I feel safe to recommend it because of the Archive's reliability, durability, amazing scope and impartiality. I have donated before, and, if I had any money right now, I would donate again. And again. Due to their amazing work.

(Also, this is a handy Firefox plugin, A dorky paradigm, but useful.)
posted by Samizdata at 2:55 PM on November 4, 2013

they also have a credit union :P IAFCU!
posted by kliuless at 3:45 PM on November 4, 2013

I have got to find the URL of my first web page that I made probably around 1992 when the only web browser was Mosaic. I bet it's in the archive. So I looked for the root URL and poked around my old University department.. holy crap, they're still using parts of my very first web page!
posted by charlie don't surf at 6:07 PM on November 4, 2013

I wonder if we could get an answer from rajbot (policy, I understand, may interfere and that's OK) on the problem with robots.txt excluding from the archive (or at least public view) the contents of domains which have changed hands. I can understand that there's no real way the IA can interpolate the exclusion, that is, the IA must assume the current owner has the authority to ask for this voluntary exclusion, but I wonder what the IA thinking is on this. There are so few other resources that can be depended on in this area. This, to my experience, is one of the larger issues I've encountered in using the service.
posted by dhartung at 1:29 AM on November 5, 2013

That's actually the issue i came in here to lament dhartung.

I often find myself wishing there was a sort of pirate internet archive. One that simply archived everything and refused to ever spit anything back out. There's quite a few smallish sites i can't even look up now because they disappeared off the archive after some shitty domain squatter bought them up and parked them with shitty adwords and turned off robots.txt.

I was especially sad when i realized the new shady ad company that bought out myspace did this, in addition to changing a lot of URLs. Up until a year or two ago you could go browse significant slices of what that site was like at various times. It was honestly almost as interesting as going back and looking at old geocities or angelfire pages.

Quite a lot of the i'd say 2000-2010 internet history is just going to be a black hole in the future if a good solution to this isn't found. I still say something like 2002-2008 is going to be a black hole for a lot of photographs since they're locked in to long since discarded phones that obfuscated the ability to transfer photos off without paying for data or some scammy transfer method/kit, but that's a whole other discussion...
posted by emptythought at 3:30 AM on November 5, 2013 [1 favorite]

Hi dhartung,

Speaking from the engineering side, there is a small staff spread across a huge number of projects, so it is an issue of resources I think. That is not a very exciting answer, I know.. There is a lot left to do, and we continue to work hard on the Wayback and all the other projects as well.
posted by rajbot at 1:11 PM on November 5, 2013 [1 favorite]

Does the original author need to have saved it for it to be there? I am looking for an essay called: The Arrogance of Ignorance by Prometheus in his discontinued blog Photon In The Darkness. So far no luck. Several websites have links to it but those links are broken.
posted by RuvaBlue at 1:37 PM on November 5, 2013

RuvaBlue, no, web crawlers are used for archiving web pages. If you take a broken link, and paste it into the search box at https://archive.org/web, you might be able to find an archived version of the page. In your case, i think this is the page you were looking for.
posted by rajbot at 1:50 PM on November 5, 2013 [2 favorites]

Rajbot: stbalbach raised an interesting idea. Will it ever be possible to search for search terms on Wayback Machine, instead of only searching for specific URLs?
posted by Jacob Knitig at 6:35 PM on November 5, 2013

I think so, but I don't know when it will happen. The Internet Archive runs a subscription crawl service call Archive-It that already provides search for terms within partner collections. Implementing search on the full crawl index is quite a bit more difficult..
posted by rajbot at 8:46 PM on November 5, 2013

Part of Internet Archive building badly burned in early morning fire. No one hurt, sounds like it's not too big a blow to operations, but they lost a lot of scanning gear and of course a huge expensive mess.
posted by Nelson at 3:32 PM on November 6, 2013

Part of Internet Archive building badly burned in early morning fire.

I hope they have an offsite backup of the Internet, at their main site.

BTW, I have had an exceptionally difficult time searching the wayback machine. You can't use Google, and there does not seem to be any internal indexing. The search feature appears to do nothing, unless you feed it a fully formed URL.
posted by charlie don't surf at 3:44 PM on November 6, 2013

True story: I once got a tour of the Internet Archive datacenter back when it was in the Presidio. It was in a basement in an old army building. I noticed all the server racks were on little risers, like 6 inches off the ground. I asked. They explained the racks were 6 inches off the floor because the most the basement had ever flooded was about 3 inches.
posted by Nelson at 4:09 PM on November 6, 2013

I guess I wasn't very clear, in an attempt to make a joke. The question is, how do you back up more than 1.5 petabytes of data?
posted by charlie don't surf at 4:24 PM on November 6, 2013

Incrementally? Dunno. One thing that's nice about the Web Archive is you can afford to lose random pieces of it, like occasional hard drive failure, since no aspect of the archive was ever complete in the first place. Obviously you don't want to design for frequent failure, but it is a simpler requirement than, say, an archive of bank transactions.
posted by Nelson at 4:57 PM on November 6, 2013

Scanning Center Fire ā€” Please Help Rebuild, some details on the Internet Archive blog.
posted by Nelson at 7:09 PM on November 6, 2013

> how do you back up more than 1.5 petabytes of data?

The Internet Archive now has 15 petabytes of data (and growing), stored redundantly. There is one complete copy of the data at the SF datacenter, another in Redwood City, and there are partial copies at Bibliotheca Alexandrina and in Amsterdam.
posted by rajbot at 8:15 PM on November 6, 2013 [3 favorites]

Part of Internet Archive building badly burned in early morning fire.

posted by homunculus at 12:12 PM on November 7, 2013

We're talking about you -- is your building burning?
posted by dhartung at 10:41 PM on November 7, 2013

Yeah, we have lost the scanning center, next door to our main SF office, to a fire... Fortunately, no one was hurt.

(For those who don't know, the Internet Archive operates 36 scanning centers around the world, which scan about 1500 books/day. The San Francisco scanning center was one of those 36, where we also scanned microfilm and movies, in addition to books.)

We are still dealing with the fire. Everyone in the community has been very helpful.. thank you for all the support!
posted by rajbot at 2:37 PM on November 8, 2013 [2 favorites]

Yes kids, even your snapchat usage will comeback to bite you.

Ten boys arrested for child porn distribution connected to Snapchat
...the boys allegedly lured seven girls into sending pornographic photos to them, using the fact that Snapchat messages self-destruct as bait. What the girls apparently did not know (or trusted would not happen) is that there are a slew of ways to preserve Snapchat messages.
posted by XMLicious at 1:07 AM on November 18, 2013

« Older ā€œIā€™d say it was a pretty solid year."   |   "So how did you get your name?" Newer »

This thread has been archived and is closed to new comments