25,000 Transcribed Texts From 1473-1700 Published Online
January 28, 2015 7:12 AM   Subscribe

The University of Michigan Library, the University of Oxford's Bodleian Libraries and ProQuest have made public more than 25,000 manually transcribed texts from 1473-1700 — the first 200 years of the printed book. Full text access. Multiple format downloads, including ePUB. Or just download the entire corpus.

The texts represent a significant portion of the estimated total output of English-language work published during the first two centuries of printing in England.

The release via Creative Commons Public Domain Dedication marks the completion of the first phase in the Early English Books Online-Text Creation Partnership (EEBO-TCP). An anticipated 40,000 additional texts are planned for release into the public domain by the end of the decade.
posted by Bobby Rijndael (34 comments total) 80 users marked this as a favorite
 
Great stuff there! A few nuggets for fans of Sam Pepys.
posted by beagle at 7:23 AM on January 28, 2015 [5 favorites]


Creative Commons Public Domain Dedication (CC0 1.0 Universal)

As good as it gets, and how it should be. Too many libraries claim some form of license on old books that isn't full Public Domain, which is copyfraud, no library can overrule public domain law.
posted by stbalbach at 7:30 AM on January 28, 2015 [1 favorite]


Apropos a discussion recently had about the medieval distinction between ale (an alcoholic beverage made from grain) and beer (an alcoholic beverage made from grain and hops):
Bere is made of malte, of hoppes, and wa∣ter, it is a naturall drynke for a dutche man. And nowe of late dayes it is moche vsed in Englande to the detrimēt of many englysshe mē, specially it kylleth thē the which be trou¦bled with the colyke & the stone & the strāgu∣lyon, for the drynke is a colde drynke: yet it doth make a man fat, & doth inflate the bely, as it doth appere by the dutche mens faces & belyes.
From A compendyous regyment or a dyetary of healthe made in Mountpyllyer, by Andrewe Boorde (1490? - 1549).

Alas, by the end of the 16th century the Dutch fashion for contaminating ale with hops had come to stay. And with it came the commercialization of brewing and the end of brewing as a woman-dominated profession in England.
posted by Bobby Rijndael at 7:32 AM on January 28, 2015 [8 favorites]


I see some are marked "restricted" and others are "free". Oh well. Maybe they can claim rights on the transcription of a public domain book. Seems sketchy.
posted by stbalbach at 7:32 AM on January 28, 2015


Martin Mueller's recent article in Spenser Review gives some really good background on the project and some guidance about ways the public release could be used to strengthen the digital early modern corpus. [spoiler alert: 40,000 more texts are coming in 2020!]
posted by activitystory at 7:38 AM on January 28, 2015


They probably couldn't enforce any restriction on you possessing the text, but a web site can be as capricious as it likes in providing you access to a text with no need of legal justification.
posted by idiopath at 7:39 AM on January 28, 2015


I don't see why they manually transcribed them when they could have just used a scanner and maybe some OCR software.
posted by sexyrobot at 7:40 AM on January 28, 2015


I see some are marked "restricted" and others are "free". Oh well. Maybe they can claim rights on the transcription of a public domain book. Seems sketchy.

From the FAQ:
3.10. What is the difference between a freely available text and a restricted one?

The resources in the OTA collection have been deposited with the Archive under different licenses. Some depositors require that you register and sometimes also contact them before you are allowed to download their resource. These resources you have to request first by filling out a form. Other resources are able to be freely downloaded, but this still involves providing your email so we can send you a link at which you can download the text.

3.11. Why do some resources require asking for permission?

Some of our depositors want to be consulted before anyone can access their resources. It may be that they want to know who is using the resource or that they are working on improving or expanding the resource and may have a later version available. The OTA encourages all depositors to make their works freely available if at all possible.
So the restrictions seem to have been put in place by the institutions that had the original works, not the project itself, which encourages free, open access.
posted by Bobby Rijndael at 7:41 AM on January 28, 2015 [3 favorites]


A new song, called Jacke Doues

"And care not how ere the world goe"
posted by stbalbach at 7:41 AM on January 28, 2015


I don't see why they manually transcribed them when they could have just used a scanner and maybe some OCR software.

A lot of these books have unusual typefaces and layout. Almost all of them will have archaic and inconsistent spellings. Many will use characters and diacritical marks that are no longer common (or used at all) in English. It would give OCR software fits, in turn making full-text search pointless and text analysis difficult.
posted by Bobby Rijndael at 7:45 AM on January 28, 2015 [13 favorites]


I don't see why they manually transcribed them when they could have just used a scanner and maybe some OCR software.

OCR performs astonishingly badly on printed text of this period. Attempts are being made to improve this, but transcription is still much better than the current state of the art. Below a certain threshold of error (99.5% is I think from memory the figure often bandied around) it's actually cheaper to rekey them than it is to proof OCR'ed text, and more accurate.
posted by GeorgeBickham at 7:45 AM on January 28, 2015 [8 favorites]


OCR is not as easy with early modern typefaces as it is with 20thC texts, but the Early Modern OCR Project at Texas A&M is doing some amazing things.

The great thing about these texts being open is that people now can correct/mark-up/and otherwise improve them.
posted by activitystory at 7:45 AM on January 28, 2015 [2 favorites]


Yeah, not really all open access, is it. I just did a search for "potato" looking for early descriptions of the vegetable. All the earliest cites require me to login to see them. I don't have a way to login, so basically I'm SOL. So much for the spirit of free inquiry, eh?
posted by Chrischris at 7:49 AM on January 28, 2015


Great stuff there! A few nuggets for fans of Sam Pepys.

"The Portugal history, or, A relation of the troubles that happened in the court of Portugal in the years 1667 and 1668 in which is to be seen that great transaction of the renunciation of the crown by Alphonso the Sixth, the dissolution of his marriage with the Princess Maria Frances Isabella of Savoy : the marriage of the same princess to the Prince Don Pedro, regent of the realm of Portugal, and the reasons alledged at Rome for the dispensation thereof."

Snappy title there Sam.
posted by sobarel at 7:50 AM on January 28, 2015 [1 favorite]


Why would they write the books out by hand if they were already printed? For that vintage throwback feel? Stupid hipsters.
posted by The 10th Regiment of Foot at 8:13 AM on January 28, 2015 [1 favorite]


oh my god i thought was going to be like the News of Old Twitter thing where these were a bunch of short messages that had been found and were being transcribed.

i really wanted that. "get ye some other wench to play milkmaid! thou art truly mad!"

or "verily, it is what the lady hath spake."
posted by sio42 at 8:53 AM on January 28, 2015


This could be a fantastic resource for OCR developers (and computer vision researchers in general) looking to improve their algorithms. As everybody notes above, OCR is generally horribly bad at this; having a huge amount of already-transcribed material to chew on is about the best thing someone in machine learning could wish for.
posted by clawsoon at 8:55 AM on January 28, 2015


More on why they keyboarded rather than using OCR.

This could be a fantastic resource for OCR developers (and computer vision researchers in general) looking to improve their algorithms. As everybody notes above, OCR is generally horribly bad at this; having a huge amount of already-transcribed material to chew on is about the best thing someone in machine learning could wish for.

I wouldn't have thought this was an ideal training-set for that purpose, as the transcripts are (somewhat lightly) regularised or corrected: things like the long s; the early-modern 'ct' and 'st' ligatures; the use of two vs for a w (early English printers mainly used Continental type); and other quirks like turned type, absent letters pulled through inking or made illegible (but which can be inferred through context) - all these have been silently corrected. The population of these regularisations is not immense, and the transcribers (many of whom are highly skilled, but anonymous - the editors outsourced the labour to commercial operators) did try to transcribe what they see. But mapping an early-modern typecase and the vagaries of its use to a modern character-set is not straightforward, even before you get into things like diagrams, ornaments and pen-corrections. The point of this project was to provide reading texts for linguistic and general use, not to improve OCR, so I'm not sure if the amount of adapation necessary for that purpose would make it worthwhile. Plus, you would have to get access to high-quality images of the originals (and they would have to be the exact same copies, as early books vary significantly) to perform the training. The keyboarders used low-quality scans from microfilms, which are paywalled behind ProQuest's EEBO product.
posted by GeorgeBickham at 9:12 AM on January 28, 2015 [2 favorites]


Thank you posting this!

The ability to look at books from this era makes me desk chair wiggle with joy.

I mean you gotta love titles like this:

The naturall and morall historie of the East and West Indies. Intreating of the remarkable things of heaven, of the elements, mettalls, plants and beasts which are proper to that country: together with the manners, ceremonies, lawes, governments, and warres of the Indians.


Purchas his pilgrimes In fiue bookes. The first, contayning the voyages and peregrinations made by ancient kings, patriarkes, apostles, philosophers, and others, to and thorow the remoter parts of the knowne world: enquiries also of languages and religions, especially of the moderne diuersified professions of Christianitie. The second, a description of all the circum-nauigations of the globe. The third, nauigations and voyages of English-men, alongst the coasts of Africa ... The fourth, English voyages beyond the East Indies, to the ilands of Iapan, China, Cauchinchina, the Philippinæ with others ... The fifth, nauigations, voyages, traffiques, discoueries, of the English nation in the easterne parts of the world .

It's an insight to a completely different mindset and worldview.
posted by Jalliah at 9:32 AM on January 28, 2015


I'm curious why, since these were transcribed, the use of VV for W, u for v, etc is followed. I though these were just typographic, not actual spelling variations.
posted by CheeseDigestsAll at 9:33 AM on January 28, 2015 [1 favorite]


I'm curious why, since these were transcribed, the use of VV for W, u for v, etc is followed. I though these were just typographic, not actual spelling variations.

I'd forgotten that they transcribe VV: not everyone would and some would argue that it depends on how they are kerned.

u and v (and i and j) are not entirely separate letters at this time: they are interchanged according to their position within the word. Each may also be pronounced as either a verb or a consonant, so most editors will transcribe the glyph.
posted by GeorgeBickham at 9:45 AM on January 28, 2015 [1 favorite]


Fun times if you search the word "witch"

Lots of fiery sermons and prose
An account of at least one trial
and this
A perfect discovery of witches shewing the divine cause of the distractions of this kingdome, and also of the Christian world : very profitable to bee read by all sorts of people, especially judges of assizes, sheriffes, justices of the peace, and grand-jury-men, before they passe sentence on those that are condemned for witch-craft

I started reading it one my lunch hour and if I've got the gist of it, it's book where the author is arguing about how wrong people have been about what a witch is and therefore making mistakes about witches.
Basically, that isn't witchcraft or this isn't a witch, this is what it is.

It's fascinating.
posted by Jalliah at 10:22 AM on January 28, 2015 [1 favorite]


EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
posted by corb at 10:31 AM on January 28, 2015


All I've got to say is that she turned me into a newt, Jalliah.
posted by The 10th Regiment of Foot at 11:10 AM on January 28, 2015 [1 favorite]


This is awesome! ACK!!!
posted by xarnop at 11:30 AM on January 28, 2015


How long 'til this gets included in Google's Ngram?
posted by jetsetsc at 12:41 PM on January 28, 2015


Not long at all, jetsetsc. Check out the Early Print project at Washington University in St Louis.
posted by activitystory at 1:02 PM on January 28, 2015 [2 favorites]


Search results for 'fuck': 7 matches in 4 records

That seems pretty low. I mean there are threads on the blue with more hits than that. (fuck, there are threads where there are more hits than that because of my casual vulgarity)
posted by el io at 1:04 PM on January 28, 2015


I think some of those "fuck"s are actually "suck"s mis-transcribed. Fewer fucks actually given.
posted by jetsetsc at 1:50 PM on January 28, 2015


Oooo, John Donne sermons and poetry both ...
posted by Quasirandom at 2:49 PM on January 28, 2015


For those not clear, the Oxford Text Archive (OTA) is not identical to the EEBO-TCP project. EEBO-TCP is open for downloading in its entirety, in XML, with no restrictions. OTA is simply one place the files are being hosted, and often puts its own restrictions on things.

For straight access to everything in Phase I of the TCP, the Michigan library link is ideal. You can "view full text" for anything in Phase I. (Phase II is planned for release five years after the last texts are transcribed, with access within those 5 years coming via partner libraries who have helped fund the effort. This has also been the case with Phase I for the last five years; it's been done for a while, but paywalled to pay for the project)

More information on the TCP venture can be found here, including, I believe, their transcription guidelines, which address the u/v distinction, vv/w, etc.

The Early Modern OCR Project has done a lot of great things, including developing the Typewright Tool that is integrated into the aggregator site 18th Connect. This allows you to access otherwise paywalled page images for purposes of correcting the (often terrible) OCR that is what ProQuest normally provides and what is actually searched when you work with these types of resources. We (I am part of this effort, broadly) hope to eventually integrate this with the EEBO and EEBO-TCP collection to better that OCR as well.
posted by scdjpowell at 6:20 PM on January 28, 2015 [3 favorites]


here I was thinking texting technology was all 21st century! :D
posted by labanjohnson at 6:57 PM on January 28, 2015


The XML transcripts are also on Github. Here is a CSV file listing all the texts.
posted by GeorgeBickham at 10:48 PM on January 28, 2015


Title: Villare Anglicum, or, A vievv of the tovvnes of England collected by the appointment of Sir Henry Spelman.

FINALLY!
posted by clavdivs at 11:28 PM on January 28, 2015


« Older Perpetual Pizza, because .Pizza is a TLD   |   Obituary for the Marlboro Man Newer »


This thread has been archived and is closed to new comments