Join 3,521 readers in helping fund MetaFilter (Hide)

Online Corpora
January 24, 2011 6:47 PM   Subscribe

Online Corpora. In linguistics, a corpus is a collection of 'real world' writing and speech designed to facilitate research into language. These 6 searchable corpora together contain more than a billion words. The Corpus of Historical American English allows you to track changes in word use from 1810 to present; the Corpus del Español goes back to the 1200s.
posted by Paragon (11 comments total) 36 users marked this as a favorite

We can only pray that YouTube comments were not collected.
posted by Joe Beese at 6:57 PM on January 24, 2011

YouTube comments are collected, and putting them into this kind of format is an incredibly simple, if lengthy, task. I would go so far as to say that it would be enlightening, if not edifying, to use them in such a way.

Outside of the context of the videos they refer to (though title and posted description can be collected easily enough) they will make even less sense of course, but it's still data.
posted by aeschenkarnos at 7:58 PM on January 24, 2011

Love the idea, love the data, hate the UI. Entire windows full of buttons shouldn't jump wildly from left to right just because you moved the mouse.

It'd be nice to see them apply comparative graphing like the Google Ngram Viewer to this.
posted by otherthings_ at 8:25 PM on January 24, 2011

Heads up, linguistic nerds.

That Google Ngram corpus?

You can play with it, online, via Amazon EC2. Just mount the file system, it's a volume you can play with.

posted by effugas at 8:29 PM on January 24, 2011 [1 favorite]

I'm slightly obsessed with corpora. Here are a few that I like to visit, depending on what I need: (I have many more links gathered on delicious and Pinboard, too, if anybody is interested.)

The descriptions for each link have been copied from their respective websites.

posted by iamkimiam at 11:18 PM on January 24, 2011 [24 favorites]

Jesus X. Christ - can we swap that comment for the FPP? That's an amazing list.
posted by Paragon at 11:23 PM on January 24, 2011 [2 favorites]

To add to Kim's list, there are some more non-English corpora as well:
Ernestus Corpus of Casual Dutch
Nijmegen Corpus of Casual French
Nijmegen Corpus of Casual Spanish
Nijmegen Corpus of Casual Czech

One maintained by the LDC, linked by iamkimiam, that gets brought up a lot in some circles is the TIMIT Acoustic-Phonetic Continuous Speech Corpus, which has speakers reading off a number of sentences, rather than spontaneous productions.

My final addition, which is not a corpus, is for those of you into checking wordform and lemma frequencies in Dutch, German, and English via the web. WebCelex is an interface for Celex.
posted by knile at 4:17 AM on January 25, 2011 [1 favorite]


Take that, corpus.
posted by Faint of Butt at 9:13 AM on January 25, 2011

The link that effugas should have included: Google Books Ngrams on S3 (for AWS/EC2). I looked for this in December and it didn't exist then: I'm so happy to see it now.
posted by xueexueg at 11:43 AM on January 25, 2011

That list is amazing! Sadly, project We Say Tomato is no longer accepting submissions.
posted by nirvan at 10:49 PM on January 25, 2011

Oh my. Oh my.
posted by cortex at 12:42 PM on January 26, 2011

« Older Rahm Emanuel is ruled ineligible to run for Chicag...  |  Archivist of the United States... Newer »

This thread has been archived and is closed to new comments