Online Corpora
January 24, 2011 6:47 PM   Subscribe

Online Corpora. In linguistics, a corpus is a collection of 'real world' writing and speech designed to facilitate research into language. These 6 searchable corpora together contain more than a billion words. The Corpus of Historical American English allows you to track changes in word use from 1810 to present; the Corpus del Español goes back to the 1200s.
posted by Paragon (11 comments total) 34 users marked this as a favorite
We can only pray that YouTube comments were not collected.
posted by Joe Beese at 6:57 PM on January 24, 2011

YouTube comments are collected, and putting them into this kind of format is an incredibly simple, if lengthy, task. I would go so far as to say that it would be enlightening, if not edifying, to use them in such a way.

Outside of the context of the videos they refer to (though title and posted description can be collected easily enough) they will make even less sense of course, but it's still data.
posted by aeschenkarnos at 7:58 PM on January 24, 2011

Love the idea, love the data, hate the UI. Entire windows full of buttons shouldn't jump wildly from left to right just because you moved the mouse.

It'd be nice to see them apply comparative graphing like the Google Ngram Viewer to this.
posted by otherthings_ at 8:25 PM on January 24, 2011

Heads up, linguistic nerds.

That Google Ngram corpus?

You can play with it, online, via Amazon EC2. Just mount the file system, it's a volume you can play with.

posted by effugas at 8:29 PM on January 24, 2011 [1 favorite]

I'm slightly obsessed with corpora. Here are a few that I like to visit, depending on what I need:
  • CHAINS: Characterising Individual Speakers CHAINS is a research project funded by Science Foundation Ireland from April 2005 to March 2009. Its goal is to advance the science of speaker identification by investigating those characteristics of a persons speech that make them unique.
  • The Blog Authorship Corpus The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words - or approximately 35 posts and 7250 words per person.
  • Old Bailey Corpus The proceedings of the Old Bailey, London's central criminal court, were published from 1674 to 1913 and constitute a large body of texts from the beginning of Late Modern English. The Proceedings contain over 200,000 trials, totalling ca. 134 million words and its verbatim passages are arguably as near as we can get to the spoken word of the period. The material thus offers the rare opportunity of analyzing everyday language in a period that has been neglected both with regard to the compilation of primary linguistic data and the description of the structure, variability, and change of English.
  • CoRD | Corpus of Early English Medical Writing (CEEM) The Corpus of Early English Medical Writing is a corpus of English vernacular medical writing. Consisting of three diachronically divided subcorpora, the corpus covers the entire history of medical writing in English from the earliest manuscripts to the beginning of modern clinical medicine.
  • The York-Toronto-Helsinki Parsed Corpus of Old English Prose (YCOE) The York-Toronto-Helsinki Parsed Corpus of Old English Prose (YCOE) is a 1.5 million word syntactically-annotated corpus of Old English prose texts. As a sister corpus to the Penn-Helsinki Parsed Corpus of Middle English (PPCME2), it uses the same form of annotation and is accessed by the same search engine, CorpusSearch. The YCOE was created with a grant from the English Arts and Humanities Research Board (B/RG/AN5907/APN9528). The corpus itself (the annotated text files) is distributed by the Oxford Text Archive and can be obtained from them free of charge for non-commercial use.
  • ARCHER Corpus (The University of Manchester) ARCHER is a multi-genre corpus of British and American English covering the period 1650-1990, first constructed by Douglas Biber and Edward Finegan in the 1990s. It is managed as an ongoing project by a consortium of participants at fourteen universities in seven countries.
  • Santa Barbara Corpus of Spoken American English The Santa Barbara Corpus of Spoken American English is based on a large body of recordings of naturally occurring spoken interaction from all over the United States. The Santa Barbara Corpus represents a wide variety of people of different regional origins, ages, occupations, genders, and ethnic and social backgrounds. The predominant form of language use represented is face-to-face conversation, but the corpus also documents many other ways that that people use language in their everyday lives: telephone conversations, card games, food preparation, on-the-job talk, classroom lectures, sermons, story-telling, town hall meetings, tour-guide spiels, and more.
  • The Newcastle Electronic Corpus of Tyneside English (NECTE) The Newcastle Electronic Corpus of Tyneside English (NECTE) is a corpus of dialect speech from Tyneside in North-East England. It is based on two pre-existing corpora, one of them collected in the late 1960s by the Tyneside Linguistic Survey (TLS) project, and the other in 1994 by the Phonological Variation and Change in Contemporary Spoken English (PVC) project. NECTE amalgamates the TLS and PVC materials into a single Text Encoding Initiative (TEI)-conformant XML-encoded corpus and makes them available in a variety of aligned formats: digitized audio, standard orthographic transcription, phonetic transcription, and part-of-speech tagged. This website describes the NECTE corpus in detail, and makes it available to academic researchers, educationalists, the media in non-commercial applications, and organisations such as language societies and individuals with a serious interest in historical dialect materials.
  • The Limerick Corpus of Irish English The Limerick Corpus of Irish English (L-CIE) has been developed by the University of Limerick in conjunction with Mary Immaculate College, Limerick. This one-million word spoken corpus of Irish English discourse includes conversations recorded in a wide variety of mostly informal settings throughout Ireland. The corpus is a collection of naturally occurring spoken data from everyday Irish contexts. There are currently 375 transcripts (totaling over 1,000,000 words) available at this site.
  • American National Corpus The American National Corpus (ANC) project is creating a massive electronic collection of American English, including texts of all genres and transcripts of spoken data produced from 1990 onward. The ANC will provide the most comprehensive picture of American English ever created, and will serve as a resource for education, linguistic and lexicographic research, and technology development.
  • Great Britain (ICE-GB) @ The British component of ICE is based at the Survey of English Usage, University College London. The British ICE corpus (ICE-GB) was released in 1998 and is now available. The corpus is POS-tagged and parsed.
  • WebCorp: The Web as Corpus WebCorp LSE is a fully-tailored linguistic search engine to cache and process large sections of the web.
  • WaCKy "We are a community of linguists and information technology specialists who got together to develop a set of tools (and interfaces to existing tools) that will allow linguists to crawl a section of the web, process the data, index and search them. We try to keep everything very laid-back and flexible (minimal constraint on data representation, programming language, etc.) to make it easier for people with different backgrounds and goals to use our resources and/or contribute to the project. We built a few corpora you can download, and in the near future we'll have a web interface for direct online use of the corpora.
  • VISL - Corpus Eye VISL's grammatical and NLP research are both largely corpus based. On the one hand, VISL develops taggers, parsers and computational lexica based on corpus data, on the other hand these tools - once functional - are used for the grammatical annotation of large running text corpora, often with or for external partners (project list 1999-2009. The main methodological approach for automatic corpus annotation is Constraint Grammar (CG), a word based annotation method.
  • TIME Magazine Corpus of American English This website allows you to quickly and easily search more than 100 million words of text of American English from 1923 to the present, as found in TIME magazine. You can see how words, phrases and grammatical constructions have increased or decreased in frequency and see how words have changed meaning over time.
  • Regex Dictionary by Lou Hevly The Regex Dictionary is a searchable online dictionary, based on The American Heritage Dictionary of the English Language, 4th edition, that returns matches based on strings —defined here as a series of characters and metacharacters— rather than on whole words, while optionally grouping results by their part of speech.
  • Michigan Corpus of Academic Spoken English >> ELI Corpora & UM ACL The Michigan Corpus of Academic Spoken English (MICASE) is a collection of nearly 1.8 million words of transcribed speech (almost 200 hours of recordings) from the University of Michigan (U-M) in Ann Arbor, created by researchers and students at the U-M English Language Institute (ELI). MICASE contains data from a wide range of speech events (including lectures, classroom discussions, lab sections, seminars, and advising sessions) and locations across the university.
  • LDC - Linguistic Data Consortium New Corpora at the LDC include: Indian Language Part-of-Speech Tagset: Bengali ~100K words of manually annotated Bengali text Message Understanding Conference 7 Timed (MUC7_T) ~ timed annotation for named entities Asian Elephant Vocalizations ~57.5 hours of audio recordings of vocalizations by Asian Elephants NIST 2005 Open Machine Translation (OpenMT) Evaluation ~ source data, reference translations, and scoring software used in the NIST 2005 OpenMT evaluation TRECVID 2006 Keyframes & Transcripts ~keyframes extracted from English, Chinese, and Arabic broadcast programming
  • Linas' collection of NLP data Here is a collection of linguistic data, including a collection of parsed texts from Voice of America, Project Gutenberg, the simple English Wikipedia, and a portion of the full English Wikipedia. This data is the result of many CPU-years worth of number-crunching, and is meant to provide pre-digested input for higher order linguistic processing. Two types of data are provided: parsed and tagged texts, and large SQL tables of statistical correlations. The texts were dependency parsed with a combination of RelEx and Link Grammar, and are marked with both dependencies (subject, object, prepositional relations, etc.), with features (part-of-speech tags, verb-tense and noun-number tags, etc., with Link Grammar linkage relations, and with phrasal constituency structure.
  • LexChecker LexChecker is a web-based corpus query tool that shows how English words are used. Users submit a word into the query box (like a Google search) and LexChecker returns a list of the patterns in which the word is typically used. Each pattern listed for a word is linked to sentences from the British National Corpus (BNC) that show the word occurring in that pattern. The patterns are what we have dubbed 'hybrid n-grams'. These are a uniquely useful form of corpus search result. They can consist of a string of words such as keep a close eye on or gain the upper hand. Or they could contain substitutable slots marked by specific parts of speech, for example run the risk of [v-ing] or stand [noun] in good stead or [verb] a storm of protest (as in raise/spark/cause/create/unleash a storm of protest).
  • Forensic Linguistics Institute (FLI) Corpus of Texts Appeals, Blackmail and Extortion, Confessions, Death Row Final Statements, Declarations of War, Last Wills and Testaments, Miscellaneous, Statements by Police, Suicide Notes
  • David Lee's Corpus-based Linguistics LINKS "These annotated links (c. 1,000 of them) are meant mainly for linguists and language teachers who work with corpora, not computational linguists/NLP (natural language processing) people, so although the language-engineering-type links here are fairly extensive, they are not exhaustive..."
  • CORPORA List The CORPORA list is open for information and questions about text corpora such as availability, aspects of compiling and using corpora, software, tagging, parsing, bibliography, conferences etc. The list is also open for all types of discussion with a bearing on corpora.
  • Corpora for Language Learning and Teaching
  • [bnc] British National Corpus The British National Corpus (BNC) is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of current British English, both spoken and written.
  • AUE: The alt.usage.english Home Page This is the web site of the alt.usage.english newsgroup. Contains audio archive.
  • Project We Say Tomato
  • Corpus of Historical American English (COHA) COHA allows you to quickly and easily search more than 400 million words of text of American English from 1810 to 2009 (see details on corpus composition). You can see how words, phrases and grammatical constructions have increased or decreased in frequency, how words have changed meaning over time, and how stylistic changes have taken place in the language.
  • Speech Accent Archive The speech accent archive uniformly presents a large set of speech samples from a variety of language backgrounds. Native and non-native speakers of English read the same paragraph and are carefully transcribed. The archive is used by people who wish to compare and analyze the accents of different English speakers.
  • IDEA - The International Dialects Of English Archive IDEA was created in 1997 as a free, online archive of primary source dialect and accent recordings for the performing arts. Its founder and director is Paul Meier, author of the best-selling Accents and Dialects for Stage and Screen, a leading dialect coach for theatre and film, and a specialist in accent reduction.
  • American Rhetoric: The Power of Oratory in the United States Database of and index to 5000+ full text, audio and video versions of public speeches, sermons, legal proceedings, lectures, debates, interviews, other recorded media events, and a declaration or two.
  • The AMI Meeting Corpus The AMI Meeting Corpus is a multi-modal data set consisting of 100 hours of meeting recordings.
(I have many more links gathered on delicious and Pinboard, too, if anybody is interested.)

The descriptions for each link have been copied from their respective websites.

posted by iamkimiam at 11:18 PM on January 24, 2011 [26 favorites]

Jesus X. Christ - can we swap that comment for the FPP? That's an amazing list.
posted by Paragon at 11:23 PM on January 24, 2011 [2 favorites]

To add to Kim's list, there are some more non-English corpora as well:
Ernestus Corpus of Casual Dutch
Nijmegen Corpus of Casual French
Nijmegen Corpus of Casual Spanish
Nijmegen Corpus of Casual Czech

One maintained by the LDC, linked by iamkimiam, that gets brought up a lot in some circles is the TIMIT Acoustic-Phonetic Continuous Speech Corpus, which has speakers reading off a number of sentences, rather than spontaneous productions.

My final addition, which is not a corpus, is for those of you into checking wordform and lemma frequencies in Dutch, German, and English via the web. WebCelex is an interface for Celex.
posted by knile at 4:17 AM on January 25, 2011 [1 favorite]


Take that, corpus.
posted by Faint of Butt at 9:13 AM on January 25, 2011

The link that effugas should have included: Google Books Ngrams on S3 (for AWS/EC2). I looked for this in December and it didn't exist then: I'm so happy to see it now.
posted by xueexueg at 11:43 AM on January 25, 2011

That list is amazing! Sadly, project We Say Tomato is no longer accepting submissions.
posted by nirvan at 10:49 PM on January 25, 2011

Oh my. Oh my.
posted by cortex at 12:42 PM on January 26, 2011

« Older Welcome to Chicago.   |   US National Archives says historian tampered with... Newer »

This thread has been archived and is closed to new comments