A Corpus of Corpora
September 12, 2015 12:33 PM   Subscribe

corpora is a Github repository containing machine-readable lists of interesting words and phrases that "are potentially useful in the creation of weird internet stuff." The corpora range from the mundane (common English words, animals, corporations, pizza toppings) to the obscure (types of knot, wrestling moves, Lovecraftian deities) to the absurd (states of drunkenness, deceased Spinal Tap drummers, unrhymable words).
posted by schmod (40 comments total) 59 users marked this as a favorite
 
Doesn't list sources, and claims CC0. This isn't doing Open Data very well.
posted by scruss at 1:00 PM on September 12, 2015


If you want some more 'serious' corpora, Mark Davies has a list of very big ones. Although, I'm not sure if my favorite there, The Corpus of American Soap Operas, qualifies as serious.
posted by mr.ersatz at 1:00 PM on September 12, 2015 [5 favorites]


Doesn't "assuage" rhyme with "beige"?
posted by alasdair at 1:03 PM on September 12, 2015


And "page" and "mage"....
posted by Greg_Ace at 1:38 PM on September 12, 2015


Do I alone I have faith
There is a rhyme for 'eighth'?
posted by hexatron at 1:51 PM on September 12, 2015 [1 favorite]


i thought coming up with the word lists was the fun part of making a Twitter bot
posted by indubitable at 1:53 PM on September 12, 2015


Goth rhymes with doth.
posted by Ursula Hitler at 2:18 PM on September 12, 2015


i thought coming up with the word lists was the fun part of making a Twitter bot

It is sometimes, and sometimes, it isn't. Depends on the bot! Corpora, in part, powers @monstersubtypes (source). Without it, it would have taken much longer to make (maybe so long I wouldn't bother), and it would have ended up being more predictable.
posted by ignignokt at 2:18 PM on September 12, 2015 [1 favorite]


> Doesn't "assuage" rhyme with "beige"?

No (assuming you use the Frenchy zh-sound in "beige"), it rhymes, as Greg_Ace says, with "page" and "mage"—at least that's the official pronuncation. Apparently the version with zh is an accepted alternate form; I guess a lot of people have the impulse to pronounce it that way, for reasons that escape me. Does "assuage" sound French to people?
posted by languagehat at 2:44 PM on September 12, 2015 [1 favorite]


Thank you for this post and the other suggestions in the comments. I like the idea of rolling Twitter bots, but I can't always get them to that right balance between functional and funnily broken. Having more interesting corpora should seed my bots better.

Here's one I made recently
, which has ugly code you should not look at. Its seed corpus was a list of common words in Google searches, if I recall right.
posted by mccarty.tim at 2:45 PM on September 12, 2015


This is one of my favorite repositories. I added English honorifics.

A recent botanical commit.
posted by jjwiseman at 3:34 PM on September 12, 2015 [1 favorite]


languagehat: "Doesn't "assuage" rhyme with "beige"?

No (assuming you use the Frenchy zh-sound in "beige"), it rhymes, as Greg_Ace says, with "page" and "mage"—at least that's the official pronuncation. Apparently the version with zh is an accepted alternate form; I guess a lot of people have the impulse to pronounce it that way, for reasons that escape me. Does "assuage" sound French to people?
"

I'm saying "assuage" and "beige" to myself over and over trying to figure out why someone would think they don't rhyme. Surely I've heard someone somewhere speak the word "assuage" in a sentence? Is it another one of those words for which I've made up my own pronunciation because I've only ever seen it in print? Is it an accent thing? How are you supposed to pronounce it? Now I'm paranoid I can't use the word "assuage" in conversation lest I make an ass of myself. Has everything I've been taught about beige been a lie?
posted by double block and bleed at 3:54 PM on September 12, 2015


If "assuage" and "beige" rhyme for you there are three possible explanations I can think of.

1) They don't rhyme for you, and you're mistaken

2) You pronounce both with a final /dʒ/, so that they also rhyme with "page" and "mage"; this is the same sound that begins the word "jail"

3) You pronounce both with a final /ʒ/, which is the Frenchy zh-sound, and is the same as the middle consonant in most Americans' pronunciation of "measure."

I would guess that the most likely explanation for you is (3). People that don't rhyme them will probably use the Frenchy zh-sound for "beige," and the other sound (the one in "page" and "mage") for "assuage."
posted by Kutsuwamushi at 4:27 PM on September 12, 2015 [3 favorites]


My old-school pronunciation key (New Oxford American Dictionary that ships with Mac OS 10.10) shows

assauge | ə'swāj |
beige | bāZH |


The answer to whether these two words rhyme is complicated. If we're talking simple rhyme (vowel) then, yes, these words rhyme.

The consonants are a bit different but to most speakers of American English the endings of those two words are virtually indistinguishable.

Now, if you're talking language poetry, then basically "rhyme" takes very non-intuititve meanings which include everything from orthography, typography, pagination, etc.

All that said, those two words rhyme for me with the exception languagehat carves.
posted by mistersquid at 4:31 PM on September 12, 2015 [1 favorite]


The reason I like this repo so much is because it's so simple, and still enabling & powerful. It's like the BASIC of natural language or semantic databases, or something. Instead of dealing with all the complexity of WordNet or ntlk or OpenNLP or FreeBase, it's like, here's a list of planets or TV shows or Spinal Tap drummers: Go! Make a bot or something! It makes it easier for an interested amateur to just make a thing.
posted by jjwiseman at 5:04 PM on September 12, 2015


I was wondering if I was wrong about goth and doth. Like, maybe doth is supposed to be pronounced like doath, or doathe. But then the rhymes would be oath or loathe. Any way you say it, there are rhymes for doth. (In addition to goth, there's also moth, froth, cloth, sloth and the ice planet Hoth.)
posted by Ursula Hitler at 5:16 PM on September 12, 2015


This is the weirdest derail ever.
posted by schmod at 5:39 PM on September 12, 2015 [3 favorites]


Metafilter challenge: Everyone make a bot using corpora!
posted by jjwiseman at 5:41 PM on September 12, 2015


Also see Fifteen Thousand Useful Phrases
posted by Confess, Fletch at 6:00 PM on September 12, 2015 [5 favorites]


What kind of weird internet stuff can someone create with these lists?
posted by wrabbit at 6:05 PM on September 12, 2015


This is the weirdest derail ever.

If you mean people pointing out that there are words with rhymes on the list of words that supposedly have no rhymes, how is that a derail? What rails should the thread be on?
posted by Ursula Hitler at 6:41 PM on September 12, 2015 [2 favorites]


So I'm a person who learned many words through reading without being curious enough to learn how they're pronounced. I have just learned that just because a word contains "sau" does not mean that "sausage" is a good word to base your pronunciation of it on.

Great...yet more shame!
posted by maxwelton at 7:10 PM on September 12, 2015


What rails should the thread be on?

Pale snail rails?
posted by Greg_Ace at 7:14 PM on September 12, 2015


They'll mail frail shale and kale-veiled quail to the pale snail rails in jail.

I contributed the curds corpus (from this project) and made this disgusting thing with the body fluids and music genres corpora.
posted by moonmilk at 7:19 PM on September 12, 2015 [3 favorites]


Using shm-reduplication don't all words have at least one rhyming twin?

angel - schmangel
beige - schmeige
batman - schmatman
posted by urbanwhaleshark at 7:29 PM on September 12, 2015 [1 favorite]


"Pint" rhymes quite satisfactorily with "another pint" though I usually add "please" because my mama raised me right.
posted by Greg_Ace at 9:08 PM on September 12, 2015 [2 favorites]


I was wondering if I was wrong about goth and doth. Like, maybe doth is supposed to be pronounced like doath, or doathe. But then the rhymes would be oath or loathe. Any way you say it, there are rhymes for doth.

I think doth is pronounced duth, like with a schwa sound for the o.
posted by aka burlap at 11:10 PM on September 12, 2015 [2 favorites]


That is how I would say doth, yes. But then I'm Australian so this discussion is reading a bit weirdly.
posted by deadwax at 11:27 PM on September 12, 2015 [1 favorite]


the Frenchy zh-sound

Mefi username available
posted by iotic at 2:13 AM on September 13, 2015 [1 favorite]


I think doth is pronounced duth

Bismuth, then! Even pronounced like duth, doth STILL has a rhyme! (Also duth rhymes with Earth, if you're fancy.)
posted by Ursula Hitler at 6:00 AM on September 13, 2015


rhymeless_words.json is less than 0.1% of this repository, but is the subject of more than 50% of the comments on this post. I mean, damn, people, we have a list of almost 2000 curds here!

I'll just leave you with that, and this oprah_quote: "Energy is the essence of life. Every day you decide how you're going to use it by knowing what you want and what it takes to reach that goal, and by maintaining focus."
posted by jjwiseman at 9:24 AM on September 13, 2015 [2 favorites]


I think we've chosen our goal and maintained focus quite well, thank you very much.
posted by Greg_Ace at 10:04 AM on September 13, 2015


Aha! No, I was doing the zh sound in "assuage", probably primed by "beige", but I would normally pronounce it like "mage", so I was incorrect in my comment above. As you were!
posted by alasdair at 1:06 PM on September 13, 2015


You know, they accept pull requests. You could submit a patch for the rhyming issues, or any other issue for that matter.

Trying to think of a different way to use the corpora, I wrote a tool that attempts to automatically classify metafilter usernames by finding which corpora they seem to reference. I ran it over the entire list of mefi users to get some idea of the "most interesting" corpora: e.g. only Styrofoam references the "plastic brands" corpus, while hundreds or thousands of users reference "english stop words".

Some examples from this thread:

deadwax: "A list of words that naturally complete the phrase 'They were feeling...'."
aka burlap: "fabrics"
urbanwhaleshark: "List of household objects", data/animals/common.json (not all corpora have descriptions)
wrabbit: data/animals/common.json, "List of household objects"

5 users have names that refer to a popular strain of cannabis: A-Train, Blue Buddha, Mango, Nebula, and Satori.
6 usernames refer to toxic chemicals: maldrin, phenylphenol, PlutoniumX, Sarin Bellum, Sarine, strychnine.
3 usernames refer to sandwiches: Barbecue, Tavern, Tuna.
posted by jjwiseman at 6:09 PM on September 13, 2015 [4 favorites]


Oops, make that 186 usernames that reference strains of cannabis, 426 usernames that reference sandwiches.
posted by jjwiseman at 7:58 PM on September 13, 2015


User "hare's breath": The only username that references "Commonly mistaken English phrases most likely caused by hearing them rather than reading them (eggcorns)"
posted by jjwiseman at 8:01 PM on September 13, 2015


I learned how to correctly pronounce "assuage" today, so this thread was worth it.
posted by Chrysostom at 9:05 PM on September 13, 2015


I think doth is pronounced duth

Bismuth, then! Even pronounced like duth, doth STILL has a rhyme! (Also duth rhymes with Earth, if you're fancy.


I most often hear "doth" with the STRUT vowel.

Sometimes I hear it with the CLOTH vowel, such that it rhymes with "goth."

I've never heard it with the FOOT vowel, which it what it would take for it to rhyme with my pronunciation of "bismuth." But maybe I pronounce "bismuth" funny?
posted by nebulawindphone at 5:20 AM on September 14, 2015


I'd never heard it pronounced until I checked it with a couple of online sources. Their UK accented robot voices pronounced it bis-MUTH, FWIW.
posted by Ursula Hitler at 6:26 AM on September 14, 2015


This belated mention of a zombie film may seem like a non-sequitur, but only if you don't check with her.

Check with her, with her, with her, wither… whether.

Whether what? Wait, wait. Something's wrong here. Something's…

Pontypool is a low budget zombie film with very little gore and with an preternaturally powerful narrative structure / delivery mechanism. I was COMPLETELY psychoanalytically freaked out the night/morning I watched it.

And if you watch it you'll understand how it relates to this thread.

Hoo boy, will you realize.

posted by mistersquid at 8:30 PM on September 24, 2015


« Older Squeezebox Stories: tales of the accordion, the...   |   Calling All Brothers Newer »


This thread has been archived and is closed to new comments