also would make some good Desert Golf levels
September 20, 2019 1:40 PM   Subscribe

Hey, here's a methodology for a nice set of graphs visualizing the distribution of the letters a-z across the start, middle, and end of English words, using the Brown Corpus as a source. Quicker summary here.
posted by cortex (7 comments total) 16 users marked this as a favorite
 
This was very interesting! and the methodology was easy to follow along. This is the kind of things I think about when I space out in meetings (I once compiled a list of letters that are affected in pronunciation by adding an h, which led me to figure out what kind of mouth movements that requires, which then led me down the rabbit hole of characterizing mouth-sounds on wikipedia)
posted by FirstMateKate at 2:06 PM on September 20, 2019 [3 favorites]


I was curious which letters would have the most U-shaped distributions -- more common at both the beginning and the end than in the middle. Looks like it's maybe F, K, and S (I guessed S), but the funny part is that U seems to have the least-U-shaped distribution (tied maybe with V).
posted by straight at 4:25 PM on September 20, 2019 [1 favorite]


Recently I made a numerical version of something similar, using Scrabble dictionary words. I've been making such things as part of learning Python.

Here's an imgur image of my table for word lengths 2 to 7, in the Collins 2019 tournament dictionary. The data was compiled using Python and plopped into my now-kinda-old version of Excel, with color-scale highlighting for relative distribution of letters within each word length.

(I have done a bunch of scripting around analyzing the Scrabble tournament dictionary words. If anyone is interested in this type of thing, please memail me. It'll be something I've either already done, or would be happy to do as part of my programming exercises.)
posted by sylvanshine at 7:31 PM on September 20, 2019 [1 favorite]


This is very cool.

It’d be a gargantuan task but I’d love to see this done for sound, with words broken up into syllable onset, nucleus and codas.

You could then do some very loose matching up of shapes. It’d get impossible fast, but it’d be neat to see which letters and phonemes cause the most havoc. For example, the phonemes and letters in ‘bat’ would be pretty comparable, but words like ‘brat’ or ‘bar’ or ‘cat’ get more complicated because of clusters, accents or sound-to-letter correspondences. And multisyllabic words like ‘aluminium’ are an absolute minefield.

I also wonder how a British corpus would shift his results, with the different distribution of s and z, and the higher prevalence of u. Or a modern media corpus, like Wikipedia, with all that wacky slang the kids use today.

I like how ‘the’ is the most common but ‘toe’ is the most representative word. I wonder what the most common and representative words are in speech and how different they are in frequency and form.
posted by iamkimiam at 11:53 PM on September 20, 2019 [1 favorite]


Incredibly useful for some card games I'm developing. Thank you.
posted by Hogshead at 6:20 PM on September 21, 2019


This is terrific. Thanks for sharing!
posted by Conrad Cornelius o'Donald o'Dell at 9:13 PM on September 21, 2019


What a lovely discussion of the choices that one has to make when constructing a data visualization.
posted by fantabulous timewaster at 11:40 AM on September 22, 2019


« Older least restrictive environment   |   This is how you make an entrance Newer »


This thread has been archived and is closed to new comments