Fun with n-grams and the internet's other discussion site
November 22, 2015 9:01 PM   Subscribe

You may have heard about n-grams, which identify particular strings of text in a large corpus (an n=3 n-gram could be "plate of beans"). You probably have played with Google Ngram search which lets you look through millions of books to see the first use of the phrase, or when it was most popular (though be warned, recent research shows some limitations, such as the false popularity of a certain expletive in the 1700s). The newest is the Reddit ngram search by 538, which lets you chart the rise and fall of things progressive and regressive. I await more insights in the discussion...
posted by blahblahblah (20 comments total) 20 users marked this as a favorite
 
So 2012 was peak friendzone. Hm.
posted by Cash4Lead at 9:05 PM on November 22, 2015


If you search for "dickbutt", the graph kind of looks like a dick.
posted by fontor at 9:12 PM on November 22, 2015


This is neat, even if it's a little depressing to see "SJWs" get talked about as much as income inequality.

I regrettably note the spike in "MRA" in late may 2014, probably thanks to the Elliot Rodger shootings.
posted by JauntyFedora at 9:25 PM on November 22, 2015


though be warned, recent research shows some limitations

If only this massive record of human writing wasn't more-or-less controlled by a single for-profit company, researchers might be able to develop new techniques for analyzing the data that would be more useful, say by attempting to classify works by type and creating separate ngram corpora for scientific vs literary works or by attempting to detect and filter out misdated works.
posted by zachlipton at 9:30 PM on November 22, 2015 [1 favorite]


If only this massive record of human writing wasn't more-or-less controlled by a single for-profit company,

You can get a 450 million word corpus from BYU, plus another billion or so if you want to use their scraped online texts.

Ultimately, while the breadth of Google is cool, there are lots and lots of corpuses out there. (I know people using the Enron email corpus for research, for example). They just don't have the nice interface.
posted by blahblahblah at 9:37 PM on November 22, 2015 [3 favorites]


And, I forgot to mention, you can actually download the Google Ngram corpus, too.
posted by blahblahblah at 9:40 PM on November 22, 2015 [5 favorites]


And if you're so inclined, you can get frequency data for around 500 million words of Metafilter too. I really need to run more recent years on that, my code's just a little dodgy and manual so I've been putting it off.

I'd love to do 2-gram and 3-gram tables as well but that gets to be a real bear in terms of computation and memory footprint. One of these days.
posted by cortex at 10:01 PM on November 22, 2015 [12 favorites]


I gather you can figure out some charts or some such with this stuff but I read all this and I'll heard was some Peanuts-style WAHWAHWAH WAHWAH WAAAH WAH.

What Cortex just said sounds good, though. I mean, go team.
posted by fluffy battle kitten at 12:08 AM on November 23, 2015


And, I forgot to mention, you can actually download the Google Ngram corpus, too.

You can download the corpus (it would be nice if they bothered to update it, as the linked article points out), but that's only the processed word-level data. You can't, for instance, try to classify the source texts by category and then pull n-grams by category of text. You also can't tell what words commonly come after other words.
posted by zachlipton at 12:11 AM on November 23, 2015


Holy shit, Ron Paul was a big fuckin' deal on Reddit! Unbelievably so, to the point that only Obama got more attention.

Reddit-clone Voat was a thing about as long as people were mad at Reddit CEO Ellen Pao. Surprise, surprise.

"Thug" as racist dogwhistle.

Less serious stuff - Starcraft: ded gaem or deadest game?

The meteoric rise of shitposts: could they be linked to the insidious threat of anime? We may be reaching a crisis point.
posted by knuckle tattoos at 12:40 AM on November 23, 2015 [2 favorites]


Convergence: W(H)OA(H)
posted by ardgedee at 4:40 AM on November 23, 2015


Happy v Merry
posted by Rock Steady at 5:05 AM on November 23, 2015 [1 favorite]


We are disappearing.
posted by Rock Steady at 5:09 AM on November 23, 2015 [2 favorites]


Sudden and mysterious spike in users saying "please" and "thank you" in late 2013, with more gradual increases since then. Was that a particularly polite time for reddit users?

Also THX, TBH, IDK and IMO have all been gradually gaining ground.
posted by subdee at 5:11 AM on November 23, 2015


Given the epic copyright battle that Google has been fighting until recently, it would surprise me if the project had been paused to keep from rocking the boat.

OTOH, knowing Google, it could also be because the lead engineer got bored and moved on to another project.
posted by CheeseDigestsAll at 6:05 AM on November 23, 2015 [1 favorite]


Was that a particularly polite time for reddit users?

Oh, please.
posted by ChurchHatesTucker at 7:18 AM on November 23, 2015 [1 favorite]


Rock Steady - given the low percentages to begin with I'd say it's likely less Metafilter disappearing and more the amount of other things being talked about going up a lot.
posted by Wretch729 at 7:53 AM on November 23, 2015 [1 favorite]


Yeah, I'm sure that's the case, but it's still disappearing, whether it's down a drain or under a flood.
posted by Rock Steady at 9:09 AM on November 23, 2015


> And, I forgot to mention, you can actually download the Google Ngram corpus, too.

That's sort of true. The corpus data only contains n-grams that occur >40 times. On the one hand, this makes the dataset a more manageable size (and even so it's very large). On the other hand, though, it makes the data useless for a wide variety of purposes, including using it as input to any kind of language model. I doubt that's an accident.
posted by dendrochronologizer at 12:06 PM on November 23, 2015


Sudden and mysterious spike in users saying "please" and "thank you" in late 2013, with more gradual increases since then. Was that a particularly polite time for reddit users?

I have a theory . . .

Nah, I don't know. But wouldn't it be interesting if "please" and "thank you" gained popularity because Redditors started directly appealing to companies like Riot and Valve after they responded to complaints and open letters on social media? It crossed my mind because Diretide was also late 2013 - almost certainly a coincidence.
posted by knuckle tattoos at 1:16 PM on November 23, 2015


« Older Unicorn on a Roll   |   Eleanor Saitta calls for secure decentralized... Newer »


This thread has been archived and is closed to new comments