To say Twitter is colloquial is putting it lightly.
May 20, 2010 7:02 AM   Subscribe

Lexicalist attempts to be 'a demographic dictionary of modern American English.' Here's how it works. Lexicalist's developer David Bamman goes into greater detail at Language Log.

Bamman has also worked on The Dynamic Lexicon, a project to automatically generate bilingual dictionaries using parallel texts.
posted by shakespeherian (28 comments total) 18 users marked this as a favorite

 
This web site has about a thousand entries for variations on hahahah.
posted by Rory Marinich at 7:11 AM on May 20, 2010


The sheer volume of data, however, gives us the flexibility to focus more on precision than on overall accuracy - we can throw away all tweets where we aren't over 99% sure of the physical location.

With this disambiguated data, we can map the usage of words and phrases across the US by normalizing each word's count by the volume of total data coming out of each state (to avoid biasing the statistics toward populous states such as New York and California).


Yes, the last thing you want to do (after using a non-random data source and throwing away a non-random subselection of it) is to bias your data!
posted by DU at 7:16 AM on May 20, 2010


On a whim I typed in "jew," and South Dakota inexplicably leads the pack at 10.6%, followed by West Virginia at 5.3%. Related words suggested are "drummer," "jls," "balloon," "trapped," "dull" and "shark."

Ooookay.
posted by Bromius at 7:18 AM on May 20, 2010


New Mexico represents 26% of all uses of 'gazillion.'
posted by shakespeherian at 7:23 AM on May 20, 2010


This is interesting, thanks! I sent the link to all my linguistics- and history-of-style-teaching friends.
posted by FelliniBlank at 7:24 AM on May 20, 2010


Related words for "fml":
-_-
-__-
ughh
ugh
ahhh
o_o

I think it works!
posted by escabeche at 7:27 AM on May 20, 2010


"keyword: schlep
We don't have any demographics for that word - try another!"

Feh.
posted by ROU_Xenophobe at 7:29 AM on May 20, 2010 [1 favorite]


Conservative institutions popular topic with older men in conservative states.

Liberal institutions popular topic with older men in conservative states.

(I had a surprisingly hard time even thinking of any liberal institutions. Come on, American Left. Get with the program.)
posted by DU at 7:29 AM on May 20, 2010


That "IMHO" is 72.1% male is so, so right. That the first related word is "online dating" is so, so wrong.
posted by escabeche at 7:31 AM on May 20, 2010


I got a 91.6% male with "ubuntu". Can anyone beat that with something more skewed?
posted by DU at 7:34 AM on May 20, 2010 [1 favorite]


Ahh, so this is based on the written rather than the spoken word. My lexicographic quest for information on the provenance and current use of the word "seent" (past tense of "seen" ((LOL)), observed in the wild in Philadelphia) continues.
posted by Mister_A at 7:41 AM on May 20, 2010


No sign of either Cthulhu or twinks.

Cubs and Bears seem to have been vastly overshadowed by sports fans in Chicago.
posted by ursus_comiter at 7:45 AM on May 20, 2010


Hmmm, Montana is four times more special than any other state.
posted by plastic_animals at 7:46 AM on May 20, 2010 [1 favorite]


http://www.lexicalist.com/search.cgi?s=crab&d=map

I got exactly what I expected with the first two results for this one.
posted by codacorolla at 7:49 AM on May 20, 2010


Damn it! This is the sort of thing that will keep me busy for hours...

I wonder why "stamps, dove, coma, basket, filipino and glamour" are the related words for 'gourmet'?
posted by plastic_animals at 7:57 AM on May 20, 2010


I've been putting football teams into the search, and for some teams it pops out their home state as the leader, and for others it doesn't...

Works:
ravens
browns
colts

Doesn't:
cardinals (tho the winner, Missouri, has a minor league team by the same name)
giants
panthers (the Iowa Panthers are a college basketball team)

It's really neat to watch states light up around the ones that it does work on. For example, the Dakotas that don't have teams light up when searching for the Vikings, which is the team of the neighboring Minnesota.
posted by codacorolla at 8:01 AM on May 20, 2010


cardinals (tho the winner, Missouri, has a minor league team by the same name)

/sound of a barstool getting pushed back abruptly

I'm sorry, could you say that again? I think I misheard you.
posted by Horace Rumpole at 8:05 AM on May 20, 2010 [2 favorites]


I can't decide if your an Arizona Cardinal fan who's mad that they don't have a larger share in AZ, or a MI Cardinal fan who's mad that's not the first thing that popped into my head.
posted by codacorolla at 8:10 AM on May 20, 2010


I hate to keep replying to this thread, but I just remembered the St. Louis Cardinals. Hahahahaha, sorry. I don't really follow baseball, obviously.
posted by codacorolla at 8:11 AM on May 20, 2010


1 Hawaii 18.4%
2 California 8.4%
3 Oregon 7.4%
4 Colorado 6.4%
5 Tennessee 4.7%
6 Arizona 4.4%
7 Virginia 4.4%
8 Missouri 4.3%

So I guess surfers really do say gnarly.

Missouri?
posted by eye of newt at 8:12 AM on May 20, 2010


36.2%

What is it with Arkansas and 'bollocks'? (Related words include: 'twat', 'bastard', 'cock', and 'goldman sachs')
posted by robself at 8:14 AM on May 20, 2010


Hahahahaha, sorry. I don't really follow baseball, obviously.

/Sits back down, buys codacorolla a beer.
posted by Horace Rumpole at 8:16 AM on May 20, 2010


Ron Paul:

1 Connecticut 12.3%
2 Idaho 8.2%
3 Alaska 7.3%
4 North Dakota 6.0%
5 South Carolina 5.2%
posted by T.D. Strange at 9:00 AM on May 20, 2010


This is fascinating and I really love the aim of the project and how it's being implemented. However, there are a couple of unexplained bits that are either skewing the data, misrepresenting it, or producing really narrow results to account for really broad assumptions.

First off, gender in Computer Mediated Communication (CMC) is NOT as simple as determining who falls into which bucket and that's that. Studies have found that people ascribe to and use the gendered features of the CMC genre they're participating in (perpetuating the gendered style or gendered stereotypes of that medium/genre), regardless of their actual gender. So you might get more 'David's' in usernames on a site where the discourse style and/or demographics is male-dominated, is perceived as either of these things, or uses features that people also associate with male-ness. Not to mention the possibility of other gender categories besides male or female. Or the NLP techniques used here (I'm not questioning the methodology...I just don't know what it is and determining gender by teasing apart user names or use of pronouns in writing samples is way more complicated that one would think...If I'm quoting someone, if my username can be parsed differently, if I'm being vague/ironic/deceptive...all this things can complicate the assessment of my actual biological sex vs. my purported gender identity that I want the world to see...and so which are we measuring here?).

Another thing about the assessment of gender that is bugging me too though...if he used something that makes an assessment of gender, based on a probability, using his data as, well, the data, then aren't we in a loop here? Especially considering the problem mentioned above about CMC writing being gendered, regardless of actual gender (and there are other problems with assessing or describing gender in CMC, but I don't want to write an essay here). The same circularity concern goes for age and geography too.

For age, there are a couple of complicating factors in CMC...Susan Herring, in her 2008 article on online youth identity sums up one such problem, "Past research on youth may help to shed light on the kinds of behaviors that young people can be expected to outgrow. For example, sociological research has found that sociability is greatest among adolescents and young adults, and decreases over the life course. All else being equal, this suggests that one should interpret observed differences in digital sociability between younger and older users as life-stage related, rather than as indicating an ongoing change in the direction of increased sociability for all digital media users." If we are measuring variables that can represent different stages in youth and identity development ('twilight'), with respect to what's popular or relevant - in general and to certain populations - than we need to take sociability and age grading into account here. It's just simply not a straight line where the results of 'twilight' indicate broad categories, such as 'these people' or 'this age group', etc. If so, then you might assume that 25-35 year olds in technologically connected places like Silicon Valley just LOVE Facebook and Gizmodo...what with all the hype about it on Twitter lately. Point is, we need to look at these trends and results in context. Third wave variationist sociolinguistics addresses this specifically.

This brings me to another point, Twitter is a very, very specific CMC genre, dominated by a certain age group (which is (not?) surprisingly NOT 14-18 year olds), constrained by several things (including message space), focused on certain topics (what's 'interesting' or what people are 'doing', etc.), available to and used by a limited population, and largely one-sided. It is important to study, to be sure, but it is not conversation like RL chatter, it is not writing like a novel or even a blog, and so it needs to be contextualized as such.

Also, in this context, it is a huge problem to throw away the data you can't categorize by gender, age, or geography, because that data is going to carry some weight.

Lastly, polysemy presents a problem. It's somewhat addressed with 'pop' but man, I want to know more! How is polysemy accounted for and teased apart? Or are the results not supposed to be that fine grained? It gets me thinking about something that has occurred to me a lot over the years...the meaning of and associations with the word 'columbine'. Before the Columbine HS shooting, the word was probably pretty rare and almost exclusively used in reference to the flower. But immediately after the shooting, we'd probably see a spike in the word across all available sources. Where that spike would occur is telling, but maybe not for the reasons you think (for ex. people near Columbine HS might be talking just as much about the shooting, but they may not be using the word 'columbine' as much as other, more far away places that may need to contextualize or preface the story). Nowadays, I wonder who is saying 'columbine' and in what context. On Twitter, my intuitive guess is that it's a tossup between flower uses and a reference to the school shooting. And 'columbine' is a long word...if there's another way to reference either, using less characters, then the frequency or collocation for the OTHER use would be higher, even if it is talked about less. In other words, it is the way things are talked about too (within the style and constraints of the genre) that matters.

Anyways, I don't mean to rip things apart, I just wanted to bring up some points and questions I have. I love this project and want to see it grow and expand, especially to further our knowledge about how people talk online, which is my favorite subject, meta or otherwise.
posted by iamkimiam at 10:06 AM on May 20, 2010 [1 favorite]


Ugh. Strike that second sentence. I have no intention to come out blazing with 'THERE ARE ALL THESE PROBLEMS, SEE!?!?!'. Yikes.
posted by iamkimiam at 10:07 AM on May 20, 2010


Columbine: "People are talking about this 17% more today than they were a month ago (on average, once every 4,917,678 words)."

Now that is interesting. I would imagine it would have gotten more mention in April, when the shooting happened. But maybe it has something to with the flower blooming in late spring/early summer.

I'd love to see these results for April/May for the last 12 years or so.
posted by iamkimiam at 10:16 AM on May 20, 2010


Huh. I just realized that the shooting happened EXACTLY a month and 11 years ago, on April 20, 1999. I wonder how precise ""People are talking about this 17% more today than they were a month ago" is. Is 'today' elipsed from "a month ago [today]"?
posted by iamkimiam at 10:18 AM on May 20, 2010


There's a lot of comedy here:

in the top states

in the age range

in the gender disparity
posted by shii at 2:50 PM on May 20, 2010


« Older The worst place to mug someone? Probably right in ...  |  Oil from the Deepwater Horizon... Newer »


This thread has been archived and is closed to new comments