Good is to MetaFilter as evil is to LOLCats
January 3, 2016 6:14 PM   Subscribe

A web tool (scroll down) built by Radim Řehůřek allows you to compute analogies between English words using Google's word2vec semantic representation, trained on 100 billion words of Google News. "He" is to "Linda" as "she" is to "Steve." "Wisconsin" is to "Milwaukee" as "Maryland" is to "Baltimore." "Good" is to "MetaFilter" as "evil" is to "LOLCats."

Řehůřek is the creator of gensim, a Python library for semantic analysis.

A relevant 2014 paper by Omer Levy and Yoav Goldberg about what word2vec is really doing when it talks about semantic relations between words and analogies.

Previously on MetaFilter.
posted by escabeche (30 comments total) 10 users marked this as a favorite
Cat is to bird as dog is to bald eagle, apparently. (At least that is the second result. The first just gives "birds". The first result for just about everything I tried was pretty useless, but the second or third is often interesting.)
posted by lollusc at 6:22 PM on January 3, 2016

Apparently Metafilter is to Good as Reddit is to Bad.

I mean, I knew that already, but it's nice to know that the semantic representation engine agrees with me.
posted by leotrotsky at 6:43 PM on January 3, 2016 [7 favorites]

butts is to lol as dongs is to tho. Those dongs tho.
posted by cortex at 6:46 PM on January 3, 2016 [4 favorites]

Apparently a woman needs a man like a fish needs a trout.
posted by leotrotsky at 6:46 PM on January 3, 2016 [5 favorites]

oxygen is to eight as carbon is to

"But professor, I put the correct answer down! Why didn't I get credit!"
posted by Wolfdog at 6:47 PM on January 3, 2016

Stupid : Is :: Stupid : Does
posted by Cool Papa Bell at 6:48 PM on January 3, 2016 [2 favorites]

wool is to sheep as
silk is to [["goats",0.4809827506542206],["cows",0.47358667850494385],["shepherds_tending",0.4675234258174896],["mink",0.46212467551231384],["pigs",0.4528849422931671]]
posted by Wolfdog at 6:51 PM on January 3, 2016

Trump : President :: Hell : President

This thing is good.
posted by leotrotsky at 6:52 PM on January 3, 2016 [4 favorites]

Rose : Sweet :: Romeo : Zanjoe

Well, this thing is alright.
posted by leotrotsky at 6:55 PM on January 3, 2016

Awesome, it's the return of Google Sets!
posted by JHarris at 6:58 PM on January 3, 2016 [2 favorites]

ROFLMAO is most similar to ["KeithOlbermann_@",0.7536270618438721].

posted by solarion at 6:58 PM on January 3, 2016 [1 favorite]

language : Esperanto :: music : klezmer revival

Metafilter : cortex :: Enterprise : hippocampus

Potter : wand :: Frodo : nunchuk controller
posted by zompist at 7:01 PM on January 3, 2016 [10 favorites]

"But professor, I put the correct answer down! Why didn't I get credit!"

Probably because you copied the wrong answer to the answer sheet -- you're looking at the debugging info, but the algorithm's final choice shows up in the sentence, replacing the question mark.
posted by effbot at 7:05 PM on January 3, 2016

scooby : dooby :: dooby : dum_dum_dum_dum

So close.
posted by carter at 7:07 PM on January 3, 2016 [2 favorites]

man : woman :: lithium : lithium carbonate
posted by wormwood23 at 7:37 PM on January 3, 2016

lime : tequila :: catfood : Big Gulps
Is this why the cats look so disappointed at meals?
posted by pernoctalian at 8:19 PM on January 3, 2016

cat is to kitten as sloth is to gluttony.
head is to hat as butt is to ass.
posted by knuckle tattoos at 9:04 PM on January 3, 2016 [1 favorite]

I typed in many childish things and was duly amused.
posted by feckless fecal fear mongering at 9:47 PM on January 3, 2016

garbage: in :: garbage :: out
posted by yinchiao at 10:21 PM on January 3, 2016 [1 favorite]

cow : beef :: pig : pork
Not bad.
cow : beef :: chicken : meat
Well, I can't argue with that.
finger : hand :: toe : Gallinari_stubbed
Apple : iPod :: Nintendo : Nintendo_DS
Nice. Hmm...
Apple : iPhone :: Nintendo : 3DS
Very nice.
crepe : France :: burrito : Germany
This is news to me...
posted by NMcCoy at 11:16 PM on January 3, 2016 [8 favorites]

"Metafilter" is most similar to... "Geekosystem".

...I can't really argue.
posted by eykal at 12:18 AM on January 4, 2016

Additionally, Scotland is to England as Metafilter is to Buzzfeed, which seems a tad harsh on old England there. And life is to death what Metafilter is to... Make of that what you will.
posted by eykal at 12:34 AM on January 4, 2016 [2 favorites]

I've built several things like this in the past, i.e. models that translate words or symbols into vectors that are supposed to represent meaning in some way...

Mostly they were relatively small sets of words/symbols though... Like all different make/model combinations for cars (a few thousand) or all words used in a speech therapy program for aphasics (hundreds).

I'll say this in defense of the word2vec thing : This is hard. Even if you have orders of magnitude more documents or usage examples than words, you'll always get a lot of results that make sense and a lot that don't. And you can never check them all.

If models based on vector representations break down and produce ridiculous results, it happens mostly because the obvious answer depends on context much more than we realize. If the semantics of symbols can be modeled as a linear space at all (highly doubtful), then only within a given context.

So if you're thinking about weather, Oslo is a lot like Anchorage (guessing); If you're thinking about anything else, they're probably pretty different...

And if weather is what you care about, it might make sense to come up with some temperature/rainfall/whatever continuum where cities with similar weather end up close together. In a more general setting, this will end up making less and less sense.

So since the word2vec thing doesn't know what context you're thinking of, it has to either pick one (better, ideally based on some idea of its likelihood) or somehow mush them all together and hope for the best...

Kind of nice how often the 'correct' thing shows up in the list at least, though. I'll probably end up using this for something at some point.
posted by kleinsteradikaleminderheit at 7:26 AM on January 4, 2016 [1 favorite]

Can someone who understands this better help me understand how it works?

I began with their test question: Man is to king as woman is to ? and got the response Queen, which indicated it is working properly.

So I tried Man is to cowboy as woman is to ? and got Cowgirl. I see the same logic that provided the King/Queen answer.

And Man is to Uncle as woman is to ? gives the answer Aunt.

So I decided to switch genders and see what happened, and submitted Man is to Aunt as woman is to ? I didn't get Uncle. I got "mother." Any idea why?

Also, Man is to president as woman is to ? Gave me the answer President (with a capital P, although I used lower case in the submission. Any idea why it's President instead of president?

And finally, Man is to coach as woman is to ? gives the response "Head Coach."

I tried reading part of the Levy and Goldberg paper but didn't see anything that helped me understand how these particular responses are being generated.
posted by layceepee at 8:32 AM on January 4, 2016

math is to maths as sports is to sport.
posted by madcaptenor at 8:46 AM on January 4, 2016 [2 favorites]

Harvard : Boston :: Yale : Hartford [ooh, so close]
Harvard : Boston :: Princeton : Cleveland [lol wut]
Harvard : Boston :: Vanderbilt : Atlanta [are you trolling?]
Harvard : Boston :: Auburn : Jacksonville [this feels like trolling]
Harvard : Boston :: Stanford : San_Diego [definitely trolling]
posted by mhum at 9:09 AM on January 4, 2016 [2 favorites]

And yes, I know Harvard is in Cambridge but if you plug in Harvard : Cambridge, the results are polluted by interpreting the latter as University of Cambridge not Cambridge, MA.
posted by mhum at 9:14 AM on January 4, 2016

layceepee: The program doesn't "actually understand" any relations. That is, it doesn't model family relationships etc.

It derives all its knowledge from words occurring very near other words. So "king" occurs very often near "queen", near "England", near "Tut", and so on. In their corpus "king" occurs over 50,000 times, and the vicinity of each one of those instances is a context.

Next, they throw those contexts at a neural net. As they say, the results of this are a bit "opaque", but (as I understand it) the net "notices" different axes of similarity. E.g. it notices that there is a clump of words that tend to co-occur, like "crown", "king", "majesty", "tut", "queen", etc. We can mark all words as belonging to this clump or not— this is a dimension of the vector space. Humans can recognize it as "words referring to royalty", but it's still just based on nearness in the texts.

Once the net has a few hundred dimensions, each word can be seen as a vector pointing somewhere in this multi-dimensional space.

The claim is that a relation "a:a* :: b* : [?]" can be approximated by doing some vector arithmetic. In particular, the missing word b* will have a vector very similar to (b - a + a*).

The whole thing works when the neural net has found pretty solid clumps. That mostly comes down to having lots of examples. So the analogies have to be pretty simple, corresponding to a dimension it knows (like 'find the female equivalent of this common word'). It's also quite good at finding grammatical relationships, e.g. work : works :: be : is. (Although even this fails with rare words.)

Where it fails hilariously (as in the samples I posted), it's because it doesn't have good clumps. E.g. it has "MetaFilter" a bunch of times in the corpus, but "cortex" didn't occur close enough to it, often enough, that it "really knew" that they were related. Notice that the failure mode is that the missing word b* is similar to a* but doesn't have the right relationship. E.g. "hippocampus" is similar to "cortex" as they are both brain parts.

With your examples, my guess is that the female-related terms occur much less often in the database, and so it's just not as good at finding their interrelationships.
posted by zompist at 10:12 AM on January 4, 2016 [4 favorites]

Thanks, zompist, that's very helpful.

It sounds similar to the way that the AI Watson that competed successfully on Jeopardy works. Is that true, or have I misunderstood one or the other?
posted by layceepee at 12:01 PM on January 4, 2016

With mine, too, it often would just put the plural of b* in my analogies. But these?

bird: fly :: fish : longline_vessels
soccer : goal :: rugby : Elgan_O'Donnell

posted by Metro Gnome at 8:23 PM on January 4, 2016

« Older Cat + Monkey. 2016   |   The Selfish Side of Gratitude Newer »

This thread has been archived and is closed to new comments