Find a separating hyperplane with one weird kernel trick!
April 8, 2013 8:48 AM   Subscribe

 
This is a post about funny things in the field of machine learning.
posted by curuinor at 8:54 AM on April 8, 2013 [23 favorites]


My kernel trick has always been to use make menuconfig. I hope I didn't rip off too many Pennsylvanians with that.
posted by DU at 8:54 AM on April 8, 2013 [3 favorites]


It's very clear to me. Are you from Pennsylvania?
posted by weapons-grade pandemonium at 8:54 AM on April 8, 2013 [8 favorites]


Dear creator of that site,

I hate you for making light of my profession. And I love you for pretty much everything else, including the above.

Now I need to go wash out my eyes with a toothbrush.
posted by blindcarboncopy at 8:55 AM on April 8, 2013 [1 favorite]


How does using crc32 help my classifier. How is that better than anything else from the algorithms perspective.

Can a programmer with a degree break this down for me so I can go to the management team with it?
posted by Ad hominem at 8:56 AM on April 8, 2013 [1 favorite]


For those unaware, oneweirdkerneltrick.com is tied to some papers presented at SIGBOVIK 2013. If that's your kind of humor, you can lose the rest of your day there.
posted by Going To Maine at 9:04 AM on April 8, 2013 [2 favorites]


Yeah, SIGBOVIK happened.
posted by oonh at 9:05 AM on April 8, 2013 [2 favorites]


Ad hominem, there are two questions here: why does the hashing trick work, and why would you want to use it.

Let's start with the second question. The reason the hashing trick was invented was to deal with an annoying problem in machine learning: as the number of features used in your model starts growing, the old algorithms break down, scale-wise. Take random forests: they are wonderful for all kinds of things, but you try to shove e.g. 10^8 sparse features into one, and it will pretty much fall apart. Training will take too long, and then you can't even use your resulting model because it's so huge you can't load it into a single machine. So now you are also solving difficult parallelization challenges just because you have so many damn features. At this point, most sane people would go to feature engineering (moving from 10^8 to e.g. 10^4), but this is not an easy thing to get right, and you are throwing away perfectly good information the whole time.

Now, the question is, why would you want to have 10^8 features. There are lots of domains where this is the fact of life, but the most prominent one is text classification. If you take a large corpus and want to use all of it, once you take 1-grams and 2-grams you are right there. Game over. Unless... unless there was a way for you to have a high-quality model that can be trained sufficiently quickly and is of fixed output size, regardless of how many features you put into that.

The hashing trick is that way. Hash the features into a fixed-size array. So if your output hash size is 20 bits, you always have 2^20 potentially-dense features, even if you started with 10^12 sparse ones. Guaranteed. Combine this with online learning, and your input length can be essentially unlimited as well. Unlimited number of input records, unlimited number of input features, and a high-quality, fixed-size model comes out every time. What's not to love?

Now, as to why the hashing trick works, I will go ahead and point you to the best, most-accessible introduction I've seen. There are links to more papers in there, if you are interested, but this blog nails the intuition nicely. Here: http://blog.someben.com/2013/01/hashing-lang/. The gist of it is that there are some very easy-to-meet conditions under which the distribution assumptions minimize both information loss and overfitting when you go from "raw" to "hashed" features.

So this was a long way to answer a short question. If you already have a working classifier, the hashing trick may not help it much. It is there to solve the kinds of problems for which you are unlikely to have a previously-working classifier in the first place.
posted by blindcarboncopy at 9:17 AM on April 8, 2013 [31 favorites]


Well, there's also the interesting niceness that anything you can hash, you can also learn. A little bit useful for the people dealing with natural language, but pretty goddamn useful for somebody dealing with, say, a billion JSON thingies.
posted by curuinor at 9:20 AM on April 8, 2013 [1 favorite]


Well played, Gentlemen, well played indeed. But I believe the game is mine... Mornington Crescent for the win.
posted by Naberius at 9:24 AM on April 8, 2013 [11 favorites]


Ok I get it. You are actually counting on collisions when you throw away most of the hash but due to zipf's law and the nature of language you assume that when you have a collision one of values will have been fairly infrequently used. So you trade some precision for speed.
posted by Ad hominem at 9:33 AM on April 8, 2013


"If this isn't a stunt/spam post, you need to do a better job making that clear."

I can see how it might not be totally clear for everyone, but this is not a stunt post, this post is amazing.
posted by Blasdelb at 9:37 AM on April 8, 2013 [8 favorites]


Ad hominem, exactly right but I would change the wording slightly to "Throw away some precision for speed and feasibility". To underscore that it is very hard to build a model that can take advantage of all the information in your input stream, once your input is sufficiently "wide" and "long", even if you didn't care about speed. So yes you are losing information, but under high-scale conditions, it is the least loss of information that you can realistically get away with anyway.

(Disclaimer: this statement is from implementation perspective, as opposed to what is theoretically possible. Not intended to be construed as medical advice. Do not remove the label.)
posted by blindcarboncopy at 9:38 AM on April 8, 2013 [1 favorite]


Oh my god those protest signs are freaking amazing. I am sending this to everyone in my lab.
posted by en forme de poire at 9:39 AM on April 8, 2013


Right, better to have it slightly imprecise than not at all.

Thanks for the explanation.
posted by Ad hominem at 9:39 AM on April 8, 2013


PROTECT OUR RANDOM FORESTS
posted by en forme de poire at 9:46 AM on April 8, 2013 [9 favorites]


You ever see little kids who laugh at jokes the adults are telling, just because they're trying to pick up social cues? That's what this post is to me. I can tell just enough to know that it's a joke, and not enough to understand any aspect of it otherwise.
posted by Navelgazer at 10:08 AM on April 8, 2013 [22 favorites]


Lost it at "SUPPORT VECTOR MACHINES!"
posted by Nomyte at 10:10 AM on April 8, 2013 [1 favorite]


I can see how it might not be totally clear for everyone, but this is not a stunt post, this post is amazing.

I can tell just enough to know that it's a joke, and not enough to understand any aspect of it otherwise.

Actually I have a bit of a beef with this post because it is mixing up things that are specifically intended to be humorous (the protest, one weird kernel trick) with things that are humorously phrased but not actually jokes (the hashing trick, the jeopardy app, the cat data). I mean, I can barely keep up with this stuff & am often uncertain if I actually am, so mixing fact & fiction is getting to me a little bit.
posted by Going To Maine at 10:27 AM on April 8, 2013 [2 favorites]


"Oh sure, going in that direction will totally minimize the objective function" —Sarcastic Gradient Descent.

Hee. That -is- funny. Yes, I'm a CS nerd.
posted by Iosephus at 10:34 AM on April 8, 2013 [2 favorites]


Most impressive is that they didn't spell it "kernal" like I would have.
posted by alms at 10:40 AM on April 8, 2013


Haha! Dusting off my metafilter account to comment. I'm a robotics PhD student at CMU and one of the creators of the oneweirdkerneltrick website and the papers within. (Please check out the papers, including the slides/paper for the Kardashian Kernel that is further below).

For context, this is work we presented at SIGBOVIK, an annual joke conference at CMU. It is obviously meant to be satire. (Though if you're willing to invest in our n-dimensional polytope scheme please send us money - it's totally not a pyramid scheme!).

If you don't know what's going on in this page/papers I recommend a) reading lots of books on machine learning and computer vision, and b) reading lots of celebrity gossip blogs and "teen swagg" facebook posts and other dens of frivolous pop/teen/internet culture.
posted by instantlunch at 11:19 AM on April 8, 2013 [19 favorites]


I genuinely appreciate the reminder that there are people a lot nerdier than I am. :)
posted by Foosnark at 11:27 AM on April 8, 2013 [1 favorite]


Is our machines learning?
posted by srboisvert at 12:14 PM on April 8, 2013 [3 favorites]


The first link was enough to put me right off. It looks like it should be funny, but it might as well be in Chinese. And I can't read a word of Chinese, as big as a house.

*shrug*
posted by Too-Ticky at 12:32 PM on April 8, 2013


The first link was enough to put me right off. It looks like it should be funny, but it might as well be in Chinese. And I can't read a word of Chinese, as big as a house.
Try the Chinese Room.
posted by delmoi at 12:46 PM on April 8, 2013 [9 favorites]


Try the Chinese Room.

Well I got that joke at least.
posted by Navelgazer at 12:52 PM on April 8, 2013 [1 favorite]


Professor Bovik is also alluded to in the Java spec:

"For example, a package named edu.cs.cmu.bovik.cheese …"
posted by zippy at 12:56 PM on April 8, 2013


I laughed so hard I snorked. Thanks for posting.
posted by benito.strauss at 1:37 PM on April 8, 2013


The one weird kernel trick is almost perfect, but in the real ads they always use digits rather than spelling out the numbers, and it drives me mad for some reason. (So it would be "1 weird kernel trick").
posted by Pyry at 2:16 PM on April 8, 2013 [2 favorites]


Here's the cat head detection paper (PDF).
posted by whir at 2:16 PM on April 8, 2013


Man. Tenure is a hell of a thing...
posted by schmod at 9:25 PM on April 8, 2013 [1 favorite]


Just pop pop it on the stack!
posted by blue_beetle at 9:26 PM on April 8, 2013


I had a Loompanics copy of the Principia Discordica when I was a kid that read like this site.
posted by midnightscout at 10:48 PM on April 8, 2013


That reminds me...

/has sex with giant apple
posted by obiwanwasabi at 1:48 AM on April 9, 2013


This article is probably the best computer vision paper that I have ever read. Sure, it's an extension of eigenfaces, but you have to realize that human faces are not that variable! In contrast, cat faces may vary in shape, size, colour, ... Indeed, given that the algorithm is almost entirely image-based, it is impressive that the authors did not even bother to align the pictures; the cats aren't even looking in the same direction as the test images, for chrissakes!

Having said that, I am disappointed at the lack of quantitative analysis. I would like to see some error measurements pooled across a large set of images -- the examples seem somewhat cherry-picked at the moment. I'd also like to see a visualization of the purrincipal catponents: the eigenfaces were great because you could see which features contributed the most to the facial shape (e.g., hairline).

Still, I am very impressed with this work. The authors allude to possibly combining the results with a hierarchical feline stack, which seems like a fruitful research area, and I will be keeping an eye on this research group.
posted by tickingclock at 9:41 PM on April 9, 2013 [1 favorite]




« Older The pace of global warming   |   Tell Death to bugger right off! Newer »


This thread has been archived and is closed to new comments