Join 3,551 readers in helping fund MetaFilter (Hide)


Cinematch++
June 26, 2009 4:44 PM   Subscribe

Over three years later, has the Netflix Prize been won? Today our team submitted our solution to the Netflix Prize, resulting in a score of .8558, which corresponds to an improvement over Netflix Cinematch algorithm of 10.05%. This is the first submission in the competition to break the 10% barrier and sets off a 30 day period where all competitors are invited to submit their best and final solutions. (Previously.)

The team includes Bob Bell and Chris Volinsky of the statistics research department at AT&T Research (members of the 2007 and 2008 Progress Prize-winning teams); Andreas Toscher and Michael Jahrer, machine learning experts at commendo research and consulting in Austria (members of the 2008 winning team); Martin Piotte and Martin Chabbert, engineers and founders of Pragmatic Theory in Montreal; and Yehuda Koren, a senior scientist at Yahoo Research in Israel (a member of the 2007 and 2008 winning teams).
posted by youarenothere (58 comments total) 7 users marked this as a favorite

 
Is this what sports are like for normal people? I was watching that leaderboard like. Well. A scoreboard.
posted by GilloD at 4:51 PM on June 26, 2009


Sounds nice. About when do they plan on implementing the winner, if they're going to do the selection in a month? And I can't seem to find what they want to do with the $1 million, if they've stated it anywhere.
posted by mccarty.tim at 4:53 PM on June 26, 2009


Finally, I can stop asking friends, librarians and video store employees what to watch next.

But if Netflix thinks you're gay, the chances are 85.58% that it's right?
posted by filthy light thief at 4:53 PM on June 26, 2009


This is the first submission in the competition to break the 10% barrier and sets off a 30 day period where all competitors are invited to submit their best and final solutions.

This has me wondering whether there's an realistic possibility of strong teams intentionally fronting slightly deficient output—if your algo had pushed past the 10% mark in practice but you wanted more time to tune it tighter, would it be worth the gamble to hold off on reporting that, claim to be at like 9.5% or so, and wait to see if another team will limp past 10% honestly while you're tinkering? Then, bam, you get some "progress" from nowhere once the 30 day window opens.

Of course, if one team could do it, why not several teams? In which case that becomes a hell of a hand of poker once the cards come up.

I'm not really sure how progress is measured on a per-team basis, though, so maybe this whole idea is moot. But it seems like with decent stakes and a long-form contest like this, some game theory would come into it.
posted by cortex at 4:54 PM on June 26, 2009 [4 favorites]


Also, I don't know if Netflix has said as much, but it seems reasonable that a job offer might come along with that cash prize. If you're buying golden eggs anyway...
posted by cortex at 4:56 PM on June 26, 2009


Could someone explain how they determine the accuracy of the algorithm to someone who doesn't really understand algorithms? Do they go by whether or not the people choose the suggested movies?
posted by orme at 4:58 PM on June 26, 2009


Does this mean Netflix will finally stop recommending Benjamin Button to me?
posted by scody at 5:05 PM on June 26, 2009 [3 favorites]


Not too surprising, they were getting closer and closer.

This has me wondering whether there's an realistic possibility of strong teams intentionally fronting slightly deficient output—if your algo had pushed past the 10% mark in practice but you wanted more time to tune it tighter, would it be worth the gamble to hold off on reporting that, claim to be at like 9.5% or so, and wait to see if another team will limp past 10% honestly while you're tinkering? Then, bam, you get some "progress" from nowhere once the 30 day window opens.

I assume the contestants don't get to see the data that netflix actually uses to test against, which make it extreemly difficult to tune an engine to get 9.78% but not >10.0%.
posted by delmoi at 5:09 PM on June 26, 2009


orme - I believe the accuracy is determined by comparing predicted ratings based on a user's historical ratings with the ratings they actually give to a movie once they see it.
posted by telegraph at 5:10 PM on June 26, 2009


Cool. I remember talking with Koren when I wrote the Wired feature about the Netflix Prize last year (see the "previously" link above.) He told me he really wanted his team (himself, Bell, and Volinsky) to get past 10% on their own. But he conceded that if they got really close, he probably wouldn't be able to resist joining forces with other leading contenders to get over the top.
posted by escabeche at 5:13 PM on June 26, 2009 [3 favorites]


They give the contestants a random anonymized slice of their massive data set. Contestants tweak and tune their algorithm against that random slice. They then upload their program to netflix who run it on the full(?) corpus, comparing what the algorithm suggests for a person to what they actually rented. That results in a number which goes on the scoreboard and tracks teams' progress.
posted by Rhomboid at 5:13 PM on June 26, 2009 [1 favorite]


I assume the contestants don't get to see the data that netflix actually uses to test against, which make it extreemly difficult to tune an engine to get 9.78% but not >10.0%.

But that wouldn't be necessary -- you'd just leave your 9.78% algorithm unupdated.

There was certainly speculation among the contestants I talked to that some of this gamesmanship was going on, but nobody copped to it themselves.
posted by escabeche at 5:15 PM on June 26, 2009


Could someone explain how they determine the accuracy of the algorithm to someone who doesn't really understand algorithms? Do they go by whether or not the people choose the suggested movies?

I think they go by the rating. Like user K has rated movies A1 through A100. You give the algorithm A1 through A99, and ask it to predict the rating for movie A100. Or you could give it A1 through A50 and ask it to predict A51-A100. Then you give it a score based on how accurate the guesses are.

If you have an algorithm that can predict the correct score for a movie that the user has seen and rated (but that the algorithm hasn't been shown) then you can probably guess correct scores for movies that the user hasn't seen.

The thing about the netflix prize though, is that after the first few years there probably wasn't that much innovation going on, just fiddling of parameters of some SVM or Baysian network or whatever they were using.
posted by delmoi at 5:16 PM on June 26, 2009


Finally, I can stop asking friends, librarians and video store employees what to watch next.

In my experience, Netflix is much more accurate already than any individual--even one who knows me well. I find this really interesting, actually. There's a neat thing on Netflix where you can look at the ratings of people who rate similarly to the way you do; but even the closest individual match out there in Netflix land is at best in about 45% agreement with me--and will absolutely love some films that I absolutely loath and vice versa. Hidden somewhere in these algorithms is going to be some really interesting insights into the nature of taste and the ways in which we are both utterly predictable in our tastes and wildly individual.
posted by yoink at 5:18 PM on June 26, 2009 [1 favorite]


But that wouldn't be necessary -- you'd just leave your 9.78% algorithm unupdated.

Right, but then how would you know it would actually do better then 10% when someone finally beat the thing with test data.

Here's the thing. You have data on your hard drive you've downloaded. Netflix has more data you've never seen. You run your algorithm on your data and get whatever score you get. Then you upload it to netflix and they run it on their data.

You really have no idea what kind of score you're going to get when you upload the code, because you don't know how the data will be different. There's no guarantee that if you submit code that get's 9.78% on your data will still get 9.78% on their data. In fact, it's pretty unlikely. The scores will probably be close, though.
posted by delmoi at 5:19 PM on June 26, 2009


You really have no idea what kind of score you're going to get when you upload the code, because you don't know how the data will be different. There's no guarantee that if you submit code that get's 9.78% on your data will still get 9.78% on their data. In fact, it's pretty unlikely. The scores will probably be close, though.

The question then becomes "how close is close", I guess. Modifying the gamesmanship theory to account for the blind testing against the full set, the idea that a team might come up with what they think is a significant improvement to their calculations and then test a crippled version of that improvement isn't implausible.

Say, uploading a version that does a random A/B divvying-up of Bog Standard and New & Improved methodologies, and seeing the delta there, and taking the risk that a projection of the improvement for the mixed method onto the full-on New & Improved version would be close to accurate.

Dicey stuff, and I'm really curious now to see how the teams have approached it and what all Netflix has done to control for this sort of thing, but this is exactly the sort of setup in which I have a hard time believing "it's hard!" would deter folks from doing something clever and sleighty—the whole exercise is a glorious round of "it's hard!", with cash money on the line as a bonus.
posted by cortex at 5:29 PM on June 26, 2009


Dicey stuff, and I'm really curious now to see how the teams have approached it and what all Netflix has done to control for this sort of thing, but this is exactly the sort of setup in which I have a hard time believing "it's hard!" would deter folks from doing something clever and sleighty—the whole exercise is a glorious round of "it's hard!", with cash money on the line as a bonus.

Wouldn't the fairest thing be if the "one month" clock was reset every time a new champion emerged? That is, if someone beats this team, all the other teams--including this one--get one month to see if they can top it. This would go on until a month went by with no new champion.

That seems so obviously the easy way to avoid any of this gamesmanship I'm surprised that those aren't the actual rules.
posted by yoink at 5:32 PM on June 26, 2009 [1 favorite]


I'm on Netflix, & I often check their ratings to avoid stinkers. I just noted a one-star difference between average & like-me ratings to say "Transformers 2 isn't for me" over on another thread.
posted by Pronoiac at 5:39 PM on June 26, 2009


I just noted a one-star difference between average & like-me ratings to say "Transformers 2 isn't for me" over on another thread.

Yeah, when you see a really huge delta between "all other users" and "our best guess for you" you know something's up. I find it very rare, these days, that Netflix will be off by more than a star in their rating prediction for me. Except in straight-up comedy, actually. That is such a "am I in the right mood for this" genre, I guess.
posted by yoink at 5:42 PM on June 26, 2009


This shit is not worth anything until it I can be sure that the recommendation will be accurate for me 90% of the time. Throw in the fact that people change as their life experiences change, and I wonder what the best case scenario is. I've been obsessed with films for most of my life and have friends with whom I've worked on, and discussed films with, for decades. I know their tastes very well. And I don't think I could hit 90% with any of them. They regularly surprise me, and not minor stuff, but like I hate it and they love it (f.ex. Juno - I hated, they loved). I hold no hope for this, not in my lifetime, and not in my next 7 Hindu lifetimes either. The recommendations from netflix right now, strike me mostly as completely random. I watch this contest with the same idle curiosity as I'd watch a snail race, and with the same hope of applicability in my life.
posted by VikingSword at 5:49 PM on June 26, 2009 [1 favorite]


That seems so obviously the easy way to avoid any of this gamesmanship I'm surprised that those aren't the actual rules.

There isn't going to be any gamesmanship. These guys have been slowly grinding out 0.01% increases for a year now. The top groups only made it over 10% by a gradual merging of their algorithms with the group that is winning being a combination of at least three other teams: Pragmatic Theory (a couple engineers out of Quebec), BellKor (some AT&T eggheads), and Big Chaos (two guys from an Austrian research&consulting firm).

Note that the team in 2nd place is Pragmatic Theory by itself at 9.80. The team in third place is BellKor in BigChaos at 9.71, a combination of BellKor and Big Chaos. Those are the three teams that combined to get above 10%. The next place team is something called "Grand Prize Team", another melding of many teams. The odds of them being able to come up with another 0.33% improvement in the next 30 days is somewhere between none and none.

No one else is going to come close to 10%: These were far and away the best guys in the competition and none of them could do it by themselves either.
posted by Justinian at 5:52 PM on June 26, 2009 [2 favorites]


The recommendations from netflix right now, strike me mostly as completely random

Really? I'm very surprised by that. How many films have you rated on the site?
posted by yoink at 5:52 PM on June 26, 2009


Let me retract my statement slightly: The Pragmatic Theory or BellKor/BigChaos guys may well have eventually been able to claw past 10%, but it would have been a long agonizing process over the next, I would guess, year and a half.
posted by Justinian at 5:59 PM on June 26, 2009


How many films have you rated on the site?

Approximately 400 or so, since 2006. And I don't think of myself as being capricious - as a data point, I usually watch netflix dvds only with my wife, and we very seldom have a difference of opinion on the films (the one exception is old Japanese movies, but even there, we agree more than we disagree). Point being, my taste is shared by at least one other human being, so I'm not some kind of two-headed cow freak - and yet, netflix fails abysmally.
posted by VikingSword at 6:08 PM on June 26, 2009


Approximately 400 or so, since 2006.

Hmmm. I've rated over 1600, so my experience might reflect that. Still, I'd have thought 400 would be enough so that at least, say, when you go into their "movies you'll love" section and list films under a genre by rating, the top twenty or so should be "likely winners." I wonder if there is some particular genre faultline that you're straddling?

It could also be that you like only very, very outre films--in which case Netflix's datasets of fellow viewers/recommenders just may not be deep enough to catch you. I do notice that in some areas Netflix recommendations are simply pointless. Opera, for example, just isn't watched heavily enough for the ratings ever to move much out of the neutral 3-star area; similarly dance performance DVDs I never bother to notice the rating on. Maybe you only hang out in relatively under-rented and under-rated areas?
posted by yoink at 6:17 PM on June 26, 2009


Taking rating dates into account is really helpful: it lets you weigh the more recent ratings more heavily, & if you watch two in rapid succession, you'll exaggerate the difference. These are things I've read about being in use in contest entries.

For me, there are many types of movie that depend on mood. "The Darjeeling Express" & "Synecdoche, New York" might have worked better with more advance warning of non-linearity.

VikingSword: are you saying it isn't guessing your ratings well, or that it's suggesting movies badly?
posted by Pronoiac at 6:19 PM on June 26, 2009


Vikingsword: They regularly surprise me, and not minor stuff, but like I hate it and they love it (f.ex. Juno - I hated, they loved).

Juno, and other movies like it (indie quirky films, like Wes Anderson's movies), have been identified as aberrations in Netflix ratings and recommendations; this point was discussed in last year's NYTimes Magazine article about the prize.
posted by telegraph at 6:20 PM on June 26, 2009


Well, I do watch mostly foreign, but I do a fair share of U.S. movies, though again, mostly indies, or David Lynch type stuff, plus documentaries.

And it both guesses my rating wrong, and gives bad recommendations.
posted by VikingSword at 6:25 PM on June 26, 2009


I vaguely recall that by far the "worst" aberration was Napoleon Dynamite - almost impossible to predict with anything resembling accuracy. This doesn't surprise me at all, given the almost random distribution of opinions about it among my social circles.
posted by Tomorrowful at 6:26 PM on June 26, 2009


If they've just been slowly eking out minor improvements over the last year, isn't it likely that they've been overfitting to the data set they were given? In that case, is it not likely that the algorithm at 9.8% will do better on Netflix's test?
posted by painquale at 6:30 PM on June 26, 2009


In that case, is it not likely that the algorithm at 9.8% will do better on Netflix's test?

Doesn't that depend on how large their data set is? Presumably for a set large enough, it shouldn't matter, no?
posted by VikingSword at 6:33 PM on June 26, 2009


I always wondered if the Netflix prize was rigged. Since you have to submit your code to Netflix to get the progress prizes, couldn't they just update their in-house algorithm with the winner of the progress prize, making it more difficult to beat the house every year?
posted by pravit at 6:38 PM on June 26, 2009 [1 favorite]


... couldn't they just update their in-house algorithm ... making it more difficult to beat the house every year?

I don't think that that's how it works. I think your score is measured against the benchmark set by their in-house algorithm at the beginning of the competition.
posted by mhum at 7:07 PM on June 26, 2009


I watch this contest with the same idle curiosity as I'd watch a snail race, and with the same hope of applicability in my life.
posted by VikingSword


They don't have to predict the preferences of niche viewers like yourself, just the other 99 percent of the people who use Netflix.
posted by mecran01 at 7:09 PM on June 26, 2009


What mhum said: the "qualifying RMSE" - 0.8572 - is listed in the rules as the number to meet or beat, so they can't move the goalposts like that. They mention how they derived that number, but it's not shifting with time.
posted by Pronoiac at 7:11 PM on June 26, 2009


They don't have to predict the preferences of niche viewers like yourself, just the other 99 percent of the people who use Netflix.

Yeah, this is a pretty key point. Netflix isn't throwing a million bucks around because they want to help the self-motivated cinema geek with something he neither wants nor expects help from them on.

I mean, I hear you; the accuracy of their ratings has zero practical effect on me, either, partly because I'm willfully stupid about my viewing choices ("It looks awful, but Christopher Walken is in it? Add to motherfuckin' Queue!"), partly because I use other social resources to get leads on films, and partly because I never have enough time to watch through my existing backlog of Stuff To Rent such that I have to resort to consulting the recommendations and numbers-of-stars they so helpfully provide.

But it's still a fascinating bit of territory to me. These snails have a really interesting destination, and the implications of the race even being run in the first place are pretty great.
posted by cortex at 7:23 PM on June 26, 2009 [3 favorites]


Wow . . . technology and ingenuity really never cease to amaze.

I love metafilter because it forces me to exercise parts of my brain I wouldn't otherwise flex, and it's an interesting measuring stick on copious levels.
posted by eggman at 7:29 PM on June 26, 2009


Or maybe it's 'cause I just smoked a little, something I rarely do?

Yeah, that must be it . . .
posted by eggman at 7:30 PM on June 26, 2009 [1 favorite]


I vaguely recall that by far the "worst" aberration was Napoleon Dynamite - almost impossible to predict with anything resembling accuracy. This doesn't surprise me at all, given the almost random distribution of opinions about it among my social circles.

You'd think they could build this into the recommendation system. They could pop up a little "people either love this or hate it" flag for films that are particularly difficult to track. No doubt, of course, there'd be some people who very predictably like such "quirky" films.
posted by yoink at 7:57 PM on June 26, 2009


It helps with serendipity. You know how it suggests movies when you put them in the queue? I got, to start: Big Man Japan, from that The Calamari Wrestler, from that The Rug Cop. My roommates & I found that string of recommendations amusing.

If you have Greasemonkey & you trust their estimated ratings, Netflix Queue Sorter is awesome. I sorted a week or two ago, then manually put movies above TV series for now, & seeing the ratings move around .1 or .2 is something I've only just noticed. I wonder what ratings in my queue will look like if I sort just before the new ratings engine kicks on.

I'd worry about unconsciously formed, unknown cliques influencing ratings on movies, per some experiment I read last year about music, but I can't find that & I kind of shrug.

Oh, if you want contentious movies, there's a list. "Napoleon Dynamite's" #5.
posted by Pronoiac at 8:03 PM on June 26, 2009


Apparently I confuse Netflix, because at some point it's just stopped recommending me movies. The really odd thing is that if I add a movie after reading a review (say, at Ebert's site), I will get a "movies most like this one" screen that sometimes includes movies with 3.5+ "predicted ratings." Those movies will never be recommended to me, though.
posted by sonic meat machine at 11:13 PM on June 26, 2009


Is the code open? If so, it would seem like someone could setup a standalone service ala FreeDB.
posted by Mitheral at 11:32 PM on June 26, 2009


scody: "Does this mean Netflix will finally stop recommending Benjamin Button to me?"

1) This doesn't mean that Netflix will stop recommending movies that you've prejudged. I don't think there's any reliable way to predict those kinds of things, without inputting a bunch of outside-the-system data. The goal here wasn't to recommend movies you were interested in renting, but rather to recommend movies that you would rate highly once you did see them. A thin distinction, perhaps, but I know I can think of a few movies that I'm never interested in seeing that I honestly would probably enjoy.

2) There's always been a "not interested" button below the ratings stars on Netflix. That seems like a good way to get Button off your recommendations even under the current regime.
posted by Plutor at 7:17 AM on June 27, 2009


What if a bunch of assholes jumped onto NetFlix and started choosing the opposite of the movie they'd most like to see, just to screw up the performance of the algorithm? Or, say, made every fourth pick completely at random?

Also, in theory you could win the million dollar prize if you had 10,000 guys who could each check with your team's algorithm before then making an identical choice. And you could do that with 10 guys, if it just happened to be the 10 guys that NetFlix was using to judge your algorithm's performance. Instead of trying to perfect an algorithm, it might be easier to just try hacking the contest.

Here's the thing: For me, NetFlix's algorithm actually suggests movies that I've seen and that I know I don't like. So I guess all this reverence for NetFlix sort of rubs me the wrong way. But maybe I'm just frustrated because there is no real right answer. I bet Michael Jackson's The Wiz is really popular today. I bet it wasn't last week. And what happens if ultimately, a group of 10 people will always choose some different movies than the previous group of 10 people? Then, really, there's a certain element of random to all of this.
posted by destinyland at 8:52 AM on June 27, 2009


What if a bunch of assholes jumped onto NetFlix and started choosing the opposite of the movie they'd most like to see, just to screw up the performance of the algorithm? Or, say, made every fourth pick completely at random?

1) They're already looking at some pretty noisy data. People rent the wrong movie by mistake. They rent movies they expect to hate because they're completist fans of some actor or director. They rent movies for their kids, for their friends, for school assignments, blah blah blah. So one of the challenges for anyone making this sort of algorithm is to deal with that kind of noise, and one more source of noise won't kill anyone who's got a basically sound approach.

2) You'd need a whole lot of assholes. Their data set, if I understand correctly, is every single movie any of their customers has ever rated. You're not going to get enough volume to reach even a fraction of a percent of that with a couple of bored /b/tards on a long weekend.
posted by nebulawindphone at 11:11 AM on June 27, 2009


True. Although NetFlix only has something like 35,000 titles (according to Wikipedia).

If a thousand guys all gave five stars, say, only to movies whose titles start with the letters in BCHAN -- and then tipped off one of the development teams -- wouldn't that team then enjoy a competitive advantage?
posted by destinyland at 12:47 PM on June 27, 2009


If one b/tard writes a script that rates Every Movie Ever according to alphanumeric taxicab distance from bchan, and then distributes that script to a thousand other b/tards, then suddenly you have some extremely broad (though not deep) sample biasing. Remember that flap where they destroyed the Time most-important-people poll?
posted by kaibutsu at 5:37 PM on June 27, 2009


The scores are based on samples from before the beginning of the contest. So you'll also need a time machine to rig this.
posted by Pronoiac at 7:17 PM on June 27, 2009


This is an incredible contest, and as someone who pretty much does statistics for a living, I would really love to see how they did it.

I've done a lot of these forecasting models, and yeah, a lot of times it's a series of gradual steps, with a few steps backwards every once in a while, which gets you places.
posted by gushn at 9:34 PM on June 27, 2009


The finish is going to be more interesting than I had thought because all the big players NOT on the current leader team are starting to group up to try to catch them. The second place team "Grand Prize Team" which is a big coalition is up to 9.90%.

Of course it's always possible that the BellKor's Pragmatic Chaos team has been refining their entry and will improve on 10.05%.
posted by Justinian at 12:11 AM on July 9, 2009


Anybody who is at all interested in this contest should probably go and take a look at the current leaderboard with less than 24 hours to go.

I strongly advise it.
posted by Justinian at 7:51 PM on July 25, 2009


I feel mildly, and admittedly only very provisionally, vindicated about my game-theoretical witterings up-thread.
posted by cortex at 9:45 PM on July 25, 2009


Re: a comment in your pm cortex, I believe this doesn't push the deadline back at all. The 30 days ends tomorrow midday even though Bellkor et al were beaten with less than 24 hours remaining.

I wonder if The Ensemble and Bellkor et al will team up overnight so that the winning "team" is essentially a giant conglomerate of every single serious participant in the contest. Everybody's a winner!
posted by Justinian at 10:20 PM on July 25, 2009


"In accord with the Rules, teams have thirty (30) days, until July 26, 2009 18:42:37 UTC, to make submissions that will be considered for this Prize."

*checks*

Last chance in just over 18 hours.
posted by Pronoiac at 10:39 PM on July 25, 2009


Guess I should get started. I'm such a procrastinator!
posted by Justinian at 11:48 PM on July 25, 2009


Well, looks like BellKor put in one last submission a couple of minutes ago, but it wasn't enough to bump them back to the top.

Bear in mind that the scores reported on the leaderboard are for the "quiz" data, which is a random 50% of the qualifying data set. The official contest results will be determined by the results on the other half, the "test" data, for which the scores have never been published. So this could still go either way.
posted by teraflop at 11:28 AM on July 26, 2009


BellKor actually put in a submission that tied The Ensemble's previous submission of 10.09, but sadly for BK the Ensemble team also put in a new submission that squeaked out an extra 0.01 to maintain that lead.
posted by Justinian at 3:49 PM on July 26, 2009


Huh. It looks like Ensemble's 0.01 final lead (on the quiz data set) came with about 4 minutes to go in the competition. Crazy.

But yeah, the final result is decided based on the test data set. Who knows what's going to happen.
posted by Justinian at 3:57 PM on July 26, 2009


« Older “...the Platonic nerd is invariably male. The ste...  |  Tart cards [NSFW] are the mean... Newer »


This thread has been archived and is closed to new comments