MapReduce: running large-scale computations in parallel
September 8, 2006 12:02 AM Subscribe

A look at an algorithm Google uses to run large-scale computations in parallel on thousands of cheap PCs: MapReduce. Via Joel on Software.
posted by russilwvong (16 comments total) 4 users marked this as a favorite

(reduce #'hype (map 'list #'grumble '(everything old is new again)))
posted by oncogenesis at 12:24 AM on September 8, 2006 [2 favorites]

I read this paper a couple of years ago and implementing something that runs on servers is fairly simple. You just write a map function, and a reduce function (usually the simpler of the two), and that's it. I know that many google engineers often write MapReduce programs, so it's not just the large-scale problems and top engineers that use them.

It's interesting how many problems can be reduced to Map-Reduce form. Lots of real world problems as well as your standard textbook problems can be easily mapped to MapReduce (no pun intended).

I'm also surprised google would publish a paper on their internal workings, since everything is so secretive there. Google isn't really the publishing machine that Microsoft Research or IBM Research are.
posted by tasty at 12:30 AM on September 8, 2006

Hadoop is an open source implementation of mapreduce in a distributed system. I've never tried it, but the docs look promising and it was pioneered by Doug Cutting, the same genius behind Lucene and Nutch.

Sure the core model behind mapreduce is nothing new (as the name itself acknowledges), but I don't think anyone's ever implemented and used a distributed system like mapreduce at the scale and effectiveness Google has. It's very powerful.

For a good example of an application of mapreduce, see the sawzall paper. Google analyzes log files via mapreduce. Really big ones. Very fast. It was one of my favourite things I learned at Google.
posted by Nelson at 3:13 AM on September 8, 2006

You were supposed to keep this secret!
posted by furtive at 3:19 AM on September 8, 2006

Any programmer worth his salt knows how to write programs using functional programming techniques, and takes them for granted.

In his article, Joel seems to be introducing them as if they're a new concept or something, using a paper that's about grid computing, not functional programming, as his reference.

I've never really liked Joel's writing, and his software engineering calibre is most of the reason for my distaste. His article on exceptions made me want to scream and shout.
posted by Jerub at 3:21 AM on September 8, 2006

Just to be clear; 'map and 'reduce are nothing new in computing, although if you've only worked in C-like languages there's often a fun moment of revelation when you understand how powerful these functional primitives are.

But an easily usable map/reduce distributed across thousands of machines is something interesting and very useful. And it was certainly 'new' for me, working at Google, to discover that mapreduce works well for so many problems. Parallel computing is all about finding programming paradigms that make it easy for sequentially-thinking humans to efficiently use parallel computers. And it turns out that mapreduce works pretty well. That's cool.
posted by Nelson at 3:59 AM on September 8, 2006

I dunno, Jerub, I was actually having a discussion with some developers at my work a few days ago about this very gripe:

I think the reason programmers in C/C++/Java style languages have been attracted to exceptions is simply because the syntax does not have a concise way to call a function that returns multiple values

I just can't believe the fine folks at Sun haven't come up with a method for returning multiple data types that doesn't use some "bundling" hack (for example, returning an array of Objects). It would sure make my life easier.
posted by Civil_Disobedient at 4:01 AM on September 8, 2006

How is this different from graph reduction as an approach to expression evaluation? Simon Peyton Jones did some interesting work in creating techniques for 'executing' functional programs by reducing the graph to a single node using parallelism. While nobody in their right mind would want to writing any code of significance in Miranda (a functional language), it turns out to be reasonable to compile lisp/scheme into Miranda and then graph reduce the program. Further, the compiler to compile scheme into Miranda is fairly easy to write in scheme, which means it can compile itself into Miranda and is now a parallelized compiler. Effectively, if you can compile a program into Miranda, then you get scaling parallelism.
posted by plinth at 5:33 AM on September 8, 2006

I just can't believe the fine folks at Sun haven't come up with a method for returning multiple data types that doesn't use some "bundling" hack

You can't? :P
posted by sonofsamiam at 6:33 AM on September 8, 2006

I should point out that Google's Map and Reduce don't correspond precisely to the traditional definitions of those functions. This is a paper by Ralf Lämmel that provides a more rigorous description of the MapReduce programming model. (warning, Haskell)
posted by teferi at 7:39 AM on September 8, 2006

I just can't believe the fine folks at Sun haven't come up with a method for returning multiple data types that doesn't use some "bundling" hack

Anonymous structs are too good for you people.
posted by public at 8:04 AM on September 8, 2006

Unless you mean a function with multiple different possible return types, in which case...

Anonymous unions are too good for you people.
posted by public at 8:05 AM on September 8, 2006

Or virtual classes...
posted by public at 8:06 AM on September 8, 2006

I think he means something like Common Lisp's (values rv1 rv2 ...) syntax, public.
posted by teferi at 8:14 AM on September 8, 2006

cool. we need more geek posts and threads.
posted by dminor at 3:03 PM on September 8, 2006

Anonymous unions are too good for you people.

Uh, Java doesn't have unions.
posted by Civil_Disobedient at 4:09 PM on September 8, 2006

« Older Peter Brock, RIP. | Lonelygirl15 is bogus Newer »

This thread has been archived and is closed to new comments

MetaFilter

MapReduce: running large-scale computations in parallel
September 8, 2006 12:02 AM Subscribe

Tags

Share

MapReduce: running large-scale computations in parallel September 8, 2006 12:02 AM Subscribe

Tags

Share

MapReduce: running large-scale computations in parallel
September 8, 2006 12:02 AM Subscribe