pseudofish

powered by text files

technology

MapReduce Examples

In investigating MongoDB I decided to explore MapReduce to analyse the posts from a heap of RSS feeds. However, the MongoDB documentation is fairly basic. This post has a bit more detail on more complex map/reduce functions.

Looking outside of MongoDB, there are a few other groups working with MapReduce. Google started the whole thing off, and has some good resources:

Some examples of using MapReduce include:

The reverse web-link graph sounds interesting:

The map function outputs pairs for each link to a target URL found in a page named “source”. The reduce function concatenates the list of all source URLs associated with a given target URL and emits the pair: .

There is also some work on using MapReduce to allow for machine classification to run in parallel. I also found a presentation on Large Scale Data Analysis that was helpful.

The main thing to get my head around is that the map function can return more complicated results than just a simple count. The reduce function then just glues the results together in some way.

And then there is the idea of multi-pass MapReduce. That could get interesting.

For now, I’m doing a lot of reading. The next step will be start to put into practice some of these ideas for analysing data. Actually, the next step is collecting the data, but that’s on a different thread.