Lately, I have become extremely interested in MapReduce, specifically the open source implementation of this in Hadoop.
From Wikipedia (MapReduce):
MapReduce is a software framework implemented by Google to support parallel computations over large (greater than 100 terabyte) data sets on unreliable clusters of computers. This framework is largely taken from map and reduce functions commonly used in functional programming.
That description, though quite accurate, does not do justice to MapReduce.
A MapReduce framework, in essence, allows you to distribute the processing of a large amount of data over a set of commodity systems. The MapReduce framework takes care of all the underlying issues with managing the distributed system and provides to you, a simple yet powerful interface for processing the data that is reminiscent of the map and reduce combinators from functional programming languages like Lisp.
Here are a couple of quick links to get you started on MapReduce:
Introduction to Parallel Programming and MapReduce: This tutorial covers the basics of parallel programming and the MapReduce programming model.
MapReduce: Simplified Data Processing on Large Clusters: The original paper by Jeff Dean and Sanjay Ghemawat of Google describing the framework.
Hadoop
MapReduce is a framework, Hadoop is an open source implementation of it. In addition to MapReduce, Hadoop also implements a distributed file system akin to the Google File System (Google’s distributed file system) which works very well with MapReduce. HadoopMapReduce describes Hadoop’s implementation of MapReduce.
Critisisms
Like any new paradigm, MapReduce is starting to find its critics. MapReduce: A major step backwards is an article recently published by well respected database gurus David J. DeWitt and Michael Stonebraker (Stonebraker was the architect behind Ingres and Postgres among other things) that argues that using MapReduce is like going back to the stone age, that using it amounts to forgetting a lot of the lessons gleaned in working with Relational Database Systems (RDBMS) over the decades.
In my opinion, the complaints are misplaced and stem from looking at MapReduce in the wrong perspective. They are seeing MapReduce as a replacement for a RDBMS, albeit in some specific cases. They complain that MapReduce does not employ some of the most elegant things in RDBMSes: schemas, indexing, keeping the data and schema separate etc. ‘Databases are hammers; MapReduce is a screwdriver‘ and ‘Relational Database Experts Jump The MapReduce Shark‘ are a couple of well written responses to this particular piece of criticism.
I will be posting some interesting stuff on MapReduce in general and Hadoop in particular on this blog, keep checking back.