Editorials

Map Reduce Explained

SSWUG TV
With Stephen Wynkoop
What, Mango on the Windows Phone?
Check it out on SSWUG TV.

Map Reduce Explained

David shares more insight into the history of the Map Reduce pattern. His practical explanation should provide some ideas for how you can utilize this pattern in your own systems.

David Writes:

The description of Map and Reduce is not quite right, as the traditional Map requires the input to already be “chunked” for it (though MapReduce does not; more on that later).


Map and Reduce come from functional programming, specifically Lisp. In Javascript, Map is equivalent to the Array object’s forEach method.


Basically, Map takes an array of “something”s, a function to operate on each “something” and produces an array of “something else”s. Since each execution of the mapping function is independent of any other execution of the mapping function, this is inherently parallel.


Reduce takes an array of “something”s and a function that takes two such “something”s and produces one “something else”. This reducer function is then applied to each element in the input array until all elements are combined with each other. Reduce can only be parallelized if the reduction operation can be considered Commutative:
http://mathworld.wolfram.com/Commutative.html


For instance, determining the number of people named ‘Bob’ that have pogo sticks can be written in SQL as

select COUNT(id) from myTable
where firstName = 'Bob' and hasPogoStick = 1


and in the Map-Reduce pattern (using Javascript for the more familiar syntax for most) as

myTable.forEach(function(record) {
if(record.firstName == 'Bob' && record.hasPogoStick) {
return 1;
}
return 0;
}).reduce(function(recA, recB) {
return recA + recB;
});


Now, both traditional Map and Reduce expect fully-formed arrays as input to start their operation. They both can be parallelized, but this means not a single reducer function is called until all of the mapper function calls have finished, but if we make sure our reduction is commutative, that doesn’t have to be true, and even if it isn’t commutative, once we have the first two results done, the reducer can start then.


Further, if we’re streaming in a large amount of data, we would want to start operating on that data even before the entire array is loaded (and Map conceptually doesn’t depend on any of the other elements of the array of input), or may not be able to hold the entirety of the data in memory at once on the server to begin with. Further, the data may be continuously updated by new results. If all we care about is the reduced result, and we can just apply it to the previous result, why not just constantly stream the new data in as it appears, map it into an easy-to-process element and then reduce it with the previous version in near-real-time? Finally, perhaps the input data can be used for multiple outputs, so why not have one thing that pre-parses the input into multiple map-reduce “streams”?


This added complexity is what Google’s MapReduce is, turning the two-step map-reduce pattern into six steps to account for dealing with input and output streams versus simple arrays, sharding and sorting control (in case the reducer is not commutative) as well as the traditional map and reduce functions.


MapReduce is perfect for processing a torrential storm of data into condensed, understandable chunks, but you don’t actually care to keep that data around after it’s processed. You can apply the technique to traditional databases, but the conceptual and networking overhead involved mean it can often be overkill for databases; just look at the size difference between the SQL and Javascript. Even if I used a ternary operator for the mapper function, the JS is longer. Similarly, however, if the input is not already in a database, the overhead of establishing a database and parsing the input into a relational form, performing the reporting operations (that’s really what this is), and then dumping the database afterwards could be much higher than a map-reduce even on a small set of data.

This ends our discussion of NoSQL and optimization techniques for now. If you have other comments, questions or insights you’d like to share, be sure to forward them to btaylor@sswug.org.

Cheers,

Ben

$$SWYNK$$

Featured Article(s)
Ten Breakthroughs That Changed DB2 Forever (Part 1)
Today, DB2 is a high-speed, enterprise DBMS with unparalleled transaction processing capabilities. But it wasn’t always that way. These innovations helped push DB2 to the top.

Featured White Paper(s)
Structuring the Unstructured: How to Dimensionalize Semi-Structured Business Data
Written by Interactive Edge

The Business Intelligence industry … (read more)