MapReduce
MapReduce is a buzzword that has been hard to miss for the last few years. The Google approach to scale has made the concept famous; a fact which they have even patented.
This is one of the more interesting data storage techniques where a software development pattern and data persistence have in a sense merged into a single concept. Hadoop, an open system file storage engine based on a distributed file system integrated with the MapReduce capability is an example of this tight integration.
MapReduce is a technique for breaking up work so that it can be distributed to multiple CPUs for processing. SETI is a great example of the MAP algorithm. SETI takes large chunks of radio noise from space and distributes it to a large grid (your PC if you choose to participate) of computers for processing, the results of which are returned to the SETI host for integration with the results of other machines.
MapReduce also takes chunks of work to be processed. A Map program processes a large data source, breaking it down to be processed by another processor. Mapping may occur in more than one pass. So, the output of an initial map may become the input for a more detailed mapping process.
Once mapping is completed, or available, reduction is begun. In reducing the results, all the data that was broken out into buckets during the map phase are processed on multiple processors. After reduction is completed, the results are gathered back into meaningful data.
In order for this data to be communicated across processors, and even across machines in the network, you begin to see technologies such as Hadoop evolve. Hadoop is a file system of Key Value pairs. A simple hash value can act as a key. The value can be anything that can be stored on a file system. So, the buckets become files in a file system in the form of key value pairs, available to the entire cloud.
This file storage method fits quite nicely with MapReduce process. For this reason, companies that have large amounts of data to be chunked through such as web URLs or keywords in the metadata for a web page, or even the text of a web page, MapReduce and Key Value Pair storage become optimal for seeking, sorting mapping and reducing data.
Wikipedia has a great article on the subject of MapReduce. It also has some great links where the storage method is compared to structured storage.
As you can see, with the increased speed of networks, and multipule cores or processors on machines, the ability to distribute work across all processors is enhanced by techniques such as MapReduce.
Are you working with Parallel Processing techniques? How about sharing your experience with us. I’d love to hear about frameworks or patterns you have found useful. Drop me a note on Facebook, Twitter, or e-mail at btaylor@SSWUG.org.
DBTechCon
It’s coming April 20th22nd. Be sure to get registered before it’s too late.
Cheers,
Ben
$$SWYNK$$
Featured Article(s)
Report on Reporting
Is your Reporting server a bit of a mystery. We all know that this is something that your company takes very seriously. The purpose of this article is to review how much information we can gather from your reporting server without having to create an elaborate solution.
Featured White Paper(s)
Upgrading to Microsoft SQL Server 2008 R2 from Microsoft SQL Server 2008, SQL Server 2005, and SQL Server 2000
More than ever, organizations rely on data storage and analysis for business operations. Companies need the ability to deploy… (read more)
Featured Script
Check Free Space on Server
Checks disk spae on server… (read more)