Posted 2 years ago

Interesting Map Reduce

Map Reduce has become a de-facto standard for parallelizing programs. Thanks to Google for introducing it. Also, I am grateful to the course “Dealing with Massive Data” by Prof. Sergei Vassilvitski  

I wish to use this blog to post the Map Reduce versions of algorithms that I would propose for a variety of problems.

What is Map Reduce ? 

Map Reduce programming model doesn’t maintain state as it was built based on inspirations drawn from Functional Programming paradigm. 

As the name suggests, Map Reduce consists of a Map step, a Reduce step and     an intermediate shuffler/sorter step. 

  • The Mappers reads the input stream (may be a key,value pair) and after processing outputs a key, value pair.
  • The intermediate shuffler sorts the key, value pairs and passes it to Reducer in the form of a Key and associated list of values
  • The Reducers act upon a key and its associated list of values and produce the result.

The Mapper / Reducer step accounts for the scaling/ parallelism  and that is where the hardware power is utilized. Ex. Reducers can simultaneously act upon different key, value pairs.

The above three steps together is called ONE Map Reduce round. Map Reduce designers design and determine the number of rounds and the behavior of mapper / reducer in every round.

for further readings :  click here or Google / Bing / Yahoo / Wiki

How do I do Map Reduce ? 

I did my initial map reduce programming in python. Yelp has given us a brilliant API called mrjob. The special features of mrjob include that you can 

1. Test your map reduce code on a single machine (locally)

Python is the only requirement.

2. On Hadoop configured clusters

HADOOP_HOME should have been exported in the environment

3. Run it on Amazon EC2 machines 

AMAZON AWS SECRET / public keys should be exported and BOTO python egg, an Ec2 client should have been installed.

Most of the times , people start learning Map Reduce programming model with counting the number of occurrences of words. To be different, Let us start with writing a map reduce code that can generate adjacency list from the list of edges(indicating unweighted undirected edge) that constitute the graph.

Let the input file contains list of edges in the form of comma seperated values.

1,2

2,3

2,4

Algorithm/ python code 

import mrjob;

from mrjob.job import MRJob;

class MRAdjList(MRJob):

  def mapper(self,key, line):

    nodes = line.split(‘,’);

    yield (nodes[0],nodes[1]);

    yield (nodes[1],nodes[0]);

 def reducer(self, key, values):

   adjList = list(values);

   node = key;

   yield node, adjList;

if __name__ == “__main__”:

   MRAdjList.run();

Input File

1,2

3,4

2,3

2,4

1,3

output 

“1” [“2”, “3”]

“2” [“1”, “3”, “4”]

“3” [“1”, “2”, “4”]

“4” [“2”, “3”]


Execution python adjList.py < input.txt > output.txt. Here mapper makes two yields as the graph is undirected. In case of directed graphs , single yield should be sufficient.

To be continued with more Examples :-)

Raghavan Muthuregunathan.

Learning can never reach a saturation point - An Indian Godess :)