I wish to use this blog to post the Map Reduce versions of algorithms that I would propose for a variety of problems.
What is Map Reduce ?
Map Reduce programming model doesn’t maintain state as it was built based on inspirations drawn from Functional Programming paradigm.
As the name suggests, Map Reduce consists of a Map step, a Reduce step and an intermediate shuffler/sorter step.
The Mapper / Reducer step accounts for the scaling/ parallelism and that is where the hardware power is utilized. Ex. Reducers can simultaneously act upon different key, value pairs.
The above three steps together is called ONE Map Reduce round. Map Reduce designers design and determine the number of rounds and the behavior of mapper / reducer in every round.
for further readings : click here or Google / Bing / Yahoo / Wiki
How do I do Map Reduce ?
I did my initial map reduce programming in python. Yelp has given us a brilliant API called mrjob. The special features of mrjob include that you can
1. Test your map reduce code on a single machine (locally)
Python is the only requirement.
2. On Hadoop configured clusters
HADOOP_HOME should have been exported in the environment
3. Run it on Amazon EC2 machines
AMAZON AWS SECRET / public keys should be exported and BOTO python egg, an Ec2 client should have been installed.
Most of the times , people start learning Map Reduce programming model with counting the number of occurrences of words. To be different, Let us start with writing a map reduce code that can generate adjacency list from the list of edges(indicating unweighted undirected edge) that constitute the graph.
Let the input file contains list of edges in the form of comma seperated values.
Algorithm/ python code
from mrjob.job import MRJob;
def mapper(self,key, line):
nodes = line.split(‘,’);
def reducer(self, key, values):
adjList = list(values);
node = key;
yield node, adjList;
if __name__ == “__main__”:
“1” [“2”, “3”]
“2” [“1”, “3”, “4”]
“3” [“1”, “2”, “4”]
“4” [“2”, “3”]
Execution python adjList.py < input.txt > output.txt. Here mapper makes two yields as the graph is undirected. In case of directed graphs , single yield should be sufficient.
To be continued with more Examples :-)
Learning can never reach a saturation point - An Indian Godess :)