Nitendra Gautam

Introduction to Hadoop Mapreduce framework

Hadoop Mapreduce framework is a Big data processing framework which consists of MapReduce programming model and Hadoop Distributed File System.

MapReduce Framework is the parallel programming model for processing huge amount of data which splits the input data-set into independent chunks, which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.

MapReduce applications specify the input/output locations and supply map and reduce functions via implementations of appropriate Hadoop interfaces, such as Mapper and Reducer. These, and other job parameters, comprise the job configuration. The Hadoop job client then submits the job (jar/executable, etc.) and configuration to the JobTracker .After that JobTracker assumes the responsibility of distributing the software/configuration to the slaves, scheduling tasks and monitoring them, providing status and diagnostic information to the job-client.

The Map/Reduce framework operates exclusively on <key, value> pairs .So this framework views the input to the job as a set of <key, value> pairs and produces a set of <key, value> pairs as the output of the job, conceivably of different types. The underlying system takes care of the partitioning of the Input data ,scheduling the program’s execution across several machines ,handling several machine failures and managing requires inter-machine communication. Computation processing occurs on both the structured data stored in the files system and structured data on database.

Following are the MapReduce Components.

Job:

A job contains compiled binary ,that contains our mapper and reduce functions , implementation to those functions and some configurations information that will drive the job.

Input Format:

Determines how files are parsed into MapReduce pipeline. Different input have different formats .Example: Image have binary format ,database format, record based input format.

MapReduce Phases

Split Phase (Input Splits):

Input data is divided into input splits based on the InputFormat. Input splits equate to a map task which runs in parallel. In this phase ,data stored in the HDFS is splitted and is send to the mappers. Default input format is the text format which is broken up into line by line.

Map Phase

Transforms the input splits into key/value pairs based on user defined code. Mapper gets all of the data based on the keys .

Combiner Phase:

It is the local reducer which runs after mapper phase but before shuffle and sort phase. It is an optimization as it saves network optimization by running the local reducer .Generally it is the same reduce code which runs after map phase but before shuffle and sort phase.

Combiner is like a mini reducer function that allow us to perform a local aggregation of map output before it is transferred to reducer phase. Basically, it is used to optimize the network bandwidth usage during a MapReduce task by cutting down the amount of data that is transferred from a mapper to the reducer.

Shuffle and Sort Phase:

Moves map outputs to the reducers and sorts them by key .It is carried out by data node.

Shuffle :It partitions and groups the data

Sort: It sorts the data and send it to the reducers

Shuffle and sort phase takes all the network bandwidth and uses data nodes to shuffle and sort the data.

Reduce Phase:(Reducers):

It aggregates the key/value pairs based on user defined code. It acquires the sorted data and sorts the result. A reducer function receives an iterate input values from an output list .It combines these value together returning an single value output.

(input) –>map(K1,V1) –> list(K2,V2) –>Shuffle /Sort–>reduce (k2,list(v2))–>list(k3,v3)–>(Output)

 Hadoop Mapreduce Framework
Figure: Hadoop Mapreduce Framework(Hadoop in Action)

Configuration parameters required to run a MapReduce job

The main configuration parameters which users need to specify in “MapReduce” framework are given below.

  • Job’s input locations in the distributed file system
  • Job’s output location in the distributed file system
  • Input format of data
  • Output format of data
  • Class containing the map function
  • Class containing the reduce function
  • JAR file containing the mapper, reducer and driver classes

Joins using MapReduce Framework

There are 3 types of joins, Reduce-Side joins, Map-Side joins and the Memory-Backed Join that can be used to join Tables in MapReduce .

MapSide Join

Joining at map side performs the join before data reached to map. function It expects a strong prerequisite before joining data at map side.

  • Data should be partitioned and sorted in particular way.
  • Each input data should be divided in same number of partition.
  • Must be sorted with same key.
  • All the records for a particular key must reside in the same partition.

Reduce Side Join

Reduce side join occurs in reducer side and is also called as Repartitioned join or Repartitioned sort merge join and also it is mostly used join type. This type of join would be performed at reduce side. i.e it will have to go through sort and shuffle phase which would incur network overhead. to make it simple we are going to add the steps needs to be performed for reduce side join.

  • Data Source is referring to data source files, probably taken from RDBMS
  • Tag would be used to tag every record with it’s source name, so that it’s source can be identified at any given point of time be it is in map/reduce phase. why it is required will cover it later.
  • Group key is referring column to be used as join key between two data sources.

Memory Backed Join

Memory Backed Join-we use this join for small tables which can be fit in the memory of data nodes Among these Reduce side join is the efficient one as it joins the tables based on the key which are shuffled and sorted before going to the reducer. Hadoop sends identical keys to the same reducer, so by default the data is organized for the joins.

Map side join and its advantage

Map side join is a process where two data sets are joined by the mapper.

The advantages of using map side join in MapReduce are as follows:

  • Map-side join helps in minimizing the cost that is incurred for sorting and merging in the shuffle and reduce stages.
  • Map-side join also helps in improving the performance of the task by decreasing the time to finish the task.

InputFormat in Hadoop

InputFormat defines the input specifications for a job in MapReduce. It performs the following functions.

  • Validates the input-specification of job.
  • Split the input file(s) into logical instances called InputSplit. Each of these split files are then assigned to individual Mapper.
  • Provides implementation of RecordReader to extract input records from the above instances for further Mapper processing

Distributed Cache

There are situations when you are writing a Mapreduce application and need to copy certain files like configs,archives,libraries,jars to each of the nodes.To solve this problem Hadoop framework provides a tool called Distributed cache.With this we can distribute applictaion-specific, large, read-only files effieciently to the slave node before any tasks for the job are executed on that node.

It is effiecient in the sense that it copies the necessary files only once to the slaves.It is designed to distribute a small number of medium-sized artifacts, ranging from a few MBs to few tens of MBs.

Different Modes in Hadoop

There are three mode in whuch a Hadoop Mapreduce application can be executed.

  • Standalone Mode Default mode of Hadoop, it uses local file system for input and output operations. This mode is mainly used for debugging purpose, and it does not support the use of HDFS. Further, in this mode, there is no custom configuration required for mapred-site.xml, core-site.xml, hdfs-site.xml files. Much faster when compared to other modes.

  • Pseudo-Distributed Mode (Single Node Cluster) In this case, you need configuration for all the three files mentioned above. In this case, all daemons are running on one node and thus, both Master and Slave node are the same.
  • Fully Distributed Mode (Multiple Cluster Node) This is the production phase of Hadoop (what Hadoop is known for) where data is used and distributed across several nodes on a Hadoop cluster. Separate nodes are allotted as Master and Slave.

References

Apache Hadoop

Gautam, N. “Analyzing Access Logs Data using Stream Based Architecture.” Masters, North Dakota State University ,2018.Available