What is Map Reduce in Hadoop?
Some program like query languages or scripts is used to process the stored data.If client want to process the stored files, Job Tracker will apply the program on HDFS. Job tracker will send request to Name node to get the different nodes where the files are stored. Name node will reply back with Meta Data. Now Job tracker will assign it to nearest Task Tracker to apply the program on file and start processing, this processing is called Map.
If any task tracker fails to do task due to Data Node down, job tracker will reassign the same job to another task tracker. Task tracker will give heartbeat and number of free slots to job tracker for every 3 seconds. If Task tracker fails to give heartbeat, Job tracker will wait for 30 seconds because it may works slowly due to huge tasks. After 30seconds, it assumes dead and reassigns.
If nearest Task tracker runs huge Mappers, then Job tracker requests another Task tracker to do task which is far away. Name Node and Job Tracker are single point of failures, so they are highly reliable.
Reducers will combine all the output of mappers and finally total output will be processed. It will be updated in Meta Data where client can request the output blocks through Name Node and approach block directly.
- Number of input split files is equal to number of Mappers
- Number of Reducers is equal to number of Outputs files.
Map Reduce Flow Chart in Hadoop
Mappers and Reducers can only work with key, value pairs.
Text line input split files are must be converted to key,value pairs and pass to Mappers to process it. Mappers will produce another key, value pairs which will be the input for Reducers. Record Reader is a predefined interface to convert text input files to Key, value pairs.There will be one Record Reader for input splits and Mappers.
In Hadoop, Text lines are called as records.
Record Reader reads one record from input splits at a time. It converts the record to Key-Value Pairs based on four formats of file. The four formats of file are
- Text Input Format
- Key Value Text Input Format
- Sequence File Input Format
- Sequence File As Text Input Format
If the file format is not mentioned, by default record reader take it as Text Input Format file. If the file is in Text Input Format, now the line is converted into Key-Value pair as (byteOffset , Entire line) and key-value pair is passed to Mappers.Byte Offset is the address of the line. Ex: (0, hi how are you).
Now Mappers take the key-value pair and start working on that depending on the logic what we are written in mapper code. After that Record reader moves back to input splits to read second line and process continues for all the records, for each key-value pairs from Record Reader, Mappers will run once.
The above process occurs for all the input splits as parallel processing in less time.This is main concept of Hadoop.
Hadoop introduced Box classes and objective types used for (key,value) pairs are as below:
|Primitive Data Types||Box classes|
Conversions are made manually as below:
To convert int to IntWritable – new IntWritable(int)
To convert IntWritable to int – get() method.
Mappers will take key-value pairs with objective types of box classes.
- If the file format is Text Input Format,Record Reader reads the first line and converts to byteoffset, entireline pair.
- If the file is in Key Value Text Input Format or any, then we have to explicitly specify in Driver code. So that record understands the file format and read the first line and converts into some other key-value pairs.
- If the file is in Key Value Text Input Format,Record reader read the first line until first tab space character and take it as key and after the tab space,it will be consider as value.It will convert into (Text,Text) and never consider duplicate keys and values.
Now Mappers will give the output in Key-value pairs, based on one same mapper code and number of key-value pair inputs in parallel. For example, it may convert into (Text, IntWritable).