Hadoop Mapreduce :
Map Phase :This is a phase whern the Mappers will accept the task and process(division of computation) specific to each node.
The result will be in key-value pairs. This is called intermediate output and will be stored in Local disk.
Sort and shuffle :
Each key value pairs from each mapper are taken and the values are now joined based on Keys and stored in local disk . After sorting and shuffling is done based on Keys of Key-value pair , the values will be sent to Reducers.
Reduce : The output from the sort and shuffle will now be reduced and is stored in HDFS. This will be the final output.
Key- Value Pair : This is the output of the Mapper which will be given for Sorting and merging .
Combiner : It is called as a mini reducer .It is generally used for searching in data set (Example highest salary in employee table).
It will search the highest of each dataset from Map stage .
Hive cannot convert nested subqueries into joins
Sample Text :
<1, What do you mean by Object>
<2, What do you know about Java>
<3, What is Java Virtual Machine>
<4, How Java enabled High Performance>
Map :
<What,1> <do,1> <you,1> <mean,1> <by,1> <Object,1>
<What,1> <do,1> <you,1> <know,1> <about,1> <Java,1>
<What,1> <is,1> <Java,1> <Virtual,1> <Machine,1>
<How,1> <Java,1> <enabled,1> <High,1> <Performance,1>
Combiner :
<What,1,1,1> <do,1,1> <you,1,1> <mean,1> <by,1> <Object,1>
<know,1> <about,1> <Java,1,1,1>
<is,1> <Virtual,1> <Machine,1>
<How,1> <enabled,1> <High,1> <Performance,1>
Partitioner :Partitioner will move the input to respective reducers based on key values from the mapper stage.
No of Partitioners = No of reducers
Reducer :
<What,3> <do,2> <you,2> <mean,1> <by,1> <Object,1>
<know,1> <about,1> <Java,3>
<is,1> <Virtual,1> <Machine,1>
<How,1> <enabled,1> <High,1> <Performance,1>