Data Engines: Hadoop Mapreduce Architechture

Advent of Hadoop

To Understand Hadoop better let’s start with an Example let say we have a site for selling out mobile phones called MobileZone and our technical system looks something like below.

Now if we had to find out how many iphones should we order to sell. We need to use the data from the e-commerce and the inventory system. We load the data coming from e-commerce and inventory to a traditional DataWarehouse and do the reporting using the reporting tools like tableau and based on the report we could decide on. This system works well in years where the data generation is typically less.

Now if I have to change the question how many customers bought the iphone who loved it but hated the delivery? Now these customers showed up their reviews in the twitter and other sites. These data that will be generated from other sources does not fit into our traditional database system.

These Data which is coming from different sources and in different formats is called as Bigdata. We need something new to store and process this Bigdata. That’s when Hadoop came in to picture.

Hadoop

In a crude way think of Hadoop as a very big datawarehouse which takes data from any source and in any format.It host a master and many nodes and it gives us 2 services ie. Storage and Processing.

Now after Hadoop Processes the data, The processed data can be loaded to analytics and for reporting there by we can predict or decide on the future sales.

Hadoop is a framework for distributed processing of large data sets that uses a clusters of computers which has simple programming models for data processing.

Architecture

Hadoop Processing:

Let us say we have a site and we need to create a dashboard which will show us how many liked or viewed the site. So our first task is to set up the Cluster, so we use the Hadoop admin to set up the cluster with one Master Node also called as the NameNode and 4 Data Nodes.We will see more about the Name node and Data nodes further. Once the Cluster is set up the data is ingested to the Hadoop.

For eg: We have the data facebook.json(640mb) which when ingested in Hadoop , it is broken into 128mb blocks each.Now each 128 mb block is replicated 3 times to avoid fault tolerance. This makes Hadoop so reliable. So totally we have 15 blocks.

To Process the data of facebook.json, the data about the data that is loaded in to dataNodes will be stored in Namenode and the Actual data will be stored in the respective datanodes.

Now as shown below the divided among the name nodes accordingly.

NameNode has a Service called Job tracker,and the data nodes will have a service called TaskTracker.Once the Data is loaded, the tasktracker will read the metadata stored in namenodes and assigns the respective tasks to respective tasktrackers which will perform their jobs locally Once the data is processed

NameNode :

The NameNode is the Master Node in a HDFS file system. It maintains the data about the file system,and keeps track of the metadata stored across the datanodes . It only stores the Metadata of the data and not the data itself. Client applications converses to the NameNode whenever a request is recieved to locate a file, or when they want to add/copy/move/delete a file and in response to that the NameNode will return a list of relevant DataNode servers where the data lives.

Secondary Data Node :

The NameNode is a Single Point of Failure for the HDFS Cluster as the Metadata is stored only on the name node. This makes HDFS not a High Availability system. When the NameNode goes down, the file system goes down. We can host a optional SecondaryNameNode on a separate machine. It creates checkpoints of that namespace by merging the edits file into the fsimage file and hence does not provide any real redundancy. Hadoop 0.21+ has this BackupNameNode that the user can configure to make it Highly Available.

DataNodes :

An HDFS cluster can have many DataNodes. DataNodes stores the blocks of data and blocks from different files can be stored on the same DataNode. Each DataNode marks its presence or activeness by sending a signal message like "I am Alive." Periodically. This helps the NameNode to keep track of the data nodes and maintain the metadata accordingly.

JobTracker Service :

The JobTracker is the service within Hadoop that runs the MapReduce tasks in the respective nodes in the cluster acoording to the client task,The nodes that have the data, or at least are in the same rack.

TaskTracker :

A TaskTracker is service that accepts tasks - Map, Reduce and Shuffle operations - from a JobTracker. Every TaskTracker will be configured with a set of slots, from this we will know the number of tasks that it can accept.To schedule a task, the job tracker first looks for an empty slot on the same server that hosts the DataNode containing the data, if not, it looks in on the same machine in the same rack.The TaskTracker starts a separate JVM processes to do the actual task to ensure that process failure does not take down the task
tracker. The TaskTracker monitors these processes, capturing the output and exit codes. Once the process is completed, either success or not, the tracker notifies the JobTracker accordingly.The TaskTrackers also send out heartbeat messages to the JobTracker, to ensure that they are still alive and active, so that the jobtracker can update its metadata about the empty slots.

Mapreduce Execution Process :

The job or the task is submitted to the Job trackers.
The JobTracker connects to the NameNode to find the location of the data
The JobTracker locates TaskTracker nodes with the available nodes and other data respectivily
The JobTracker assign the task to the identified available TaskTracker nodes.
The TaskTracker nodes will be monitored for their heartbeat signals every min ,if they seem to have failed and the task is assigned on a different TaskTracker.
Job Tracker will be notified if at all if a taskfails.The jobtracker will then resubmit the job or avoids that specific record from processing or it may blacklist the tasktracker as unreliable.
The status is updated once the task is completed by the tasktracker.
Client applications will request the JobTracker for information on the task processed.
The JobTracker is the point of failure for MapReduce service. All the jobs are halted if it goes down

Hadoop Mapreduce Architechture

No comments:

Training