From my previous blog, we learnt how to set up a Hadoop Single Node Installation. Now, I will show how to set up a Hadoop Multi Node Cluster. A Multi Node Cluster in Hadoop contains two or more DataNodes in a distributed Hadoop environment.  This is practically used in organizations to store and analyse their Petabytes and Exabytes of data.

Here in this blog, we are taking three machine to set up multi-node cluster – MN and DN1/DN2.

  • Master node (MN) will run the NameNode and ResourcesManager Daemons.
  • Data Nodes (DN1 and DN2) will be our data nodes that stores the actual data and provide processing power to run the jobs. Both hosts will run the DataNode and NodeManager daemons.

Software Required:

  • REHL 7 – Set up MN and DN1/DN2 with REHL 7 operating system – Minimal Install.
  • Hadoop-2.7.3
  • JAVA 7
  • SSH

Configure the System

First of all, we have to edit hosts file in /etc/ folder in MasterNode (MN) , specify the IP address of each system followed by their host names.

Disable the firewall restrictions.

Now set up OS group and user for Hadoop software.

Add the directories to keep the hdfs files.

Download and Unpack Hadoop/Java Binaries.

Download and extract the Java Tar File on Master node. And Similarly download hadoop 2.7.3 Package on Master Node (MN)  and extract the Hadoop tar File.

Set Environment Variables:

Add Hadoop/java binaries to your PATH.  Edit /home/hadoop/.bash_profile and add the following lines in master node. Then, save the bash file and close it.

For applying all these changes to the current Terminal, execute the source command.

To make sure that Java and Hadoop have been properly installed on your system and can be accessed through the Terminal, execute the java -version and hadoop version commands.

Now clone the Master Node MN to Data Node DN1/DN2.

Distribute Authentication Key-pairs for the Hadoop User:

Login to MN as the hadoop user, and generate an ssh-key. Copy the generated ssh Key to Master Node’s authorized keys.

Copy the master node’s ssh key to DN1 and DN2 authorized keys.

Now test the passwordless connectivity login through SSH.

Configure Hadoop:

Now edit the configuration files in hadoop/etc/hadoop directory in master node. Set the NameNode location.

Set path for HDFS:

Edit hdfs-site.conf  on master node for NameNode and DataNode file location.

The last property, dfs.replication, indicates how many times data is replicated in the cluster. You can set 2 to have all the data duplicated on the two nodes. Don’t enter a value higher than the actual number of data nodes.

Set YARN as Job Schedule:

Copy mapred-site from the template in configuration folder and the edit mapred-site.xml on Master node. Set yarn as the default framework for MapReduce operations.

Configure YARN:

Edit yarn-site.xml on master node.

Configure Slaves

The file slaves is used by startup scripts to start required daemons on all nodes. Edit ~/hadoop/etc/hadoop/slaves to be.

Format HDFS:

Format the namenode (Only on master machine).

Run and Monitor HDFS:

Now start Hadoop services by executing the following commands. It’ll start NameNode and SecondaryNameNode on node-master, and DataNode on node1 and node2, according to the configuration in the slaves config file.

In addition to the previous HDFS daemon, you should see a ResourceManager on node-master, and a NodeManager on node1 and node2.

Check all the daemons running on both master and slave machines.

Web Interface:

At last, open the browser and go to master:50070/dfshealth.html on your master machine, this will give you the NameNode interface.

To view Hadoop clusters and all applications, browse the following address into your browser.

You will see the information about NodeManager.

Secondary NameNode information avaiable via following link.

Stop the Services:

Services can be stopped in following order.

I hope you would have successfully installed a Hadoop Multi Node Cluster. If you are facing any problem, you can comment below, we will be replying shortly.



Leave a Reply