Hadoop is a Java-based programming framework that supports the processing and storage of extremely large datasets on a cluster of inexpensive machines. It was the first major open source project in the big data playing field and provides high throughput access to application data .

The main goal of this tutorial is to get a simple Hadoop installation up and running so that you can play around with the software and learn more about it.

Environment: This  blog has been tested in the following software version.

  • REHL ( Red hat Linux 7.4) on Virtual box 5.2
  • Hadoop 2.7.3 version
  • update /etc/hosts file with Hostname and IP address.

[root@cdhs ~]# cat /etc/hosts
10.0.2.5 cdhs

Dedicated Hadoop system user:

After VM set up, please add a non sudo user dedicated to Hadoop which will be used to configure Hadoop. Following command will add the user hduser and the group hadoop to VM machine.

Apache Software:

download the apache hadoop software from official site.

 Install Java:

Hadoop is written in Java, hence before installing Apache Hadoop we will need to install Java first. To install Java in your system first we will need to download the file from oracle website from here.

Once the Java is installed in your system, you can check the version of Java using the following command.

Now edit .bash_profile file using your favorite editor and add Hadoop/java home.

After unpack the downloaded Hadoop distribution, edit the file $HADOOP_HOME/etc/hadoop/hadoop-env.sh to define some parameters as follows. It is important that we setup Java path here, otherwise Hadoop will not be able to use Java.

Once done, you can now check if the environment variables are now set or not. Run the following command.

Configuring Hadoop:

Hadoop has many configuration files, which are located at $HADOOP_HOME/etc/hadoop directory. You can view the list of configuration files using the following command. As we are installing Hadoop on a single node in pseudo distributed mode. We will need to edit some configuration files in order for Hadoop to work.

The first file is “core-site.xml”  file which contains configuration of the port number used by HDFS.

Configuration for NameNode and DataNode.

Configuration for ResourceManager and NodeManager.

Now copy mapred-site.xml.template as file mapred-site.xml using the following command. And edit  the “mapred-site.xml” file for Mapreduce Applications and JobHistory Server.

 Now we will need to create two directories to store namenode and datanode using the following commands.

Now you will need to configure the SSH keys for your new user so that the user can securely log into hadoop without any password.

We have now configured Hadoop to work on a single node cluster. Now we can initialize HDFS file system by formatting the namenode directory using the following command.

Now we can start Hadoop cluster, navigate to $HADOOP_HOME/sbin directory using the following command.

You can start Hadoop services by executing the following commands.

Now start YARN using the following command.

You can check the status of the services using the following command.

This shows that Hadoop is successfully running on the server.

You can now browse the Apache Hadoop services through your browser. By default Apache Hadoop namenode service is started on port 50070. Go to following address using your favorite browser.

http://10.0.2.5:50070

To view Hadoop clusters and all applications, browse the following address into your browser.

http://10.0.2.5:8088

You will see the information about NodeManager.

http://10.0.2.5:8042

Secondary NameNode information avaiable via following link.

http://10.0.2.5:50090/

If problem in opening any of the URL above, please disable the IPTABLES service.

Conclusion:

In this tutorial we have learnt how to install Apache Hadoop on a single node with Pseudo distribution mode.

 

Leave a Reply