Page 3 of 10

Install ClouderaManager from Local Repository

This section explains how to set up a local yum repository to install CDH on the machines in your cluster. There are a number of reasons you might want to do this, for example:

  • Server in your cluster don’t have access to internet. You can still use YUM to do an installation on those machines by creating a local YUM repository.
  • To make sure that each node will have the same version of software installed.
  • Local repository is more efficient.

We need internet connection to download the repo/packages.

Set up Local Repo:

Create local web  publishing directory.  And Install web server such as Apache/http on the machine that host the RPM and start the http server. Continue reading → Install ClouderaManager from Local Repository

Hadoop Cluster via Cloudera Manager

I have written couple of blogs to set up Hadoop as Single/Cluster Muti-node environment and deploying, configuring and running a Hadoop cluster manually is rather time and cost-consuming. Here’s a helping hand to create a fully distributed Hadoop cluster with Cloudera Manager. In this blog, we’ll see how fast and easy to install Hadoop cluster with cloudera Manager.

Software used:

  • CDH5
  • Cloudera Manager – 5.7
  • OS – REHL 7
  • VirtualBox – 5.2

Prepare Servers:

For Minimal cluster, we need 3 servers for non-production cluster.

  • CM – CloudManager + other Hadoop Services ( Minimum 8GB )
  • DN1/DN2 – Data Nodes

Please do the following steps on one machine CloudManager (CM)

Disable Selinux:

Setup NTP:

Continue reading → Hadoop Cluster via Cloudera Manager

Commissioning/Decommissioning – Datanode in Hadoop

Commissioning of nodes means adding new data node in cluster and decommissioning stands for removing node from cluster. You can’t directly add/remove dataNode in large and a real-time cluster as it can cause a lot of disturbance. So if you want to scale your cluster , you need commissioning  and steps are below.

Commission:

Pre-requirements:

  • Clone existing Node.
  • Change IP address and hostname –  192.168.1.155 and DN3
  • Update Hosts files on all nodes – add this entry in /etc/hosts file “192.168.1.155 DN3”
  • Make it password less

Configuration changes:

We need to update the include file on both the Resource Manager and the Namenode . If it’s not present, then create an include file on both the Nodes.

Go to your NameNode and add include file in hdfs-site.xml file.

Also update the slaves file on NameNode and add new DataNode IP address.

Edit the “yarn-site.xml” file where ResourceManager is running. Continue reading → Commissioning/Decommissioning – Datanode in Hadoop

High Availability Set up – HDFS/YARN using Quorum

In this blog, I am going to talk about how to configure and manage a High availability HDFS (CDH 5.12.0) cluster.  In earlier releases, the NameNode was a single point of failure (SPOF) in a HDFS cluster. Each cluster had a single NameNode, and if that machine or process became unavailable, the cluster as a whole would be unavailable until the NameNode was either restarted or brought up on a separate machine. The Secondary NameNode did not provide failover capability.

The HA architecture solved this problem of NameNode availability by allowing us to have two NameNodes in an active/passive configuration. So, we have two running NameNodes at the same time in a High Availability cluster:

  • Active NameNode
  • Standby/Passive NameNode.

We can implement the Active and Standby NameNode configuration in following two ways:

  • Using Quorum Journal Nodes
  • Shared Storage using NFS

Using the Quorum Journal Manager (QJM) is the preferred method for achieving high availability for HDFS. Read here to know more about QJM and NFS methods. In this blog, I’ll implement the HA configuration for quorum based storage and here are the IP address and corresponding machines Names/roles.

CDMH1

  • NameNode machines – NN1/NN2 of equivalent hardware and spec
  • JournalNode machines – The JournalNode daemon is relatively lightweight, so these daemons can reasonably be collocated on machines with other Hadoop daemons, for example NameNodes, the JobTracker, or the YARN ResourceManager. There must be at least three JournalNode daemons, since edit log modifications must be written to a majority of JournalNodes.So 3 JN’s runs on NN1/NN2 and MGT Server.
  • Note that when running with N JournalNodes, the system can tolerate at most (N – 1) / 2 failures and continue to function normally.
  • The ZookeerFailoverController (ZKFC) is a Zookeeper client that also monitors and manages the NameNode status. Each of the NameNode runs a ZKFC also. ZKFC is responsible for monitoring the health of the NameNodes periodically.
  • Resource Manager Running on same NameNode NN1/NN2.
  • Two Data Nodes – DN1 and DN2

Continue reading → High Availability Set up – HDFS/YARN using Quorum

Set up Hadoop Cluster – Multi-Node

From my previous blog, we learnt how to set up a Hadoop Single Node Installation. Now, I will show how to set up a Hadoop Multi Node Cluster. A Multi Node Cluster in Hadoop contains two or more DataNodes in a distributed Hadoop environment.  This is practically used in organizations to store and analyse their Petabytes and Exabytes of data.

Here in this blog, we are taking three machine to set up multi-node cluster – MN and DN1/DN2.

  • Master node (MN) will run the NameNode and ResourcesManager Daemons.
  • Data Nodes (DN1 and DN2) will be our data nodes that stores the actual data and provide processing power to run the jobs. Both hosts will run the DataNode and NodeManager daemons.

Software Required:

  • REHL 7 – Set up MN and DN1/DN2 with REHL 7 operating system – Minimal Install.
  • Hadoop-2.7.3
  • JAVA 7
  • SSH

Configure the System

First of all, we have to edit hosts file in /etc/ folder in MasterNode (MN) , specify the IP address of each system followed by their host names.

Disable the firewall restrictions. Continue reading → Set up Hadoop Cluster – Multi-Node

Install Apache Hadoop – Single Node REHL 7

Hadoop is a Java-based programming framework that supports the processing and storage of extremely large datasets on a cluster of inexpensive machines. It was the first major open source project in the big data playing field and provides high throughput access to application data .

The main goal of this tutorial is to get a simple Hadoop installation up and running so that you can play around with the software and learn more about it.

Environment: This  blog has been tested in the following software version.

  • REHL ( Red hat Linux 7.4) on Virtual box 5.2
  • Hadoop 2.7.3 version
  • update /etc/hosts file with Hostname and IP address.

[root@cdhs ~]# cat /etc/hosts
10.0.2.5 cdhs

Dedicated Hadoop system user:

After VM set up, please add a non sudo user dedicated to Hadoop which will be used to configure Hadoop. Following command will add the user hduser and the group hadoop to VM machine. Continue reading → Install Apache Hadoop – Single Node REHL 7