HDFS Command line – Manage files and directories.

In my previous blog, we have configured hadoop single and cluster set up. Now try to create the files and directories on Hadoop distributed file system(HDFS).  You can see the full list here.

When I started the hdfs commands I got confused with three different command syntax. All three commands appears to be same but have some differences as explained below.

  • hadoop fs {args}

FS relates to a generic file system which can point to any file systems like local, HDFS etc. So this can be used when you are dealing with different file systems such as Local FS, (S)FTP, S3, and others.

  • hadoop dfs {args}

dfs is very specific to HDFS. would work for operation relates to HDFS. This has been deprecated and we should use hdfs dfs instead.

  • hdfs dfs {args}

same as 2nd i.e would work for all the operations related to HDFS and is the recommended command instead of hadoop dfs Continue reading → HDFS Command line – Manage files and directories.

Create/Restore a snapshot of an HDFS directory

In this tutorial, we focus on HDFS snapshots. Common use cases of HDFS snapshots include backups and protection against user errors.

Create a snapshot of HDFS directory:

HDFS directories must be enabled for snapshots in order for snapshots to be created. Steps are:

  • From the Clusters tab -> select HDFS service.
  • Go to the File Browser tab. Select the file directory.


  • Verify the Snapshottable Path and click Enable Snapshots.

snap2.PNG Continue reading → Create/Restore a snapshot of an HDFS directory

ClouderaManager – Installation on Google Cloud

In this post, I am going to tell you about how to set-up a Hadoop cluster on Google Cloud Platform.

Register on Google Cloud Platform

First of all, you have to register on Google cloud. It’s easy. Just sign-in with your Gmail id and fill your credit card details. Once registered you will get (300 USD) 1-year free subscription on Google Cloud.

How to create Virtual Machines

  • Create a new project. Give a name to your project or leave as it is provided by Google.
  • Now click on the icon on the top left corner of your homepage. A list of products and services will appear which the Google cloud provides. Click on Compute Engine and then click on VM instances. The VM Instances page will open, select Create Instance.

Continue reading → ClouderaManager – Installation on Google Cloud

Install ClouderaManager from Local Repository

This section explains how to set up a local yum repository to install CDH on the machines in your cluster. There are a number of reasons you might want to do this, for example:

  • Server in your cluster don’t have access to internet. You can still use YUM to do an installation on those machines by creating a local YUM repository.
  • To make sure that each node will have the same version of software installed.
  • Local repository is more efficient.

We need internet connection to download the repo/packages.

Set up Local Repo:

Create local web  publishing directory.  And Install web server such as Apache/http on the machine that host the RPM and start the http server. Continue reading → Install ClouderaManager from Local Repository

Hadoop Cluster via Cloudera Manager

I have written couple of blogs to set up Hadoop as Single/Cluster Muti-node environment and deploying, configuring and running a Hadoop cluster manually is rather time and cost-consuming. Here’s a helping hand to create a fully distributed Hadoop cluster with Cloudera Manager. In this blog, we’ll see how fast and easy to install Hadoop cluster with cloudera Manager.

Software used:

  • CDH5
  • Cloudera Manager – 5.7
  • OS – REHL 7
  • VirtualBox – 5.2

Prepare Servers:

For Minimal cluster, we need 3 servers for non-production cluster.

  • CM – CloudManager + other Hadoop Services ( Minimum 8GB )
  • DN1/DN2 – Data Nodes

Please do the following steps on one machine CloudManager (CM)

Disable Selinux:

Setup NTP:

Continue reading → Hadoop Cluster via Cloudera Manager

Commissioning/Decommissioning – Datanode in Hadoop

Commissioning of nodes means adding new data node in cluster and decommissioning stands for removing node from cluster. You can’t directly add/remove dataNode in large and a real-time cluster as it can cause a lot of disturbance. So if you want to scale your cluster , you need commissioning  and steps are below.



  • Clone existing Node.
  • Change IP address and hostname – and DN3
  • Update Hosts files on all nodes – add this entry in /etc/hosts file “ DN3”
  • Make it password less

Configuration changes:

We need to update the include file on both the Resource Manager and the Namenode . If it’s not present, then create an include file on both the Nodes.

Go to your NameNode and add include file in hdfs-site.xml file.

Also update the slaves file on NameNode and add new DataNode IP address.

Edit the “yarn-site.xml” file where ResourceManager is running. Continue reading → Commissioning/Decommissioning – Datanode in Hadoop