Configure HA – HiveMetastore and Load Balancing for HiveServer2

Apache hive is a Data Warehouse software project built on top of apache Hadoop for providing data summary, query and analysis. Hive gives an SQL like interface to query data stored in various databases and file systems that integrate with Hadoop.

Configuring High Availability for Hive requires the following components to be fail proof:

  • Hive MetaStore – RDBMS (MySQL)
  • ZooKeeper
  • Hive MetaStore Server
  • HiveServer2

Set up MySQL db:

First of all set up hive metastore as MySql database. Here are the steps:

Now login MySQL database and create the hive database /user. And grant the privileges.

Install Hive:

Add the service to cluster through Cloudera Manager. Continue reading → Configure HA – HiveMetastore and Load Balancing for HiveServer2

Create/Restore a snapshot of an HDFS directory

In this tutorial, we focus on HDFS snapshots. Common use cases of HDFS snapshots include backups and protection against user errors.

Create a snapshot of HDFS directory:

HDFS directories must be enabled for snapshots in order for snapshots to be created. Steps are:

  • From the Clusters tab -> select HDFS service.
  • Go to the File Browser tab. Select the file directory.

snap1

  • Verify the Snapshottable Path and click Enable Snapshots.

snap2.PNG Continue reading → Create/Restore a snapshot of an HDFS directory

Decommission/Recommission – DataNode in Cloudera

Commissioning nodes stand for adding new nodes in current cluster which operates your Hadoop framework. In contrast, decommissioning nodes stands for removing nodes from your cluster. This is very useful feature to handle node failure during the operation of Hadoop cluster without stopping entire Hadoop nodes in your cluster.

Decommission:

You can’t decommission a DataNode or host with DataNode if number of the data nodes equals to the replication factor. if you attempt to decommission a datanode in such situation the data node decommission process will not complete. you have to abort the decommission process and change the replication factor.

ConfC1

In my case, I have two data node and decommission one will leave only on data node. Before decomm process , change the replication factor to 1.

Same can be done via command line.

Now restart the stale services. Continue reading → Decommission/Recommission – DataNode in Cloudera

Configure High Availability – HDFS/YARN via Cloudera

In earlier releases, the NameNode was a single point of failure (SPOF) in a HDFS cluster. Each cluster had a single NameNode, and if that machine or process became unavailable, the cluster as a whole would be unavailable until the NameNode was either restarted or brought up on a separate machine. The Secondary NameNode did not provide failover capability. The HA architecture solved this problem of NameNode availability by allowing us to have two NameNodes in an active/passive configuration. The NameNode is the centerpiece of an HDFS file system

To enable Namenode HA in cloudera, you must ensure that the two nodes are of same configuration in terms of memory, disk, etc for optimal performance. Here are the steps.

ZooKeeper:

First of install “ZooKeeper to set up HA for NameNode.

Select cluster -> Action -> Add Service and pop will appear.

HA2 Continue reading → Configure High Availability – HDFS/YARN via Cloudera

ClouderaManager – Installation on Google Cloud

In this post, I am going to tell you about how to set-up a Hadoop cluster on Google Cloud Platform.

Register on Google Cloud Platform

First of all, you have to register on Google cloud. It’s easy. Just sign-in with your Gmail id and fill your credit card details. Once registered you will get (300 USD) 1-year free subscription on Google Cloud.

How to create Virtual Machines

  • Create a new project. Give a name to your project or leave as it is provided by Google.
  • Now click on the icon on the top left corner of your homepage. A list of products and services will appear which the Google cloud provides. Click on Compute Engine and then click on VM instances. The VM Instances page will open, select Create Instance.

Continue reading → ClouderaManager – Installation on Google Cloud

Install ClouderaManager from Local Repository

This section explains how to set up a local yum repository to install CDH on the machines in your cluster. There are a number of reasons you might want to do this, for example:

  • Server in your cluster don’t have access to internet. You can still use YUM to do an installation on those machines by creating a local YUM repository.
  • To make sure that each node will have the same version of software installed.
  • Local repository is more efficient.

We need internet connection to download the repo/packages.

Set up Local Repo:

Create local web  publishing directory.  And Install web server such as Apache/http on the machine that host the RPM and start the http server. Continue reading → Install ClouderaManager from Local Repository