Page 2 of 10

Amazon Redshift – Part 2

All the interaction in my previous post is done via SQL. Amazon Redshift also has a management console that provides insight into the operation of the system.  So let’s have a look.

Examine Load operation:

Amazon redshift maintains information about every data load query performed. You can see the query duration, start time and SQL executed.


You can see the system performance during in cluster performance. The cpu utilization, network throughput and write IOPS etc.. From same page you can queries tat executed in the cluster. I highlighted the query and its information displayed on the left side.


Status tab to view the information about the cluster. Continue reading → Amazon Redshift – Part 2

Amazon Redshift – Part 1

As your application gains popularity and traction, size of the data that you have to analyze increases exponentially. Some queries now start taking a lot of time and size of the data becomes unmanageable in traditional databases. So we start looking at the data warehouse solution for data storage which can keep the data organized and easily accessible.  So if you are looking at data warehouse solution then also keep Redshift in mind as well.

What is Redshift??

Amazon Redshift is a completely managed, petabyte-scale data warehouse service in the cloud. You can start with just a couple hundred gigabytes of data and scale to a petabyte or more. This enables you to use your data to acquire new insights for your business and customers. Redshift creates powerful solution by using various AWS services. So features are as:

  • Exceptionally fast when it comes to load the data and query it for analytical and reporting purposes
  • High performance due to massive parallelism with multiple nodes, optimized because of reduced I/O in columnar storage  and  data compression in reducing memory footprint and massively improves the I/O speed.
  • Can scale horizontally and bundle well with other AWS echo systems like S3, EMR.
  • Red shift comes with various security features.
  • ANSI SQL compatible
  • Redshift 1MB block size and because of Larger block size I/O request reduces hence better performance.

Redshift Architecture:

redshift_image Continue reading → Amazon Redshift – Part 1

Configure HA – HiveMetastore and Load Balancing for HiveServer2

Apache hive is a Data Warehouse software project built on top of apache Hadoop for providing data summary, query and analysis. Hive gives an SQL like interface to query data stored in various databases and file systems that integrate with Hadoop.

Configuring High Availability for Hive requires the following components to be fail proof:

  • Hive MetaStore – RDBMS (MySQL)
  • ZooKeeper
  • Hive MetaStore Server
  • HiveServer2

Set up MySQL db:

First of all set up hive metastore as MySql database. Here are the steps:

Now login MySQL database and create the hive database /user. And grant the privileges.

Install Hive:

Add the service to cluster through Cloudera Manager. Continue reading → Configure HA – HiveMetastore and Load Balancing for HiveServer2

Create/Restore a snapshot of an HDFS directory

In this tutorial, we focus on HDFS snapshots. Common use cases of HDFS snapshots include backups and protection against user errors.

Create a snapshot of HDFS directory:

HDFS directories must be enabled for snapshots in order for snapshots to be created. Steps are:

  • From the Clusters tab -> select HDFS service.
  • Go to the File Browser tab. Select the file directory.


  • Verify the Snapshottable Path and click Enable Snapshots.

snap2.PNG Continue reading → Create/Restore a snapshot of an HDFS directory

Decommission/Recommission – DataNode in Cloudera

Commissioning nodes stand for adding new nodes in current cluster which operates your Hadoop framework. In contrast, decommissioning nodes stands for removing nodes from your cluster. This is very useful feature to handle node failure during the operation of Hadoop cluster without stopping entire Hadoop nodes in your cluster.


You can’t decommission a DataNode or host with DataNode if number of the data nodes equals to the replication factor. if you attempt to decommission a datanode in such situation the data node decommission process will not complete. you have to abort the decommission process and change the replication factor.


In my case, I have two data node and decommission one will leave only on data node. Before decomm process , change the replication factor to 1.

Same can be done via command line.

Now restart the stale services. Continue reading → Decommission/Recommission – DataNode in Cloudera

Configure High Availability – HDFS/YARN via Cloudera

In earlier releases, the NameNode was a single point of failure (SPOF) in a HDFS cluster. Each cluster had a single NameNode, and if that machine or process became unavailable, the cluster as a whole would be unavailable until the NameNode was either restarted or brought up on a separate machine. The Secondary NameNode did not provide failover capability. The HA architecture solved this problem of NameNode availability by allowing us to have two NameNodes in an active/passive configuration. The NameNode is the centerpiece of an HDFS file system

To enable Namenode HA in cloudera, you must ensure that the two nodes are of same configuration in terms of memory, disk, etc for optimal performance. Here are the steps.


First of install “ZooKeeper to set up HA for NameNode.

Select cluster -> Action -> Add Service and pop will appear.

HA2 Continue reading → Configure High Availability – HDFS/YARN via Cloudera