Analyze Big data with EMR

Amazon Enterprise MapReduce is a fully managed cluster platform that process and analyze larger amount of data.  When you run a large amount of data you eventually run into processing problems. By using hadoop cluster EMR can help in reducing large processing problems and split big data sets into smaller jobs and distribute them across many compute nodes. EMR can do this with big data framework and open source projects. Big data framework includes :

  • Apache Hadoop, Spark, Hbase
  • Presto
  • Zeppelin, Ganglia, Pig, hive etc..

Amazon EMR mainly used for log processing and analysis, ETL Processing, Clickstream analysis and Machine learning.

EMR Architecture:

Amazon EMR architecture contains following three types of nodes:

  • Master Nodes:
    • EMR have Single Master Node and don’t have another master node to fail over.
    • Master node manages resources of the cluster
    • Co-ordinates distribution and parallel execution of MapReduce executable.
    • Tracking and directing HDFS.
    • Monitor health of core and task nodes.
    • Resource Manager also running on master nodes which is responsible for scheduling the resources.
  • Core  nodes:
    • Core nodes are slaves nodes and run the tasks as directed by master node.
    • Core contains data as part of HDFS or EMRFS. So data daemons runs on core nodes and store the data.
    • Core nodes also run NodeManager which takes action from Resource Manager like how to manage the resources.
    • ApplicationMaster is task which negotiates the  resources with Resource Manager and working with NodeManager to execute and monitor application containers.
  • Task Nodes:
    • Task nodes also controlled by master and are optional.
    • These nodes are required to provide extra capacity to the cluster in terms of CPU and memory
    • Can be added/removed  any time   from running cluster.

Continue reading → Analyze Big data with EMR

Amazon Machine Learning – Introduction

Machine learning is a method of data analysis that automates analytical model building. It is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns and make decisions with minimal human intervention. With rise of big data machine learning become a key technique for solving problems.

Machine learning uses two types of techniques:

  • Unsupervised
    • Used for unlabeled data and where we don’t know the output.
    • Self guided learning algorithm
    • Clustering Technique: Aim is to use exploratory data analysis to find hidden patterns or groupings in data.
  • Supervised
    • Used for labelled data and desired output is known.
    • providing the algorithm training data to learn from
    • Techniques available:  Classification and Regression

Amazon Machine Learning:

Amazon ML is a robust machine learning platform that allow developers to train predictive models. Amazon ML creates models from supervised data sets. The process of creating a model from set of known observation called training data. When setting up a new model in Amazon ML, we first need to upload our data. Data needs to be CSV-formatted, with the first row containing the name of each data field, and each following row containing the data samples. Training data sets can be huge, so they need to be uploaded from either Amazon S3 or Redshift storage.

To test the amazon ML, I uploaded the two datasets to S3. I used customer review data to predict whether customer will like the restaurant or not. And second one is to predict House pricing based on previous sale. Continue reading → Amazon Machine Learning – Introduction

Amazon Redshift – Part 2

All the interaction in my previous post is done via SQL. Amazon Redshift also has a management console that provides insight into the operation of the system.  So let’s have a look.

Examine Load operation:

Amazon redshift maintains information about every data load query performed. You can see the query duration, start time and SQL executed.


You can see the system performance during in cluster performance. The cpu utilization, network throughput and write IOPS etc.. From same page you can queries tat executed in the cluster. I highlighted the query and its information displayed on the left side.


Status tab to view the information about the cluster. Continue reading → Amazon Redshift – Part 2

Amazon Redshift – Part 1

As your application gains popularity and traction, size of the data that you have to analyze increases exponentially. Some queries now start taking a lot of time and size of the data becomes unmanageable in traditional databases. So we start looking at the data warehouse solution for data storage which can keep the data organized and easily accessible.  So if you are looking at data warehouse solution then also keep Redshift in mind as well.

What is Redshift??

Amazon Redshift is a completely managed, petabyte-scale data warehouse service in the cloud. You can start with just a couple hundred gigabytes of data and scale to a petabyte or more. This enables you to use your data to acquire new insights for your business and customers. Redshift creates powerful solution by using various AWS services. So features are as:

  • Exceptionally fast when it comes to load the data and query it for analytical and reporting purposes
  • High performance due to massive parallelism with multiple nodes, optimized because of reduced I/O in columnar storage  and  data compression in reducing memory footprint and massively improves the I/O speed.
  • Can scale horizontally and bundle well with other AWS echo systems like S3, EMR.
  • Red shift comes with various security features.
  • ANSI SQL compatible
  • Redshift 1MB block size and because of Larger block size I/O request reduces hence better performance.

Redshift Architecture:

redshift_image Continue reading → Amazon Redshift – Part 1