Analyze Big data with EMR

Amazon Enterprise MapReduce is a fully managed cluster platform that process and analyze larger amount of data.  When you run a large amount of data you eventually run into processing problems. By using hadoop cluster EMR can help in reducing large processing problems and split big data sets into smaller jobs and distribute them across many compute nodes. EMR can do this with big data framework and open source projects. Big data framework includes :

  • Apache Hadoop, Spark, Hbase
  • Presto
  • Zeppelin, Ganglia, Pig, hive etc..

Amazon EMR mainly used for log processing and analysis, ETL Processing, Clickstream analysis and Machine learning.

EMR Architecture:

Amazon EMR architecture contains following three types of nodes:

  • Master Nodes:
    • EMR have Single Master Node and don’t have another master node to fail over.
    • Master node manages resources of the cluster
    • Co-ordinates distribution and parallel execution of MapReduce executable.
    • Tracking and directing HDFS.
    • Monitor health of core and task nodes.
    • Resource Manager also running on master nodes which is responsible for scheduling the resources.
  • Core  nodes:
    • Core nodes are slaves nodes and run the tasks as directed by master node.
    • Core contains data as part of HDFS or EMRFS. So data daemons runs on core nodes and store the data.
    • Core nodes also run NodeManager which takes action from Resource Manager like how to manage the resources.
    • ApplicationMaster is task which negotiates the  resources with Resource Manager and working with NodeManager to execute and monitor application containers.
  • Task Nodes:
    • Task nodes also controlled by master and are optional.
    • These nodes are required to provide extra capacity to the cluster in terms of CPU and memory
    • Can be added/removed  any time   from running cluster.

Continue reading → Analyze Big data with EMR