Amazon Enterprise MapReduce is a fully managed cluster platform that process and analyze larger amount of data. When you run a large amount of data you eventually run into processing problems. By using hadoop cluster EMR can help in reducing large processing problems and split big data sets into smaller jobs and distribute them across many compute nodes. EMR can do this with big data framework and open source projects. Big data framework includes :
- Apache Hadoop, Spark, Hbase
- Zeppelin, Ganglia, Pig, hive etc..
Amazon EMR mainly used for log processing and analysis, ETL Processing, Clickstream analysis and Machine learning.
Amazon EMR architecture contains following three types of nodes:
- Master Nodes:
- EMR have Single Master Node and don’t have another master node to fail over.
- Master node manages resources of the cluster
- Co-ordinates distribution and parallel execution of MapReduce executable.
- Tracking and directing HDFS.
- Monitor health of core and task nodes.
- Resource Manager also running on master nodes which is responsible for scheduling the resources.
- Core nodes:
- Core nodes are slaves nodes and run the tasks as directed by master node.
- Core contains data as part of HDFS or EMRFS. So data daemons runs on core nodes and store the data.
- Core nodes also run NodeManager which takes action from Resource Manager like how to manage the resources.
- ApplicationMaster is task which negotiates the resources with Resource Manager and working with NodeManager to execute and monitor application containers.
- Task Nodes:
- Task nodes also controlled by master and are optional.
- These nodes are required to provide extra capacity to the cluster in terms of CPU and memory
- Can be added/removed any time from running cluster.
Various storage option are available for EMR cluster.
- Instance store: Local storage attached to EC2 instance but data lose after terminating the EMR cluster. Can be used where high i/o or high IOPS at low-cost.
- EBS volume: EBS volume for Data storage but data lost after EMR cluster termination.
- EMR FS: An implementation of HDFS which allows cluster to store/ingest data directly from S3. Data copy from S3 to HDFS can be done via S3DistCp
Launch a Cluster:
You can launch EMR cluster with few clicks. Sign up for AWS cloud and go to AWS console.
On service Menu click EMR and further click create cluster. There are two options to create cluster.
I’ll go through Quick One first. Fill up the following information.
- Cluster Name: testcluster
- Logging : S3 buket to keep the hadoop logs
- Release : Choose the latest version
- Applications : Only 4 sets are available and need to choose from those sets.
- Instance type: m4.large ( all nodes same type)
- Number of instances: 2 ( 1 Master and 1 core)
- EC2 key pair: processed without Key pair
- EMR role and profile : leave it default(EMR will create the default roles for you)
- Click create cluster and will take 5 mins to create the cluster.
When you click on create cluster, choose go to advance option.
- From here you can choose the EMR release and custom application that you want to install. In quick option, you won’t able to choose the specific application but have to select one set out of 4 sets. I choose cluster with application hive and spark.
- Secondly you can enter hadoop configuration like change the YARN job logs in JSON format.
- You can load JSON file as well from S3.
- In advance option you can specify the VPC and subnet settings and EBS volume sizes etc.. This option is not available in quick option.
- Advance option allow you to choose the different configuration of master, core and task nodes.In quick option all nodes are of same type.
- Auto scaling can also be configured in Advance option.
- Terminate protection is enabled by default. This will protect the cluster from being accidentally terminated by Amazon.
- But for Transient cluster ( like temp cluster who do particular task and automatically terminate after the task), disable the terminate protection.
- Boot strap scripts can be executed in advance EMR option.
- Authentication and encryption: can be changed.
- security groups: additional security rules can be applied.
Cluster will be ready with in 4-5 mins time. Download the key-pair as well to connect to the cluster.
EMR log file:
Like any Hadoop environment, Amazon EMR would generate a number of log files. While some of these log files are common for any Hadoop system, others are specific to EMR. Here is a brief introduction to these different types of log files.
- Bootstrap Action Logs: These logs are specific to Amazon EMR. It’s possible to run bootstrap actions when the cluster is created. An example of bootstrapping can be installing a Hadoop component not included in EMR, or using a certain parameters in a configuration file. The bootstrap action logs contain the output of these actions.
- Instance State Logs: These log files contain infrastructure resource related information, like CPU, memory, or garbage collection.
- Hadoop/YARN Component Logs: These logs are associated with the Hadoop daemons like those related to HDFS, YARN, Oozie, Pig, or Hive. I instructed to create YARN related log files in S3.
- Step Logs: This type log is specific to Amazon EMR. As we said, an EMR cluster can run one or more steps of a submitted job. These steps can be defined when the cluster is created, or submitted afterwards. Each step of a job generates four types of log files. Collectively, these logs can help troubleshoot any user-submitted job.
Loading Data into Hive:
Once cluster is ready connect to the cluster by using hadoop user with private key. Now run the hive program and create the table. The dataset is already uploaded to “s3://testcluster-emr/input/restaurant.data” bucket.
Once hive table is created query the Hive table to list the rating for the restaurant. It took almost 33 seconds to execute the query.
# Create table statement in Hive
CREATE EXTERNAL TABLE `restaurants_data`(
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
# Run query to get the result.
select count(*), rating from restaurants_data group by rating;
Now exit hive and get ready with Spark SQL.
Before running the Spark SQL turn off the verbose logging. Spark SQL is compatible with hive metastore which means tables created with Hive don’t need to be re-created for Spark SQL.
Now cache the restaurant table created by hive in Spark SQL. Caching will take time depends upon how big the table is??? Once done, run the same select Sql query that we ran with hive. You’ll get the query result in 3 seconds.
Spark SQL includes a cost-based optimizer, columnar storage and code generation to make queries fast.
Spark SQL setting (Turn off verbose logging):
sudo sed -i -e 's/rootCategory=INFO/rootCategory=WARN/' /etc/spark/conf/log4j.properties
select count(*), rating from restaurants_data group by rating
Submit your Hive script as a step:
Use the Add Step option to submit your Hive script to the cluster using the console. The Hive script have been uploaded to Amazon S3 for you. My hive script contain the select statement only.
Go to the cluster and scroll to the Steps section and expand it, then choose Add step.
- Step type: Hive program
- Name: Count hive table or anything
- Script : S3 location ( My script contain the select statement only)
- Input S3 location : Dataset ( you can upload a script to create the table and dataset can be as input file)
- output: The query writes results
- Arguments: Include the following argument to allow column names that are the same as reserved words if any.
- For Action on failure, accept the default option Continue.
- Click Add.
You can see the hive step logs and jobs once completed.
This example was a simple demonstration. In real life, there may be dozens of steps with complex logic, each generating very large log files. Manually looking through thousands of lines of log may not be practical for troubleshooting purposes. So S3 is good candidate for placing step logs for troubleshooting.
EMR CLI commands are listed here. You can run these to EMR nodes.
[hadoop@ip-172-31-64-121 emr]$ aws emr list-clusters
[hadoop@ip-172-31-64-121 emr]$ aws emr list-instances --cluster-id j-3RUZVHS6CKWYC
EMR cluster management console provide the GUI console to monitor, resize, terminate and a lot of other features. The same can be achieved EMR CLI.
You can monitor the cluster status, node stats, CPU, I/O and application jobs completed etc… Really handy.
Application history can be monitored separately as well which will show the type of the job and successful/failure status etc..
Select the cluster and see the overall status from one screen. You can resize the cluster from hardware option.
You can increase the core nodes from here. There is only one master node and you can’t change that. NO HA at the moment but meta data can be stored in metastore and later can be used to restore any failed cluster. Just update the count in core node and click green tick.
Resizing will take time and once done cluster core nodes status will be in running state.
Additional task nodes can be added from same tab.
To terminate the cluster, you need to turn off the termination protection.
This option is available in the summary page.
In this post, we had a quick introduction to Amazon EMR cluster, different launch options, running the hive/spark SQL queries and different types of log files.