Let’s have a little introduction about these services, Amazon EMR is a web service that makes it easy to process large amounts of data efficiently using underlying Hadoop, Spark, HBase, Presto, Hive, and other Big Data Frameworks. One can create a Hadoop cluster of any size through the UI console or through the CLI or programmatically. Cloudera is open source(As well Enterprise version); you have access to the source code and can inspect it for debugging purposes and make modifications as required.
Amazon EMR VS Cloudera Comparison
|AWS EMR||Cloudera on EC2|
|Auto Scaling||EMR segregates slave nodes into two subtypes – Core Nodes and Task nodes. This results in high scalability and low cost by using the spot instance for task node.||EC2 does not categorize slave nodes into core and task nodes. This increases the risk of losing HDFS data in case a node is removed/lost.|
|Dynamic Orchestration||You can dynamically orchestrate a new cluster on-demand within a very short span of time. After successful completion of the jobs, this cluster can be terminated in turn, improving the utilization and reducing the costs drastically.||You would need to stop the instance if you are not using it and would need to start it in case you need to process the data for your application. If your application already running on ec2 then it shall take the resources unnecessarily|
|Access to Amazon S3||You can access data on S3 from EMR directly or through Hive Tables. EMR is highly tuned for working with data on S3 through AWS-proprietary binaries. EMR works seamlessly with other Amazon services like, , and .||EC2 uses Apache libraries (s3a) to access data on s3. On the other hand, EMR uses AWS proprietary code to have faster access to s3.|
|Highly Availablity||EMR Service continuously monitors the slave nodes and replaces any unhealthy node with a new node, behind the scene.||Unlike EMR, EC2 does not categorize slave nodes into core and task nodes. This increases the risk of losing HDFS data in case a node is removed/lost.|
|Ease of Use||AWS manages EMR Hadoop service as well as underlying AWS infrastructure. So you can quickly start a new Hadoop cluster quickly and start processing the data.||Cloudera is comparatively more difficult to learn and configure. But once you have it setup, it’s far more flexible than EMR, and there’s no additional infrastructure cost. Cloudera Manager has an easy to use web GUI. This helps manage and monitor Hadoop services, cluster, and physical host hardware.|
|Hadoop Management Console||AWS does not provide any management console similar to Apache’s Ambari or Cloudera Manager, for EMR. This makes it difficult to manage and monitor various Hadoop services on a running cluster.||Cloudera also provides Cloudera Director to enable self-service for using CDH in the cloud. It provides a single-pane-of-glass administration experience for central IT to reduce costs and deliver agility, and for end-users to easily provision and scale clusters.|
|On-Premise and Cloud Options||AWS does not provide the on-premise option and rely on the other amazon services.||Cloudera offers both on-premise and on-cloud options. This helps reuse the on-premise expertise – experience, human resources, and learnings.|
|Additional Features, Data flexibility, and Debuging||EMR have standard features and use S3 for data processing. One can’t debug the issue and there is not much control.||Cloudera uses the open source Hue framework for its user interface. If you require new features from your web interface, you can easily implement them using the Hue SDK.Also since you have the control on source code so you can debug the issue easily|
Let’s take an example to configure a 6-Node Hadoop cluster(m4.large) for our data processing. These calculations have been done for a region (US East) and one can look at details here.
|AWS EMR||Cloudera on EC2|
|Instance required for year||$0.030 per Hour * 6* 24*365||$0.120 per Hour*6*24*365|
We have taken the worst case scenario in which we need to run the big data sets processing throughout the year. In the case of EMR, we can have one master node and five slaves nodes. In the case of Cloudera, we would need 6 EC2 instances to run 6 nodes and we can reduce the costing by using reserved the instances but in most of the scenarios you need these clusters for a smaller duration. So the above calculations suggest that EMR is very cheap compared to a core EC2 cluster using Cloudera.
Your choice is going to depend on your particular use case and effective costing on these platforms.
- If you don’t want to invest time in managing and updating your distribution then AWS EMR shall be the best option for you.
- If your data is stored in S3 and you want to run the occasional job on the data and dump the results back to S3 then it shall make sense that you use Elastic Map/Reduce (EMR).
- If you need to run a full Hadoop/HBase stack 24×7 and have custom data format (other than S3) then Cloudera shall be the best option for you.
- If you need to debug the issues and need to integrate it with other software then Cloudera shall be the best option for you.