Amazon EMR (Elastic MapReduce), is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark, on AWS to process and analyze vast amounts of data.
EMR Simplifies management with rapid cluster provisioning, managed scaling, and automated software installation. Using these frameworks and related open-source big data projects like Apache Spark, Hive, Presto etc. we can process petabyte-scale data for analytics purposes and business intelligence workloads.
The main component of Amazon EMR is the cluster, which is a collection of Amazon Elastic Compute Cloud (EC2) instances. Each instance in the cluster is called a node. Each node has a role within the cluster, referred to as the node type. Amazon EMR also installs different software components on each node type, giving each node a role in a distributed application like Apache Hadoop.
The node types in Amazon EMR are as follows:
- Master node: Every cluster has at least one master node that manages the cluster by running software components to coordinate the distribution of data and tasks among other nodes for processing. The master node tracks the status of tasks and monitors the health of the cluster.
- Core node: Every cluster has at least one core node with software components that run tasks and store data in the Hadoop Distributed File System (HDFS) on the cluster.
- Task node: A cluster can have optional task nodes with software components that only runs tasks and does not store data in HDFS.