Hey readers! 👋

In this blog, I walk through my hands-on journey with Amazon EMR—starting from creating a traditional EMR cluster to leveraging the flexibility of EMR Serverless for both batch and interactive workloads. From provisioning resources and running PySpark jobs to handling data stored in S3, this guide captures the complete cycle of developing and executing Spark applications on AWS.

Definitions and EC2-based Setup

EMR is a managed big data environment that enables us to deploy clusters with a desired number of nodes that can be pre-configured with different softwares.
Firstly, we select Amazon EMR running on Amazon EC2 and create a cluster ‘mehul-testing-cluster‘. Note that we select the required application bundle.

Amazon EMR’s functionality revolves around the different types of EC2 instances. There are three types of EC2 nodes:
1. Primary Node (Master): In HDFS, it manages the NameNode service and in YARN, it oversees the resource manager.
2. Core Nodes (Workers): Responsible for both storage and processing. In HDFS, they run the DataNode service and in YARN, they operate the node managers.
3. Task Nodes: They are dedicated entirely to processing tasks.
There are two types of cluster configurations as well:
1. Uniform Instance Groups: It’s a fixed, predefined structure where each type of node is set up with a specific instance type, and all nodes of the same type share identical configurations.
2. Flexible Instance Fleets: It’s a flexible configuration allowing us to specify a range of instance types and capacity for each node type.
For our case, we choose Uniform Instance Groups, with all node types (Primary, Core & Task) as m4.large.

However, we set the Task nodes to Spot Instances, for reducing costs. Since Task nodes can tolerate interruptions, they are ideal for Spot instances.

We set the cluster size manually - 2 Core nodes and 1 Task node.

In each region, one Virtual Private Cloud (VPC) gets provided by default, with subnets allocated for each availability zone within that VPC.

In order to connect to the cluster, we create an EC2 key pair ‘myemrkeypair’.

We provide this key pair inside security configurations.

Afterwards, we click on create cluster and the cluster gets provisioned.

Running Jobs in EMR

We connect to EMR’s Primary node using SSH in Putty (an SSH client).

Once the connection gets established, we can begin running our PySpark code, as we can see that PySpark is already installed.

Consider a scenario where we have an S3 bucket ‘mehul-retaildata01‘ in which we have a file ‘orders.csv‘.

Inside our SSH client, we can read this file into a dataframe as follows.

In order to find out the count of each ‘order_status’ in our data, we can write the following Spark SQL command.

Multiple Datasets

Consider that inside our S3 bucket, we create a folder ‘input‘ and load three files into it - ‘orders.csv‘, ‘order_items.csv‘ and ‘customers.csv‘.

We also create a folder ‘spark-script‘ inside the bucket and upload a file named ‘emr_aggregated_orders_data.py‘.

The ‘emr_aggregated_orders_data.py’ file contains the following code.

The code reads three datasets, aggregates them using joins and writes the aggregated data back in Parquet format.

Inside ‘Add step’ section in EMR, we add a Spark application step and provide the S3 path to our Python script ‘emr_aggregated_orders_data.py‘.

We provide arguments for our script, which include three data sources and one output path.

The step runs successfully.

The output folder in S3 bucket gets populated as expected, with Parquet files.

EMR Serverless

In EMR Serverless, there is no need to specify the number or size of resources upfront, as they get allocated only when needed.

Until now, we have been using the root user, but this time we will create a new user ‘mehul‘.

We provide the new user with administrator access.

We log in with the new user credentials, and create and launch an EMR studio to manage EMR Serverless.

Out of the available options of either Spark or Hive workloads, we choose a Spark application and name it as ‘demo_spark_application1‘.

Batch Workloads

Firstly, from the application setup options, we create a batch job.
Consider a new bucket ‘mehul-emr-serverless-demo1‘ and we create three folders inside it - ‘logs’, ‘input_data’ and ‘script’.

We upload the following ‘orders.csv’ file into ‘input_data‘ folder.

We upload the following PySpark code file ‘application_orders.py‘ into the ‘script‘ folder.

This code reads the data from the input path, counts the number of each ‘order_status‘ and writes the data in Parquet format at the output S3 path.

We submit a batch job run by providing the runtime role (IAM role and policies) and providing access to the S3 bucket. We also provide the script location, inside the arguments for which, we provide the input path and the output path.

Once we submit the job, the application gets started and begins provisioning the required resources. We can see that the job ran successfully.

Inside our bucket, the ‘output_data‘ folder gets created and gets populated with the required Parquet files.

The ‘logs‘ folder also gets populated with the output logs.

The ‘stdout.gz‘ object shows the output that gets printed on the driver.

Interactive Workloads

Now, secondly, we want to create an interactive workload application named ‘demo_spark_application2‘. Note that interactive jobs come with pre-initialized capacity.

Since we can use both batch and interactive jobs in interactive workloads, we will submit a batch job first.
We submit a new output folder ‘output_data_2‘ in script arguments.

The batch job inside interactive job cluster application also ran successfully.

The ‘output_data_2‘ folder gets populated with Parquet files as expected.

We will run an interactive notebook with our interactive cluster now. We create a new studio with interactive workloads option.

A cluster gets attached to the workspace for interactive analysis, by default. This implies that when we run consecutive jobs, there is no delay in acquiring resources, unlike with batch jobs.

Now, we can begin using the interactive notebooks, which allow us to collaboratively analyze data and run Spark jobs in a user-friendly interface.

After reading the dataframe, we convert it into a temporary view and run the Spark SQL command to calculate the count of each ‘order_status‘ in our ‘orders‘ data.

Conclusion

By exploring both EC2-based and serverless EMR setups, we've seen how scalable and cost-efficient data processing can be achieved on AWS. Whether you're building one-off batch pipelines or working interactively with large datasets, EMR provides the control, flexibility, and power needed for modern big data workloads.

Stay tuned!

23: Amazon EMR and EMR Serverless Guide 📘