How do I make my Glue job run faster?

Raju MandalRaju Mandal
8 min read

When I started using AWS Glue, I was impressed by how quickly I could spin up a serverless data pipeline without worrying about managing infrastructure. But that excitement didn’t last long. As my data grew and the workflows became more complex, my Glue jobs started to crawl instead of fly, processing times stretched from minutes to hours, and costs began creeping up.

Like many, I assumed AWS Glue “just works” out of the box. But under the hood, it’s still Apache Spark, and Spark needs tuning to perform well, especially in cloud environments with large-scale datasets.

If you're working with AWS Glue and wondering how to improve performance (and save some money), this post is for you.

The Pain of Slow Glue Jobs

Before anything, let me set the stage first!

I was working on a Glue ETL job that ingested JSON logs from an S3 bucket, transformed them into a structured format, and then wrote the output to Amazon Redshift for downstream analytics. Simple enough, right?

Except, the job was taking over 45 minutes to run on just 3 GB of data.

Here’s what I observed:

  • Stage bottlenecks in Spark UI with long shuffle times

  • High memory usage, leading to stage retries

  • Skewed partitions are causing some tasks to take far longer than others

  • Glue job logs filled with generic “Job is still running...” messages

  • The overall cost was stacking up because of the long runtime and retries

The worst part? This job was on a daily schedule, meaning I was burning time and money regularly.

At first, I thought: “Maybe I just need to throw more DPUs at it.” But increasing the allocated DPUs (Glue’s measure of compute power) made very little difference—the real issue wasn’t capacity, it was how I was using it.

That’s when I decided to go deeper.

Analyzing Spark UI & Job Metrics

Before you can optimize a Glue job, you need to know where the bottlenecks are. Glue provides a lot of hidden performance clues, you just have to know where to look.

Accessing the Metrics

First, I enabled job metrics in AWS Glue by navigating to:

Glue Console → Jobs → [Your Job] → Monitoring tab

This gives you:

  • Driver & executor CPU/memory usage

  • Number of Spark stages and tasks

  • Error counts and retry rates

  • Execution timeline

Then I dug deeper by opening the Spark UI logs. These reveal what’s happening under the hood:

  • Stage-level DAGs

  • Skewed task execution times

  • Shuffle operations

  • Wide vs narrow transformations

What I Learned from the Metrics

Here’s what stood out:

  • Shuffle-heavy stages: Several .join() and .groupBy() Operations were creating expensive shuffle operations.

  • Skewed partitions: Some tasks were completed in seconds, while others took minutes. This means the data was skewed.

  • Large stage retries: Spark was retrying entire stages due to memory pressure.

  • Input/output imbalance: The data read from S3 wasn’t partitioned correctly, causing some tasks to process huge chunks while others stayed idle.

This analysis gave me a roadmap for what to tackle first: input partitioning, filtering logic, and memory optimization.

Next, I focused on fixing the S3 partitioning strategy, which turned out to be a game-changer.

Partitioning the Input Data Smartly

One of the biggest performance killers in AWS Glue (and Spark in general) is reading unpartitioned or poorly partitioned data. That’s exactly what was happening in my case.

My input data was being dumped into S3 in a flat structure like this:

s3://my-bucket/logs/2025-04-01.json
s3://my-bucket/logs/2025-04-02.json
...

This setup offered zero partitioning benefits to Spark. Every time the Glue job ran, it had to scan all the files, regardless of the date range needed. No wonder it was choking.

What did I change?

I restructured the S3 layout to a partitioned format based on the log date:

bashCopyEdits3://my-bucket/logs/date=2025-04-01/log.json
s3://my-bucket/logs/date=2025-04-02/log.json
...

And then, in my Glue script, I made sure to push filters down to the catalog:

datasource = glueContext.create_dynamic_frame.from_catalog(
    database="log_db",
    table_name="logs_table",
    push_down_predicate="date >= '2025-04-01' and date <= '2025-04-07'"
)

This simple change brought two big benefits:

  1. Spark only reads the relevant partitions, drastically reducing I/O.

  2. Execution time dropped by almost 60% just from this optimization.

Partitioning Tips That Helped Me

  • Always partition on fields with high cardinality and query filtering use (e.g., date, region, event_type).

  • Don’t over-partition. Too many small files can slow things down, too.

  • Use consistent formatting (e.g., date=YYYY-MM-DD) for partition keys.

With this fix in place, I moved on to the next logical bottleneck: filtering data early in the pipeline, not after loading.

Filtering Early, Not Late

One of the easiest ways to kill performance in Spark (and Glue) is to load everything and then filter. That’s exactly what I was doing:

# What I used to do
df = glueContext.create_dynamic_frame.from_catalog(
    database="log_db",
    table_name="logs_table"
).toDF()

filtered_df = df.filter(df["event_type"] == "ERROR")

Looks fine, right? But it’s a performance trap. This approach loads all records into memory before applying any filters. For large datasets, that’s a nightmare.

What did I change?

Instead of filtering after loading, I applied the filter as early as possible using predicate pushdown:

# Much better
dynamic_frame = glueContext.create_dynamic_frame.from_catalog(
    database="log_db",
    table_name="logs_table",
    push_down_predicate="event_type == 'ERROR'"
)

Under the hood, this tells Glue (and ultimately Spark) to apply the filter while reading, not after. That means:

  • Fewer records are loaded into memory

  • Smaller shuffle stages

  • Lower network I/O

  • Less garbage collection

Benchmarks from My Job

StageBefore Filter PushdownAfter Filter Pushdown
Data Read Time~5 min~1.2 min
Shuffle Size3.1 GB400 MB
Total Execution Time25 min9 min

That’s a 2.7x improvement from just moving one line of logic upstream. Wild.

By now, I was seeing tangible improvements, but I knew the engine needed tuning too. So next, I went into job-level parameter tuning.

Tuning Glue Job Parameters

Glue is built on Apache Spark, and Spark has dozens of tunable knobs. Out of the box, AWS Glue picks some defaults for you, but if your job grows in size or complexity, those defaults may choke.

That’s what was happening to me. Even after improving partitioning and filtering, my job still had:

  • Stage retries

  • Long task durations

  • Unused executors idling around

Here are the key parameters I tweaked, along with their impact:

--conf spark.sql.shuffle.partitions

Default: 200
New Value: 64

Lowering this reduced the number of small shuffle tasks. Since my dataset wasn’t huge, fewer partitions made task coordination faster.

--conf spark.executor.memory

Default (Glue 2.0): Managed by AWS
New Value (Glue 3.0/4.0): 6g

Explicitly bumping executor memory helped eliminate out-of-memory errors in complex transformations.

--conf spark.sql.adaptive.enabled=true

Enabled adaptive query execution (AQE), which allowed Spark to optimize joins and shuffles at runtime. This was HUGE for handling skewed data.

Glue Worker Type

Switching from Standard to G.1X (for memory-heavy workloads) gave me better price-performance for this job.

Job Timeout

Previously set too high (120 mins). I lowered it to 30 mins to catch runaway jobs and force better retries.

Other Settings Worth Trying

  • --enable-glue-datacatalog : Ensures Glue uses the Data Catalog instead of crawling again.

  • --conf spark.serializer=org.apache.spark.serializer.KryoSerializer : Faster serialization.

  • --conf spark.sql.broadcastTimeout=600 : Increased broadcast timeout to avoid join failures.

After tuning these, my job’s Spark UI looked dramatically cleaner. No retries. Shuffle times cut in half. Fewer stages. Less executor idle time.

With the engine running smoother, my next move was to clean up the ETL logic itself, reducing unnecessary transformations that were silently eating performance.

Cleaning Up Unnecessary Transformations

Spark is lazy by design, which is great for optimization, but it can also hide inefficiencies. When I reviewed my ETL script with fresh eyes, I realized: I was over-transforming everything.

Here’s a simplified version of what I had:

df = df.withColumn("clean_col", clean_udf(df["raw_col"]))
df = df.withColumn("ts", to_timestamp(df["event_time"]))
df = df.drop("raw_col")
df = df.cache()  # unnecessary
df = df.filter(df["region"].isNotNull())

Looks normal, right? But there were problems:

  • I was chaining too many .withColumn() calls, each one could trigger wide transformations.

  • I had redundant .drop() calls that forced unnecessary column scans.

  • I used .cache() thinking it would speed things up, but caching large datasets without reuse only increases memory pressure.

  • My UDF (clean_udf) was slow, unoptimized Python logic that couldn’t be parallelized well.

What did I change?

  1. Combined transformations:
    Reduced column rewrites by combining logic inside a single .select() block:

     df = df.select(
         F.to_timestamp("event_time").alias("ts"),
         F.when(F.col("raw_col").isNotNull(), F.col("raw_col")).alias("clean_col"),
         "other_column"
     )
    
  2. Removed caching:
    Unless you’re reusing a DataFrame multiple times, caching can do more harm than good.

  3. Optimized UDFs:
    Wherever possible, I replaced Python UDFs with Spark-native functions. They’re faster and more scalable.

  4. Dropped only at write time:
    Instead of dropping columns mid-pipeline, I deferred drops to just before write:

     df.drop("debug_cols", "temp_cols").write.format("parquet").save(output_path)
    

What was the Impact, if you ask?

These changes led to:

  • Lower memory usage

  • Shorter stage execution times

  • Fewer shuffle operations

  • Cleaner DAG in Spark UI

This step gave my job the “spark” it needed. But I still noticed some friction in how Glue’s DynamicFrames were behaving, so the next tweak was all about switching from DynamicFrames to DataFrames in the right places.

Conclusion

Optimizing AWS Glue jobs isn’t just about throwing more compute at the problem. The real wins come from:

  • Understanding your data: Partition it effectively and apply filters early.

  • Tuning Spark parameters: Don’t accept the defaults, fine-tune memory, shuffle, and parallelism settings.

  • Optimizing transformations: Use the right data structure (DynamicFrames vs DataFrames) and eliminate unnecessary steps.

  • Leverage Glue features: Job bookmarks and parallelism are powerful tools for incremental data processing and scalability.

By using these strategies, I was able to turn a frustrating, slow job into a fast, cost-efficient pipeline. Whether you’re processing large datasets or just trying to improve reliability, the principles here apply to most AWS Glue jobs.

And that’s how I tuned my AWS Glue jobs to run 10x faster! If you have any questions or other performance tips, feel free to reach out in the comments.

0
Subscribe to my newsletter

Read articles from Raju Mandal directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Raju Mandal
Raju Mandal

A digital entrepreneur, actively working as a data platform consultant. A seasoned data engineer/architect with an experience of Fintech & Telecom industry and a passion for data monetization and a penchant for navigating the intricate realms of multi-cloud data solutions.