PySpark RDD Cheat Sheet

Soyoola SodunkeSoyoola Sodunke
3 min read

This cheat sheet provides a quick reference to the most commonly used PySpark RDD operations. PySpark RDDs (Resilient Distributed Datasets) are the fundamental data structure in Apache Spark, providing fault-tolerant, distributed data processing capabilities.

1. Creating RDDs

  • From a local collection:

    data = [1, 2, 3, 4, 5]
    rdd = sc.parallelize(data)
  • From a text file:

    rdd = sc.textFile("file_path.txt")
  • From a directory of text files:

    rdd = sc.wholeTextFiles("directory_path")

2. Basic Transformations

  • map(func): Apply a function to each element. x: x * 2)
  • flatMap(func): Apply a function to each element and flatten the result.

    rdd.flatMap(lambda x: x.split(" "))
  • filter(func): Filter elements based on a condition.

    rdd.filter(lambda x: x > 2)
  • distinct(): Return distinct elements.

  • sample(withReplacement, fraction, seed): Sample a fraction of the data.

    rdd.sample(False, 0.5, 42)

3. Key-Value Pair Transformations

  • mapValues(func): Apply a function to the value of each key-value pair.

    rdd.mapValues(lambda x: x * 2)
  • flatMapValues(func): Apply a function to the value of each key-value pair and flatten the result.

    rdd.flatMapValues(lambda x: x.split(" "))
  • reduceByKey(func): Aggregate values for each key.

    rdd.reduceByKey(lambda x, y: x + y)
  • groupByKey(): Group values for each key.

  • sortByKey(ascending=True): Sort RDD by key.

  • keys(): Extract keys from key-value pairs.

  • values(): Extract values from key-value pairs.


4. Actions

  • collect(): Return all elements of the RDD as a list.

  • count(): Return the number of elements in the RDD.

  • first(): Return the first element of the RDD.

  • take(n): Return the first n elements of the RDD.

  • takeSample(withReplacement, num, seed): Return a sample of num elements.

    rdd.takeSample(False, 5, 42)
  • reduce(func): Aggregate elements using a function.

    rdd.reduce(lambda x, y: x + y)
  • foreach(func): Apply a function to each element (no return value).

    rdd.foreach(lambda x: print(x))
  • saveAsTextFile(path): Save RDD as a text file.


5. Set Operations

  • union(other): Return the union of two RDDs.

  • intersection(other): Return the intersection of two RDDs.

  • subtract(other): Return elements in the first RDD but not in the second.

  • cartesian(other): Return the Cartesian product of two RDDs.


6. Advanced Transformations

  • coalesce(numPartitions): Decrease the number of partitions.

  • repartition(numPartitions): Increase or decrease the number of partitions.

  • zip(other): Zip two RDDs together.
  • zipWithIndex(): Zip RDD elements with their index.

  • zipWithUniqueId(): Zip RDD elements with a unique ID.


7. Persistence (Caching)

  • persist(storageLevel): Persist the RDD in memory or disk.

  • unpersist(): Remove the RDD from persistence.


8. Debugging and Inspection

  • getNumPartitions(): Get the number of partitions.

  • glom(): Return an RDD of partitions as lists.

  • id(): Get the RDD's unique ID.

9. Joins

  • join(other): Inner join two RDDs.

  • leftOuterJoin(other): Left outer join two RDDs.

  • rightOuterJoin(other): Right outer join two RDDs.

  • fullOuterJoin(other): Full outer join two RDDs.


10. Broadcast and Accumulator Variables

  • Broadcast Variables:

    broadcast_var = sc.broadcast([1, 2, 3]) x: x + broadcast_var.value[0])
  • Accumulator Variables:

    accum = sc.accumulator(0)
    rdd.foreach(lambda x: accum.add(1))

This cheat sheet covers the most essential PySpark RDD operations. For more advanced use cases, refer to the official PySpark documentation.

Subscribe to my newsletter

Read articles from Soyoola Sodunke directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Soyoola Sodunke
Soyoola Sodunke

"Never in doubt." From a decade-long career spanning customer support, telemarketing, back-office provisioning, and retail in the telecom industry, I made a bold leap into the tech space as a data engineer—a decision that transformed my career and life. Faced with stagnant growth and limited opportunities, I embraced the COVID-19 lock-down in 2020 as a turning point, dedicating myself to learning SQL, Excel, Power BI, and other data-related skills online. In just two years, hard work paid off. I secured my first role as a Data Analytics Engineer in February 2022, and my growth since then has been exponential. By combining my extensive customer service experience—strong collaboration, communication, and interpersonal skills—with telecom domain expertise, I quickly excelled in my new field. Early this year, I was promoted to third-level manager, achieving significant professional and financial milestones. Today, I proudly hold certifications in Databricks Data Engineer Associate and Microsoft Azure, actively expanding my expertise by learning Microsoft Fabric and other advanced programming. I have led teams to deliver impactful data projects, such as building on-premises data platforms and transforming company data into actionable insights that drive profits, customer satisfaction, and business growth. My journey is a testament to resilience, lifelong learning, and the power of embracing change.