Lesson 1: Spark

Here's a structured tutorial for Lesson 1: Spark, including 4 code examples with explanations.
Example 1: Creating a SparkSession
from pyspark.sql import SparkSession
# Initialize SparkSession
spark = SparkSession.builder.appName("Lesson1Tutorial").getOrCreate()
# Print Spark version
print(spark.version)
Explanation:
Importing SparkSession:
from pyspark.sql import SparkSession
imports theSparkSession
class, required for working with Spark's DataFrame API in PySpark.Initializing SparkSession:
spark = SparkSession.builder.appName("Lesson1Tutorial").getOrCreate()
creates a Spark session, which is the entry point for Spark applications.Naming the Spark App:
appName("Lesson1Tutorial")
sets the name of the Spark application, making it easier to monitor in Spark UI.Printing Spark Version:
print(spark.version)
retrieves and prints the current version of Spark installed in the environment.
Example 2: Creating a DataFrame
data = [("Alice", 25), ("Bob", 30), ("Charlie", 35)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)
df.show()
Explanation:
Defining Data:
data = [("Alice", 25), ("Bob", 30), ("Charlie", 35)]
represents a list of tuples, where each tuple contains a name and an age.Column Names:
columns = ["Name", "Age"]
defines column names for the DataFrame, making it structured like a table.Creating DataFrame:
df = spark.createDataFrame(data, columns)
converts the list of tuples into a Spark DataFrame, which is optimized for distributed processing.Displaying DataFrame:
df.show
()
prints the contents of the DataFrame in tabular form, showing the "Name" and "Age" columns.
Example 3: Filtering Data in a DataFrame
filtered_df = df.filter(df.Age > 28)
filtered_df.show()
Explanation:
Applying Filter:
filtered_df = df.filter(df.Age > 28)
selects only rows where the "Age" column value is greater than 28.Condition Evaluation:
df.Age > 28
acts as a boolean filter, returning only rows meeting the condition.Creating a New DataFrame:
filtered_df = df.filter(...)
results in a new DataFrame containing only the filtered results, without modifying the original DataFrame.Displaying Filtered Data:
filtered_
df.show
()
prints the new DataFrame, showing only people older than 28.
Example 4: Aggregating Data with GroupBy
from pyspark.sql.functions import avg
df_grouped = df.groupBy("Age").count()
df_grouped.show()
Explanation:
Importing Functions:
from pyspark.sql.functions import avg
imports aggregation functions likeavg()
to be used for calculations.Grouping Data:
df.groupBy("Age")
groups rows with the same "Age" together, preparing for aggregation.Applying Count Aggregation:
df_grouped = df.groupBy("Age").count()
calculates the number of times each age appears in the DataFrame.Displaying Grouped Data:
df_
grouped.show
()
prints the result, showing the count of each unique age value in the dataset.
This tutorial gives a solid foundation for working with Spark and PySpark, covering session creation, DataFrame operations, filtering, and aggregation.
Subscribe to my newsletter
Read articles from user1272047 directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
