Pyspark - 1

What is Spark and Pyspark?

Spark is an open-source, distributed computing framework designed for fast and general-purpose cluster computing.

  • Fast: Leverages in-memory caching to significantly speed up computations compared to traditional MapReduce.

  • Versatile: Supports a wide range of processing tasks, including data analysis, machine learning, stream processing, and graph processing.

  • Distributed: Can process massive datasets across a cluster of machines, enabling parallel processing for faster results.

  • Easy to use: Provides high-level APIs in various languages (Scala, Java, Python, R) for easy development

Pyspark is the Python API for Apache Spark. PySpark provides a user-friendly interface to interact with Spark using the Python programming language.

What is Sparksession and Sparkconf?

SparkSession is the central object in Spark for interacting with Spark functionalities. It's the primary way to access and manipulate Spark DataFrames and Datasets.

Key Responsibilities:

  • Manages the Spark context and its configuration.

  • Provides methods to create DataFrames and Datasets.

  • Enables SQL queries over DataFrames registered as tables.

  • Handles caching and persistence of data.

SparkConf is an object used to set various configuration properties for a Spark application. These properties control how Spark executes tasks, allocates resources, and interacts with the underlying cluster.

  • master: Specifies the deployment mode (e.g., "local", "yarn", "spark://master:7077").

  • appName: Sets the name of the Spark application.

  • spark.executor.cores: Configures the number of cores per executor.

  • spark.executor.memory: Sets the memory allocated to each executor.

Note: SparkSession internally uses a SparkConf object to manage the configuration settings for the Spark application.

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("YourAppName") \
    .getOrCreate()

Spark Dataframes and RDDs

FeatureRDDSQL DataFrame
Data RepresentationUnstructuredStructured with schema
ImmutabilityImmutableImmutable
Fault ToleranceResilient through lineageResilient through lineage and persistence
ParallelismInherent parallelismInherent parallelism
APILower-level APIHigher-level API with SQL support
PerformanceCan be slower for complex operationsGenerally faster due to optimizations

Lineage and persistence are complementary mechanisms that ensure Spark's fault tolerance. Lineage enables automatic recovery of lost data, while persistence provides a performance boost and enhances fault tolerance by caching data. By effectively utilizing these mechanisms, users can build robust and scalable Spark applications that can handle node failures and efficiently process large datasets.

Basic Functions

TaskPySpark CodeExplanation
List All Tablesspark.catalog.listTables()Returns a list of all tables registered in the Spark session.
Pandas DataFrame to Spark DataFramespark_df = spark.createDataFrame(pandas_df)Creates a Spark DataFrame from an existing Pandas DataFrame.
Spark DataFrame to Pandas DataFramepandas_df = spark_df.toPandas()Converts a Spark DataFrame to a Pandas DataFrame. Note: This operation can be expensive for large datasets as it requires transferring all the data to the driver.
Create Temp Viewspark_df.createOrReplaceTempView("table_name")Creates or replaces a temporary view with the given name, allowing SQL queries to be executed on the Spark DataFrame.
Read Data (e.g., from CSV)spark_df = spark.read.csv("path/to/file.csv", header=True, inferSchema=True)Reads data from a CSV file into a Spark DataFrame. <br> - header=True: Indicates that the first row contains column names. <br> - inferSchema=True: Automatically infers the data types of each column.

Stay tuned for the next in the series….

0
Subscribe to my newsletter

Read articles from Manas Chandan Behera directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Manas Chandan Behera
Manas Chandan Behera