Python vs Spark: Kinesis & Kafka Streaming

In the world of real-time data processing, Amazon Kinesis and Apache Kafka are popular tools for handling large streams of data. Processing this data can be done using various frameworks and libraries, among which Python consumers and Apache Spark Structured Streaming stand out. This blog explores how these approaches differ when reading from Kinesis and Kafka, highlighting their capabilities, mechanisms, and use cases.

Python Consumers: Reading from Kinesis and Kafka

Python consumers offer flexibility and direct control over how data is ingested and processed. Here’s how they interact with Kinesis and Kafka:

Reading from Kinesis

Direct Shard Access:

Individual Shard Readers, Python consumers can independently read from Kinesis shards by managing shard iterators. Each consumer reads from a specific shard, making it straightforward to implement but requiring manual coordination.

Python Consumers Reading from Kinesis

+-----------------------+
|   Kinesis Stream      |
+-----------------------+
        |   |   |
       S1  S2  S3  ... Sn
        |   |   |
       C1  C2  C3  ... Cn

Coordination: When scaling to multiple consumers, coordination is required to ensure that each shard is only read by one consumer at a time. This can be managed through centralized databases or distributed key-value stores.

Fault Tolerance: Python consumers need to implement manual mechanisms to track the last processed record and handle retries, which can complicate fault tolerance.

import boto3

def read_from_shard(shard_id, stream_name):
    kinesis = boto3.client('kinesis')
    shard_iterator = kinesis.get_shard_iterator(
        StreamName=stream_name,
        ShardId=shard_id,
        ShardIteratorType='LATEST'
    )['ShardIterator']

    while True:
        records = kinesis.get_records(ShardIterator=shard_iterator)
        for record in records['Records']:
            process(record)
        shard_iterator = records['NextShardIterator']

def process(record):
    print(record['Data'])

# Example usage
read_from_shard('shardId-000000000000', 'your-stream-name')

Reading from Kafka

Partition-Specific Consumers:

Consumer Groups, Python consumers use Kafka’s built-in consumer groups to read from partitions. Each consumer in a group reads from one or more Kafka partitions, which Kafka dynamically assigns to balance the load.

Python Consumers Reading from Kafka

+-----------------------+
|    Kafka Topic        |
+-----------------------+
       |   |   |   |
      P1  P2  P3  ... Pn
       |   |   |   |
      C1  C2  C3  ... Cm

Automatic Load Balancing: Kafka handles the partition assignment and rebalancing when consumers join or leave the group, making it easier to scale and manage.

Offset Management: Kafka automatically tracks the offsets, allowing consumers to resume from the last processed record after a failure, simplifying fault tolerance.

from kafka import KafkaConsumer

consumer = KafkaConsumer(
    'your-topic',
    bootstrap_servers=['localhost:9092'],
    group_id='your-group',
    auto_offset_reset='earliest'
)

for message in consumer:
    print(message.value)

In above setups, the maximum parallelism for data processing is limited to the number of partitions or shards, leading to constrained scalability. To overcome this limitation, we can decouple compute from storage, similar to how Massively Parallel Processing (MPP) architectures operate. MPP systems separate storage from compute to achieve high scalability and flexibility. Spark Structured Streaming adopts this approach, effectively decoupling storage from compute to enhance parallelism and processing capabilities. Let's explore how Spark accomplishes this with Kinesis and Kafka.

Spark Structured Streaming: Reading from Kinesis and Kafka

Apache Spark Structured Streaming abstracts the complexities of stream processing, offering a unified approach for handling both Kinesis and Kafka data streams. Here’s how it works:

Reading from Kinesis

Unified DataFrame Approach:

Stream Abstraction, Spark reads data from all Kinesis shards and combines them into a single DataFrame. This approach abstracts away shard-specific details, providing a consistent interface for processing the entire stream.

Spark Structured Streaming Reading from Kinesis

+-----------------------+
|   Kinesis Stream      |
+-----------------------+
        |   |   |
       S1  S2  S3  ... Sn
        \   |   /
        +----+
        |Spark|
        +----+
         / | \
        E1 E2 E3 ... Em

Parallel Processing: Spark partitions the unified DataFrame based on its internal mechanism and distributes these partitions across available executors for parallel processing.

Fault Tolerance: Spark handles checkpointing and state management, allowing it to resume processing from the last checkpointed state in case of failures.

pythonCopy codefrom pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("KinesisExample") \
    .getOrCreate()

kinesisDF = spark \
    .readStream \
    .format("kinesis") \
    .option("streamName", "your-stream-name") \
    .option("region", "us-east-1") \
    .option("initialPosition", "LATEST") \
    .load()

processedDF = kinesisDF.selectExpr("CAST(data AS STRING)")

query = processedDF \
    .writeStream \
    .outputMode("append") \
    .format("console") \
    .option("checkpointLocation", "/path/to/checkpoint/dir") \
    .start()

query.awaitTermination()

Reading from Kafka

Consumer Group Integration:

Partition-Aware Reading, Spark uses Kafka consumer groups internally to read from Kafka partitions. Each executor reads from one or more partitions, with Kafka managing partition assignments and rebalancing.

Spark Structured Streaming Reading from Kafka

+-----------------------+
|    Kafka Topic        |
+-----------------------+
       |   |   |   |
      P1  P2  P3  ... Pn
       \   |   /
        +------+
        |Spark |
        +------+
         / | \
        E1 E2 E3 ... Em

Offset Management: Spark tracks offsets for each Kafka partition and stores them in the checkpoint directory. This allows Spark to resume from the last processed offset for each partition independently.

Scalability: Spark distributes the read DataFrame across executors, allowing it to scale processing beyond the initial number of Kafka partitions.

pythonCopy codefrom pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("KafkaExample") \
    .getOrCreate()

kafkaDF = spark \
    .readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "localhost:9092") \
    .option("subscribe", "your-topic") \
    .load()

processedDF = kafkaDF.selectExpr("CAST(value AS STRING)")

query = processedDF \
    .writeStream \
    .outputMode("append") \
    .format("console") \
    .option("checkpointLocation", "/path/to/checkpoint/dir") \
    .start()

query.awaitTermination()

Comparing Python Consumers and Spark Structured Streaming

Aspect	Python Consumers (Kinesis)	Python Consumers (Kafka)	Spark Structured Streaming (Kinesis)	Spark Structured Streaming (Kafka)
Data Ingestion	Direct shard readers	Uses Kafka consumer groups	Unified DataFrame from all shards	Uses Kafka consumer groups internally
Parallelism	Independent consumers for each shard	Consumers read from partitions in parallel	Unified DataFrame partitioned across executors	Executors read from partitions, unified DataFrame
Coordination	Requires manual coordination	Kafka manages partition assignment	Abstracted, no manual coordination needed	Abstracted, Kafka manages partition assignment
Fault Tolerance	Must be implemented manually	Kafka handles offsets and rebalancing	Handled by Spark with checkpointing	Handled by Spark with per-partition offsets
Scalability	Limited by number of shards	Scales with number of partitions	Scales beyond shard count with partitioning	Scales with number of partitions and executors
API Complexity	Requires detailed shard management	Simple API for consumer groups	Simplified by unified DataFrame abstraction	Simplified by unified DataFrame and consumer groups
Latency	Lower latency, direct shard reads	Lower latency, direct partition reads	Slightly higher due to DataFrame abstraction	Slightly higher due to DataFrame abstraction

Conclusion

Choosing between Python consumers and Spark Structured Streaming depends on your specific requirements for real-time data processing:

Python Consumers: Offer flexibility and control, ideal for simple use cases or environments where you need direct interaction with shards or partitions. They require more effort for fault tolerance and coordination.
Spark Structured Streaming: Provides a high-level abstraction that simplifies stream processing and scales efficiently. It handles fault tolerance and parallelism internally, making it suitable for complex data processing and large-scale applications.

Both approaches have their advantages and are well-suited to different scenarios. By understanding their strengths and limitations, you can make an informed choice for your streaming data processing needs.

Comparing Python Consumers and Spark Structured Streaming: Reading from Kinesis and Kafka

Python Consumers: Reading from Kinesis and Kafka

Reading from Kinesis

Reading from Kafka

Spark Structured Streaming: Reading from Kinesis and Kafka

Reading from Kinesis

Reading from Kafka

Comparing Python Consumers and Spark Structured Streaming

Conclusion

Subscribe to my newsletter

Ajay Veerabomma

Ajay Veerabomma