Comparing Python Consumers and Spark Structured Streaming: Reading from Kinesis and Kafka

Ajay VeerabommaAjay Veerabomma
6 min read

In the world of real-time data processing, Amazon Kinesis and Apache Kafka are popular tools for handling large streams of data. Processing this data can be done using various frameworks and libraries, among which Python consumers and Apache Spark Structured Streaming stand out. This blog explores how these approaches differ when reading from Kinesis and Kafka, highlighting their capabilities, mechanisms, and use cases.

Python Consumers: Reading from Kinesis and Kafka

Python consumers offer flexibility and direct control over how data is ingested and processed. Here’s how they interact with Kinesis and Kafka:

Reading from Kinesis

Direct Shard Access:

Individual Shard Readers, Python consumers can independently read from Kinesis shards by managing shard iterators. Each consumer reads from a specific shard, making it straightforward to implement but requiring manual coordination.

Python Consumers Reading from Kinesis

+-----------------------+
|   Kinesis Stream      |
+-----------------------+
        |   |   |
       S1  S2  S3  ... Sn
        |   |   |
       C1  C2  C3  ... Cn

Coordination: When scaling to multiple consumers, coordination is required to ensure that each shard is only read by one consumer at a time. This can be managed through centralized databases or distributed key-value stores.

Fault Tolerance: Python consumers need to implement manual mechanisms to track the last processed record and handle retries, which can complicate fault tolerance.

import boto3

def read_from_shard(shard_id, stream_name):
    kinesis = boto3.client('kinesis')
    shard_iterator = kinesis.get_shard_iterator(
        StreamName=stream_name,
        ShardId=shard_id,
        ShardIteratorType='LATEST'
    )['ShardIterator']

    while True:
        records = kinesis.get_records(ShardIterator=shard_iterator)
        for record in records['Records']:
            process(record)
        shard_iterator = records['NextShardIterator']

def process(record):
    print(record['Data'])

# Example usage
read_from_shard('shardId-000000000000', 'your-stream-name')

Reading from Kafka

Partition-Specific Consumers:

Consumer Groups, Python consumers use Kafka’s built-in consumer groups to read from partitions. Each consumer in a group reads from one or more Kafka partitions, which Kafka dynamically assigns to balance the load.

Python Consumers Reading from Kafka

+-----------------------+
|    Kafka Topic        |
+-----------------------+
       |   |   |   |
      P1  P2  P3  ... Pn
       |   |   |   |
      C1  C2  C3  ... Cm

Automatic Load Balancing: Kafka handles the partition assignment and rebalancing when consumers join or leave the group, making it easier to scale and manage.

Offset Management: Kafka automatically tracks the offsets, allowing consumers to resume from the last processed record after a failure, simplifying fault tolerance.

from kafka import KafkaConsumer

consumer = KafkaConsumer(
    'your-topic',
    bootstrap_servers=['localhost:9092'],
    group_id='your-group',
    auto_offset_reset='earliest'
)

for message in consumer:
    print(message.value)

In above setups, the maximum parallelism for data processing is limited to the number of partitions or shards, leading to constrained scalability. To overcome this limitation, we can decouple compute from storage, similar to how Massively Parallel Processing (MPP) architectures operate. MPP systems separate storage from compute to achieve high scalability and flexibility. Spark Structured Streaming adopts this approach, effectively decoupling storage from compute to enhance parallelism and processing capabilities. Let's explore how Spark accomplishes this with Kinesis and Kafka.

Spark Structured Streaming: Reading from Kinesis and Kafka

Apache Spark Structured Streaming abstracts the complexities of stream processing, offering a unified approach for handling both Kinesis and Kafka data streams. Here’s how it works:

Reading from Kinesis

Unified DataFrame Approach:

Stream Abstraction, Spark reads data from all Kinesis shards and combines them into a single DataFrame. This approach abstracts away shard-specific details, providing a consistent interface for processing the entire stream.

Spark Structured Streaming Reading from Kinesis

+-----------------------+
|   Kinesis Stream      |
+-----------------------+
        |   |   |
       S1  S2  S3  ... Sn
        \   |   /
        +----+
        |Spark|
        +----+
         / | \
        E1 E2 E3 ... Em

Parallel Processing: Spark partitions the unified DataFrame based on its internal mechanism and distributes these partitions across available executors for parallel processing.

Fault Tolerance: Spark handles checkpointing and state management, allowing it to resume processing from the last checkpointed state in case of failures.

pythonCopy codefrom pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("KinesisExample") \
    .getOrCreate()

kinesisDF = spark \
    .readStream \
    .format("kinesis") \
    .option("streamName", "your-stream-name") \
    .option("region", "us-east-1") \
    .option("initialPosition", "LATEST") \
    .load()

processedDF = kinesisDF.selectExpr("CAST(data AS STRING)")

query = processedDF \
    .writeStream \
    .outputMode("append") \
    .format("console") \
    .option("checkpointLocation", "/path/to/checkpoint/dir") \
    .start()

query.awaitTermination()

Reading from Kafka

Consumer Group Integration:

Partition-Aware Reading, Spark uses Kafka consumer groups internally to read from Kafka partitions. Each executor reads from one or more partitions, with Kafka managing partition assignments and rebalancing.

Spark Structured Streaming Reading from Kafka

+-----------------------+
|    Kafka Topic        |
+-----------------------+
       |   |   |   |
      P1  P2  P3  ... Pn
       \   |   /
        +------+
        |Spark |
        +------+
         / | \
        E1 E2 E3 ... Em

Offset Management: Spark tracks offsets for each Kafka partition and stores them in the checkpoint directory. This allows Spark to resume from the last processed offset for each partition independently.

Scalability: Spark distributes the read DataFrame across executors, allowing it to scale processing beyond the initial number of Kafka partitions.

pythonCopy codefrom pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("KafkaExample") \
    .getOrCreate()

kafkaDF = spark \
    .readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "localhost:9092") \
    .option("subscribe", "your-topic") \
    .load()

processedDF = kafkaDF.selectExpr("CAST(value AS STRING)")

query = processedDF \
    .writeStream \
    .outputMode("append") \
    .format("console") \
    .option("checkpointLocation", "/path/to/checkpoint/dir") \
    .start()

query.awaitTermination()

Comparing Python Consumers and Spark Structured Streaming

AspectPython Consumers (Kinesis)Python Consumers (Kafka)Spark Structured Streaming (Kinesis)Spark Structured Streaming (Kafka)
Data IngestionDirect shard readersUses Kafka consumer groupsUnified DataFrame from all shardsUses Kafka consumer groups internally
ParallelismIndependent consumers for each shardConsumers read from partitions in parallelUnified DataFrame partitioned across executorsExecutors read from partitions, unified DataFrame
CoordinationRequires manual coordinationKafka manages partition assignmentAbstracted, no manual coordination neededAbstracted, Kafka manages partition assignment
Fault ToleranceMust be implemented manuallyKafka handles offsets and rebalancingHandled by Spark with checkpointingHandled by Spark with per-partition offsets
ScalabilityLimited by number of shardsScales with number of partitionsScales beyond shard count with partitioningScales with number of partitions and executors
API ComplexityRequires detailed shard managementSimple API for consumer groupsSimplified by unified DataFrame abstractionSimplified by unified DataFrame and consumer groups
LatencyLower latency, direct shard readsLower latency, direct partition readsSlightly higher due to DataFrame abstractionSlightly higher due to DataFrame abstraction

Conclusion

Choosing between Python consumers and Spark Structured Streaming depends on your specific requirements for real-time data processing:

  • Python Consumers: Offer flexibility and control, ideal for simple use cases or environments where you need direct interaction with shards or partitions. They require more effort for fault tolerance and coordination.

  • Spark Structured Streaming: Provides a high-level abstraction that simplifies stream processing and scales efficiently. It handles fault tolerance and parallelism internally, making it suitable for complex data processing and large-scale applications.

Both approaches have their advantages and are well-suited to different scenarios. By understanding their strengths and limitations, you can make an informed choice for your streaming data processing needs.

1
Subscribe to my newsletter

Read articles from Ajay Veerabomma directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Ajay Veerabomma
Ajay Veerabomma