Comparing Python Consumers and Spark Structured Streaming: Reading from Kinesis and Kafka
In the world of real-time data processing, Amazon Kinesis and Apache Kafka are popular tools for handling large streams of data. Processing this data can be done using various frameworks and libraries, among which Python consumers and Apache Spark Structured Streaming stand out. This blog explores how these approaches differ when reading from Kinesis and Kafka, highlighting their capabilities, mechanisms, and use cases.
Python Consumers: Reading from Kinesis and Kafka
Python consumers offer flexibility and direct control over how data is ingested and processed. Here’s how they interact with Kinesis and Kafka:
Reading from Kinesis
Direct Shard Access:
Individual Shard Readers, Python consumers can independently read from Kinesis shards by managing shard iterators. Each consumer reads from a specific shard, making it straightforward to implement but requiring manual coordination.
Python Consumers Reading from Kinesis
+-----------------------+
| Kinesis Stream |
+-----------------------+
| | |
S1 S2 S3 ... Sn
| | |
C1 C2 C3 ... Cn
Coordination: When scaling to multiple consumers, coordination is required to ensure that each shard is only read by one consumer at a time. This can be managed through centralized databases or distributed key-value stores.
Fault Tolerance: Python consumers need to implement manual mechanisms to track the last processed record and handle retries, which can complicate fault tolerance.
import boto3
def read_from_shard(shard_id, stream_name):
kinesis = boto3.client('kinesis')
shard_iterator = kinesis.get_shard_iterator(
StreamName=stream_name,
ShardId=shard_id,
ShardIteratorType='LATEST'
)['ShardIterator']
while True:
records = kinesis.get_records(ShardIterator=shard_iterator)
for record in records['Records']:
process(record)
shard_iterator = records['NextShardIterator']
def process(record):
print(record['Data'])
# Example usage
read_from_shard('shardId-000000000000', 'your-stream-name')
Reading from Kafka
Partition-Specific Consumers:
Consumer Groups, Python consumers use Kafka’s built-in consumer groups to read from partitions. Each consumer in a group reads from one or more Kafka partitions, which Kafka dynamically assigns to balance the load.
Python Consumers Reading from Kafka
+-----------------------+
| Kafka Topic |
+-----------------------+
| | | |
P1 P2 P3 ... Pn
| | | |
C1 C2 C3 ... Cm
Automatic Load Balancing: Kafka handles the partition assignment and rebalancing when consumers join or leave the group, making it easier to scale and manage.
Offset Management: Kafka automatically tracks the offsets, allowing consumers to resume from the last processed record after a failure, simplifying fault tolerance.
from kafka import KafkaConsumer
consumer = KafkaConsumer(
'your-topic',
bootstrap_servers=['localhost:9092'],
group_id='your-group',
auto_offset_reset='earliest'
)
for message in consumer:
print(message.value)
In above setups, the maximum parallelism for data processing is limited to the number of partitions or shards, leading to constrained scalability. To overcome this limitation, we can decouple compute from storage, similar to how Massively Parallel Processing (MPP) architectures operate. MPP systems separate storage from compute to achieve high scalability and flexibility. Spark Structured Streaming adopts this approach, effectively decoupling storage from compute to enhance parallelism and processing capabilities. Let's explore how Spark accomplishes this with Kinesis and Kafka.
Spark Structured Streaming: Reading from Kinesis and Kafka
Apache Spark Structured Streaming abstracts the complexities of stream processing, offering a unified approach for handling both Kinesis and Kafka data streams. Here’s how it works:
Reading from Kinesis
Unified DataFrame Approach:
Stream Abstraction, Spark reads data from all Kinesis shards and combines them into a single DataFrame. This approach abstracts away shard-specific details, providing a consistent interface for processing the entire stream.
Spark Structured Streaming Reading from Kinesis
+-----------------------+
| Kinesis Stream |
+-----------------------+
| | |
S1 S2 S3 ... Sn
\ | /
+----+
|Spark|
+----+
/ | \
E1 E2 E3 ... Em
Parallel Processing: Spark partitions the unified DataFrame based on its internal mechanism and distributes these partitions across available executors for parallel processing.
Fault Tolerance: Spark handles checkpointing and state management, allowing it to resume processing from the last checkpointed state in case of failures.
pythonCopy codefrom pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("KinesisExample") \
.getOrCreate()
kinesisDF = spark \
.readStream \
.format("kinesis") \
.option("streamName", "your-stream-name") \
.option("region", "us-east-1") \
.option("initialPosition", "LATEST") \
.load()
processedDF = kinesisDF.selectExpr("CAST(data AS STRING)")
query = processedDF \
.writeStream \
.outputMode("append") \
.format("console") \
.option("checkpointLocation", "/path/to/checkpoint/dir") \
.start()
query.awaitTermination()
Reading from Kafka
Consumer Group Integration:
Partition-Aware Reading, Spark uses Kafka consumer groups internally to read from Kafka partitions. Each executor reads from one or more partitions, with Kafka managing partition assignments and rebalancing.
Spark Structured Streaming Reading from Kafka
+-----------------------+
| Kafka Topic |
+-----------------------+
| | | |
P1 P2 P3 ... Pn
\ | /
+------+
|Spark |
+------+
/ | \
E1 E2 E3 ... Em
Offset Management: Spark tracks offsets for each Kafka partition and stores them in the checkpoint directory. This allows Spark to resume from the last processed offset for each partition independently.
Scalability: Spark distributes the read DataFrame across executors, allowing it to scale processing beyond the initial number of Kafka partitions.
pythonCopy codefrom pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("KafkaExample") \
.getOrCreate()
kafkaDF = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "your-topic") \
.load()
processedDF = kafkaDF.selectExpr("CAST(value AS STRING)")
query = processedDF \
.writeStream \
.outputMode("append") \
.format("console") \
.option("checkpointLocation", "/path/to/checkpoint/dir") \
.start()
query.awaitTermination()
Comparing Python Consumers and Spark Structured Streaming
Aspect | Python Consumers (Kinesis) | Python Consumers (Kafka) | Spark Structured Streaming (Kinesis) | Spark Structured Streaming (Kafka) |
Data Ingestion | Direct shard readers | Uses Kafka consumer groups | Unified DataFrame from all shards | Uses Kafka consumer groups internally |
Parallelism | Independent consumers for each shard | Consumers read from partitions in parallel | Unified DataFrame partitioned across executors | Executors read from partitions, unified DataFrame |
Coordination | Requires manual coordination | Kafka manages partition assignment | Abstracted, no manual coordination needed | Abstracted, Kafka manages partition assignment |
Fault Tolerance | Must be implemented manually | Kafka handles offsets and rebalancing | Handled by Spark with checkpointing | Handled by Spark with per-partition offsets |
Scalability | Limited by number of shards | Scales with number of partitions | Scales beyond shard count with partitioning | Scales with number of partitions and executors |
API Complexity | Requires detailed shard management | Simple API for consumer groups | Simplified by unified DataFrame abstraction | Simplified by unified DataFrame and consumer groups |
Latency | Lower latency, direct shard reads | Lower latency, direct partition reads | Slightly higher due to DataFrame abstraction | Slightly higher due to DataFrame abstraction |
Conclusion
Choosing between Python consumers and Spark Structured Streaming depends on your specific requirements for real-time data processing:
Python Consumers: Offer flexibility and control, ideal for simple use cases or environments where you need direct interaction with shards or partitions. They require more effort for fault tolerance and coordination.
Spark Structured Streaming: Provides a high-level abstraction that simplifies stream processing and scales efficiently. It handles fault tolerance and parallelism internally, making it suitable for complex data processing and large-scale applications.
Both approaches have their advantages and are well-suited to different scenarios. By understanding their strengths and limitations, you can make an informed choice for your streaming data processing needs.
Subscribe to my newsletter
Read articles from Ajay Veerabomma directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by