Google Dataflow Optimization: Streaming Engine
What is Streaming Engine
"By default, the Dataflow pipeline runner executes the steps of your streaming pipeline entirely on worker virtual machines, consuming worker CPU, memory, and Persistent Disk storage. Dataflow's Streaming Engine moves pipeline execution out of the worker VMs and into the Dataflow service backend. For more information, see Streaming Engine.
The Streaming Engine model has the following benefits:
Reduced CPU, memory, and Persistent Disk storage resource usage on the worker VMs. Streaming Engine works best with smaller worker machine types (
n1-standard-2
instead ofn1-standard-4
). It doesn't require Persistent Disk beyond a small worker boot disk, leading to less resource and quota consumption.More responsive Horizontal Autoscaling in response to variations in incoming data volume. Streaming Engine offers smoother, more granular scaling of workers.
Most of the reduction in worker resources comes from offloading the work to the Dataflow service. For that reason, there is a charge associated with the use of Streaming Engine."
How it Works
The core of the Streaming Engine is to assign a key to each message being processed and perform all processing in the context of the key. The key allows the Streaming Engine to track the state across related messages. The state is tracked, using the key, in a Key-Value persistent store. When Dataflow transforms modify the messages, these are persisted in the store. Also, by using keys, the messages can be more easily partitioned and load-balanced among workers, as each worker can process a range of the key space.
For more details, please see this article
Pricing
For streaming pipelines, the Dataflow Streaming Engine moves streaming shuffle and state processing out of the worker VMs and into the Dataflow service backend
Streaming Engine usage is billed by the volume of streaming data processed, which depends on the following:
The volume of data ingested into your streaming pipeline
The complexity of the pipeline
The number of pipeline stages with shuffle operation or with stateful DoFns
Examples of what counts as a byte processed include the following items:
Input flows from data sources
Flows of data from one fused pipeline stage to another fused stage
Flows of data persisted in user-defined state or used for windowing
Output messages to data sinks, such as to Pub/Sub or BigQuery
Please see this Google Cloud Platform pricing page for the latest information on costs per region.
How to Use Streaming Engine
All you need to do is add the following CLI parameter when you are deploying your Dataflow pipeline. --enable_streaming_engine
. Then, your dataflow streaming job will start to use the streaming engine for processing!
Useful Links
Subscribe to my newsletter
Read articles from Nikhil Rao directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
Nikhil Rao
Nikhil Rao
Los Angeles