A Practical Guide to Making Iceberg Good for Streaming Data

Table of contents
- Why Apache Iceberg Needs Streaming Support
- The Challenges of Streaming into Apache Iceberg
- RisingWave: Making Apache Iceberg Streaming Native
- Practical Example: Real-Time Revenue Calculation with RisingWave and Iceberg
- Beyond Ingestion: Query and Transform Iceberg Data with RisingWave
- Advantages of Streaming Data into Apache Iceberg with RisingWave
- Conclusion

Why Apache Iceberg Needs Streaming Support
Apache Iceberg has revolutionized data lakes by treating cloud object storage, like Amazon S3, as databases. Iceberg provides schema evolution, transactional safety, hidden partitioning, snapshot time travel, and robust metadata. Its ability to query data efficiently on S3 makes Iceberg the go-to table format for modern data platforms.
However, there's one significant drawback: Iceberg struggles with streaming data.
While Iceberg excels with periodic batch loads (hourly, daily), it struggles with continuous, event-driven data ingestion, such as Kafka streams or database CDC (Change Data Capture) logs. Common issues include too many small files, oversized metadata manifests, and degraded query performance. Iceberg expects data to be cleaned and structured before it lands, leaving organizations facing complex streaming data challenges.
The Challenges of Streaming into Apache Iceberg
To stream data efficiently into Iceberg, you need to master these critical tasks:
High-throughput ingestion from Kafka or CDC logs (PostgreSQL, MySQL, MongoDB)
Handling late-arriving data using event timestamps
Real-time transformation and cleansing
Grouping data into efficiently sized Iceberg files
Atomically committing snapshots
Typically, teams cobble together Kafka, Debezium, Flink, and custom Iceberg connectors. This approach is fragile, costly, and requires extensive custom coding and operational overhead. Additionally, query performance suffers due to S3’s inherent latency, forcing precomputed aggregates and transformations to improve speed.
RisingWave: Making Apache Iceberg Streaming Native
RisingWave offers a simpler, integrated solution. It treats Iceberg as an active table engine rather than a passive storage destination. With RisingWave, Iceberg becomes a streaming-aware, live table backed by incremental processing.
You can ingest data declaratively from Kafka and database CDC streams directly in SQL. RisingWave's PostgreSQL-compatible syntax enables seamless integration and interaction with Iceberg-backed tables—queries, inserts, updates, and materialized views are all supported.
RisingWave handles data compaction and snapshotting efficiently, preventing the small file problem. Data remains instantly queryable by engines like Trino or Spark.
Practical Example: Real-Time Revenue Calculation with RisingWave and Iceberg
Imagine real-time order event processing from Kafka, calculating hourly regional revenues, stored in Iceberg:
CREATE TABLE hourly_revenue (
window_start TIMESTAMP,
window_end TIMESTAMP,
region TEXT,
revenue DOUBLE
)
AS SELECT
window_start,
window_end,
region,
SUM(order_total) AS revenue
FROM TUMBLE (
source_stream,
order_time,
INTERVAL '1 hour'
)
GROUP BY window_start, window_end, region
WITH (
connector = 'kafka',
topic = 'orders',
properties.bootstrap_servers = 'kafka:9092',
scan.startup.mode = 'earliest',
commit_checkpoint_interval = 120
)
FORMAT PLAIN ENCODE JSON
ENGINE = iceberg;
This single SQL statement:
Streams Kafka data
Processes event-time tumbling windows
Compacts data into optimized Iceberg files in S3
Enables immediate query access for downstream analytics
Beyond Ingestion: Query and Transform Iceberg Data with RisingWave
RisingWave extends Iceberg beyond simple ingestion. You can:
Directly query Iceberg tables within RisingWave
Join Iceberg tables with other real-time data sources
Create incremental materialized views, significantly reducing compute and storage overhead
For instance, building a continuously updated leaderboard of top regions:
CREATE MATERIALIZED VIEW top_regions AS
SELECT region, SUM(revenue) AS total
FROM hourly_revenue
WHERE window_start >= NOW() - INTERVAL '6 hours'
GROUP BY region
ORDER BY total DESC
LIMIT 3;
This view updates incrementally, providing near-real-time analytics without redundant computations or separate batch jobs.
Advantages of Streaming Data into Apache Iceberg with RisingWave
RisingWave transforms the way you use Iceberg by offering:
Real-time data processing and incremental updates
Reduced operational complexity
Improved query performance and freshness
Seamless integration with existing analytical tools (Trino, Spark)
Instead of complex pipelines, RisingWave delivers a streaming-native experience for Iceberg:
Fully incremental processing
SQL-native operations
Operational simplicity
Direct compatibility with Iceberg semantics
Conclusion
Apache Iceberg doesn’t inherently need streaming capabilities—what’s required is a streamlined, streaming-first infrastructure around it. RisingWave provides precisely this infrastructure, integrating seamlessly with Iceberg, delivering robust, real-time, SQL-driven data processing.
If you're looking to fully leverage Apache Iceberg for streaming data, RisingWave is built to bridge this critical gap, unlocking new efficiencies and analytics capabilities for your data lakehouse.
Subscribe to my newsletter
Read articles from RisingWave Labs directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

RisingWave Labs
RisingWave Labs
RisingWave is an open-source distributed SQL database for stream processing. It is designed to reduce the complexity and cost of building real-time applications. RisingWave offers users a PostgreSQL-like experience specifically tailored for distributed stream processing. Learn more: https://risingwave.com/github. RisingWave Cloud is a fully managed cloud service that encompasses the entire functionality of RisingWave. By leveraging RisingWave Cloud, users can effortlessly engage in cloud-based stream processing, free from the challenges associated with deploying and maintaining their own infrastructure. Learn more: https://risingwave.cloud/. Talk to us: https://risingwave.com/slack.