Cost-efficient event ingestion into Iceberg S3 Tables on AWS


Amazon S3 Tables was launched on December 3rd 2024, and provides you “storage that is optimized for tabular data such as daily purchase transactions, streaming sensor data, and ad impressions in Apache Iceberg format”.
While S3 Tables can be queried with a variety of AWS services, such as Amazon Athena, Amazon EMR, Apache Spark or, even better, DuckDB, the ingestion of data is currently pretty limited. You can insert data via SQL queries in Athena, EMR Spark clusters, and as a pretty new feature, Amazon Data Firehose.
For use cases that involve (mass) event ingestion, which also want to be cost-efficient, using Athena would be impractical (and slow), whereas doing so via EMR Spark clusters is very costly. Contrary to that, Amazon Data Firehose’s pricing is strictly pay-per-use for Apache Iceberg tables (not sure why the use a different name for S3 Tables though):
For Firehose streams that is configured with Apache Iceberg Tables as a destination, you will be billed for the amount of data processed to the destination. Pricing is billed per GB ingested with no 5KB increments. If data processed bytes before delivery is more than the ingested bytes due to custom Lambda processing, then the additional bytes are also billed. Additional bytes are billed at the same rate as shown in Kinesis Data Streams as a source to Apache Iceberg tables as a destination for all sources of ingestion including Direct PUT.
So that’s a pretty sweet deal at $0.075/GB (for the first 250TB/month) one might say!
Example Use Case
How would a event ingestion pipeline that lands the data in S3 Tables look like? A typical example use case would be web analytics, where client browsers send requests containing information about the browser, the used devices, user location and potentially many more details to a backend. There can be different event types, such as normal pageviews, or tracking events of varying structure.
The backend then ingests the events, validates, cleans, transform and stores them, so that different metrics can be calculated in later steps. For example visitor count, pageview counts, session length etc.
For sending the events, we will use the OwnStats Client, which is a plugin for getanalytics.io. Alternatively, you can also sent sample events via HTTP clients as well.
Architectural Implementation Options
But how should the actual events be “landed” on AWS, if we consider them as typical JSON payloads? Below you’ll find some architectural variants that achieve the same or similar goals, but with different services involved. As we’d like to optimize for costs, we want to focus on services that are pay-per-use, but also discuss break-even points compared to fixed-priced services as well.
Variant 1: API Gateway, Lambda and Kinesis Data Firehose to S3
The usual suspect for event ingestion would be an API created with API Gateway fronting Lambda functions, then publishing the events to a Kinesis Data Firehose stream that forwards the data to S3 Tables, or more “classical” to S3 as Hive-partitioned Parquet files:
Variant 2: CloudFront, Realtime Logs, Kinesis Data Streams & Firehose to S3 via Lambda
CloudFront basically replaces API Gateway here, by hosting a 1×1 pixel GIF file that can be used to pass event payloads to via GET request query strings, which is the same method a lot of web analytics providers are using (such as Google Analytics).
CloudFront Realtime Logs forward the 100% “sample” of the logs containing the actual event payloads to a Kinesis Data Stream, which itself feeds the events to a Kinesis Data Firehose stream that uses a processing Lambda function that decodes the event payloads to a tabular form, and also adds Hive partitioning info back to the stream. Firehose then writes the data in batches to S3.
Unfortunately, it’s currently not possible for the Realtime Logs to directly write to Kinesis Data Firehose, instead of Kinesis Data Streams first, which imposes a fixed cost “tax” on this architecture of $0.015/hour, which means at least $10.80/month.
This architecture is implemented at OwnStats, which is also further documented and available on GitHub as well.
Variant 3: CloudFront, CloudWatch Logs, Kinesis Data Firehose to S3 via Lambda
In this architecture variant, the CloudFront Realtime Logs and the Kinesis Data Stream are replaced by a CloudWatch Logs Delivery, which forward the standard logs (v2!) from CloudFront to the Kinesis Data Firehose.
This gets rid of the fixed costs for the Kinesis Data Stream (as outlined above), and replaces this with a pay-per-use pricing of $0.25/GB for the Logs Delivery. This means if you ingest less than 40GB/month of events, you’ll definitely spent less with this variant.
The data doesn’t land in S3 as Hive-partitioned Parquet files in this case, but in the self-optimizing Apache Iceberg format in a S3 Table. The processing Lambda function is updated to support the routing of the events to the correct S3 Table, as well as parsing the raw logs from CloudFront, and transform them to the right tabular format.
Caveats
There are currently multiple downsides of this approach:
There is no complete CloudFormation support for S3 Tables, meaning that you can use it to manage S3 Table Buckets, but not Namespaces or actual S3 Tables. This works only with the CLI, SDKs or the API. This means that we can’t deploy this as one single stack, but have to take a multi-step/stack approach with manual interactions.
The forwarding of the data from the Kinesis Data Firehose to the S3 Table only works if a LakeFormation Data Lake Administrator has been set up. Furthermore, the IAM role you deploy the CloudFormation stacks with needs to have the correct permissions / managed policy as well.
When creating the LakeFormation Permission via CloudFormation, there seems to be a strange bug that even when you define a DependsOn dependency for the permission that needs to be there before creating the Kinesis Data Firehose resource which has a reference to the actual Namespace and S3 Table, you’ll be seeing the following error:
Role arn:aws:iam::112233445566:role/rolename is not authorized to perform: glue:GetTable for the given table or the table does not exist
If the Kinesis Data Firehose resource is created in another subsequent CloudFormation stack, it works. This seems to be some kind of race condition in the CloudFormation service.
S3 Tables currently only support the Apache Iceberg primitive types as per the docs. As we’d like to store and query JSON payloads, it’d be much nicer if it would also support semi-structured types like
variant
.
S3 Tables supports all Apache Iceberg primitive types. For more information, see the Apache Iceberg documentation.
Nevertheless, we want to move forward to the implementation, while taking the caveats into account.
Implementation of the Event Ingestion Pipeline
As Variant 1 has been implemented by many other people already, and Variant 2 is implemented for example by OwnStats as well (while producing fixed costs), Variant 3 shall be be the focus because it offers an overall pay-per-use approach, is technically novel and thus potentially more interesting for the community, even though it has the outlined caveats.
The focus lies on implementing the Example Use Case of web analytics events, as outlined above, and builds upon the insights gained while creating OwnStats, a completely serverless web analytics
Prerequisites
The following prerequisites need to be available on the machine you want to deploy from:
AWS IAM credentials that have the respective permissions to deploy the different stack’s resources
A Node.js installation (> v18, including npm)
A global installation of the Serverless Framework (v3, not v4!) via
npm i -g serverless@3
Structure of the project
The Iceberg Event Pipeline project consists of three sequentially deployable CloudFormation stacks, as well as a set of scripts that create the Namespace and the actual S3 Table resources, as they currently have no CloudFormation support themselves.
Base Stack
The Base Stack contains the CloudFront Distribution, the S3 Bucket for the Backup of non-forwardable data from the Kinesis Data Firehose to the S3 Table, and the S3 Table Bucket.
Processing Stack
The Processing Stack contains the Kinesis Data Firehose Policy, the LakeFormation permissions, the Glue Database Resource Link, and the Lambda function for the event processing.
Storage Stack
The Storage Stack contains the actual Kinesis Data Firehose, the CloudWatch Logs Delivery as well as it’s Source and Destination.
Configuration
In the root folder of the project, there’s a file called config.json, which contains the project’s configuration that’s used during the stack deployments. It looks like this:
{
"profile": "default",
"stage": "prd",
"region": "us-east-1",
"s3": {
"tableBucket": "iceberg-event-analytics-data",
"namespace": "iceberg_event_analytics",
"tableName": "event_data"
}
}
You can configure the AWS credentials profile name, the stage, and the AWS region you want to deploy the stacks to. In the s3 object, the S3 Tables-related resource names can be configured. You can also just use the defaults.
S3 Table Structure
The structure of the target table is defined in a JSON file as well, which is used during the manual creation of the S3 Table resources (see below). The table structure is reused from the OwnStats incoming_events table, though slightly adapted/simplified.
Column name | Data type | Is partition key? | Description |
event_id | uuid | yes | The domain name |
event_date | string | no | The date of the event (YYYY-MM-DD), as string |
event_timestamp | timestamp | no | The exact event timestamp |
arrival_timestamp | timestamp | no | The exact timestamp when the event arrived in the Kinesis Data Stream |
edge_city | string | no | The name of the edge city (all edge location info is derived from the x-edge-location field in the logs) |
edge_state | string | no | The state of the edge location |
edge_country | string | no | The country of the edge location |
edge_country_code | string | no | The country code of the edge location |
edge_latitude | float | no | The latitude of the edge location |
edge_longitude | float | no | The longitude of the edge location |
edge_id | string | no | The original id of the edge location |
domain_name | string | no | The domain name where the event was recorded |
referrer | string | no | The referrer |
referrer_domain_name | string | no | The domain name of the referrer |
browser_name | string | no | The name of the browser |
browser_version | string | no | The version of the browser |
browser_os_name | string | no | The OS name of the browser |
browser_os_version | string | no | The OS version of the browser |
browser_timezone | string | no | The timezone of the browser |
browser_language | string | no | The language of the browser |
device_type | string | no | The device type |
device_vendor | string | no | The device vendor |
device_outer_resolution | string | no | The outer resolution of the device |
device_inner_resolution | string | no | The inner resolution of the device |
device_color_depth | int | no | The color depth of the device |
device_platform | string | no | The platform of the device |
device_memory | float | no | The memory of the device (in MB) |
device_cores | int | no | The number of cores of the device |
utm_source | string | no | Identifies which site sent the traffic |
utm_campaign | string | no | Identifies a specific product promotion or strategic campaign |
utm_medium | string | no | Identifies what type of link was used, such as cost per click or email |
utm_content | string | no | Identifies what specifically was clicked to bring the user to the site |
utm_term | string | no | Identifies search terms |
request_url | string | no | The full requested URL |
request_path | string | no | The path of the requested URL |
request_query_string | string | no | The query string of the requested URL |
request_bytes | int | no | The size of the request in bytes |
request_status_code | int | no | The HTTP status code of the request |
request_cache_status | string | no | The CloudFront cache status |
request_delivery_time_ms | int | no | The time in ms it took for CloudFront to complete the request |
request_asn | int | no | The ASN of the requestor |
event_type | string | yes | The type of the event (pageview or track ) |
event_name | string | no | The name of the event for tracking events |
event_data | string | no | The stringified event payload for tracking events |
Deployment
To be able to deploy the project, you have to have set up all the above mentioned requirements on the machine you want to deploy from.
Deployment of the Base Stack
cd base-stack
# Deploy Base Stack
sls deploy
# Return to project base directory
cd ..
Manual creation of S3 Table resources & Asset sync for CloudFront
cd manual-setup
# Install dependencies
npm i
# Create S3 Table resources
npm run create-resources
# Sync assets to S3 Bucket that backs the Cloudfront Distribution
npm run sync-s3
# Return to project base directory
cd ..
Deployment of the Processing Stack
cd processing-stack
# Install dependencies
npm i
# Deploy Processing Stack
sls deploy
# Return to project base directory
cd ..
Deployment of the Storage Stack
cd storage-stack
# Deploy Storage Stack
sls deploy
# Return to project base directory
cd ..
Sending test events
If everything went smoothly, you can now send send events to your pipeline, which will land in the S3 Table. Therefore, you can either use the OwnStats client as stated in the Use Case description, or use provided script that produces the same request signature:
cd manual-setup
# Send event
npm run send-event -- "My Event" '{"testevent":true,"message":"foobar"}'
Querying the data in the S3 Table
Using Athena
You can query the data with Athena in the AWS Console. Please make sure that you select the correct catalog and database before you run a sample query on the event_data
table:
SELECT * FROM event_data ORDER BY arrival_timestamp DESC
If everything worked, you should see your events in the result browser:
Using DuckDB
I recently wrote a blog post about querying S3 Tables with DuckDB, please have a look to learn how you can do it.
Verdict
Creating low-cost, pay-per-use event data pipelines which store their data in the Apache Iceberg format on AWS is possible. Currently, it still requires a lot of configuration efforts, and doesn’t support single-stack IaC, neither via CloudFormation nor the AWS CDK.
This is a mayor hurdle in the DX from my current point of view, especially in the light of the very recent product releases from Cloudflare on 2025-04-10:
While those are still beta releases, it’s visible that the innovation in the event processing / efficient Data Lake storage space is accelerating. Once Cloudflare enables that Pipelines can directly write Iceberg tables on R2, together with the R2 Data Catalog and the upcoming containers product (e.g. running DuckDB as an API), you can host your complete Lakehouse data stack on Cloudflare. With potentially much less complexity compared to AWS at the moment.
Also from a cost perspective, this is bound to gain some share of the market, especially for smaller, non-enterprise companies
Workers Paid ($5 / month) Ingestion First 50 GB per month included Delivery to R2 First 50 GB per month included
In this scenario, you’d pay $5 for ingesting 50GB of data to R2, while on AWS you’d already have to pay $12.5 for the CloudWatch Logs Delivery to Kinesis Data Firehose only.
Also, there’s much less complexity from a IaC perspective compared to the pretty huge required infrastructure on AWS. I’m curious how / if AWS will react.
Subscribe to my newsletter
Read articles from Tobias Müller directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
