Cost-efficient event ingestion into Iceberg S3 Tables on AWS

Tobias MüllerTobias Müller
13 min read

Amazon S3 Tables was launched on December 3rd 2024, and provides you “storage that is optimized for tabular data such as daily purchase transactions, streaming sensor data, and ad impressions in Apache Iceberg format”.

While S3 Tables can be queried with a variety of AWS services, such as Amazon Athena, Amazon EMR, Apache Spark or, even better, DuckDB, the ingestion of data is currently pretty limited. You can insert data via SQL queries in Athena, EMR Spark clusters, and as a pretty new feature, Amazon Data Firehose.

For use cases that involve (mass) event ingestion, which also want to be cost-efficient, using Athena would be impractical (and slow), whereas doing so via EMR Spark clusters is very costly. Contrary to that, Amazon Data Firehose’s pricing is strictly pay-per-use for Apache Iceberg tables (not sure why the use a different name for S3 Tables though):

For Firehose streams that is configured with Apache Iceberg Tables as a destination, you will be billed for the amount of data processed to the destination. Pricing is billed per GB ingested with no 5KB increments. If data processed bytes before delivery is more than the ingested bytes due to custom Lambda processing, then the additional bytes are also billed. Additional bytes are billed at the same rate as shown in Kinesis Data Streams as a source to Apache Iceberg tables as a destination for all sources of ingestion including Direct PUT.

So that’s a pretty sweet deal at $0.075/GB (for the first 250TB/month) one might say!

Example Use Case

How would a event ingestion pipeline that lands the data in S3 Tables look like? A typical example use case would be web analytics, where client browsers send requests containing information about the browser, the used devices, user location and potentially many more details to a backend. There can be different event types, such as normal pageviews, or tracking events of varying structure.

The backend then ingests the events, validates, cleans, transform and stores them, so that different metrics can be calculated in later steps. For example visitor count, pageview counts, session length etc.

For sending the events, we will use the OwnStats Client, which is a plugin for getanalytics.io. Alternatively, you can also sent sample events via HTTP clients as well.

Architectural Implementation Options

But how should the actual events be “landed” on AWS, if we consider them as typical JSON payloads? Below you’ll find some architectural variants that achieve the same or similar goals, but with different services involved. As we’d like to optimize for costs, we want to focus on services that are pay-per-use, but also discuss break-even points compared to fixed-priced services as well.

Variant 1: API Gateway, Lambda and Kinesis Data Firehose to S3

The usual suspect for event ingestion would be an API created with API Gateway fronting Lambda functions, then publishing the events to a Kinesis Data Firehose stream that forwards the data to S3 Tables, or more “classical” to S3 as Hive-partitioned Parquet files:

Variant 2: CloudFront, Realtime Logs, Kinesis Data Streams & Firehose to S3 via Lambda

CloudFront basically replaces API Gateway here, by hosting a 1×1 pixel GIF file that can be used to pass event payloads to via GET request query strings, which is the same method a lot of web analytics providers are using (such as Google Analytics).

CloudFront Realtime Logs forward the 100% “sample” of the logs containing the actual event payloads to a Kinesis Data Stream, which itself feeds the events to a Kinesis Data Firehose stream that uses a processing Lambda function that decodes the event payloads to a tabular form, and also adds Hive partitioning info back to the stream. Firehose then writes the data in batches to S3.

Unfortunately, it’s currently not possible for the Realtime Logs to directly write to Kinesis Data Firehose, instead of Kinesis Data Streams first, which imposes a fixed cost “tax” on this architecture of $0.015/hour, which means at least $10.80/month.

This architecture is implemented at OwnStats, which is also further documented and available on GitHub as well.

Variant 3: CloudFront, CloudWatch Logs, Kinesis Data Firehose to S3 via Lambda

In this architecture variant, the CloudFront Realtime Logs and the Kinesis Data Stream are replaced by a CloudWatch Logs Delivery, which forward the standard logs (v2!) from CloudFront to the Kinesis Data Firehose.

This gets rid of the fixed costs for the Kinesis Data Stream (as outlined above), and replaces this with a pay-per-use pricing of $0.25/GB for the Logs Delivery. This means if you ingest less than 40GB/month of events, you’ll definitely spent less with this variant.

The data doesn’t land in S3 as Hive-partitioned Parquet files in this case, but in the self-optimizing Apache Iceberg format in a S3 Table. The processing Lambda function is updated to support the routing of the events to the correct S3 Table, as well as parsing the raw logs from CloudFront, and transform them to the right tabular format.

Caveats

There are currently multiple downsides of this approach:

  • There is no complete CloudFormation support for S3 Tables, meaning that you can use it to manage S3 Table Buckets, but not Namespaces or actual S3 Tables. This works only with the CLI, SDKs or the API. This means that we can’t deploy this as one single stack, but have to take a multi-step/stack approach with manual interactions.

  • The forwarding of the data from the Kinesis Data Firehose to the S3 Table only works if a LakeFormation Data Lake Administrator has been set up. Furthermore, the IAM role you deploy the CloudFormation stacks with needs to have the correct permissions / managed policy as well.

  • When creating the LakeFormation Permission via CloudFormation, there seems to be a strange bug that even when you define a DependsOn dependency for the permission that needs to be there before creating the Kinesis Data Firehose resource which has a reference to the actual Namespace and S3 Table, you’ll be seeing the following error:

Role arn:aws:iam::112233445566:role/rolename is not authorized to perform: glue:GetTable for the given table or the table does not exist

  • If the Kinesis Data Firehose resource is created in another subsequent CloudFormation stack, it works. This seems to be some kind of race condition in the CloudFormation service.

  • S3 Tables currently only support the Apache Iceberg primitive types as per the docs. As we’d like to store and query JSON payloads, it’d be much nicer if it would also support semi-structured types like variant.

S3 Tables supports all Apache Iceberg primitive types. For more information, see the Apache Iceberg documentation.

Nevertheless, we want to move forward to the implementation, while taking the caveats into account.

Implementation of the Event Ingestion Pipeline

As Variant 1 has been implemented by many other people already, and Variant 2 is implemented for example by OwnStats as well (while producing fixed costs), Variant 3 shall be be the focus because it offers an overall pay-per-use approach, is technically novel and thus potentially more interesting for the community, even though it has the outlined caveats.

The focus lies on implementing the Example Use Case of web analytics events, as outlined above, and builds upon the insights gained while creating OwnStats, a completely serverless web analytics

Prerequisites

The following prerequisites need to be available on the machine you want to deploy from:

  • AWS IAM credentials that have the respective permissions to deploy the different stack’s resources

  • A Node.js installation (> v18, including npm)

  • A global installation of the Serverless Framework (v3, not v4!) via npm i -g serverless@3

Structure of the project

The Iceberg Event Pipeline project consists of three sequentially deployable CloudFormation stacks, as well as a set of scripts that create the Namespace and the actual S3 Table resources, as they currently have no CloudFormation support themselves.

Base Stack

The Base Stack contains the CloudFront Distribution, the S3 Bucket for the Backup of non-forwardable data from the Kinesis Data Firehose to the S3 Table, and the S3 Table Bucket.

Processing Stack

The Processing Stack contains the Kinesis Data Firehose Policy, the LakeFormation permissions, the Glue Database Resource Link, and the Lambda function for the event processing.

Storage Stack

The Storage Stack contains the actual Kinesis Data Firehose, the CloudWatch Logs Delivery as well as it’s Source and Destination.

Configuration

In the root folder of the project, there’s a file called config.json, which contains the project’s configuration that’s used during the stack deployments. It looks like this:

{
  "profile": "default",
  "stage": "prd",
  "region": "us-east-1",
  "s3": {
    "tableBucket": "iceberg-event-analytics-data",
    "namespace": "iceberg_event_analytics",
    "tableName": "event_data"
  }
}

You can configure the AWS credentials profile name, the stage, and the AWS region you want to deploy the stacks to. In the s3 object, the S3 Tables-related resource names can be configured. You can also just use the defaults.

S3 Table Structure

The structure of the target table is defined in a JSON file as well, which is used during the manual creation of the S3 Table resources (see below). The table structure is reused from the OwnStats incoming_events table, though slightly adapted/simplified.

Column nameData typeIs partition key?Description
event_iduuidyesThe domain name
event_datestringnoThe date of the event (YYYY-MM-DD), as string
event_timestamptimestampnoThe exact event timestamp
arrival_timestamptimestampnoThe exact timestamp when the event arrived in the Kinesis Data Stream
edge_citystringnoThe name of the edge city (all edge location info is derived from the x-edge-location field in the logs)
edge_statestringnoThe state of the edge location
edge_countrystringnoThe country of the edge location
edge_country_codestringnoThe country code of the edge location
edge_latitudefloatnoThe latitude of the edge location
edge_longitudefloatnoThe longitude of the edge location
edge_idstringnoThe original id of the edge location
domain_namestringnoThe domain name where the event was recorded
referrerstringnoThe referrer
referrer_domain_namestringnoThe domain name of the referrer
browser_namestringnoThe name of the browser
browser_versionstringnoThe version of the browser
browser_os_namestringnoThe OS name of the browser
browser_os_versionstringnoThe OS version of the browser
browser_timezonestringnoThe timezone of the browser
browser_languagestringnoThe language of the browser
device_typestringnoThe device type
device_vendorstringnoThe device vendor
device_outer_resolutionstringnoThe outer resolution of the device
device_inner_resolutionstringnoThe inner resolution of the device
device_color_depthintnoThe color depth of the device
device_platformstringnoThe platform of the device
device_memoryfloatnoThe memory of the device (in MB)
device_coresintnoThe number of cores of the device
utm_sourcestringnoIdentifies which site sent the traffic
utm_campaignstringnoIdentifies a specific product promotion or strategic campaign
utm_mediumstringnoIdentifies what type of link was used, such as cost per click or email
utm_contentstringnoIdentifies what specifically was clicked to bring the user to the site
utm_termstringnoIdentifies search terms
request_urlstringnoThe full requested URL
request_pathstringnoThe path of the requested URL
request_query_stringstringnoThe query string of the requested URL
request_bytesintnoThe size of the request in bytes
request_status_codeintnoThe HTTP status code of the request
request_cache_statusstringnoThe CloudFront cache status
request_delivery_time_msintnoThe time in ms it took for CloudFront to complete the request
request_asnintnoThe ASN of the requestor
event_typestringyesThe type of the event (pageview or track)
event_namestringnoThe name of the event for tracking events
event_datastringnoThe stringified event payload for tracking events

Deployment

To be able to deploy the project, you have to have set up all the above mentioned requirements on the machine you want to deploy from.

💡
You need to deploy the stacks / do the manual steps as outlined below, otherwise it won’t work as they depend on each other.

Deployment of the Base Stack

cd base-stack

# Deploy Base Stack
sls deploy

# Return to project base directory
cd ..

Manual creation of S3 Table resources & Asset sync for CloudFront

cd manual-setup

# Install dependencies
npm i 

# Create S3 Table resources
npm run create-resources

# Sync assets to S3 Bucket that backs the Cloudfront Distribution
npm run sync-s3

# Return to project base directory
cd ..

Deployment of the Processing Stack

cd processing-stack

# Install dependencies
npm i

# Deploy Processing Stack
sls deploy

# Return to project base directory
cd ..

Deployment of the Storage Stack

cd storage-stack

# Deploy Storage Stack
sls deploy

# Return to project base directory
cd ..

Sending test events

If everything went smoothly, you can now send send events to your pipeline, which will land in the S3 Table. Therefore, you can either use the OwnStats client as stated in the Use Case description, or use provided script that produces the same request signature:

cd manual-setup

# Send event
npm run send-event -- "My Event" '{"testevent":true,"message":"foobar"}'

Querying the data in the S3 Table

Using Athena

You can query the data with Athena in the AWS Console. Please make sure that you select the correct catalog and database before you run a sample query on the event_data table:

SELECT * FROM event_data ORDER BY arrival_timestamp DESC

If everything worked, you should see your events in the result browser:

Using DuckDB

I recently wrote a blog post about querying S3 Tables with DuckDB, please have a look to learn how you can do it.

Verdict

Creating low-cost, pay-per-use event data pipelines which store their data in the Apache Iceberg format on AWS is possible. Currently, it still requires a lot of configuration efforts, and doesn’t support single-stack IaC, neither via CloudFormation nor the AWS CDK.

This is a mayor hurdle in the DX from my current point of view, especially in the light of the very recent product releases from Cloudflare on 2025-04-10:

While those are still beta releases, it’s visible that the innovation in the event processing / efficient Data Lake storage space is accelerating. Once Cloudflare enables that Pipelines can directly write Iceberg tables on R2, together with the R2 Data Catalog and the upcoming containers product (e.g. running DuckDB as an API), you can host your complete Lakehouse data stack on Cloudflare. With potentially much less complexity compared to AWS at the moment.

Also from a cost perspective, this is bound to gain some share of the market, especially for smaller, non-enterprise companies

Workers Paid ($5 / month)
IngestionFirst 50 GB per month included
Delivery to R2First 50 GB per month included

In this scenario, you’d pay $5 for ingesting 50GB of data to R2, while on AWS you’d already have to pay $12.5 for the CloudWatch Logs Delivery to Kinesis Data Firehose only.

Also, there’s much less complexity from a IaC perspective compared to the pretty huge required infrastructure on AWS. I’m curious how / if AWS will react.

6
Subscribe to my newsletter

Read articles from Tobias Müller directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Tobias Müller
Tobias Müller