Amazon S3 is more than storage and brings in a lot for the analytics ecosystem


At AWS re:Invent 2024, Amazon S3 announced S3 Tables and S3 Metadata (preview) specifically for analytics workloads. Although not extensively into building Analytics workloads, coming from the background of building ETL pipelines that use S3 and parquet data and I wanted to explore this new capability!
Parquet data in S3
Parquet is the columnar storage format that is efficient for data storage and retrieval and is widely used by different ETL and big data processing frameworks such as Apache Spark, Hive, and Amazon Athena. This makes the parquet data more accessible with the query approach by services like Athena that use SQL queries on the parquet data. Parquet-formatted data is stored as any other object in an S3 Bucket, with an additional tag specifically mentioning that it is in parquet
format.
Storing this parquet data in Amazon S3 opens up the opportunity to leverage cloud storage by integrating AWS Services such as AWS Glue for ETL, Amazon Redshift for data warehouse, and Amazon SageMaker for ML workloads. Parquet stored on S3 is not only performant but also cost-efficient as the files are smaller than CSV and can be much cheaper to store on S3 with different S3 storage classes based on how frequently they are queried.
S3 Tables
Amazon S3 launched a new bucket type - tables
, which is designed for structured data formats using the Apache Iceberg table storing Apache parquet format data. This offers up to 3x faster query performance with parquet and a 10x higher transaction per second when compared with parquet data stored as an object in an S3 bucket, making it ideal for data analytics workloads.
S3 tables enable structure data formatting with the structure of Table Buckets with namespace
, which contains the tables
that could be queried from services like Amazon Athena the catch is to set the tables in a namespace, and also loading the tables is supported with only Amazon EMR and open source Apache Spark which makes the developers go through the environment setup of EMR clusters or Spark; making the getting started experience enforced to go to EMR or self-hosted Spark.
Setting up one-time integration with AWS Analytics Services
Another region-wide, one-time setup of AWS Analytics services is an additional environment setup process one should follow.
Honestly, this could have been done natively by AWS, and given the developers to manage permissions to different AWS Analytics services.
S3 Tables with built-in management
What makes S3 Tables performance is that way S3 Tables handles the typical table management with -
Data compaction - which combines the small table objects into larger objects which is configurable between 64MB and 512MB as a snapshot.
Snapshot management - ensures the snapshot lifecycle has a minimum number of snapshots to retain the maximum age of the snapshot to retain, and eventually deleting the expired snapshots.
These factors weigh in to make S3 Tables performant, you can read about how S3 table uses compaction to improve query performance by up to 3 times, along with some benchmarks with uncompacted tables in general-purpose buckets v/s compacted tables in table buckets.
S3 Metadata
Along with S3 Tables, Amazon S3 also announced S3 Metadata, which is a bucket property that can be enabled to capture the metadata about each S3 object in a general-purpose S3 bucket. S3 Metadata uses S3 Tables with the power of parquet data, the metadata of the objects are now queriable making searching for S3 objects with metadata much more efficient.
Types of S3 Metadata
S3 Metadata supports each object in a bucket with metadata that are of two categories -
System defined - the metadata that is controlled by Amazon S3 natively, such as -
Date
,Content-Length
,Last-Modified
, andETag
that are immutable values by the user along with the metadata such as -Cache-Control
,Content-Disposition
, andContent-Type
, which allow change in value by the user.User defined - the metadata that could be assigned by the user when the object is being uploaded, these keys are prefixed with
x-amz-meta-
.
S3 Metadata in action
Enable the S3 Metadata for the S3 Bucket from the console which requires you to set the name of the S3 Table and following the creation, it also creates the namespace
aws_s3_metadata
.And when you navigate to
Table Buckets
, you can see the S3 Table bucket with the tables created and listed.When you try to upload a new object to the S3 bucket, you can optionally define the Metadata -
system-defined
oruser-defined
.Once uploaded, you can view the metadata from the S3 Console and also query the same with Athena.
Why would I choose S3 Metadata?
Since the data is stored as the Apache Iceberg with S3 Tables, one question that hit me was - why would I enable S3 Metadata instead of directly implementing it with S3 Tables?
S3 Metadata is crucial for organizing and managing the data of S3 Buckets, imagine having a system that extensively uses S3 Buckets to store data (of any format), leveraging the S3 Metadata for that would mean you can have the metadata of the object (user-defined metadata) that can help with data categorization and management. Additionally, with the metadata of the objects, data discovery with additional attributes would play a pivotal role for applications.
Subscribe to my newsletter
Read articles from Jones Zachariah Noel N directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Jones Zachariah Noel N
Jones Zachariah Noel N
A Developer Advocate experiencing DevRel ecospace at Freshworks. Previous being part of the start-up Mobil80 Solutions based in Bengaluru, India enjoyed and learnt a lot with the multiple caps that I got to wear transitioning from a full-stack developer to Cloud Architect for Serverless! An AWS Serverless Hero who loves to interact with community which has helped me learn and share my knowledge. I write about AWS Serverless and also talk about new features and announcements from AWS. Speaker at various conferences globally, AWS Community, AWS Summit, AWS DevDay sharing about Cloud, AWS, Serverless and Freshworks Developer Platform