Using Amazon SageMaker Lakehouse with DuckDB


Preconditions
To use the Amazon SageMaker Lakehouse with DuckDB, you first have to create a S3 Table bucket, a namespace and an actual S3 Table. All those steps are described in my other blog post “Query S3 Tables with DuckDB”, so please make sure you followed the outlined (manual) steps of the Setting up a S3 Table section before continuing with this blog post.
The setup of the pretty complicated permissions on the AWS side is implemented according to their blog post “Access data in Amazon S3 Tables using PyIceberg through the AWS Glue Iceberg REST endpoint”
Create IAM role and policy
Check your ARNs
Before creating your IAM policy, you need to check the ARNs of your S3 Table bucket, and also note the namespace.
So, if your S3 Table bucket lives in us-east-1, go to the AWS Console and check the ARN for it:
In our case it’s arn:aws:glue:us-east-1:123456789012:catalog/s3tablescatalog/duckdb-test
and the namespace we created as of the referenced blog post is test
.
Create the policy
Now, go the the IAM Policies screen of the AWS Console, and click on the Create policy button. Click on the JSON button, and paste the following JSON policy into the editor.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "SagemakerLakehouseTest",
"Effect": "Allow",
"Action": [
"glue:GetCatalog",
"glue:GetDatabase",
"glue:GetDatabases",
"glue:GetTable",
"glue:GetTables",
"glue:CreateTable",
"glue:UpdateTable"
],
"Resource": [
"arn:aws:glue:us-east-1:123456789012:catalog",
"arn:aws:glue:us-east-1:123456789012:catalog/s3tablescatalog",
"arn:aws:glue:us-east-1:123456789012:catalog/s3tablescatalog/duckdb-test",
"arn:aws:glue:us-east-1:123456789012:table/s3tablescatalog/duckdb-test/test/*",
"arn:aws:glue:us-east-1:123456789012:database/s3tablescatalog/duckdb-test/test"
]
},
{
"Effect": "Allow",
"Action": [
"lakeformation:GetDataAccess"
],
"Resource": "*"
}
]
}
Then update your AWS Account ID, the S3 Table bucket name and the namespace accordingly if you chose other values for them.
Then press Next, and use s3tables-duckdb-test
as policy name and save the policy with Save policy.
Create the role
Go to the IAM Roles screen of the AWS Console, and click on the Create role button.
Now click on Next. Select your previously created policy s3tables-duckdb-test
as trust policy by clicking the checkbox in front of the policy and clicking Next:
Enter the role name s3tables-duckdb-test-role
and click on Create role.
Define access control using Lake Formation
Application integration setup
In Lake Formation, enable full table access for external engines to access data. This allows third-party applications to get the Lake Formation temporary credential using an IAM role(s) that has full permissions (ALL) on the requested table.
Go to the AWS Lake Formation Dashboard, and then on the Left Pane, expand the Administration section, then Application integration settings and choose Allow external engines to access data in Amazon S3 locations with full table access and click on Save:
Set up Lake Formation grants
For the newly created role s3tables-duckdb-test-role
to be able to actually access the database and the tables, you need to provide database and table level permissions.
Database-level permissions
In the Lake Formation Data permissions, click on Grant. Then, in the Principals section, choose the radio button IAM users and roles, and from the drop-down choose s3tables-duckdb-test-role
. In the LF-Tags or catalog resources section, choose Named Data Catalog resources:
Choose
123456789012:s3tablescatalog/duckdb-test
for CatalogChoose
test
for DatabasesChoose CREATE TABLE, DESCRIBE for database permissions
Click on Grant
Table-level permissions
In the Lake Formation Data permissions, click on Grant. Then, in the Principals section, choose the radio button IAM users and roles, and from the drop-down choose s3tables-duckdb-test-role
. In the LF-Tags or catalog resources section, choose Named Data Catalog resources:
Choose
123456789012:s3tablescatalog/duckdb-test
for CatalogChoose
test
for DatabasesChoose All Tables for tables
Choose SUPER for table permissions.
Click on Grant.
Quick check in Athena
We verify in Athena that the table daily_sales
exists, and that we have data, before going to the DuckDB CLI to test the catalog integration:
Validate IAM role
You can validate the created IAM role on the target machine by issuing the following command (this assumes you have installed the AWS CLI):
aws sts assume-role --role-arn "arn:aws:iam::123456789012:role/s3tables-duckdb-test-role" --role-session-name s3tables-duckdb-test-role
Usage with DuckDB
Create DuckDB secret for catalog access
CREATE SECRET (
TYPE s3,
PROVIDER credential_chain,
CHAIN sts,
ASSUME_ROLE_ARN 'arn:aws:iam::123456789012:role/s3tables-duckdb-test-role',
REGION 'us-east-1'
);
Attach the Lake Formation (Glue) catalog
ATTACH '123456789012:s3tablescatalog/duckdb-test' AS s3tables_datalake (
TYPE ICEBERG,
ENDPOINT_TYPE 'GLUE'
);
List tables
SHOW ALL TABLES;
Query table
SELECT * FROM s3tables_datalake.test.daily_sales LIMIT 5;
CLI screenshot
Summary
We were able to show that the integration with the AWS Sagemaker Catalog (a.k.a. AWS Glue Iceberg REST endpoint) works with DuckDB. Once the quite tedious IAM role/policy, as well as the Lake Formation permission setup are done, the usage is pretty straight forward with simple CREATE SECRET
and ATTACH
statements.
References
DuckDB docs: Amazon Sagemaker Lakehouse
AWS Blog Post: Access data in Amazon S3 Tables using PyIceberg through the AWS Glue Iceberg REST endpoint
Blog post: Query S3 Tables with DuckDB
Subscribe to my newsletter
Read articles from Tobias Müller directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
