Using Amazon SageMaker Lakehouse with DuckDB

Tobias MüllerTobias Müller
5 min read

Preconditions

To use the Amazon SageMaker Lakehouse with DuckDB, you first have to create a S3 Table bucket, a namespace and an actual S3 Table. All those steps are described in my other blog post “Query S3 Tables with DuckDB”, so please make sure you followed the outlined (manual) steps of the Setting up a S3 Table section before continuing with this blog post.

The setup of the pretty complicated permissions on the AWS side is implemented according to their blog post “Access data in Amazon S3 Tables using PyIceberg through the AWS Glue Iceberg REST endpoint”

Create IAM role and policy

💡
As outlined in the previous blog post, you already created a S3 Table bucket called “duckdb-test” and a namespace “test“. In the following examples, I will not use my real AWS Account ID, but the artificial “123456789012” instead. Please keep in mind to use your actual AWS Account ID if you follow to implement this in your account.

Check your ARNs

Before creating your IAM policy, you need to check the ARNs of your S3 Table bucket, and also note the namespace.

So, if your S3 Table bucket lives in us-east-1, go to the AWS Console and check the ARN for it:

In our case it’s arn:aws:glue:us-east-1:123456789012:catalog/s3tablescatalog/duckdb-test and the namespace we created as of the referenced blog post is test.

Create the policy

Now, go the the IAM Policies screen of the AWS Console, and click on the Create policy button. Click on the JSON button, and paste the following JSON policy into the editor.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "SagemakerLakehouseTest",
            "Effect": "Allow",
            "Action": [
                "glue:GetCatalog",
                "glue:GetDatabase",
                "glue:GetDatabases",
                "glue:GetTable",
                "glue:GetTables",
                "glue:CreateTable",
                "glue:UpdateTable"
            ],
            "Resource": [
                "arn:aws:glue:us-east-1:123456789012:catalog",
                "arn:aws:glue:us-east-1:123456789012:catalog/s3tablescatalog",
                "arn:aws:glue:us-east-1:123456789012:catalog/s3tablescatalog/duckdb-test",
                "arn:aws:glue:us-east-1:123456789012:table/s3tablescatalog/duckdb-test/test/*",
                "arn:aws:glue:us-east-1:123456789012:database/s3tablescatalog/duckdb-test/test"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "lakeformation:GetDataAccess"
            ],
            "Resource": "*"
        }
    ]
}

Then update your AWS Account ID, the S3 Table bucket name and the namespace accordingly if you chose other values for them.

Then press Next, and use s3tables-duckdb-test as policy name and save the policy with Save policy.

Create the role

Go to the IAM Roles screen of the AWS Console, and click on the Create role button.

💡
This will require an existing IAM role or IAM user. Normally, this will be a role you’re assuming via STS, or a user whose credentials you stored on the machine you’re using. The setup of this is not part of this blog post.

Now click on Next. Select your previously created policy s3tables-duckdb-test as trust policy by clicking the checkbox in front of the policy and clicking Next:

Enter the role name s3tables-duckdb-test-role and click on Create role.

Define access control using Lake Formation

Application integration setup

In Lake Formation, enable full table access for external engines to access data. This allows third-party applications to get the Lake Formation temporary credential using an IAM role(s) that has full permissions (ALL) on the requested table.

Go to the AWS Lake Formation Dashboard, and then on the Left Pane, expand the Administration section, then Application integration settings and choose Allow external engines to access data in Amazon S3 locations with full table access and click on Save:

Set up Lake Formation grants

For the newly created role s3tables-duckdb-test-role to be able to actually access the database and the tables, you need to provide database and table level permissions.

Database-level permissions

In the Lake Formation Data permissions, click on Grant. Then, in the Principals section, choose the radio button IAM users and roles, and from the drop-down choose s3tables-duckdb-test-role. In the LF-Tags or catalog resources section, choose Named Data Catalog resources:

  • Choose 123456789012:s3tablescatalog/duckdb-test for Catalog

  • Choose test for Databases

  • Choose CREATE TABLE, DESCRIBE for database permissions

  • Click on Grant

Table-level permissions

In the Lake Formation Data permissions, click on Grant. Then, in the Principals section, choose the radio button IAM users and roles, and from the drop-down choose s3tables-duckdb-test-role. In the LF-Tags or catalog resources section, choose Named Data Catalog resources:

  • Choose 123456789012:s3tablescatalog/duckdb-test for Catalog

  • Choose test for Databases

  • Choose All Tables for tables

  • Choose SUPER for table permissions.

  • Click on Grant.

Quick check in Athena

We verify in Athena that the table daily_sales exists, and that we have data, before going to the DuckDB CLI to test the catalog integration:

Validate IAM role

You can validate the created IAM role on the target machine by issuing the following command (this assumes you have installed the AWS CLI):

aws sts assume-role --role-arn "arn:aws:iam::123456789012:role/s3tables-duckdb-test-role" --role-session-name s3tables-duckdb-test-role

Usage with DuckDB

Create DuckDB secret for catalog access

CREATE SECRET (
    TYPE s3,
    PROVIDER credential_chain,
    CHAIN sts,
    ASSUME_ROLE_ARN 'arn:aws:iam::123456789012:role/s3tables-duckdb-test-role',
    REGION 'us-east-1'
);

Attach the Lake Formation (Glue) catalog

ATTACH '123456789012:s3tablescatalog/duckdb-test' AS s3tables_datalake (
    TYPE ICEBERG,
    ENDPOINT_TYPE 'GLUE'
);

List tables

SHOW ALL TABLES;

Query table

 SELECT * FROM s3tables_datalake.test.daily_sales LIMIT 5;

CLI screenshot

Summary

We were able to show that the integration with the AWS Sagemaker Catalog (a.k.a. AWS Glue Iceberg REST endpoint) works with DuckDB. Once the quite tedious IAM role/policy, as well as the Lake Formation permission setup are done, the usage is pretty straight forward with simple CREATE SECRET and ATTACH statements.

References

0
Subscribe to my newsletter

Read articles from Tobias Müller directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Tobias Müller
Tobias Müller