Understanding the Amazon Athena Partitioning: Traditional vs Projection

PhilPhil
5 min read

What is Partitioning?

In the context of data warehousing and querying, partitioning is the process of dividing a large dataset into smaller, more manageable parts based on the values of one or more columns, known as partition keys. This division allows for more efficient data retrieval, as queries can target only the relevant partitions, significantly reducing the amount of data scanned.

Think of it like organizing a library. Instead of searching through every book, you categorize books by genre, author, or publication date. When you need a specific book, you only search within the relevant category.

Amazon Athena Partition

AWS Athena, the serverless query service, empowers users to analyze data directly in S3. But as your data grows, optimizing query performance becomes critical. That's where partitioning comes in. Two main approaches exist: traditional partitioning and partition projection. Let's delve into their differences and help you decide which is best for your use case.

Traditional Partitioning: The Classic Approach

Traditional partitioning involves physically organizing your data in S3 into separate folders based on partition keys (e.g., accountid, region, year, month, day). Athena then uses these folder structures to filter data during queries.

How it works:

  • Data Organization: You manually organize your S3 buckets with nested folders based on your chosen partition keys, adhering to either Hive-style (e.g., s3://my-bucket/vpcflowlogs/accountid=827364678983/region=ap-northeast-1/date=2023-10-26/) or non-Hive style structures (e.g.,s3://my-bucket/vpcflowlogs/827364678983/ap-northeast-1/2023/10/26/).

  • Table Definition: You define your Athena table by specifying the partition keys.

  • MSCK REPAIR TABLE: After adding new partitions, you run MSCK REPAIR TABLE to update the AWS Glue Data Catalog with the new partition locations. Note that MSCK REPAIR TABLE only supports hive-style partitions.

  • Managing partitions manually: Partition can be added to the Glue Data Catalog by using the command ALTER TABLE ADD PARTITION. This can be done manually, or automated by integrating with services like Amazon EventBridge and AWS Lambda.

    • For example, if your S3 location is structured as s3://my-bucket/vpcflowlogs/accountid=827364678983/region=ap-northeast-1/date=2023-10-26/ and your table is partitioned by accountid, region, and date, you need to add partitions when new logs are written to your S3 or when a new account joins. You can do this using ALTER TABLE ADD PARTITION or MSCK REPAIR TABLE.

        ALTER TABLE table_name ADD IF NOT EXISTS PARTITION;
      
        MSCK REPAIR TABLE table_name;
      
  • Querying: When processing queries, Athena makes a GetPartitions API call to the AWS Glue Data Catalog before performing partition pruning.

  • Cost and Performance:

    • Service Costs: Manual partition management incurs costs associated with the AWS services used for partition addition. Additionally, integrating multiple services may introduce latency, impacting real-time data access. Lambda function timeouts can pose limitations for large datasets.

      • Scenario: When a new object arrives in S3, it triggers an AWS Lambda function via Amazon EventBridge to add a partition to the AWS Glue Data Catalog.
    • API Calls: Athena initiates GetPartitions API calls to the AWS Glue Data Catalog before data retrieval, potentially increasing costs and introducing latency, especially with large datasets.

    • Metadata Storage: A high volume of registered partitions in the AWS Glue Data Catalog contributes to increased metadata storage costs.

  • Operational Maintenance: Manual partition management introduces operational overhead and increases the risk of errors, such as missing partitions or inconsistencies."

Partition Projection: The Dynamic Approach

Partition Projection eliminates the need for manual or automated partition management in the Glue Data Catalog by dynamically inferring partitions based on predefined patterns. As stated in AWS documentation, Athena calculates partition values and locations using table properties defined in AWS Glue. This allows Athena to 'project' or determine necessary partition information, bypassing time-consuming metadata lookups.

How it works:

  • Data Organization: You manually organize your S3 buckets with nested folders based on your chosen partition keys, adhering to either Hive-style (e.g., s3://my-bucket/vpcflowlogs/accountid=827364678983/region=ap-northeast-1/date=2023-10-26/) or non-Hive style structures (e.g.,s3://my-bucket/vpcflowlogs/827364678983/ap-northeast-1/2023/10/26/).

  • Table Definition: To use partition projection, you specify the range of partition values and projection types for each partition column in the table properties when creating a table in the AWS Glue Data Catalog. For example, you can define a year partition with a range from 2020 to 2024 or up to the present. For more information on supported partition types, click here to view the AWS Documentation.

  • Managing Partitions: Partitions are managed automatically. MSCK REPAIR TABLE and ALTER TABLE ADD PARTITION are not required. When new data arrives in your S3 data lake, Athena dynamically determines partition locations based on the defined patterns during query execution.

    • For example, if your S3 location is structured as s3://my-bucket/vpcflowlogs/accountid=827364678983/region=ap-northeast-1/date=2023-10-26/ and your table is partitioned by accountid, region, and date, you don't need to manually add partitions when new data arrives for dates like the 27th and 28th. This means there's no need for ALTER TABLE ADD PARTITION or MSCK REPAIR TABLE to update the Glue Data Catalog.

        s3://my-bucket/vpcflowlogs/accountid=827364678983/region=ap-northeast-1/date=2023-10-27/
        s3://my-bucket/vpcflowlogs/accountid=827364678983/region=ap-northeast-1/date=2023-10-28/
      
  • Querying: When processing queries, Athena does not make a GetPartitions API call to the AWS Glue Data Catalog. Instead, it directly fetches the data from the S3 location based on the partition key you provide in the query.

  • Cost and Performance:

    • Service Costs: Eliminating manual partition management reduces costs associated with auxiliary AWS services and eliminates latency concerns.

    • API Calls: Athena bypasses GetPartitions API calls to the AWS Glue Data Catalog, reducing API-related costs.

    • Metadata Storage: Partition projection avoids storing partition metadata in the AWS Glue Data Catalog, minimizing metadata storage costs.

  • Operational Maintenance: This approach greatly reduces maintenance work. You set up partition patterns once, and Athena takes care of the rest. It's easier to manage and update partitions, especially for data that changes often.

Summary Table:

FeatureTraditional PartitioningPartition Projection
CostPotential for increased costs due to metadata requests and storage. Incurs costs related to auxiliary AWS services.Generally cost-effective due to the elimination of metadata management.
PerformanceMSCK REPAIR TABLE and ALTER TABLE ADD PARTITION may introduce latency.Typically faster due to dynamic partition inference.
MaintenanceHigh manual effort, prone to errors and inconsistencies.Low overhead, simplified management.

Conclusion

Both traditional partitioning and partition projection are effective strategies for optimizing Athena query performance. The optimal approach depends on your specific data characteristics and workflow requirements. Partition projection generally offers advantages in terms of reduced overhead and improved performance. However, traditional partitioning remains valuable for complex scenarios requiring fine-grained control.

1
Subscribe to my newsletter

Read articles from Phil directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Phil
Phil

Hello and welcome to my blog! I am a passionate professional with a diverse background in Quality Assurance (QA) and System Engineering, and I am currently working as an AWS Cloud Engineer. Throughout my career, I have gained hands-on experience in a wide range of AWS services, both through my professional work and dedicated self-study. This practical exposure has provided me with valuable insights and deepened my understanding of leveraging AWS to build robust and scalable solutions. On this blog, I aim to share my experiences, insights, and practical guidance on AWS services and their application in real-world scenarios. Whether you are a cloud enthusiast, an aspiring AWS professional, or someone simply looking to deepen your understanding of cloud technology, I invite you to join me on this exciting journey. Thank you for visiting my blog, and I look forward to connecting and engaging with you as we delve into the world of AWS and cloud technology.