AWS Athena Overview


Key Features
Serverless – No need to manage infrastructure; AWS handles provisioning, scaling, and maintenance.
SQL-Based Queries – Uses standard SQL (Presto/Trino) to analyze data stored in Amazon S3.
Pay-Per-Query – Charges are based on the amount of data scanned per query.
Schema-on-Read – Unlike traditional databases, Athena does not require predefined schema; it infers the schema when querying data.
Integration with AWS Services – Works well with AWS Glue for metadata cataloging, AWS QuickSight for visualization, and AWS Lake Formation for data governance.
Secure and Compliant – Supports IAM-based access control, data encryption with AWS KMS, and integration with AWS Macie for security insights.
Supports Multiple Data Formats – Can query structured and semi-structured data in formats like CSV, JSON, ORC, Avro, and Parquet.
Federated Query Support – Can query data across multiple sources, including relational databases, NoSQL databases, and on-premises data stores.
How AWS Athena Works
Data Storage:
Athena queries data stored in Amazon S3.
Data remains in S3, and Athena does not store data separately.
Schema Definition:
Users define a schema using AWS Glue or manually in Athena’s console.
Schema-on-Read approach means you don’t need to load data before querying.
Query Execution:
Users submit SQL queries via the AWS Console, API, or SDKs.
Athena uses a distributed query engine (Presto/Trino) to process the request.
The result is returned to the user and optionally stored in S3.
Performance Optimization:
Use partitioning to reduce the amount of data scanned.
Convert data to columnar formats like Parquet or ORC to improve query performance.
Utilize AWS Glue Catalog for schema management.
Use Case Example
Scenario: Analyzing Web Logs Stored in Amazon S3
Context:
A company stores its web server logs in Amazon S3 and wants to analyze user behavior without setting up a traditional database.
Solution Using AWS Athena:
Upload log files (e.g., JSON or CSV format) to Amazon S3.
Create a table in Athena with a schema matching the log file structure.
Run SQL queries like:
SELECT user_ip, COUNT(*) AS visit_count FROM web_logs WHERE request_url = '/checkout' GROUP BY user_ip ORDER BY visit_count DESC LIMIT 10;
Results show the top 10 users accessing the checkout page, helping the business optimize conversions.
Benefits:
✅ No need to set up a database.
✅ Pay only for the queries run.
✅ Scale automatically for large datasets.
Topics for AWS DVA-C02
For the AWS Developer Associate exam (DVA-C02), focus on the following Athena-related topics:
Integration with AWS S3
How Athena queries S3 data
Best practices for storing and structuring S3 data
Query Performance Optimization
Using partitions and bucketing
Converting data to Parquet/ORC for efficiency
Security & Access Control
Managing IAM policies for Athena
Encrypting query results with AWS KMS
AWS Glue & Schema Management
Using AWS Glue Catalog to define schemas
Automating schema discovery with Glue Crawlers
Federated Queries
- Querying external databases using Athena Federated Query
Monitoring & Logging
Using AWS CloudTrail to audit Athena queries
Tracking query execution history in AWS Console
Subscribe to my newsletter
Read articles from Trushang Suthar directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Trushang Suthar
Trushang Suthar
Code is like a puzzle—sometimes you just need to step back, take a breath, and the solution clicks.