Hi Data Folks! 👋

In this blog, we’ll walk through setting up AWS Athena for querying data in S3, defining table structures using both manual metadata and Glue Crawlers, and optimizing query performance with techniques like partitioning and columnar file formats. Let’s dive in!

Example Scenario and Setup

AWS Athena is a powerful serverless query service that enables normal SQL data analysis directly in Amazon S3.
Consider a scenario where we need to perform analysis on 5 TB of data stored in Amazon S3, once a week. Since performance is not a key requirement, we can use Athena which enables us to pay only for the amount of data scanned during each analysis.
Since data is not moved from S3 to the cluster via an ETL process, we can define metadata on top of S3 data for achieving a table structure. This metadata can either be defined manually or through Glue Crawler, which in turn stores the data inside Glue Catalog.
Inside Athena’s query editor, we need to set up a query result location folder present inside S3; we can begin running queries thereafter.

We create an S3 bucket ‘mehul-athena-demo‘, which will store the required folders.

Inside the S3 bucket, we create two folders- ‘athena-input‘ and ‘athena-query-results‘, which will store input data and query results respectively.

Now, inside Athena’s query editor settings, we provide the location of the query results folder created above.

Create a new database ‘mehul_db‘ and select it as the database in which we will create tables and query the data.

Manual Metadata Definition

Consider the following data stored inside ‘students_with_header.csv’ file, that we wish to query using Athena.

We upload this CSV file into ‘athena-input’ folder inside the S3 bucket.

Following the first approach of defining the table structure manually, we create a new table ‘students‘ from the S3 bucket data.

We provide ‘athena-input‘ folder’s location as the dataset location and the corresponding file format (CSV). Note the table type and serialization-deserialization values inside data format specifications as well.

In order to define the schema of the table, we can bulk add columns corresponding to the CSV file’s data that we uploaded into ‘athena-input‘ folder.

Inside table properties, we can provide the value for ‘skip.header.line.count‘ as 1, so that the first line of our CSV file is considered to be a header.

Afterwards, we click on ‘Create table‘ and the ‘students‘ table gets created successfully, using the following schema definition.

On previewing the data using a SQL query, we can see the records.

Let’s say, we want to write a SQL query to find the average score of students in one particular subject.
As we can see, even after selecting data for only two columns (subject, score) and for only one value in the ‘subject’ column (Math), the entire data gets scanned (57.41 MB), which is quite inefficient.

The main reason for this huge data scan is that we used Row-based file format above (CSV).

Column-based File Format and Partitioning

In order to optimize the query to minimize data scans, we can use three ways:

Use compression techniques like snappy, gzip, etc.
Use Column-based file formats like parquet, ORC, through which we can achieve column level pruning.
Partition the data so that we can achieve partition pruning.

For our scenario, we will change the file format to parquet, to skip unnecessary columns, and create a partition on ‘subject‘ column, to skip unnecessary rows.
In order to generate the ‘students’ dataset in parquet format and partition it on the column ‘subject‘, we provide the ‘students_with_header.csv‘ file to the following PySpark code, and generate a new folder ‘students_output_parquet‘.

from pyspark.sql.functions import *
from pyspark.sql import SparkSession

if __name__=='__main__':
  spark = SparkSession.builder \
    .appName("Parquet_Dataset") \
    .getOrCreate()

  students_df = spark \
    .read \
    .format("csv") \
    .option("inferSchema",True) \
    .option("header", True) \
    .option("path","/content/data/students_with_header.csv") \
    .load()

  students_df.write \
    .partitionBy("subject") \
    .parquet("/content/data/students_output_parquet")

On executing the code, we get the following ‘students_output_parquet’ data folder in parquet format, partitioned by ‘subject‘.

Notice the parquet files inside individual ‘subject‘ folder.

Now, we upload these 14 folders into a new folder ‘athena-input-parquet‘ inside our S3 bucket.

Also, we create a new table ‘students_partitioned‘ on top of this input folder, and provide the partitioned column ‘subject‘ in partition details.

Before previewing the data inside ‘students_partitioned‘ table, we need to run ‘MSCK REPAIR‘ command on this table, without which the the partitions don’t get recognized by the metastore. This command will add the partitions into the metastore.
Now, even if we select the entire data, only 10.48 MB of data gets scanned.

If we run the earlier SQL query that finds the average score of students in a particular subject, we note that it scans only 137.73 KB of data instead of scanning the entire data (57.41 MB).

Dynamic Metadata Definition

Glue Crawler scans data stored in S3 and automatically infers column names, data types, and other metadata to create or update tables in Glue Data Catalog.
Inside AWS Glue, we create a crawler ‘mehul-athena-crawler-1‘ to query the data in the S3 bucket.

We add the S3 path of ‘athena-input‘ folder as the input data source.

After creating the IAM role to provide access permissions between Glue and Athena, we create the crawler.

Afterwards, we run the crawler and it runs successfully, thus creating the required table.

We can confirm that the table ‘mehulathena_input‘ gets created successfully.

Also, inside Glue Catalog, we can see all the schema information and metadata related to ‘mehulathena_input‘ table.

Types of Athena Analytics Engine

Trino SQL Engine: We have used Trino SQL Engine in the above queries. This engine is designed for massive data processing and it utilizes resources from a shared pool for handling diverse data queries.
Apache Spark Engine: While creating a Workgroup, we can also select Apache Spark Engine as the analytics engine. This engine is suitable for handling complex data processing tasks, where intensive and advanced analytics are required.

Conclusion

AWS Athena provides a flexible and cost-effective way to analyze large datasets directly in S3. By optimizing your data with columnar file formats like Parquet, partitioning, and dynamic schema definitions using Glue Crawlers, you can significantly reduce query costs and improve performance. These techniques can be powerful tools for building efficient data pipelines and scalable analytics solutions.

Stay tuned for more such blogs! 🔔

22: AWS Athena Setup and Optimization 📊