Interacting with Amazon S3 using AWS Data Wrangler (awswrangler) SDK for Pandas: A Comprehensive Guide
Introduction
Amazon S3 is a widely used cloud storage service for storing and retrieving data. AWS Data Wrangler (awswrangler) is a Python library that simplifies the process of interacting with various AWS services, including Amazon S3, especially in combination with Pandas DataFrames. In this article, I will guide you into the process of how to effectively use the awswrangler library to interact with Amazon S3, focusing on data manipulation with Pandas DataFrames.
Table of Contents
Introduction to AWS Data Wrangler
What is AWS Data Wrangler?
Key Features and Benefits
Prerequisites
Set Up Your AWS Account
Install Required Libraries
Connecting to Amazon S3
Configuring AWS Credentials
Creating a Connection
Uploading and Downloading Data
Loading Data to S3
Reading Data from S3
Conclusion
Leveraging awswrangler for S3 Data Operations
Further Learning Resources
1. Introduction to AWS Data Wrangler
What is AWS Data Wrangler?
AWS Data Wrangler is a Python library that simplifies the process of interacting with various AWS services, built on top of some useful data tools and open-source projects such as Pandas, Apache Arrow and Boto3. It offers streamlined functions to connect to, retrieve, transform, and load data from AWS services, with a strong focus on Amazon S3.
Key Features and Benefits
Seamless Integration: Integrate AWS services with Pandas DataFrames using familiar methods.
Efficient Data Manipulation: Perform data transformations efficiently using optimized Pandas methods.
Simplified Connection: Easily configure AWS credentials and establish connections to AWS services.
Error Handling: Built-in error handling and logging mechanisms for improved reliability.
2. Prerequisites
Set Up Your AWS Account
Before you start, ensure you have an AWS account setup with the necessary IAM (Identity and Access Management) user credentials with S3 access permissions, and have the AWS CLI configured locally.
Install Required Libraries
It is a generally well know good practice to work in isolated environments, specially when you are trying some new pythn libraries, so if you are a conda user, you should first create a conda environment where you will after that install awswrangler.
First create your conda environment by running:
conda create -n data-wrangling python=3.11
Then activate the environment running:
conda activate data-wrangling
Now is time to install the required libraries inside the environment using the following command:
pip install awswrangler pandas boto3
3. Connecting to Amazon S3
Creating S3 bucket
In order to create the S3 bucket we will use AWS CLI, and if you followed the previous guidelines on Setting up your AWS account, your access keys should be stored in your C:\Users\%USERPROFILE%.aws directory
And then you will be able to create the bucket from command line by running:
aws s3api create-bucket --bucket aws-sdk-pandas72023 --region us-east-2 --create-bucket-configuration LocationConstraint=us-east-2
I have called the bucket aws-skd-pandas72023 but you can name it whatever you like as long as it follows the naming rules for S3 buckets Bucket naming rules
Then you will be receiving the following output in the command line:
And you will be able to visualize the newly created bucket in your AWS user:
Creating a Connection
awswrangler library internally handles Sessions and AWS credentials using boto3 in order to connect to your bucket using your AWS credentials, so besides importing awswrangler you should also:
import boto3
All the packages you would need to import are the following:
#Importing required libraries
import awswrangler as wr
import yfinance as yf
import boto3
import pandas as pd
import datetime as dt
from datetime import date, timedelta
You can also git clone the repository that has the code used in this tutorial.
If you have cloned the repository, you have noticed that we are using the library yfinance to extract stocks data from the API, and then store it in a pandas dataframe, so we can write the extracted data (dataframes) in the previously created S3 bucket using awswrangler.
awswrangler API is able to read and write data in and from a huge kind of file formats and to a numeroius number of AWS services, refer to this list for more information.
4. Loading and Downloading Data from S3 buckets
Covering all the services available on the API to read and write data from and to AWS, would make this tutorial quite long, so we will be working with a commonly used data source storage in data projects, such as S3.
Uploading Data to S3
In the repository shared above I have written a function that writes the dataframe extracted with the get_data_from_api function to the S3 bucket previously created:
def write_data_to_bucket(file_name:str, mode:str):
"""
Parameters:
----------
mode(str): Available write modes are 'append', 'overwrite' and 'overwrite_partitions'
"""
path = f"s3://{bucket}/raw-data/{file_name}"
#Sending dataframe of corresponding ticker to bucket
wr.s3.to_csv(
df=df,
path=path,
index=True,
dataset=True,
mode=mode
)
So let's put the get_data_from_api into action by passing the stock symbol of NVDA stock and then we will load it to the bucket using write_data_to_bucket
If we go to the S3 bucket, we'll notice that there will be a new folder inside the bucket with the name of the stock, and a CSV file inside of it:
You can also pass multiple dataframes to the function so they will be created in the bucket:
Reading Data from S3
Reading data from an S3 bucket using awswrangler, is a very straightforward task, as you only need to pass the S3 path where your files are stored and the path_suffix corresponding to the method you are using, in this case read_csv
. Find more parameters available to use with this method in awswrangler API reference.
In my tutorial I have written a function that takes the folder name where the CSV file was stored when we wrote the data coming from the API in the S3 bucket:
def read_csv_from_bucket(folder_name:str) -> pd.DataFrame:
df=wr.s3.read_csv(path=f"s3://{bucket}/rawdata/{folder_name}/",
path_suffix = ".csv")
return df
5. Conclusion
There are a lot of methods available in the API Reference page inorder to interact with S3 and multiple other AWS services.
I encourage you to keep testing the once you find it useful to integrate with your ETLs and pipelines.
AWS Data Wrangler simplifies the interaction with Amazon S3, providing a seamless experience for performing data operations using Pandas DataFrames. This tutorial covered the fundamental concepts of connecting to S3, uploading, downloading, and transforming data, as well as advanced interactions. By leveraging AWS Data Wrangler, you can streamline your data workflows and focus on deriving insights from your data.
Cheers and happy coding!!
Resources for Further Learning
AWS Data Wrangler Documentation: https://aws-data-wrangler.readthedocs.io/
AWS Data Wrangler GitHub Repository: https://github.com/awslabs/aws-data-wrangler
AWS SDK for Python (Boto3) Documentation: https://boto3.amazonaws.com/v1/documentation/api/latest/index.html
Subscribe to my newsletter
Read articles from Felix Gutierrez directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
Felix Gutierrez
Felix Gutierrez
I am a Data Engineer with experience in data analysis, BI Development, and data visualization l have focussed my career in the Data field, learning (most of the time, by myself) the most common tools used for data analysis, manipulation, ETL building and lastly incorporating more tools to my data engineering skills, in order to improve my technical knowledge.