Feature Engineering with AWS SageMaker Processing: A Hands-On Guide

Introduction

In machine learning, pre-processing large datasets efficiently is a critical task. AWS SageMaker Processing offers a scalable, manageable way to run pre-processing, post-processing, and model evaluation on powerful managed clusters, freeing up resources on your notebook instance.

SageMaker Processing is often used by data scientists and ML engineers who want to run Python scripts or custom containers for processing large datasets. It utilizes Amazon SageMaker's Python SDK to define and execute a processing job, leveraging SageMaker's pre-built containers or custom containers to meet specific processing needs.

Why Use SageMaker Processing?

SageMaker Processing can handle large-scale processing tasks on clusters independent of the notebook instance, allowing:

Scalability: Process terabytes of data in a managed, scalable environment.
Flexibility: Run custom processing scripts within SageMaker’s environment, with control over the instance type and number of instances.
Efficiency: Keep the notebook instance running on a smaller, less expensive configuration while utilizing powerful SageMaker resources for processing.

Hands-On Guide: Feature Engineering with SageMaker Processing

We’re running the code below in a Jupyter Notebook created in SageMaker Studio. You can skip these steps and use our Jupyter Notebook directly from here if you prefer.

Load the data to S3 bucket. Ensure that the bank-additional-full.csv file is uploaded using the Jupyter Notebook upload feature.

 # Upload to S3 Bucket
 from sagemaker import Session
 import sagemaker
 bucket=sagemaker.Session().default_bucket()
 prefix = 'mlops/sagemaker-processing-activity'

 sess = Session()
 input_source = sess.upload_data('./bank-additional-full.csv', bucket=bucket, key_prefix=f'{prefix}/input_data')
 input_source

Define the IAM Role.

 # Define IAM role
 import boto3
 import re
 from sagemaker import get_execution_role

 role = get_execution_role()

Run the Processing Job
We'll use a custom Python script for feature engineering, which will:
- Load data from an S3 bucket.
- Perform necessary transformations.
- Save the processed data back to the S3 bucket.

We’ll use a pre-built Scikit-learn container in this example, which includes essential libraries for data manipulation.

    # Fetch Preprocessing Script
    !wget --no-check-certificate https://raw.githubusercontent.com/garganshulgarg/learn-mlops-with-sagemaker/refs/heads/main/applications/feature-engineering/feature-engg-script.py

    train_path = f"s3://{bucket}/{prefix}/train"
    validation_path = f"s3://{bucket}/{prefix}/validation"
    test_path = f"s3://{bucket}/{prefix}/test"

    from sagemaker.sklearn.processing import SKLearnProcessor
    from sagemaker.processing import ProcessingInput, ProcessingOutput
    from sagemaker import get_execution_role


    sklearn_processor = SKLearnProcessor(
        framework_version="0.23-1",
        role=get_execution_role(),
        instance_type="ml.m5.large",
        instance_count=1, 
        base_job_name='mlops-sklearnprocessing'
    )

    sklearn_processor.run(
        code='feature-engg-script.py',
        # arguments = ['arg1', 'arg2'],
        inputs=[
            ProcessingInput(
                source=input_source, 
                destination="/opt/ml/processing/input",
                s3_input_mode="File",
                s3_data_distribution_type="ShardedByS3Key"
            )
        ],
        outputs=[
            ProcessingOutput(
                output_name="train_data", 
                source="/opt/ml/processing/output/train",
                destination=train_path,
            ),
            ProcessingOutput(output_name="validation_data", source="/opt/ml/processing/output/validation", destination=validation_path),
            ProcessingOutput(output_name="test_data", source="/opt/ml/processing/output/test", destination=test_path),
        ]
    )

Monitor the Job

Track the job status through the SageMaker Processing Jobs console. Once completed, verify the output data in the specified S3 paths.

Validate the processed data

 !aws s3 ls $train_path/
 !aws s3 ls $test_path/

Conclusion

Using SageMaker Processing with custom Python scripts for feature engineering allows flexibility and scalability in pre-processing datasets for machine learning tasks. By separating the processing workload from the notebook environment, we can use powerful instances only when needed, reducing costs and enhancing performance.

You can find the complete code and Jupyter notebook for this example on my GitHub repository.

This post includes an overview and hands-on guide to feature engineering with SageMaker Processing.

Feature Engineering with SageMaker Processing

Introduction

Why Use SageMaker Processing?

Hands-On Guide: Feature Engineering with SageMaker Processing

Conclusion

Subscribe to my newsletter

Anshul Garg

Anshul Garg