Guide to Setting Up AWS SageMaker for XGBoost Machine Learning Models

In this article, we will walk through how to set up an environment in AWS SageMaker for building a machine learning model using the XGBoost algorithm. We will break down the process into simple steps, making it easy to follow even if you're new to machine learning or AWS.

Setting Up the Environment

To start, we need to initialize our AWS SageMaker environment and define where our data will be stored. This involves importing necessary libraries and setting key variables.
Ensure your data is preprocessed as outlined in our previous article, Feature Engineering with SageMaker Processing, before proceeding with model setup.

from sagemaker import Session
import sagemaker
import boto3
from sagemaker import get_execution_role

# Obtain the SageMaker execution role for permissions
role = get_execution_role()

# Set the default S3 bucket for storing data
bucket = sagemaker.Session().default_bucket()

# Define a prefix for organizing data within the S3 bucket
prefix = 'mlops/activity-3'

# Initialize a SageMaker session
sess = Session()

# Define S3 paths for training, validation, and test data
train_path = f"s3://{bucket}/{prefix}/train/"
validation_path = f"s3://{bucket}/{prefix}/validation/"
test_path = f"s3://{bucket}/{prefix}/test/"

Retrieve the XGBoost Container Image

Next, we need to get the container image URI for the latest version of XGBoost. This is essential as it provides the necessary environment for training our model.

container = sagemaker.image_uris.retrieve(region=boto3.Session().region_name, framework='xgboost', version='latest')

Define Input Data Locations

We will specify where our training and validation datasets are stored in S3. This helps SageMaker know where to look for the data during training.

s3_input_train = sagemaker.inputs.TrainingInput(s3_data=f's3://{bucket}/{prefix}/train', content_type='csv')
s3_input_validation = sagemaker.inputs.TrainingInput(s3_data=f's3://{bucket}/{prefix}/validation/', content_type='csv')

Set Up and Train the XGBoost Model

Now, we can configure and start training our XGBoost model. We will define important parameters such as instance type and hyperparameters that help optimize our model's performance.

from time import gmtime, strftime

# Generate a unique model name based on current time
model_name = "xgboost" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print("Model name: " + model_name)

# Define an XGBoost estimator
xgb = sagemaker.estimator.Estimator(
    container,
    role,
    instance_count=1,
    instance_type='ml.m4.xlarge',
    output_path=f's3://{bucket}/{prefix}/output',
    sagemaker_session=sess,
    model_name=model_name
)

# Set hyperparameters for training
xgb.set_hyperparameters(
    max_depth=5,
    eta=0.2,
    gamma=4,
    min_child_weight=6,
    subsample=0.8,
    silent=0,
    objective='binary:logistic',
    num_round=100
)

# Launch the training job
xgb.fit({'train': s3_input_train, 'validation': s3_input_validation})

Conclusion

In this article, we covered how to set up an AWS SageMaker environment to train an XGBoost model. By following these steps, you can easily manage your data and train machine learning models efficiently.For more detailed code and resources, you can check out my GitHub repository here. Feel free to explore further and dive into machine learning with AWS SageMaker!

Ready to deploy your trained model? Check out the next article on Deploying the Trained XGBoost Model as a Real-Time Endpoint and Deploying a Serverless Machine Learning Model on AWS SageMaker

Building a Machine Learning Model with AWS SageMaker