Unlocking the Power of Amazon Bedrock: A Step-by-Step Guide to Model Evaluation

Navya ANavya A
7 min read

Introduction

Amazon Bedrock is an AWS service that enables developers to leverage Generative AI models without the complexity of training, hosting, or scaling them. With Bedrock, users can seamlessly integrate AI models into their applications and workflows. One of its powerful features is model evaluation, which allows users to assess and compare different AI models before deployment.

In this blog, we will walk through a hands-on guide to configuring an Amazon S3 bucket for use with Amazon Bedrock and running a model evaluation job to compare the performance of a Generative AI model.

Harnessing the Power of Generative AI with AWS Bedrock: A Step-by-Step Guide

What We Are Going to Do

This tutorial will cover:

  • Creating an Amazon S3 bucket to store datasets and results.

  • Configuring permissions and CORS settings.

  • Setting up an evaluation job in Amazon Bedrock.

  • Interpreting evaluation results for model selection.

    Environment before

Open dialog containing preview of the Environment architecture before lab completion

Environment after

Open dialog containing preview of the Environment architecture after lab completion

Use Cases

Amazon Bedrock’s model evaluation feature is useful for:
- Machine Learning Engineers: Testing and comparing different Generative AI models.
- Cloud Architects: Integrating AI models into cloud-based applications.
- DevOps Engineers: Managing AI-based workflows in cloud environments.
- Software Engineers: Embedding AI capabilities into web and mobile applications.

Prerequisites

Familiarity with the following AWS services will be beneficial:
- Amazon S3 (for storage management)
- Amazon Bedrock (for Generative AI services)

If you're unfamiliar with these, consider reading AWS documentation before proceeding.

Step 1: Creating Amazon S3 Buckets for Dataset and Results

Amazon Bedrock requires two S3 buckets:

  1. Dataset Bucket – Stores the input dataset (prompt-response pairs) for model evaluation.

  2. Results Bucket – Stores the evaluation results after Amazon Bedrock processes the dataset.

1.1 Creating the Dataset Bucket

  1. Log in to the AWS Management Console.

  2. In the search bar, type S3 and click on Amazon S3.

  3. Click the Create Bucket button.

  4. Enter a unique bucket name (e.g., bedrock-dataset-bucket).

  5. Select the AWS Region where you want to store the dataset.

  6. Leave other settings as default and click Create Bucket.

1.2 Uploading a Prompt Dataset

  1. Open the bedrock-dataset-bucket.

  2. Click Upload and add a file named prompt_dataset.json.

  3. The dataset should contain example prompts like:

[
  { "prompt": "The chemical symbol for gold is", "category": "Chemistry", "referenceResponse": "Au" },
  { "prompt": "The tallest mountain in the world is", "category": "Geography", "referenceResponse": "Mount Everest" },
  { "prompt": "The author of 'Great Expectations' is", "category": "Literature", "referenceResponse": "Charles Dickens" }
]
  1. Click Upload to store the dataset in S3.

1.3 Creating the Results Bucket

  1. Go back to Amazon S3.

  2. Click the Create Bucket button again.

  3. Enter a unique bucket name (e.g., bedrock-results-bucket).

  4. Select the same AWS Region as the dataset bucket.

  5. Leave other settings as default and click Create Bucket.

Now, Amazon Bedrock will use:

  • bedrock-dataset-bucket to read the dataset.

  • bedrock-results-bucket to store evaluation results.


Step 2: Configuring Bucket Permissions and CORS

Amazon Bedrock needs permission to read the dataset from S3 and store results in another S3 bucket. To enable this, we must configure Cross-Origin Resource Sharing (CORS).

2.1 Understanding CORS and Why It’s Needed

Cross-Origin Resource Sharing (CORS) is a security feature in browsers and AWS services that prevents unauthorized requests from external sources. Since Amazon Bedrock is a separate AWS service, it needs explicit permission to access our S3 buckets.

Without CORS, Amazon Bedrock will be unable to read the dataset and store the results in S3, leading to permission errors.

2.2 Setting Up CORS Policy

To allow Amazon Bedrock to access the dataset and store results, follow these steps:

  1. Open the bedrock-dataset-bucket in Amazon S3.

  2. Go to the Permissions tab.

  3. Scroll down to Cross-origin resource sharing (CORS) and click Edit.

  4. Add the following CORS configuration:

[
  {
    "AllowedHeaders": ["*"],
    "AllowedMethods": ["GET", "PUT", "POST", "DELETE"],
    "AllowedOrigins": ["*"],
    "ExposeHeaders": ["Access-Control-Allow-Origin"]
  }
]
  1. Click Save Changes.

Step 3: Running a Model Evaluation Job in Amazon Bedrock

Now, we will set up an evaluation job in Amazon Bedrock to compare an AI model’s responses with expected answers.

3.1 Setting Up the Evaluation

  1. Open the AWS Management Console.

  2. In the search bar, type Bedrock and select Amazon Bedrock.

  3. Click on Evaluations in the left menu.

  4. Click Create Automatic: Programmatic.

  5. Configure the evaluation job with the following details:

    • Evaluation Name: bedrock-eval-job-1

    • Model Provider: Amazon

    • Model: Titan Text G1 - Express

    • Task Type: Question and Answer

    • Metrics: Accuracy

    • Prompt Dataset Location:

        s3://bedrock-dataset-bucket/prompt_dataset.json
      
    • Evaluation Results Location:

        s3://bedrock-results-bucket/evaluation-results/
      
    • IAM Role: Select an appropriate IAM Role with S3 access.

  6. Click Create to start the evaluation. The job will take approximately 10 minutes to complete.


Step 4: Viewing and Analyzing Results

Once the evaluation is complete, follow these steps to view the results:

  1. Navigate to the Model Evaluation Jobs page in Amazon Bedrock.

  2. Click on the completed job.

  3. Open the Evaluation Results Location to access the JSONL file stored in S3.

  4. Download and open the file.

4.1 Example Evaluation Results

Once the evaluation job is completed, Amazon Bedrock generates a JSONL file containing the results. Below is an example output from an evaluation job:

{
  "automatedEvaluationResult": {
    "scores": [
      {
        "metricName": "Accuracy",
        "result": 0
      }
    ]
  },
  "inputRecord": {
    "prompt": "The chemical symbol for gold is",
    "referenceResponse": "Au"
  },
  "modelResponses": [
    {
      "response": " “Au”.",
      "modelIdentifier": "amazon.titan-text-express-v1"
    }
  ]
}
{
  "automatedEvaluationResult": {
    "scores": [
      {
        "metricName": "Accuracy",
        "result": 1
      }
    ]
  },
  "inputRecord": {
    "prompt": "The tallest mountain in the world is",
    "referenceResponse": "Mount Everest"
  },
  "modelResponses": [
    {
      "response": " Mount Everest.",
      "modelIdentifier": "amazon.titan-text-express-v1"
    }
  ]
}
{
  "automatedEvaluationResult": {
    "scores": [
      {
        "metricName": "Accuracy",
        "result": 0
      }
    ]
  },
  "inputRecord": {
    "prompt": "The author of 'Great Expectations' is",
    "referenceResponse": "Charles Dickens"
  },
  "modelResponses": [
    {
      "response": "Sorry - this model is unable to respond to this request.",
      "modelIdentifier": "amazon.titan-text-express-v1"
    }
  ]
}

Explanation of Evaluation Results

  1. Each result contains three key parts:

    • inputRecord: The original question (prompt) and expected answer (referenceResponse).

    • modelResponses: The AI model’s generated response.

    • automatedEvaluationResult: The score assessing how accurate the response is.

  2. How Accuracy is Calculated

    • The accuracy metric is based on the F1 Score, which is a balance between precision (correct predictions among all predictions) and recall (correct predictions among relevant predictions).

    • The F1 score ranges from 0 to 1, where 1 indicates a perfect response and 0 indicates an incorrect response.


Analysis of the Example Results

PromptExpected AnswerModel ResponseAccuracy ScoreObservation
The chemical symbol for gold isAu“Au”.0The response included extra formatting (quotation marks), causing a mismatch.
The tallest mountain in the world isMount EverestMount Everest.1Correct answer, properly formatted.
The author of 'Great Expectations' isCharles DickensSorry - this model is unable to respond to this request.0The model failed to answer the question. This may indicate a knowledge gap.

Key Takeaways from the Evaluation

  • Formatting Matters: The first prompt received a score of 0 because the model enclosed the answer in quotation marks (“Au”.).

  • Correct Responses Get Full Score: The second prompt scored 1.0, as the response exactly matched the expected answer.

  • Model Limitations: The third prompt failed because the model did not have an answer. This could be due to:

    • A temporary issue with the model.

    • The model lacking sufficient training data on literary topics.


How to Improve Model Performance?

  • Refine Prompt Engineering: Modify how questions are framed to reduce false negatives (e.g., specifying that responses should not include extra punctuation).

  • Choose the Right AI Model: Different foundation models have varying strengths. Testing with multiple models can help find the best fit.

  • Use Custom Training Data: If models frequently fail to answer domain-specific questions, consider fine-tuning the dataset.


Summary

By following this guide, you have successfully:
✅ Created two Amazon S3 buckets for storing datasets and results.
✅ Configured CORS permissions for Amazon Bedrock.
✅ Set up an Amazon Bedrock model evaluation job.
✅ Analyzed evaluation results to compare AI model performance.

0
Subscribe to my newsletter

Read articles from Navya A directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Navya A
Navya A

👋 Welcome to my Hashnode profile! I'm a passionate technologist with expertise in AWS, DevOps, Kubernetes, Terraform, Datree, and various cloud technologies. Here's a glimpse into what I bring to the table: 🌟 Cloud Aficionado: I thrive in the world of cloud technologies, particularly AWS. From architecting scalable infrastructure to optimizing cost efficiency, I love diving deep into the AWS ecosystem and crafting robust solutions. 🚀 DevOps Champion: As a DevOps enthusiast, I embrace the culture of collaboration and continuous improvement. I specialize in streamlining development workflows, implementing CI/CD pipelines, and automating infrastructure deployment using modern tools like Kubernetes. ⛵ Kubernetes Navigator: Navigating the seas of containerization is my forte. With a solid grasp on Kubernetes, I orchestrate containerized applications, manage deployments, and ensure seamless scalability while maximizing resource utilization. 🏗️ Terraform Magician: Building infrastructure as code is where I excel. With Terraform, I conjure up infrastructure blueprints, define infrastructure-as-code, and provision resources across multiple cloud platforms, ensuring consistent and reproducible deployments. 🌳 Datree Guardian: In my quest for secure and compliant code, I leverage Datree to enforce best practices and prevent misconfigurations. I'm passionate about maintaining code quality, security, and reliability in every project I undertake. 🌐 Cloud Explorer: The ever-evolving cloud landscape fascinates me, and I'm constantly exploring new technologies and trends. From serverless architectures to big data analytics, I'm eager to stay ahead of the curve and help you harness the full potential of the cloud. Whether you need assistance in designing scalable architectures, optimizing your infrastructure, or enhancing your DevOps practices, I'm here to collaborate and share my knowledge. Let's embark on a journey together, where we leverage cutting-edge technologies to build robust and efficient solutions in the cloud! 🚀💻