Problem Statement

Evaluating multiple LLMs for text summarization tasks is a resource-intensive and complex process. Developers often encounter significant challenges in orchestrating workflows, managing parallelization, and storing results in a scalable and cost-effective manner. These difficulties can lead to increased overhead and inefficiencies. Internally, there is a need to develop a tool that assists developers in understanding the scoring of the models they use before moving to production. This tool aims to address these challenges, streamlining the evaluation process and providing valuable insights into model performance.

Solution

This solution harnesses the power of AWS's serverless services to build an automated, scalable, and cost-effective architecture for evaluating LLMs. This innovative solution incorporates several key features:

GitHub URL: https://github.com/jayyanar/llmeval-bedrock-summarize-scale

Quick Demo

https://youtu.be/rFzRSJVaeZ8

Automatic Deployment and Scaling

The evaluation workflow is automatically triggered by changes pushed to the GitHub repository. AWS Step Functions handle the orchestration of parallel evaluations for multiple LLMs, ensuring efficient resource utilization without the need for manual intervention.

Scalability and Cost-Effectiveness

The architecture leverages AWS serverless services such as AWS Step Functions, API Gateway, and Bedrock LLM. This allows the system to dynamically scale based on demand, ensuring that overhead costs are minimized when the system is not in use.

Centralized Result Storage

Evaluation results for each LLM are stored in a centralized database or data store. This centralization facilitates easy analysis and comparison of results, providing valuable insights into model performance.

Modular Architecture

The architecture is designed to be modular, with loosely coupled components. This design enables easy integration of new LLMs or the replacement of existing ones, promoting flexibility and extensibility in the evaluation process.

Technical Architecture Overview

Code Deployment: Developers push code changes to a GitHub repository.
Trigger Detection: The Amplify WebHosting service detects the changes and triggers a GET request to the Amazon API Gateway.
Workflow Orchestration: The API Gateway invokes an AWS Step Function, which orchestrates the evaluation process.
Parallel Evaluation: The Step Function parallelizes the evaluation by triggering multiple instances of the Call Bedrock LLM for Model Evaluation step, one for each model (Model 1, Model 2, Model 3) to be evaluated.
Result Storage: The evaluation results of each model are stored in a centralized database or data store.
Request Routing: The Amazon API Gateway acts as a proxy, receiving requests from the Amplify WebHosting service and routing them to the appropriate Step Function.

Detailed Implementation

AWS Bedrock

AWS Bedrock provides a suite of foundational models for a variety of machine learning tasks, including text summarization. The architecture leverages Bedrock to perform LLM evaluations by integrating its capabilities into the serverless workflow. You can find more details about Bedrock in the AWS Bedrock Documentation.

Bedrock Evaluation API

The Bedrock Evaluation API is a critical component of this architecture. It allows for the efficient execution and management of LLM evaluations. By utilizing this API, the architecture can automate the evaluation process, ensuring that each model is tested against the same dataset under consistent conditions. More information about the Bedrock Evaluation API can be found here.

Detailed Workflow

Initial Trigger: Code changes are pushed to GitHub, which automatically triggers the Amplify WebHosting service.
API Gateway: Amplify WebHosting sends a GET request to Amazon API Gateway, which then invokes an AWS Step Function.
Orchestration with Step Functions: The Step Function manages the evaluation workflow, initiating parallel evaluations for each LLM. This involves invoking the Bedrock Evaluation API for each model (e.g., Amazon Titan, LLAMA 3, Claude 3.5).
Parallel Processing: Each model is evaluated in parallel, with results being sent back to the Step Function for aggregation.
Centralized Storage: Results are stored in a centralized database, allowing for easy access and comparison.
User Interface: The evaluation results are displayed in the Amplify-hosted web application, providing developers with a user-friendly interface to view and analyze the outcomes.

Scalability and Cost-Effectiveness

The use of AWS serverless technologies, such as AWS Step Functions and API Gateway, ensures that the architecture can scale according to demand. This eliminates unnecessary costs when the system is idle, making it a cost-effective solution for enterprises.

Centralized Result Storage

Storing evaluation results in a centralized database (DynamoDB) not only facilitates easy comparison and analysis but also ensures that data is readily available for future reference. This centralized approach improves data governance and security.

Conclusion

The serverless architecture designed by Aadhi and Ayyanar helps the evaluation of LLMs by automating deployment, ensuring scalability, and centralizing results storage. This innovative approach not only reduces overhead costs but also enhances the efficiency and flexibility of LLM evaluation, making it an invaluable asset for enterprises.

To learn more about their work, you can connect with Aadhi here and AJ here.

Evaluation for AWS Bedrock Models - A Serverless Approach to Enhance Efficiency and Scalability

Problem Statement

Solution

GitHub URL: https://github.com/jayyanar/llmeval-bedrock-summarize-scale

Quick Demo

Automatic Deployment and Scaling

Scalability and Cost-Effectiveness

Centralized Result Storage

Modular Architecture

Technical Architecture Overview

Detailed Implementation

AWS Bedrock

Bedrock Evaluation API

Detailed Workflow

Scalability and Cost-Effectiveness

Centralized Result Storage

Conclusion

Subscribe to my newsletter

DataOps Labs

DataOps Labs