AWS Transcribe Client: A Lightweight Speech-to-Text Solution with Voice Activity Detection

Introduction

In this article, I'm excited to introduce the aws-transcribe-client, a lightweight TypeScript library for integrating Amazon's real-time speech-to-text capabilities into web applications. This library was born out of real-world needs during the development of Co-Author AI, an application for authoring documents via AI and voice commands.

Let me walk you through the rationale behind this project, its architecture, and how you can use it in your own applications.

The Origin Story

The aws-transcribe-client library was initially developed to power speech recognition in Co-Author AI, where voice commands significantly enhance the document authoring experience. When building Co-Author AI, we needed a speech recognition solution that was:

  1. Accurate and reliable for professional document creation
  2. Browser-compatible across all modern browsers, including Safari
  3. Intelligent with automatic silence detection and handling
  4. Easy to integrate with React-based applications
  5. Cost-efficient through smart management of AWS Transcribe streaming sessions

After evaluating several options, we decided to build a custom AWS Transcribe streaming client that would meet all these requirements. The library has since matured into a standalone solution that can be used in any web application.

Architecture Overview

The aws-transcribe-client library is structured around two main components:

1. Core Client (AWSTranscribeClient)

The core client handles all interactions with AWS Transcribe streaming API and includes:

  • Credential Management: Secure handling of AWS credentials with automatic refresh
  • Voice Activity Detection (VAD): RMS-based audio analysis to detect when speech is occurring
  • Silence Management: Intelligent detection of silence periods to start/stop transcription
  • Session Management: Tracking and reporting of usage time
  • WebAudio Integration: Cross-browser audio capture and processing
  • AWS SDK Integration: Proper handling of the Transcribe Streaming API

2. React Component (ReactAWSTranscribe)

Built on top of the core client, the React component provides:

  • State Management: React hooks to manage the transcription state
  • UI Components: Default UI elements with customization options
  • Render Props: Flexible rendering options for custom UI integration
  • Event Handling: React-friendly callback system

The library's architecture follows these design principles:

  • Framework-Agnostic Core: The core functionality works independently of any UI framework
  • Progressive Enhancement: Features like voice activity detection enhance the experience but aren't required
  • Browser Compatibility: Special handling for Safari and other browsers' audio API quirks
  • Type Safety: Full TypeScript support throughout the codebase

Technical Deep Dive

Let's look at some of the interesting technical challenges solved in this library:

Custom ReadableStream Implementation

One major challenge was Safari's limited support for the standard ReadableStream API. To solve this, the library implements a custom ReadableStream class that works across all browsers. This implementation provides the async iterator interface required by the AWS SDK while functioning in all browsers.

Voice Activity Detection

The library implements a simple but effective voice activity detection algorithm based on RMS (Root Mean Square) analysis of audio data:

private _detectVoiceActivity(inputData: Float32Array): boolean {
    const rms = Math.sqrt(inputData.reduce((acc, val) => acc + val * val, 0) / inputData.length);
    return rms > this.config.vadThreshold;
}

This approach minimizes processing overhead while reliably detecting speech, and the threshold is configurable to adapt to different environments.

Intelligent Silence Handling

To provide a good user experience, the library implements a dual-timeout system for silence:

  1. A short timeout (default: 1 second) to pause transcription during brief pauses
  2. A longer timeout (default: 60 seconds) to stop transcription completely during extended silence

This approach preserves natural speech patterns while preventing unnecessary AWS Transcribe usage during long periods of inactivity.

Getting Started with aws-transcribe-client

Installation

npm install aws-transcribe-client

Basic Usage with the Core Client

import { AWSTranscribeClient, TranscribeCredentials } from 'aws-transcribe-client';
import { LanguageCode } from '@aws-sdk/client-transcribe-streaming';

// Create a credentials provider function
const credentialsProvider = async (minutesUsed: number): Promise<TranscribeCredentials> => {
    // Fetch credentials from your server
    const response = await fetch('/api/aws-credentials');
    return await response.json();
};

// Create instance
const transcribeClient = new AWSTranscribeClient({
    region: 'us-east-1',
    languageCode: LanguageCode.EN_US,
    credentialsProvider,
    onTranscript: ({ transcript, interimTranscript }) => {
        console.log('Final transcript:', transcript);
        console.log('Interim transcript:', interimTranscript);
    },
    onSpeechStart: () => console.log('Speech started'),
    onSpeechEnd: () => console.log('Speech ended'),
    onError: (error) => console.error('Error:', error),
    onStateChange: (state) => console.log('State changed:', state)
});

// Start/stop transcription
transcribeClient.start();
transcribeClient.stop();

// Or use toggle
transcribeClient.toggle();

Using the React Component

import React, { useState } from 'react';
import { ReactAWSTranscribe, TranscriptData } from 'aws-transcribe-client';

const TranscriptionApp: React.FC = () => {
  const [transcript, setTranscript] = useState<string>('');
  const [interimTranscript, setInterimTranscript] = useState<string>('');

  const credentialsProvider = async () => {
    const response = await fetch('/api/aws-credentials');
    const credentials = await response.json();
    return credentials;
  };

  const handleTranscript = (data: TranscriptData) => {
    setTranscript(data.transcript);
    setInterimTranscript(data.interimTranscript);
  };

  return (
    <div>
      <h1>Transcription App</h1>

      <ReactAWSTranscribe
        region="us-east-1"
        credentialsProvider={credentialsProvider}
        onTranscript={handleTranscript}
      />

      <div className="transcript-container">
        <h2>Final Transcript:</h2>
        <p>{transcript}</p>

        <h2>Interim Transcript:</h2>
        <p className="interim">{interimTranscript}</p>
      </div>
    </div>
  );
};

Custom UI with Render Props

The React component supports custom UI through render props:

<ReactAWSTranscribe
  credentialsProvider={credentialsProvider}
  onTranscript={handleTranscript}
>
  {({ isListening, isActivelySpeaking, toggleListening }) => (
    <div className="my-custom-ui">
      <button 
        onClick={toggleListening}
        className={isActivelySpeaking ? 'active-speaking' : ''}
      >
        {isListening ? 'Stop' : 'Start'} Listening
      </button>
    </div>
  )}
</ReactAWSTranscribe>

Security Best Practices

Security is critical when working with AWS credentials. The library provides a credentials provider pattern that allows you to implement secure credential handling:

  1. Never hardcode AWS credentials in your client-side code
  2. Always generate temporary credentials server-side with limited permissions
  3. Use secure HTTPS connections for all credential transfers
  4. Implement a credential rotation strategy

Here's an example of a secure server-side implementation using Python/FastAPI:

from fastapi import FastAPI, HTTPException
import aioboto3
from datetime import datetime

app = FastAPI()

# Configure this value based on your requirements
RESERVED_MINUTES = 15

@app.get("/api/aws-transcribe/get-credentials")
async def get_credentials():
    try:
        # Retrieve the IAM role ARN from environment variables
        sts_role_arn = os.environ.get("AWS_TRANSCRIBE_ROLE_ARN")
        if not sts_role_arn:
            raise HTTPException(status_code=500, detail="Role ARN not configured.")

        # Create a session and assume the role
        session = aioboto3.Session()
        async with session.client("sts") as sts:
            assumed_role = await sts.assume_role(
                RoleArn=sts_role_arn,
                RoleSessionName="TranscribeSession",
                DurationSeconds=RESERVED_MINUTES * 60,
            )

        credentials = assumed_role.get("Credentials")
        if not credentials:
            raise HTTPException(status_code=500, detail="Could not assume role.")

        # Return the credentials to the client
        return {
            "accessKeyId": credentials["AccessKeyId"],
            "secretAccessKey": credentials["SecretAccessKey"],
            "sessionToken": credentials["SessionToken"],
            "expiration": credentials["Expiration"],
            "reservedMinutes": RESERVED_MINUTES,
        }
    except Exception as e:
        raise HTTPException(status_code=500, detail=f"Error generating credentials: {str(e)}")

Real-World Use in Co-Author AI

At Co-Author AI, we use the aws-transcribe-client library to enable voice commands for document authoring. Here's how it's integrated:

  1. The transcription client runs in the background, continuously monitoring for voice input
  2. When speech is detected, it's transcribed in real-time and sent to an NLP processing pipeline
  3. The processed commands are used to manipulate the document (e.g., "insert a table with 3 rows and 4 columns")
  4. Voice activity detection ensures transcription only happens when the user is speaking

This approach provides a natural interface for document creation that would be cumbersome with traditional input methods.

Performance Considerations

When integrating speech recognition into your applications, consider these performance aspects:

  1. Browser Compatibility: Test thoroughly across all target browsers, especially Safari
  2. Audio Processing Overhead: The VAD system is lightweight but does require continuous processing
  3. Network Reliability: AWS Transcribe requires a stable internet connection
  4. Cost Management: Implement proper session management to control AWS costs

Conclusion

The aws-transcribe-client library provides a robust, browser-compatible solution for integrating Amazon Transcribe's real-time speech recognition into web applications. Its voice activity detection, intelligent silence handling, and React integration make it particularly well-suited for interactive applications like Co-Author AI.

As voice interfaces become increasingly important in modern applications, tools like this will be essential for developers looking to enhance their user experience with speech recognition.

We welcome contributions and feedback on the project! Visit the GitHub repository to get involved.

About the Author

Kirmanie L. Ravariere is the creator of aws-transcribe-client and the founder of SoFREE LLC, which develops innovative applications including Co-Author AI.

0
Subscribe to my newsletter

Read articles from Kirmanie Ravariere directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Kirmanie Ravariere
Kirmanie Ravariere