Introduction

Open-source contributions often come with challenges that test your problem-solving skills. In this blog, I’ll discuss how I added audio file transcription to ChatCraft, a ChatGPT clone for developers. The feature leverages OpenAI's speech-to-text API to process audio uploads and convert them into text messages.

The Problem

ChatCraft already allowed users to upload documents (e.g., PDFs, Word files), which were processed into text and displayed as chat messages. However, audio files weren’t supported.

The goal was to

Enable audio file uploads.
Convert the audio to text using OpenAI’s speech-to-text API.
Display the transcription in the chat as a message that the users can then prompt against.

Challenges

1. Understanding the Codebase

The ChatCraft codebase is modular, with responsibilities spread across several files and components. While modularity improves scalability, it posed significant challenges:

Navigating Dependencies: Tracing through multiple files to understand how file imports, AI interaction, and chat message generation were interconnected.
Learning Existing Workflows: For example, PDF files were processed in the use-file-import.tsx file, converting them to text via an external API and displaying the content in the chat. I needed to replicate this flow for audio files while keeping the code consistent.

2. Integrating Audio Transcription

The transcription functionality for audio files already existed in src/lib/speech-recognition.ts as part of the speech-to-text feature. However, the problem lay in its implementation:

The transcription logic was tied to the SpeechRecognition class, which required initialization with a model and client.
This initialization depended on React hooks like useModels, which couldn’t be directly accessed in non-React files like use-file-import.tsx.

3. Fragmented Logic

To make audio transcription work, I initially had to split the logic across components:

The file-import logic returned an empty string for audio files, bypassing the usual flow.
The transcription was handled downstream in a React component, where the necessary hooks were available.

While this approach worked, it introduced problems:

Missed Trigger Points: The progress indicator for file processing didn’t activate for audio uploads.
Code Inconsistency: The logic for handling audio files was fragmented and diverged from how other file types (like PDFs) were processed.

4. Refactoring for Scalability

The challenge wasn’t just implementing the feature—it was doing so in a scalable, maintainable way. The existing code structure wasn’t well-suited for background processes like transcription, which required decoupling logic from the UI.

I had to consult the maintainers for this and got a lot of feedback from them which in turn enabled me to execute what they wanted from a software design point of view.

Implementation

Step 1: Refactor AI Logic

To enable audio transcription, I needed to refactor the AI logic. This involved:

Moving the existing AI interaction logic out of React hooks into a separate service.
Ensuring the service could be reused for both file uploads and background tasks like transcription.

Step 2: Add Audio File Support

Frontend Changes

File Input Update: Modified the file upload component to accept audio formats such as .mp3 and .wav.

Backend Changes

Audio Transcription Service: Created a new backend service to process audio files:
1. Receive the uploaded audio file.
2. Send it to OpenAI’s API for transcription.
3. Format the transcription and return it to the frontend.
Error Handling: Managed scenarios like API failures, large file uploads, or unsupported formats.

Step 3: Integration and Testing

After implementing the feature, I tested it thoroughly:

Uploaded various audio files to ensure accurate transcription.
Checked the chat UI for consistent formatting of transcribed messages.

The Result

Now, ChatCraft users can upload audio files, which are automatically transcribed into text and displayed in the chat. This feature aligns seamlessly with the existing document upload workflow, enhancing ChatCraft's functionality for developers.

Links to My Work

Here’s the issue and pull request for this feature:

What I Learned

Navigating Modular Code: Tracing through interdependent modules taught me how to break down complex systems into manageable parts.
The Value of Refactoring: Moving logic out of hooks made the codebase more scalable and maintainable.
Integrating APIs: Using OpenAI’s API deepened my understanding of handling external services in web applications.

Conclusion

Contributing to ChatCraft was a rewarding experience, and I’m excited to take on more challenges in the open-source world!

Contributing to ChatCraft

Table of contents