Transcript Generator

Farhan KhojaFarhan Khoja
6 min read

I was just trying to figure out how we could self-host opensource ASRs LLMs and just interact with it. So, personally I just build an API that could generate transcripts using OpenAI’s tiny and only english Whisper model which is actually open-source on Hugging Face. This resource is just about how I tried to figure out system design of the application.

The implementation can be found here: https://github.com/DEVunderdog/transcript-generator

The technology stack I used:

  • Golang for writing API Server

  • Python for managing LLMs

So basically we have two services, one is API server and another one which generates transcript. Both the services are decoupled and they communicate via GCP Pub/Sub Events.

I have built end-to-end solution by leveraging GCP resources and Github Actions for CI/CD pipelines.

Application Overview

How are we handling Authentication and Authorization ?

  • My goal is to just secure it hence I used API keys concept here.

  • Anybody could just register with their email and get the API keys

  • And use that API keys as medium of authorization to use the other resources i.e. endpoints.

  • I have tried to keep it simple just so that I could focus on other part more better and keep overall server lightweight (not technically but from logical idea)

What about API Server ?

  • Well, API server is something that manages user i.e. the API keys which are registered via email in database

  • API server provides platform to upload the audio files to GCP Cloud Storage.

  • Now the core idea here is that to generate the transcript asynchronously, so we could just trigger or in other words send event via GCP Pub/Sub to Topic that we need transcript of this audio file which is achieved via this API server.

  • The event message consists of registered email attached to API keys i.e. payload, and actually object name in GCP Cloud Storage.

What about Transcript Service ?

  • So the transcript service is continuously listening on subscribed GCP Pub/Sub subscription for an event.

  • And whenever it receives the event it downloads the audio file from GCP Cloud Storage under temporary location

  • Then it reads the audio file and pre-process the audio files into 16 KHz which is a standard for processing audio files by LLMs

  • Then OpenAI’s tiny Whisper (only english) version which is instantiate using Hugging Face transformers library.

  • During instantiation we set the model id, set LLMs device which is CPU, then set chunking length

  • Then using Hugging Face transformers library’s pipeline function we start processing and then pass the file to that pipeline and then wait for results.

  • Once the result is received we write it down to the pdf file

  • And then we sent that pdf file to email via Gmail SMTP server to the mail id which we earlier received via event.

How are we managing files ?

  • Basically, we have endpoints in API server using Forms we upload the file.

  • We are tracking those files in the database.

  • Now the core idea of managing file physically and tracking its status is somewhat different.

  • Because, the file is stored remotely on GCP Cloud Storage at one single location, but classifying those files respective to their users its a different thing hence we need to track them using database.

  • Now there are possibilities of conflicts like,

    • If we upload the file first we create a entry of file record in database and then upload the file to Cloud Storage Bucket then what if uploads fail ? then we have stale record in our database that doesn’t meant anything.

    • Let’s first upload the file to Cloud Storage bucket and then write record of file in database but then what if the upload succeeds but database transaction fails then we have stale object in Cloud Storage bucket about which our application has not any idea of.

  • So the idea here is that we need to manage transactions like system to manage resource remotely.

  • We are going to use Two Phase Commit and in that we are using Saga pattern.

Two phase commit and Saga Pattern

  • We are having three states of record for file in database i.e. PENDING, FAILED and SUCCESS.

  • And additionally we are explicitly managing LOCK_STATUS of the record for concurrent transaction and yes also row level locking too.

  • So initially when user uploads the file it creates an empty record with file metadata in database and set the status of that record to PENDING and also set the LOCK_STATUS to True.

  • Then we actually upload the file resource to Cloud Storage Bucket and once upload is successful then we write again to database by updating same record with Object URI of that resource.

  • But how are we assure that during the upload process the record isn’t tampered ?

    • We can’t do row level locking there because we don’t know how much time is it going to take to upload the file to Cloud Storage bucket hence its a bad practice.

    • So we get assured by checking the record LOCK_STATUS which needed to be True and also verifying when was it last updated.

Scenarios of Conflicts

  • Now if we create a empty record in database with metadata of file and upload to Cloud Storage bucket fails then we unlock the resource and updating the status to FAILED.

  • Now if we create a empty record in database with metadata of file and upload to Cloud Storage bucket succeeds, but updating and unlocking the file record in database fails then consequences of that is we have record in database which is locked, but we also have object in Cloud Storage bucket, and if record is locked we can’t use it.

  • So, to tackle this we use Sync.

Sync

  • Its a cleanup process that removes conflicts.

  • Currently its activated by trigger and requesting the API endpoint but we could add it as background process, (just to save my cloud run cost I didn’t set it as background process)

  • First it list conflicting files from database

    • Record whose LOCK_STATUS is True and Upload status is SUCCESS

    • Record whose LOCK_STATUS is True and Upload status is FAILED

    • Record whose LOCK_STATUS is True and Upload status is PENDING

    • Record whose LOCK_STATUS is False and Upload Status is PENDING

    • Record whose LOCK_STATUS is False and Upload Status is Failed

  • Note that, while its configured as manually triggering process so at that point we are assuming that state of database has conflicting resource hence we classify the records whose LOCK_STATUS is True and Upload Status is Pending. Otherwise if its set as background process we wouldn’t consider that situation due to concurrency.

  • After listing conflicting files, we fetch the objects from remote Cloud Storage bucket but only files that belongs to that user (because we have arranged the URI of each object according to user)

  • And then just checking which file exists and which doesn’t using Hashmap feature in Golang which is faster.

  • And once we figure out, the files which aren’t in fetched objects data structure it needs to be deleted.

  • And found objects from remote bucket needs to be Unlocked by setting LOCK_STATUS to False and Status to SUCCESS.

How did we deliver it ?

How we leverage GCP ?

CI/CD Pipelines

  • The pipelines were written in Github Actions

  • Local changes are pushed to remote repository on Github.

  • Github Actions determine the changes in which service as we have monorepo

  • Build docker images and push to GCP Artifact registry

  • Then Deploy that images on Cloud run.

This are all my learning while building this project, if you have any queries please contact me via email.

Thanks :)

0
Subscribe to my newsletter

Read articles from Farhan Khoja directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Farhan Khoja
Farhan Khoja