Speech Refiner

Tuhin DuttaTuhin Dutta
8 min read

In the digital age, the way we communicate in writing matters more than ever. Whether it’s an email, a feedback form or a social media comment. Tone and politeness can make or even break a message. That is why, here is a free to use solution of Speech Refiner, a lightweight AI tool designed to improved and soften any written communication.

Instead of delving into the mathematics behind the scene, this post only on the practical implementation and deployment to deliver a ready to use tool.

Motivation

We’ve all been there - waking up to promotional emails, automated follow-ups, or vague LinkedIn messages asking for “a quick connect.” While many of these messages border on spam, we often still need to respond, especially in professional settings without sounding rude or dismissive.

In day-to-day communication, whether it's replying to persistent sales emails, addressing unsolicited collaboration requests, or giving constructive feedback to a colleague, tone plays a crucial role. A blunt or emotionally charged message can harm relationships or escalate misunderstandings.

That’s where the idea for Speech Refiner came from:

A tool that helps you express what you mean - firmly if needed but in a polite, professional tone.

In a world driven by remote communication and asynchronous messaging, being clear without being curt is not just good etiquette but it’s a skill. Speech Refiner aims to make that easier, one sentence at a time.

Q: Why not just use ChatGPT?
A: This tool offers a high degree of customization and full control to the user. It follows a plug-and-play architecture, allowing users to seamlessly integrate any local or preferred language models into the backend as per their requirements.

Project Overview

The system is composed of two parts:

  • Frontend: An n intuitive interface where users can either record their voice or upload a pre-recorded audio clip containing a message they wish to refine. (GitHub documentation)

  • Backend: A LLM engine powered backend custom API that receives recordings from the frontend and rewrites it to deliver with softened tone, courtesy phrases, and contextual phrasing. (GitHub documentation)

Whether it's a voice note, a spontaneous thought, or a casual remark, Speech Refiner transforms it into a professional, composed version - ready to be used in emails, meetings, or digital conversations. It bridges the gap between natural speech and formal communication, helping users convey their intent with clarity and courtesy.

Application Screenshot

image

The focus will primarily be on the backend, given the AI-centric nature of our discussion, though we’ll also briefly touch on the frontend toward the end.

Backend

The backbone of the Speech Refiner application is a robust, modular, and production-ready backend service. Designed with a focus on speech-to-text transcription and language refinement, the backend plays a critical role in transforming voice inputs into polished, professional responses.

Let’s take a look at how it works under the hood.

The backend API GitHub repository has been kept private to prevent misuse, such as unauthorized API calls or abuse.

Overview

The Politeness Engine Backend is a RESTful API built using Flask, designed to handle audio files submitted from the frontend. It performs two key tasks:

  1. Transcribes speech from the uploaded audio using a speech recognition engine.

  2. Refines the transcribed text using a Large Language Model (LLM) hosted on Groq.

This backend is lightweight, scalable, and secure, making it ready for integration into real-world applications, whether as a web client, desktop interface, or mobile app.

System Architecture

The core system is organized across three main files:

  • main.py
    Acts as the entry point of the application. It defines the /upload endpoint, handles file uploads, manages temporary file storage, and orchestrates the processing flow.

  • utils.py
    Contains two utility classes:

    • Transcription: Converts audio input into text using the Google Speech Recognition API via the speech_recognition package.

    • LLM: Sends the transcribed text to a Groq-hosted LLM API with a predefined prompt, returning an enhanced, more polite version of the message.

  • requirements.txt
    Lists all dependencies required to run the service, including Flask, CORS handling, rate limiting, and audio/LLM utilities.

API Endpoint: /upload

The backend exposes a single public endpoint:

  • Method: POST

  • Route: /upload

  • Content-Type: multipart/form-data

  • Form Field: audio - Accepts audio files (WAV)

  • Request Flow:

    1. User uploads an audio file from the frontend interface.

    2. File is saved temporarily in the uploads/ directory.

    3. Audio is loaded into memory using scipy.io.wavfile.

    4. The Transcription module converts the audio into raw text.

    5. The text is truncated to 100 words to manage LLM input token limits.

    6. The LLM module sends the text to the Groq API for refinement.

    7. Refined output is returned alongside the original transcription.

    8. The temporary file is deleted to maintain a clean and secure environment.

Successful Response (200 OK)

{
  "input": "Raw transcribed text",
  "output": "Refined and polite version of the message"
}

Error Handling

  • If no audio file is provided, a 400 Bad Request is returned with a helpful error message.

  • In case of unexpected issues (e.g., invalid formats, API failures), a 500 Internal Server Error is returned with the error trace for debugging.

Rate Limiting & Security

To prevent abuse and ensure fair usage, the backend implements IP-based rate limiting using Flask-Limiter. Limits are:

  • 10 requests per minute

  • 150 requests per day

Rate limits are stored in Redis, managed through:

  • Redis Hosting: Upstash

  • Configuration: Passed via REDIS_URI environment variable

  • Fallback: Defaults to in-memory storage if Redis is unavailable

LLM Integration (Groq API)

The refinement is powered by a Groq-hosted Large Language Model (llama-3.3-70b-versatile):

  • Transcribed text is passed to the LLM with a carefully crafted prompt.

  • The LLM.query_llm() method sends this data via a secure API request.

  • The API key (GROQ_API_KEY) is stored as an environment variable and never exposed to the client.

This ensures that the entire language processing logic remains server-side and protected.

Deployment Overview

The backend is deployed on Render, a modern cloud platform for hosting APIs.

  • Server Stack: Flask app served via Gunicorn, a production-ready WSGI HTTP server.

  • Environment Variables:

    • GROQ_API_KEY – LLM API key

    • REDIS_URI – Redis connection string (via Upstash)

Uploaded audio files are never persisted long-term—they’re deleted immediately after processing.

Key Dependencies

The application uses the following Python libraries:

  • Flask, flask-cors – REST API and CORS support

  • Flask-Limiter, redis – Rate limiting infrastructure

  • speechrecognition, scipy, soundfile, numpy – Audio handling and transcription

  • requests – Communicating with the Groq API

  • gunicorn – Serving the app in production

Frontend

GitHub Repository

The Speech Refiner frontend acts as a clean and intuitive interface for interacting with the backend AI engine. It provides two primary modes of input:

  • Live voice recording using the browser microphone

  • Audio file upload (WAVformat)

Once an audio input is submitted, it is sent to the /upload endpoint of the backend. Upon successful processing, users receive the original transcription and the refined, polite version of their message, rendered instantly within the interface.

Tech Stack

  • HTML + Vanilla JavaScript: Lightweight and dependency-free

  • Web Audio API & MediaRecorder: For live voice recording

  • Fetch API: Handles communication with the Flask backend

Packaging for Distribution

  • To run the app, run:

      npx electron .
    
  • To build the .exe:

      npm install
      npx electron-packager . SpeechRefiner --platform=win32 --arch=x64 --icon=favicon.ico --overwrite
    

    This will create a packaged version of the app using Electron Packager or Electron Forge (as configured).

Security Considerations

Since audio data is sensitive, the design ensures that no processing happens on the client side. All voice inputs are securely transmitted to the backend over HTTPS, where transcription and LLM processing take place. The backend API URL is abstracted, and no secrets or tokens are exposed on the frontend.

This separation of concerns ensures a secure and privacy-respecting user experience.

Content Security Policy (CSP) Troubleshooting

  1. Local API vs Hosted API Issues

    • Issue: Hosted API (http://192.168.x.x:5000) didn’t respond as expected inside Electron.

    • Cause: Hosted API had longer response time (~5 sec) with no visual feedback.

    • Fixes:

      • Added a Processing... loader during API call.

      • Verified API response behavior using Postman/browser.

  2. Unsupported Audio Format (WebM instead of WAV)

    • Error:

        File format b'\x1aE\xdf\xa3' not understood. Only 'RIFF' and 'RIFX' supported.
      
    • Cause: MediaRecorder API defaulted to WebM; Flask expected .wav.

    • Fix: Switched to recorder.js which generates proper .wav output.

  3. Invalid WAV Header (nAvgBytesPerSec mismatch)

    • Error:

        WAV header is invalid: nAvgBytesPerSec must equal product of nSamplesPerSec and nBlockAlign
      
    • Cause: Some versions of Recorder.js generated incorrect headers.

    • Fixes:

      • CDN failed due to MIME issues.

      • Forked versions had broken links.

      • Manually downloaded corrected recorder.js from GitHub and loaded it locally.

  4. MIME Type Execution Errors

    • Error:

        Refused to execute script from CDN because its MIME type was 'text/plain'
      
    • Fix: Used local version of recorder.js:

        <script src="recorder.js"></script>
      
  5. Electron Warning

    • Message:

        Insecure Content-Security-Policy: no CSP or unsafe-eval used
      
    • Strict CSP Attempt

        <meta http-equiv="Content-Security-Policy" content="default-src 'self'; script-src 'self'; connect-src http://192.168.x.x:5000;">
      
      • Issue:

        • Inline scripts blocked.

        • Microphone stopped.

        • API hit prematurely without file.

      • Root Cause

        • Electron apps commonly use inline scripts or libraries requiring relaxed policies.

        • Strict CSP blocks eval, inline JavaScript, dynamic execution.

      • Solutions Attempted

        • Tried relaxed CSP with:

            script-src 'self' 'unsafe-inline'
          
        • Inline scripts worked, but reintroduced security risks (e.g., XSS).

      • Final Decision

        • CSP not applied now due to dev-time constraints.

        • Plan:

          • Keep .exe private.

          • Share code with API placeholder.

          • Let users build locally and request API key if needed.

recorder.js is downloaded from here.

Recommendations for Future Deployment

  • Extract all inline scripts into external files.

  • Set a strict and secure CSP header.

  • Remove unsafe-inline and unsafe-eval.

  • Validate microphone permissions and backend headers for production use.

0
Subscribe to my newsletter

Read articles from Tuhin Dutta directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Tuhin Dutta
Tuhin Dutta