Speech Refiner


In the digital age, the way we communicate in writing matters more than ever. Whether it’s an email, a feedback form or a social media comment. Tone and politeness can make or even break a message. That is why, here is a free to use solution of Speech Refiner, a lightweight AI tool designed to improved and soften any written communication.
Instead of delving into the mathematics behind the scene, this post only on the practical implementation and deployment to deliver a ready to use tool.
Motivation
We’ve all been there - waking up to promotional emails, automated follow-ups, or vague LinkedIn messages asking for “a quick connect.” While many of these messages border on spam, we often still need to respond, especially in professional settings without sounding rude or dismissive.
In day-to-day communication, whether it's replying to persistent sales emails, addressing unsolicited collaboration requests, or giving constructive feedback to a colleague, tone plays a crucial role. A blunt or emotionally charged message can harm relationships or escalate misunderstandings.
That’s where the idea for Speech Refiner came from:
A tool that helps you express what you mean - firmly if needed but in a polite, professional tone.
In a world driven by remote communication and asynchronous messaging, being clear without being curt is not just good etiquette but it’s a skill. Speech Refiner aims to make that easier, one sentence at a time.
Q: Why not just use ChatGPT?
A: This tool offers a high degree of customization and full control to the user. It follows a plug-and-play architecture, allowing users to seamlessly integrate any local or preferred language models into the backend as per their requirements.
Project Overview
The system is composed of two parts:
Frontend: An n intuitive interface where users can either record their voice or upload a pre-recorded audio clip containing a message they wish to refine. (GitHub documentation)
Backend: A LLM engine powered backend custom API that receives recordings from the frontend and rewrites it to deliver with softened tone, courtesy phrases, and contextual phrasing. (GitHub documentation)
Whether it's a voice note, a spontaneous thought, or a casual remark, Speech Refiner transforms it into a professional, composed version - ready to be used in emails, meetings, or digital conversations. It bridges the gap between natural speech and formal communication, helping users convey their intent with clarity and courtesy.
Application Screenshot
The focus will primarily be on the backend, given the AI-centric nature of our discussion, though we’ll also briefly touch on the frontend toward the end.
Backend
The backbone of the Speech Refiner application is a robust, modular, and production-ready backend service. Designed with a focus on speech-to-text transcription and language refinement, the backend plays a critical role in transforming voice inputs into polished, professional responses.
Let’s take a look at how it works under the hood.
The backend API GitHub repository has been kept private to prevent misuse, such as unauthorized API calls or abuse.
Overview
The Politeness Engine Backend is a RESTful API built using Flask, designed to handle audio files submitted from the frontend. It performs two key tasks:
Transcribes speech from the uploaded audio using a speech recognition engine.
Refines the transcribed text using a Large Language Model (LLM) hosted on Groq.
This backend is lightweight, scalable, and secure, making it ready for integration into real-world applications, whether as a web client, desktop interface, or mobile app.
System Architecture
The core system is organized across three main files:
main.py
Acts as the entry point of the application. It defines the/upload
endpoint, handles file uploads, manages temporary file storage, and orchestrates the processing flow.utils.py
Contains two utility classes:Transcription
: Converts audio input into text using the Google Speech Recognition API via thespeech_recognition
package.LLM
: Sends the transcribed text to a Groq-hosted LLM API with a predefined prompt, returning an enhanced, more polite version of the message.
requirements.txt
Lists all dependencies required to run the service, including Flask, CORS handling, rate limiting, and audio/LLM utilities.
API Endpoint: /upload
The backend exposes a single public endpoint:
Method:
POST
Route:
/upload
Content-Type:
multipart/form-data
Form Field:
audio
- Accepts audio files (WAV)Request Flow:
1. User uploads an audio file from the frontend interface.
2. File is saved temporarily in the
uploads/
directory.3. Audio is loaded into memory using
scipy.io
.wavfile
.4. The
Transcription
module converts the audio into raw text.5. The text is truncated to 100 words to manage LLM input token limits.
6. The
LLM
module sends the text to the Groq API for refinement.7. Refined output is returned alongside the original transcription.
8. The temporary file is deleted to maintain a clean and secure environment.
Successful Response (200 OK)
{
"input": "Raw transcribed text",
"output": "Refined and polite version of the message"
}
Error Handling
If no audio file is provided, a
400 Bad Request
is returned with a helpful error message.In case of unexpected issues (e.g., invalid formats, API failures), a
500 Internal Server Error
is returned with the error trace for debugging.
Rate Limiting & Security
To prevent abuse and ensure fair usage, the backend implements IP-based rate limiting using Flask-Limiter
. Limits are:
10 requests per minute
150 requests per day
Rate limits are stored in Redis, managed through:
Redis Hosting: Upstash
Configuration: Passed via
REDIS_URI
environment variableFallback: Defaults to in-memory storage if Redis is unavailable
LLM Integration (Groq API)
The refinement is powered by a Groq-hosted Large Language Model (llama-3.3-70b-versatile
):
Transcribed text is passed to the LLM with a carefully crafted prompt.
The
LLM.query_llm()
method sends this data via a secure API request.The API key (
GROQ_API_KEY
) is stored as an environment variable and never exposed to the client.
This ensures that the entire language processing logic remains server-side and protected.
Deployment Overview
The backend is deployed on Render, a modern cloud platform for hosting APIs.
Server Stack: Flask app served via
Gunicorn
, a production-ready WSGI HTTP server.Environment Variables:
GROQ_API_KEY
– LLM API keyREDIS_URI
– Redis connection string (via Upstash)
Uploaded audio files are never persisted long-term—they’re deleted immediately after processing.
Key Dependencies
The application uses the following Python libraries:
Flask
,flask-cors
– REST API and CORS supportFlask-Limiter
,redis
– Rate limiting infrastructurespeechrecognition
,scipy
,soundfile
,numpy
– Audio handling and transcriptionrequests
– Communicating with the Groq APIgunicorn
– Serving the app in production
Frontend
The Speech Refiner frontend acts as a clean and intuitive interface for interacting with the backend AI engine. It provides two primary modes of input:
Live voice recording using the browser microphone
Audio file upload (WAVformat)
Once an audio input is submitted, it is sent to the /upload
endpoint of the backend. Upon successful processing, users receive the original transcription and the refined, polite version of their message, rendered instantly within the interface.
Tech Stack
HTML + Vanilla JavaScript: Lightweight and dependency-free
Web Audio API & MediaRecorder: For live voice recording
Fetch API: Handles communication with the Flask backend
Packaging for Distribution
To run the app, run:
npx electron .
To build the
.exe
:npm install npx electron-packager . SpeechRefiner --platform=win32 --arch=x64 --icon=favicon.ico --overwrite
This will create a packaged version of the app using Electron Packager or Electron Forge (as configured).
Security Considerations
Since audio data is sensitive, the design ensures that no processing happens on the client side. All voice inputs are securely transmitted to the backend over HTTPS, where transcription and LLM processing take place. The backend API URL is abstracted, and no secrets or tokens are exposed on the frontend.
This separation of concerns ensures a secure and privacy-respecting user experience.
Content Security Policy (CSP) Troubleshooting
Local API vs Hosted API Issues
Issue: Hosted API (
http://192.168.x.x:5000
) didn’t respond as expected inside Electron.Cause: Hosted API had longer response time (~5 sec) with no visual feedback.
Fixes:
Added a
Processing...
loader during API call.Verified API response behavior using Postman/browser.
Unsupported Audio Format (WebM instead of WAV)
Error:
File format b'\x1aE\xdf\xa3' not understood. Only 'RIFF' and 'RIFX' supported.
Cause: MediaRecorder API defaulted to WebM; Flask expected
.wav
.Fix: Switched to
recorder.js
which generates proper.wav
output.
Invalid WAV Header (nAvgBytesPerSec mismatch)
Error:
WAV header is invalid: nAvgBytesPerSec must equal product of nSamplesPerSec and nBlockAlign
Cause: Some versions of Recorder.js generated incorrect headers.
Fixes:
CDN failed due to MIME issues.
Forked versions had broken links.
Manually downloaded corrected
recorder.js
from GitHub and loaded it locally.
MIME Type Execution Errors
Error:
Refused to execute script from CDN because its MIME type was 'text/plain'
Fix: Used local version of recorder.js:
<script src="recorder.js"></script>
Electron Warning
Message:
Insecure Content-Security-Policy: no CSP or unsafe-eval used
Strict CSP Attempt
<meta http-equiv="Content-Security-Policy" content="default-src 'self'; script-src 'self'; connect-src http://192.168.x.x:5000;">
Issue:
Inline scripts blocked.
Microphone stopped.
API hit prematurely without file.
Root Cause
Electron apps commonly use inline scripts or libraries requiring relaxed policies.
Strict CSP blocks
eval
, inline JavaScript, dynamic execution.
Solutions Attempted
Tried relaxed CSP with:
script-src 'self' 'unsafe-inline'
Inline scripts worked, but reintroduced security risks (e.g., XSS).
Final Decision
CSP not applied now due to dev-time constraints.
Plan:
Keep .exe private.
Share code with API placeholder.
Let users build locally and request API key if needed.
recorder.js
is downloaded from here.
Recommendations for Future Deployment
Extract all inline scripts into external files.
Set a strict and secure CSP header.
Remove unsafe-inline and unsafe-eval.
Validate microphone permissions and backend headers for production use.
Subscribe to my newsletter
Read articles from Tuhin Dutta directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
