Building a Local Speech-to-Speech Translator in Java


Introduction
Creating a local speech-to-speech translator is now more accessible than ever, thanks to advancements in open-source speech recognition, translation, and text-to-speech technologies. In this blog, I'll walk you through how I built a fully local Speech-to-Speech Translator in Java, leveraging cutting-edge tools like Whisper for speech recognition, a multilingual LLM for translation, and Edge TTS for natural-sounding speech synthesis—all running on your own machine.
Project Repository:
local-speech-to-speech-translator on Github
1. Project Overview
The goal of this project is to enable real-time speech translation from English to French, entirely on your local system. The pipeline consists of:
Speech-to-Text: Convert spoken English to text using Whisper, run locally.
Translation: Translate the recognized English text to French using a local LLM model (LLaMAX3-8B-Alpaca-GGUF) running Ollama.
Text-to-Speech: Synthesize French speech from the translated text using Edge TTS, integrated with JavaFX.
2. Whisper Speech-to-Text: Local and Efficient
OpenAI's Whisper is a state-of-the-art speech recognition model. To use Whisper locally from Java, I integrated it through Java Native Interface (JNI), which allows Java code to call native libraries written in C/C++ or Python.
WhisperJNI.loadLibrary();
WhisperJNI.setLibraryLogger(null);
var whisper = new WhisperJNI();
whisperContext = whisper.init(Path.of(System.getProperty("user.home"), "ggml-large-v3.bin"));
How it works:
The Java application records audio input from the user.
Through JNI, the Java code calls the locally installed Whisper model to transcribe the audio to English text.
This approach ensures all processing stays on your machine, maintaining privacy and reducing latency.
3. Translation: Multilingual LLM with Ollama
For translation, I use the mrjacktung/mradermacher-llamax3-8b-alpaca-gguf model running on Ollama. LLaMAX3-8B is a multilingual language model, fine-tuned for translation tasks and supporting over 100 languages, including English and French.
> ollama serve mrjacktung/mradermacher-llamax3-8b-alpaca-gguf
Why LLaMAX3-8B-Alpaca-GGUF?
It provides robust translation quality, outperforming similarly sized models on benchmarks like Flores-101.
Runs locally via Ollama, so no internet connection or cloud API is required.
Easy integration with Java: The Java app sends the English text to Ollama's API, specifying English as the source and French as the target language. The model returns the French translation.
Sample translation logs:
2025-06-18T22:40:26.472+02:00 DEBUG 42692 --- [Semantic Sardine] [nuous-processor] d.l.http.client.log.LoggingHttpClient : HTTP request:
- method: POST
- url: http://localhost:11434/api/chat
- headers: [Content-Type: application/json]
- body: {
"model" : "mrjacktung/mradermacher-llamax3-8b-alpaca-gguf",
"messages" : [ {
"role" : "system",
"content" : "You are an English-to-French translation assistant.\n\nStrict rules:\n1. Always translate the input text from English to French only. Never translate to any other language.\n2. If the input is not in English, or is ambiguous, return an empty string.\n3. Output only the French translation as plain text. Do not include any explanations, comments, formatting, quotes, or repetition of the input.\n4. Never apologize, never explain, and never provide meta-comments.\n5. If the input contains both English and non-English, translate only the English parts and ignore the rest.\n6. After translating, always verify that your output is in French. If it is not French, return an empty string.\n7. Under no circumstances should you output any language but French.\n\nExamples:\n Input: Hello, how are you?\n Output: Bonjour, comment ça va ?\n\n Input: What time is it?\n Output: Quelle heure est-il ?\n\n Input: This is a test.\n Output: Ceci est un test.\n\n Input: Bonjour, comment ça va ? (already in French)\n Output:\n\n Input: 你好,你会说英语吗? (Chinese)\n Output:\n\n Input: Hello, bonjour, how are you? (mixed English and French)\n Output: Bonjour, comment ça va ?\n\n Input: (empty input)\n Output:"
}, {
"role" : "user",
"content" : "A tiny door appears in my bedroom window."
} ],
"options" : {
"temperature" : 0.0
},
"stream" : false
}
2025-06-18T22:40:28.272+02:00 DEBUG 42692 --- [Semantic Sardine] [nuous-processor] d.l.http.client.log.LoggingHttpClient : HTTP response:
- status code: 200
- headers: [Content-Length: 386], [Date: Wed, 18 Jun 2025 20:40:28 GMT], [Content-Type: application/json; charset=utf-8]
- body: {"model":"mrjacktung/mradermacher-llamax3-8b-alpaca-gguf","created_at":"2025-06-18T20:40:28.266696Z","message":{"role":"assistant","content":"Une petite porte apparaît dans ma fenêtre de chambre."},"done_reason":"stop","done":true,"total_duration":1775358458,"load_duration":35716958,"prompt_eval_count":299,"prompt_eval_duration":1301636250,"eval_count":14,"eval_duration":435475000}
The Java code handles this interaction seamlessly, making the translation step transparent to the end user.
4. Text-to-Speech: JavaFX & Edge TTS
The final step is converting the translated French text into natural-sounding speech. For this, I use Edge TTS5, Microsoft's advanced text-to-speech technology, which supports over 300 voices in 40+ languages.
Integration in Java:
The JavaFX library provides the GUI and audio playback capabilities.
Edge TTS is accessed via local wrappers, allowing you to select the French voice, adjust speed, and synthesize speech from the translated text.
<dependency> <groupId>io.github.whitemagic2014</groupId> <artifactId>tts-edge-java</artifactId> <version>1.2.6</version> </dependency>
The resulting audio is played back to the user via the JavaFX application.
5. Bringing It All Together
The entire workflow runs locally:
User speaks in English.
Whisper (via JNI) transcribes speech to English text.
Java app sends text to LLaMAX3-8B-Alpaca-GGUF (Ollama) for translation to French.
Translated French text is synthesized to speech using Edge TTS and played to the user.
This architecture ensures:
Privacy: No audio or text leaves your machine.
Speed: No network latency for inference.
Flexibility: Swap out models or languages as needed.
6. Why Java?
Java remains a powerful choice for cross-platform applications, offering robust libraries for audio, GUI (JavaFX), and native integration (JNI). This project demonstrates that even advanced AI workflows can be effectively orchestrated in Java, making it a viable option for modern, privacy-focused language tools.
7. Try It Yourself
Check out the GitHub repository for setup instructions, code samples, and details on running the translator locally. Contributions and feedback are welcome!
References:
Subscribe to my newsletter
Read articles from Ravi Ranjan directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
