Integrate Open AI Whisper API in Android
Initial Setup
In this section, we'll go through setting up your Android project to integrate with the OpenAI Whisper API using Kotlin and Ktor.
Step 1: Create a New Android Project
Start by creating a new Android project in Android Studio:
Open Android Studio and click on "New Project."
Choose "Empty Compose Activity" or "Empty Activity," and follow the prompts to set up your project.
Once the project is created, navigate to the
build.gradle
file of your app module.
Step 2: Add Dependencies
To work with Ktor for making network requests, you'll need to add the following dependencies to your build.gradle
(Module: app) file:
dependencies {
// Ktor for Android networking
implementation("io.ktor:ktor-client-android:2.x.x")
// Ktor core
implementation("io.ktor:ktor-client-core:2.x.x")
// Content negotiation for serializing JSON
implementation("io.ktor:ktor-client-content-negotiation:2.x.x")
implementation("io.ktor:ktor-serialization-kotlinx-json:2.x.x")
// Ktor logging for debugging HTTP requests
implementation("io.ktor:ktor-client-logging:2.x.x")
// Kotlin serialization for handling JSON data
implementation("io.ktor:ktor-client-serialization:2.x.x")
}
Replace 2.x.x
with the latest stable version of Ktor. You can check the latest version here.
Step 3: Add Kotlin Serialization Plugin
In order to handle JSON serialization seamlessly, you’ll need to add the Kotlin Serialization plugin to your project. Add this line to your build.gradle
(Project: your_project_name) file:
plugins {
id 'org.jetbrains.kotlin.plugin.serialization' version '1.9.0'
}
Make sure the plugin version matches your Kotlin version.
After adding the dependencies and plugin, sync your Gradle project.
This setup is the foundation for making HTTP requests with Ktor and handling JSON responses using Kotlin serialization. In the next sections, we will focus on interacting with the Whisper API to send and receive audio transcriptions.
Setting Up the Whisper API Client
In this section, we'll set up a class to interact with the OpenAI Whisper API for audio transcription. We'll also define a reusable function to configure the HttpClient
that will handle the network requests. The API will accept audio files, send them to the Whisper API, and retrieve the transcription results.
Step 1: Define the Whisper API Response Data
First, we need to define the data model for the transcription result that we expect to receive from the Whisper API. The model includes the summary text, language, duration, and segmented transcription data.
import kotlinx.serialization.SerialName
import kotlinx.serialization.Serializable
@Serializable
data class TranscriptionResult(
@SerialName("text") val summary: String,
@SerialName("language") val language: String,
@SerialName("duration") val duration: Double,
@SerialName("segments") val segments: List<Segment>,
)
@Serializable
data class Segment(
@SerialName("id") val id: Int,
@SerialName("seek") val seek: Int,
@SerialName("start") val start: Double,
@SerialName("end") val end: Double,
@SerialName("text") val text: String,
)
This data model matches the JSON response format returned by the Whisper API when it processes the audio transcription.
Step 2: Create the Whisper API Class
Now, let's define the WhisperApiImpl
class, which will handle the API call for sending audio files and receiving transcription results. This class will use Ktor's HttpClient
to make a POST
request with the audio file and retrieve the transcription result.
import io.ktor.client.*
import io.ktor.client.call.*
import io.ktor.client.request.*
import io.ktor.client.statement.*
import io.ktor.client.plugins.contentnegotiation.*
import io.ktor.client.plugins.logging.*
import io.ktor.http.*
import io.ktor.client.plugins.timeout.*
import io.ktor.client.request.forms.*
import io.ktor.serialization.kotlinx.json.*
import kotlinx.serialization.json.Json
private const val baseUrl = "https://api.openai.com/v1/audio/transcriptions"
private const val apiKey = "YOUR_API_KEY"
class WhisperApiImpl(
private val client: HttpClient
) : WhisperApi {
override suspend fun transcribe(
bytes: ByteArray,
filename: String
): Result<TranscriptionResult> = runCatching {
val response: HttpResponse =
client.submitFormWithBinaryData(url = baseUrl, formData = formData {
append("file", bytes, audioHeaders(filename))
append("model", "whisper-1")
append("response_format", "verbose_json")
}) {
bearerAuth(apiKey)
}
response.body<TranscriptionResult>()
}
private fun audioHeaders(filename: String) = Headers.build {
append(HttpHeaders.ContentDisposition, "filename=$filename")
}
}
In this class, the transcribe
function sends an audio file as binary data to the Whisper API. We specify the "model"
as "whisper-1"
and set the response_format
to "verbose_json"
to receive detailed transcription data, including segments. The API key is passed using the bearerAuth
function.
Step 3: Create a Reusable HttpClient
Next, let's define a function that returns a configured HttpClient
instance. This client will handle the HTTP requests and responses, including timeouts, logging, and content negotiation for JSON responses.
fun createHttpClient(json: Json) = HttpClient {
install(ContentNegotiation) {
json(json)
}
install(Logging) {
level = LogLevel.ALL
logger = object : Logger {
override fun log(message: String) {
Napier.d(message = message, tag = "WhisperApi")
}
}
}
defaultRequest {
headers {
append(HttpHeaders.ContentType, "multipart/form-data")
}
}
install(HttpTimeout) {
requestTimeoutMillis = REQUEST_TIMEOUT
connectTimeoutMillis = CONNECT_TIMEOUT
}
}
This HttpClient
is configured with:
Content Negotiation: To handle JSON data using Kotlinx serialization.
Logging: To log HTTP requests and responses for debugging.
Timeout: To handle request and connection timeouts.
Step 4: Implement the API Interface
We define a WhisperApi
interface that will be implemented by the WhisperApiImpl
class. This makes the API class easier to mock or extend in the future.
interface WhisperApi {
suspend fun transcribe(bytes: ByteArray, filename: String): Result<TranscriptionResult>
}
This function returns a Result<TranscriptionResult>
, which encapsulates both success and failure cases.
In the next section, we'll explore how to integrate this API into your application, including sending audio files and handling the transcription results from the Whisper API.
Using Whisper API
Before you start, ensure you have added the Internet permission to your AndroidManifest.xml
, as the Whisper API requires network access:
<uses-permission android:name="android.permission.INTERNET" />
In this section, we'll integrate the WhisperApiImpl
into your MainActivity
and display the transcription result using Jetpack Compose. It's essential to ensure that the URI and filename of the audio file are correct and that the file has the proper extension (e.g., .mp3
, .m4a
, etc.).
Step 1: Integrate Whisper API in MainActivity
In MainActivity
, you'll create an instance of WhisperApiImpl
, set up UI elements, and use Jetpack Compose to display the transcription result.
Ensure that the audio file’s URI and filename are valid before sending it to the Whisper API. If the URI or filename is incorrect, the Whisper API will reject the file.
Here’s how to set it up:
import android.net.Uri
import android.os.Bundle
import android.widget.Toast
import androidx.activity.ComponentActivity
import androidx.activity.compose.setContent
import androidx.compose.foundation.layout.*
import androidx.compose.material3.*
import androidx.compose.runtime.*
import androidx.compose.ui.Modifier
import androidx.compose.ui.tooling.preview.Preview
import androidx.compose.ui.unit.dp
import io.ktor.client.*
import kotlinx.coroutines.launch
import kotlinx.serialization.json.Json
class MainActivity : ComponentActivity() {
private lateinit var whisperApi: WhisperApi
private val json = Json { ignoreUnknownKeys = true }
override fun onCreate(savedInstanceState: Bundle?) {
super.onCreate(savedInstanceState)
// Initialize the API with the HTTP client
whisperApi = WhisperApiImpl(createHttpClient(json))
setContent {
WhisperApp()
}
}
@Composable
fun WhisperApp() {
var transcription by remember { mutableStateOf("Transcription will appear here.") }
val coroutineScope = rememberCoroutineScope()
Column(
modifier = Modifier
.fillMaxSize()
.padding(16.dp)
) {
Text(
text = transcription,
modifier = Modifier
.fillMaxWidth()
.weight(1f),
style = MaterialTheme.typography.bodyMedium
)
Spacer(modifier = Modifier.height(8.dp))
Button(
onClick = {
// Call the Whisper API using the file URI
val fileUri = getAudioFileUri()
val filename = "recorded_audio.m4a" // Ensure correct file extension
// Validate the URI and filename before sending to the API
if (fileUri != null) {
val fileBytes = uriToByteArray(fileUri)
coroutineScope.launch {
val result = whisperApi.transcribe(fileBytes, filename)
result.onSuccess {
transcription = it.summary
}.onFailure {
Toast.makeText(this@MainActivity, "Transcription failed", Toast.LENGTH_SHORT).show()
}
}
} else {
Toast.makeText(this@MainActivity, "Invalid URI or filename", Toast.LENGTH_SHORT).show()
}
}
) {
Text(text = "Start Transcription")
}
}
}
// Mock function to get audio file URI (replace with actual implementation)
private fun getAudioFileUri(): Uri? {
// Here you can return a URI from cache or local file directory
return null
}
// Function to convert URI to ByteArray
private fun uriToByteArray(uri: Uri): ByteArray {
contentResolver.openInputStream(uri)?.use { inputStream ->
return inputStream.readBytes()
}
return ByteArray(0)
}
}
Step 2: UI Breakdown
Text Display: The transcription result is displayed in a
Text
composable, which will be updated once the transcription is received from the API.Button: A button triggers the transcription process by sending the audio file to the Whisper API.
Validation: Before making the API call, the code ensures that both the URI and the filename (with a correct extension) are valid.
Step 3: Important Points
Correct URI: Ensure that the URI points to the correct audio file. This can be obtained from app cache, a file picker, or other sources.
Correct Filename with Extension: The filename must include the appropriate extension (e.g.,
.m4a
,.mp3
). Incorrect file extensions may result in the Whisper API rejecting the file.File Conversion: Ensure that the file is converted to a
ByteArray
correctly. TheuriToByteArray
function handles this by reading the file’s content as bytes before sending it to the API.
Step 4: Testing the Transcription
After setting everything up, run the app. When the Start Transcription button is pressed:
The app checks if the file URI and filename are valid.
The app reads the audio file as a
ByteArray
.The app sends the file to the Whisper API.
The transcription result is displayed in the
Text
composable if the request is successful.
Wrap-Up
In this guide, we successfully integrated OpenAI's Whisper API into an Android app using Kotlin. We covered the project setup, built a reusable Whisper API service, and displayed the transcription results using Jetpack Compose. Key takeaways include ensuring the correct file URI and filename are used for accurate API requests. With this foundation, you're set to expand the app by adding features like progress indicators, handling multiple audio formats, and improving error handling.
This setup provides a simple yet powerful way to integrate audio transcription into your Android apps.
Subscribe to my newsletter
Read articles from Asim Latif directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
Asim Latif
Asim Latif
I’m Asim, a Software Engineer II (Mobile) at Vyro AI. I specialize in building AI-powered mobile applications, with a focus on photo editing, image generation, and music creation tools. Passionate about the intersection of AI and mobile technology, I’m constantly exploring innovative solutions to deliver seamless and creative user experiences. Let’s connect and share ideas on AI, mobile development, and tech innovation!