Integrate Open AI Whisper API in Android Kotlin

Initial Setup

In this section, we'll go through setting up your Android project to integrate with the OpenAI Whisper API using Kotlin and Ktor.

Step 1: Create a New Android Project

Start by creating a new Android project in Android Studio:

Open Android Studio and click on "New Project."
Choose "Empty Compose Activity" or "Empty Activity," and follow the prompts to set up your project.
Once the project is created, navigate to the build.gradle file of your app module.

Step 2: Add Dependencies

To work with Ktor for making network requests, you'll need to add the following dependencies to your build.gradle (Module: app) file:

dependencies {
    // Ktor for Android networking
    implementation("io.ktor:ktor-client-android:2.x.x")

    // Ktor core
    implementation("io.ktor:ktor-client-core:2.x.x")

    // Content negotiation for serializing JSON
    implementation("io.ktor:ktor-client-content-negotiation:2.x.x")
    implementation("io.ktor:ktor-serialization-kotlinx-json:2.x.x")

    // Ktor logging for debugging HTTP requests
    implementation("io.ktor:ktor-client-logging:2.x.x")

    // Kotlin serialization for handling JSON data
    implementation("io.ktor:ktor-client-serialization:2.x.x")
}

Replace 2.x.x with the latest stable version of Ktor. You can check the latest version here.

Step 3: Add Kotlin Serialization Plugin

In order to handle JSON serialization seamlessly, you’ll need to add the Kotlin Serialization plugin to your project. Add this line to your build.gradle (Project: your_project_name) file:

plugins {
    id 'org.jetbrains.kotlin.plugin.serialization' version '1.9.0'
}

Make sure the plugin version matches your Kotlin version.

After adding the dependencies and plugin, sync your Gradle project.

This setup is the foundation for making HTTP requests with Ktor and handling JSON responses using Kotlin serialization. In the next sections, we will focus on interacting with the Whisper API to send and receive audio transcriptions.

Setting Up the Whisper API Client

In this section, we'll set up a class to interact with the OpenAI Whisper API for audio transcription. We'll also define a reusable function to configure the HttpClient that will handle the network requests. The API will accept audio files, send them to the Whisper API, and retrieve the transcription results.

Step 1: Define the Whisper API Response Data

First, we need to define the data model for the transcription result that we expect to receive from the Whisper API. The model includes the summary text, language, duration, and segmented transcription data.

import kotlinx.serialization.SerialName
import kotlinx.serialization.Serializable

@Serializable
data class TranscriptionResult(
    @SerialName("text") val summary: String,
    @SerialName("language") val language: String,
    @SerialName("duration") val duration: Double,
    @SerialName("segments") val segments: List<Segment>,
)

@Serializable
data class Segment(
    @SerialName("id") val id: Int,
    @SerialName("seek") val seek: Int,
    @SerialName("start") val start: Double,
    @SerialName("end") val end: Double,
    @SerialName("text") val text: String,
)

This data model matches the JSON response format returned by the Whisper API when it processes the audio transcription.

Step 2: Create the Whisper API Class

Now, let's define the WhisperApiImpl class, which will handle the API call for sending audio files and receiving transcription results. This class will use Ktor's HttpClient to make a POST request with the audio file and retrieve the transcription result.

import io.ktor.client.*
import io.ktor.client.call.*
import io.ktor.client.request.*
import io.ktor.client.statement.*
import io.ktor.client.plugins.contentnegotiation.*
import io.ktor.client.plugins.logging.*
import io.ktor.http.*
import io.ktor.client.plugins.timeout.*
import io.ktor.client.request.forms.*
import io.ktor.serialization.kotlinx.json.*
import kotlinx.serialization.json.Json

private const val baseUrl = "https://api.openai.com/v1/audio/transcriptions"
private const val apiKey = "YOUR_API_KEY"

class WhisperApiImpl(
    private val client: HttpClient
) : WhisperApi {
    override suspend fun transcribe(
        bytes: ByteArray,
        filename: String
    ): Result<TranscriptionResult> = runCatching {
        val response: HttpResponse =
            client.submitFormWithBinaryData(url = baseUrl, formData = formData {
                append("file", bytes, audioHeaders(filename))
                append("model", "whisper-1")
                append("response_format", "verbose_json")
            }) {
                bearerAuth(apiKey)
            }
        response.body<TranscriptionResult>()
    }

    private fun audioHeaders(filename: String) = Headers.build {
        append(HttpHeaders.ContentDisposition, "filename=$filename")
    }
}

In this class, the transcribe function sends an audio file as binary data to the Whisper API. We specify the "model" as "whisper-1" and set the response_format to "verbose_json" to receive detailed transcription data, including segments. The API key is passed using the bearerAuth function.

Step 3: Create a Reusable `HttpClient`

Next, let's define a function that returns a configured HttpClient instance. This client will handle the HTTP requests and responses, including timeouts, logging, and content negotiation for JSON responses.

fun createHttpClient(json: Json) = HttpClient {
    install(ContentNegotiation) {
        json(json)
    }
    install(Logging) {
        level = LogLevel.ALL
        logger = object : Logger {
            override fun log(message: String) {
                Napier.d(message = message, tag = "WhisperApi")
            }
        }
    }
    defaultRequest {
        headers {
            append(HttpHeaders.ContentType, "multipart/form-data")
        }
    }

    install(HttpTimeout) {
        requestTimeoutMillis = REQUEST_TIMEOUT
        connectTimeoutMillis = CONNECT_TIMEOUT
    }
}

This HttpClient is configured with:

Content Negotiation: To handle JSON data using Kotlinx serialization.
Logging: To log HTTP requests and responses for debugging.
Timeout: To handle request and connection timeouts.

Step 4: Implement the API Interface

We define a WhisperApi interface that will be implemented by the WhisperApiImpl class. This makes the API class easier to mock or extend in the future.

interface WhisperApi {
    suspend fun transcribe(bytes: ByteArray, filename: String): Result<TranscriptionResult>
}

This function returns a Result<TranscriptionResult>, which encapsulates both success and failure cases.

In the next section, we'll explore how to integrate this API into your application, including sending audio files and handling the transcription results from the Whisper API.

Using Whisper API

Before you start, ensure you have added the Internet permission to your AndroidManifest.xml, as the Whisper API requires network access:

<uses-permission android:name="android.permission.INTERNET" />

In this section, we'll integrate the WhisperApiImpl into your MainActivity and display the transcription result using Jetpack Compose. It's essential to ensure that the URI and filename of the audio file are correct and that the file has the proper extension (e.g., .mp3, .m4a, etc.).

Step 1: Integrate Whisper API in `MainActivity`

In MainActivity, you'll create an instance of WhisperApiImpl, set up UI elements, and use Jetpack Compose to display the transcription result.

Ensure that the audio file’s URI and filename are valid before sending it to the Whisper API. If the URI or filename is incorrect, the Whisper API will reject the file.

Here’s how to set it up:

import android.net.Uri
import android.os.Bundle
import android.widget.Toast
import androidx.activity.ComponentActivity
import androidx.activity.compose.setContent
import androidx.compose.foundation.layout.*
import androidx.compose.material3.*
import androidx.compose.runtime.*
import androidx.compose.ui.Modifier
import androidx.compose.ui.tooling.preview.Preview
import androidx.compose.ui.unit.dp
import io.ktor.client.*
import kotlinx.coroutines.launch
import kotlinx.serialization.json.Json

class MainActivity : ComponentActivity() {
    private lateinit var whisperApi: WhisperApi
    private val json = Json { ignoreUnknownKeys = true }

    override fun onCreate(savedInstanceState: Bundle?) {
        super.onCreate(savedInstanceState)

        // Initialize the API with the HTTP client
        whisperApi = WhisperApiImpl(createHttpClient(json))

        setContent {
            WhisperApp()
        }
    }

    @Composable
    fun WhisperApp() {
        var transcription by remember { mutableStateOf("Transcription will appear here.") }
        val coroutineScope = rememberCoroutineScope()

        Column(
            modifier = Modifier
                .fillMaxSize()
                .padding(16.dp)
        ) {
            Text(
                text = transcription,
                modifier = Modifier
                    .fillMaxWidth()
                    .weight(1f),
                style = MaterialTheme.typography.bodyMedium
            )

            Spacer(modifier = Modifier.height(8.dp))

            Button(
                onClick = {
                    // Call the Whisper API using the file URI
                    val fileUri = getAudioFileUri()
                    val filename = "recorded_audio.m4a" // Ensure correct file extension

                    // Validate the URI and filename before sending to the API
                    if (fileUri != null) {
                        val fileBytes = uriToByteArray(fileUri)

                        coroutineScope.launch {
                            val result = whisperApi.transcribe(fileBytes, filename)
                            result.onSuccess {
                                transcription = it.summary
                            }.onFailure {
                                Toast.makeText(this@MainActivity, "Transcription failed", Toast.LENGTH_SHORT).show()
                            }
                        }
                    } else {
                        Toast.makeText(this@MainActivity, "Invalid URI or filename", Toast.LENGTH_SHORT).show()
                    }
                }
            ) {
                Text(text = "Start Transcription")
            }
        }
    }

    // Mock function to get audio file URI (replace with actual implementation)
    private fun getAudioFileUri(): Uri? {
        // Here you can return a URI from cache or local file directory
        return null
    }

    // Function to convert URI to ByteArray
    private fun uriToByteArray(uri: Uri): ByteArray {
        contentResolver.openInputStream(uri)?.use { inputStream ->
            return inputStream.readBytes()
        }
        return ByteArray(0)
    }
}

Step 2: UI Breakdown

Text Display: The transcription result is displayed in a Text composable, which will be updated once the transcription is received from the API.
Button: A button triggers the transcription process by sending the audio file to the Whisper API.
Validation: Before making the API call, the code ensures that both the URI and the filename (with a correct extension) are valid.

Step 3: Important Points

Correct URI: Ensure that the URI points to the correct audio file. This can be obtained from app cache, a file picker, or other sources.
Correct Filename with Extension: The filename must include the appropriate extension (e.g., .m4a, .mp3). Incorrect file extensions may result in the Whisper API rejecting the file.
File Conversion: Ensure that the file is converted to a ByteArray correctly. The uriToByteArray function handles this by reading the file’s content as bytes before sending it to the API.

Step 4: Testing the Transcription

After setting everything up, run the app. When the Start Transcription button is pressed:

The app checks if the file URI and filename are valid.
The app reads the audio file as a ByteArray.
The app sends the file to the Whisper API.
The transcription result is displayed in the Text composable if the request is successful.

Wrap-Up

In this guide, we successfully integrated OpenAI's Whisper API into an Android app using Kotlin. We covered the project setup, built a reusable Whisper API service, and displayed the transcription results using Jetpack Compose. Key takeaways include ensuring the correct file URI and filename are used for accurate API requests. With this foundation, you're set to expand the app by adding features like progress indicators, handling multiple audio formats, and improving error handling.

This setup provides a simple yet powerful way to integrate audio transcription into your Android apps.

Integrate Open AI Whisper API in Android

Table of contents

Initial Setup

Step 1: Create a New Android Project

Step 2: Add Dependencies

Step 3: Add Kotlin Serialization Plugin

Setting Up the Whisper API Client

Step 1: Define the Whisper API Response Data

Step 2: Create the Whisper API Class

Step 3: Create a Reusable `HttpClient`

Step 4: Implement the API Interface

Using Whisper API

Step 1: Integrate Whisper API in `MainActivity`

Step 2: UI Breakdown

Step 3: Important Points

Step 4: Testing the Transcription

Wrap-Up

Subscribe to my newsletter

Asim Latif

Asim Latif

Integrate Open AI Whisper API in Android

Table of contents

Initial Setup

Step 1: Create a New Android Project

Step 2: Add Dependencies

Step 3: Add Kotlin Serialization Plugin

Setting Up the Whisper API Client

Step 1: Define the Whisper API Response Data

Step 2: Create the Whisper API Class

Step 3: Create a Reusable HttpClient

Step 4: Implement the API Interface

Using Whisper API

Step 1: Integrate Whisper API in MainActivity

Step 2: UI Breakdown

Step 3: Important Points

Step 4: Testing the Transcription

Wrap-Up

Subscribe to my newsletter

Asim Latif

Asim Latif

Step 3: Create a Reusable `HttpClient`

Step 1: Integrate Whisper API in `MainActivity`