How to Implement Active Speaker Detection in Swift for iOS and Apple

Hello There! In this article I’ll walk you through an implementation of Active Speaker Detection I created this past semester. However, a very important heads up, I was primarily interested in the Audio aspect of this project, hence, the Computer Vision and Face Tracking component of my doesn’t receive nearly as much attention as the other parts. Thus, I would greatly appreciate any concerns, suggestions or contributions in the comments.

Anyways, let’s start coding.

Link To Final Project is https://github.com/carlosmbe/ActiveSpeakerDetectionStarter. The relevant files are in the ASDFiles folder.

An Overview - How Does It Work?

My current implementation uses a combination of Speech Diarization and Transcription Models paired with Apple’s Vision framework. The algorithm, in broad strokes, does the following:

Run Speech Diarization on the Video Clip to generate time ranges for when each speaker’s talking
Use Speech Transcription to identify time stamps for when words are being spoken
Using the transcription time stamps, use Vision to identify which faces are talking and log their respective positions
Repeat Step 3 until we’re confident that we’ve identified the speaker, then match that ID with the time ranges from Step 1’s Diarization Model
Use the positions obtained from the Vision framework to do some really cool Spatial Development stuff

In this article, I will focus on steps 1 through 4 as once we know who is talking and where they are; how you want to use that information is a pretty subjective choice.

Speech Models - Diarization and Transcription

Diarization

For simplicity sake, we’ll use the Diarization starter project I created a while back as setting that up is a fairly involved process. Here’s an article if you’d like to learn more about it.

Go to https://github.com/carlosmbe/SpeechDiarizationStarter
Clone the project and follow the build instructions, particularly those involving building and adding the Sherpa-Onnx and Onnxruntime frameworks. If stuck and need help, feel free to open an issue on my repository or the official Sherpa-Onnx repository.
Build and run the test app. Assuming that the frameworks have been added correctly, you should have Speech Diarization working like this:

Transcription.Swift

For transcription we’re using Apple’s built in Speech Framework. It’s quite the Swifty API so it’s not too complicated. Here’s an example class I’ve created with a few comments to explain what’s happening. This class isn’t needed for our app, I’m using it for educational purposes.

import Speech

//Struct for the results of our transciption
struct RecognizedUtterance {
    let text: String
    let startTime: TimeInterval
    let endTime: TimeInterval
}

class Transcriber: ObservableObject {

    //Create an instance of the Speech Recognizer. If you're using a lanaguage other than english, you'd intialize it here.
    //You can also write a clever algorithm for automatic detection or allow users to pick their own language
    private let speechRecognizer = SFSpeechRecognizer(locale: Locale(identifier: "en-US"))

    // Results for speech recognition data
    var recognizedUtterances: [RecognizedUtterance] = []


    //We call this function to perform the actual transcription
    func performSpeechRecognition(audioURL: URL?) async throws -> [RecognizedUtterance] {
            guard let audioURL = audioURL else {
                throw NSError(domain: "SpeechRecognition", code: 1, userInfo: [NSLocalizedDescriptionKey: "Audio URL not available"])
            }

            guard let speechRecognizer = speechRecognizer, speechRecognizer.isAvailable else {
                throw NSError(domain: "SpeechRecognition", code: 1, userInfo: [NSLocalizedDescriptionKey: "Speech recognizer is not available"])
            }

            return try await withCheckedThrowingContinuation { continuation in

                let request = SFSpeechURLRecognitionRequest(url: audioURL)
                request.taskHint = .dictation
                request.shouldReportPartialResults = false

                let recognitionTask: SFSpeechRecognitionTask? = speechRecognizer.recognitionTask(with: request) { [self] result, error in
                    if let error = error {
                        continuation.resume(throwing: error)
                        return
                    }

                    guard let result = result else {
                        continuation.resume(returning: [])
                        return
                    }

                    if result.isFinal {
                        let segments = result.bestTranscription.segments
                        for segment in segments {
                            let utterance = RecognizedUtterance(
                                text: segment.substring,
                                startTime: segment.timestamp,
                                endTime: segment.timestamp + segment.duration
                            )
                            //Debugging Print Statement
                            print(utterance)
                            recognizedUtterances.append(utterance)
                        }
                        continuation.resume(returning: recognizedUtterances)

                    }
                }
            }
        }
    }

Models.Swift

These are some helper structs and definitions we’ll be using for our project. They’ll make more sense once we cover the analyzers.

import Foundation
import SwiftUI

struct SpeakerProfile: Identifiable {
    let id = UUID()
    var speakerID: Int
    var faceID: UUID?
    var segments: [TimeSegment]
    var embedding: [Float]? 

    struct TimeSegment {
        let start: Float
        let end: Float
    }
}

struct FaceProfile: Identifiable {
    let id = UUID()
    var trackID: UUID
    var timeRanges: [TimeRange]
    var avgPosition: CGPoint
    var mouthOpennessHistory: [Double] = []
    var avgMouthOpenness: Double = 0.0

    struct TimeRange {
        let timestamp: Double
        let boundingBox: CGRect
        let isSpeaking: Bool
        let mouthOpenness: Double
    }
}


class CombinedAnalysisResult: ObservableObject {
    @Published var matchedSpeakers: [MatchedSpeaker] = []
    @Published var preprocessingComplete = false

    struct MatchedSpeaker: Identifiable {
        let id = UUID()
        let speakerID: Int
        let faceID: UUID?
        let position: CGPoint?
        let segments: [SpeakerProfile.TimeSegment]
        var isCurrentlySpeaking: Bool = false
    }
}

SpeechAnalyzer.Swift

Alright, now that we have Speech Diarization and Transcription, let’s combine the two to make an analyzer class that handles the speech and audio aspect of the project. Afterwards, we’ll create a separate class for the Vision analysis. At that point, we’ll have to connect both classes and share data between the two.


import AVFoundation
import Speech

class SpeechAnalyzer: ObservableObject {
    @Published var isAudioReady = false
    @Published var recognizedUtterances: [RecognizedUtterance] = []

    private let speechRecognizer = SFSpeechRecognizer(locale: Locale(identifier: "en-US"))
    private var audioURL: URL?
    private let sdViewModel = SDViewModel()

    struct RecognizedUtterance {
        let text: String
        let startTime: TimeInterval
        let endTime: TimeInterval
    }

    func prepareAudio(sourceURL: URL) {
        Task {
            do {
                let convertedAudioURL = try await convertMediaToMonoFloat32WAV(inputURL: sourceURL)

                DispatchQueue.main.async {
                    self.audioURL = convertedAudioURL
                    self.isAudioReady = true
                }
            } catch {
                print("Error converting audio: \(error)")
            }
        }
    }

    func performSpeechRecognition() async throws -> [RecognizedUtterance] {
        guard let audioURL = self.audioURL else {
            throw NSError(domain: "SpeechRecognition", code: 1, userInfo: [NSLocalizedDescriptionKey: "Audio URL not available"])
        }

        guard let speechRecognizer = speechRecognizer, speechRecognizer.isAvailable else {
            throw NSError(domain: "SpeechRecognition", code: 1, userInfo: [NSLocalizedDescriptionKey: "Speech recognizer is not available"])
        }

        return try await withCheckedThrowingContinuation { continuation in
            let request = SFSpeechURLRecognitionRequest(url: audioURL)
            request.taskHint = .dictation
            request.shouldReportPartialResults = false

            var utterances: [RecognizedUtterance] = []

            let recognitionTask = speechRecognizer.recognitionTask(with: request) { result, error in
                if let error = error {
                    continuation.resume(throwing: error)
                    return
                }

                guard let result = result else {
                    continuation.resume(returning: [])
                    return
                }

                if result.isFinal {
                    let segments = result.bestTranscription.segments
                    for segment in segments {
                        let utterance = RecognizedUtterance(
                            text: segment.substring,
                            startTime: segment.timestamp,
                            endTime: segment.timestamp + segment.duration
                        )
                        utterances.append(utterance)
                    }

                    continuation.resume(returning: utterances)
                }
            }
        }
    }

    func performSpeakerDiarization() async -> [SpeakerProfile] {
        guard let audioURL = self.audioURL else { return [] }

        let speakerCount = 2
        let segments = await sdViewModel.runDiarization(
            waveFileName: "",
            numSpeakers: speakerCount,
            fullPath: audioURL
        )

        var speakerMap: [Int: [SpeakerProfile.TimeSegment]] = [:]

        for segment in segments {
            let timeSegment = SpeakerProfile.TimeSegment(start: segment.start, end: segment.end)
            if speakerMap[segment.speaker] == nil {
                speakerMap[segment.speaker] = []
            }
            speakerMap[segment.speaker]?.append(timeSegment)
        }

        var speakerProfiles: [SpeakerProfile] = []
        for (speakerID, segments) in speakerMap {
            let profile = SpeakerProfile(
                speakerID: speakerID,
                faceID: nil,
                segments: segments,
                embedding: nil
            )
            speakerProfiles.append(profile)
        }

        return speakerProfiles
    }

    func matchUtterancesToSpeakers(
        utterances: [RecognizedUtterance],
        speakerProfiles: [SpeakerProfile]
    ) -> [(utterance: RecognizedUtterance, speakerID: Int)] {
        var matchedUtterances: [(utterance: RecognizedUtterance, speakerID: Int)] = []

        for utterance in utterances {
            if utterance.endTime - utterance.startTime < 0.5 {
                continue
            }

            var bestSpeakerID = -1
            var longestOverlap: Float = 0

            for speaker in speakerProfiles {
                var totalOverlap: Float = 0

                for segment in speaker.segments {
                    let overlapStart = max(Float(utterance.startTime), segment.start)
                    let overlapEnd = min(Float(utterance.endTime), segment.end)

                    if overlapEnd > overlapStart {
                        totalOverlap += overlapEnd - overlapStart
                    }
                }

                if totalOverlap > longestOverlap {
                    longestOverlap = totalOverlap
                    bestSpeakerID = speaker.speakerID
                }
            }

            if bestSpeakerID >= 0 && longestOverlap > 0 {
                matchedUtterances.append((utterance, bestSpeakerID))
            }
        }

        return matchedUtterances
    }
}

VisionAnalyzer.Swift

So in this class we’re trying to get frames from our video to analyze and then detect and track those faces we find.

import AVFoundation
import Vision

class VisionAnalyzer: ObservableObject {
    private let videoURL: URL
    private var videoAsset: AVAsset
    private var videoTrack: AVAssetTrack?
     let sequenceRequestHandler = VNSequenceRequestHandler()

    private var faceTrackingHistory: [UUID: [FaceProfile.TimeRange]] = [:]
    @Published var faceProfiles: [FaceProfile] = []

    init(videoURL: URL) {
        self.videoURL = videoURL
        self.videoAsset = AVAsset(url: videoURL)

        Task {
            let tracks = try await videoAsset.loadTracks(withMediaType: .video)
            if let track = tracks.first {
                self.videoTrack = track
            }
        }
    }

    func detectAndTrackFaces() async throws -> [FaceProfile] {
        let generator = AVAssetImageGenerator(asset: videoAsset)
        generator.requestedTimeToleranceBefore = .zero
        generator.requestedTimeToleranceAfter = .zero
        generator.appliesPreferredTrackTransform = true

        let duration = try await videoAsset.load(.duration)
        let nominalFrameRate = try await videoTrack?.load(.nominalFrameRate) ?? 30.0
        let frameCount = Int(duration.seconds * Double(nominalFrameRate))
        let samplingInterval = max(frameCount / 300, 1)

        var activeFaceTracks: [UUID: (lastBox: CGRect, lastTime: Double, avgPosition: CGPoint)] = [:]
        var faceTemporalData: [UUID: (history: [Double], sum: Double)] = [:]
        var faceTrackingHistory: [UUID: [FaceProfile.TimeRange]] = [:]

        for frameIdx in stride(from: 0, to: frameCount, by: samplingInterval) {
            let time = CMTime(seconds: Double(frameIdx) / Double(nominalFrameRate), preferredTimescale: 600)

            guard let cgImage = try await getVideoFrame(at: time) else {
                continue
            }

            let faceRequest = VNDetectFaceLandmarksRequest()
            faceRequest.revision = VNDetectFaceLandmarksRequestRevision3

            //The simulator can not use the Nerual Engine, so adding this line allows me to debug without a real device
            #if targetEnvironment(simulator)
            if #available(iOS 17.0, *) {
                if let cpuDevice = MLComputeDevice.allComputeDevices.first(where: { $0.description.contains("MLCPUComputeDevice") }) {
                    faceRequest.setComputeDevice(.some(cpuDevice), for: .main)
                }
            } else {
                faceRequest.usesCPUOnly = true
            }
            #endif

            try sequenceRequestHandler.perform([faceRequest], on: cgImage)

            guard let observations = faceRequest.results as? [VNFaceObservation] else {
                continue
            }

            for observation in observations {
                let boundingBox = observation.boundingBox
                let timestamp = time.seconds

                var mouthOpenness: Double = 0.0
                if let innerLips = observation.landmarks?.innerLips,
                   let outerLips = observation.landmarks?.outerLips {

                    let innerPoints = innerLips.normalizedPoints
                    let outerPoints = outerLips.normalizedPoints

                    let innerVertical = (innerPoints.max(by: { $0.y < $1.y })?.y ?? 0) -
                                       (innerPoints.min(by: { $0.y < $1.y })?.y ?? 0)
                    let outerVertical = (outerPoints.max(by: { $0.y < $1.y })?.y ?? 0) -
                                       (outerPoints.min(by: { $0.y < $1.y })?.y ?? 0)

                    mouthOpenness = Double(max(innerVertical, outerVertical) * boundingBox.height)
                }

                let faceCenter = CGPoint(
                    x: boundingBox.midX,
                    y: boundingBox.midY
                )

                var matchedTrackID: UUID?
                for (trackID, trackInfo) in activeFaceTracks {
                    let distance = sqrt(pow(faceCenter.x - trackInfo.avgPosition.x, 2) +
                                   pow(faceCenter.y - trackInfo.avgPosition.y, 2))

                    if (distance < 0.15) && ((timestamp - trackInfo.lastTime) < 1.0) {
                        matchedTrackID = trackID

                        let newAvgPos = CGPoint(
                            x: (trackInfo.avgPosition.x * 0.7 + faceCenter.x * 0.3),
                            y: (trackInfo.avgPosition.y * 0.7 + faceCenter.y * 0.3)
                        )

                        activeFaceTracks[trackID] = (boundingBox, timestamp, newAvgPos)
                        break
                    }
                }

                let trackID = matchedTrackID ?? UUID()
                if matchedTrackID == nil {
                    activeFaceTracks[trackID] = (boundingBox, timestamp, faceCenter)
                }

                var trackData = faceTemporalData[trackID] ?? (history: [], sum: 0.0)
                trackData.history.append(mouthOpenness)
                trackData.sum += mouthOpenness
                faceTemporalData[trackID] = trackData

                let avgOpenness = trackData.history.isEmpty ? 0.0 :
                                trackData.sum / Double(trackData.history.count)
                let isSpeaking = mouthOpenness > max(0.05, avgOpenness * 1.5)

                let timeRange = FaceProfile.TimeRange(
                    timestamp: timestamp,
                    boundingBox: boundingBox,
                    isSpeaking: isSpeaking,
                    mouthOpenness: mouthOpenness
                )

                if faceTrackingHistory[trackID] == nil {
                    faceTrackingHistory[trackID] = []
                }
                faceTrackingHistory[trackID]?.append(timeRange)
            }

            let currentTime = time.seconds
            activeFaceTracks = activeFaceTracks.filter {
                currentTime - $0.value.lastTime <= 1.0
            }
        }

        var faceProfiles: [FaceProfile] = []
        for (trackID, timeRanges) in faceTrackingHistory {
            guard timeRanges.count >= 10 else {
                continue
            }

            let avgX = timeRanges.map { $0.boundingBox.midX }.reduce(0, +) / Double(timeRanges.count)
            let avgY = timeRanges.map { $0.boundingBox.midY }.reduce(0, +) / Double(timeRanges.count)

            let totalMouthOpenness = timeRanges.reduce(0.0) { $0 + $1.mouthOpenness }
            let avgMouthOpenness = totalMouthOpenness / Double(timeRanges.count)
            let speakingCount = timeRanges.filter { $0.isSpeaking }.count

            let profile = FaceProfile(
                trackID: trackID,
                timeRanges: timeRanges,
                avgPosition: CGPoint(x: avgX, y: avgY),
                mouthOpennessHistory: timeRanges.map { $0.mouthOpenness },
                avgMouthOpenness: avgMouthOpenness
            )

            faceProfiles.append(profile)
        }

        return faceProfiles.sorted { $0.avgPosition.x < $1.avgPosition.x }
    }

     func getVideoFrame(at time: CMTime) async throws -> CGImage? {
        let generator = AVAssetImageGenerator(asset: videoAsset)
        generator.requestedTimeToleranceBefore = .zero
        generator.requestedTimeToleranceAfter = .zero

        return try await withCheckedThrowingContinuation { continuation in
            generator.generateCGImagesAsynchronously(forTimes: [NSValue(time: time)]) {
                requestedTime, image, actualTime, result, error in

                switch result {
                case .succeeded where image != nil:
                    continuation.resume(returning: image)
                case .failed where error != nil:
                    continuation.resume(throwing: error!)
                default:
                    continuation.resume(returning: nil)
                }
            }
        }
    }
}

AnalyzerCoOrdinator.Swift

Alright, now we’re almost done with the back end components, all that’s left is to combine this data and build our pipeline.

This part of the project is more of an MVP, the implementation here is relatively basic and I would greatly appreciate suggestions on how we can improve the algorithm. Essentially, what we’re doing here is checking the timestamps we got from the Speech Analyzer and pairing that information with a face. The Speech Diarization gives us time ranges for each speaker, the transcription model gives us time stamps for when the words are spoken.

In the Vision file we’re getting frames from the time stamps that feature speech, after wards we try to match a face to the Speech Diarization IDs. Afterwards, we log the face’s position and use it for our cool immersive audio shenanigans.

import AVFoundation
import Combine
import Vision
import Speech

class CombinedAnalysisCoordinator: ObservableObject {
    @Published var analysisResult = CombinedAnalysisResult()
    @Published var processingProgress: Double = 0.0

    private let speechAnalyzer: SpeechAnalyzer
    private let visionAnalyzer: VisionAnalyzer
    private var cancellables = Set<AnyCancellable>()

    init(videoURL: URL, audioURL: URL? = nil) {
        self.speechAnalyzer = SpeechAnalyzer()
        self.visionAnalyzer = VisionAnalyzer(videoURL: videoURL)

        // Start the audio conversion process
        speechAnalyzer.prepareAudio(sourceURL: audioURL ?? videoURL)
    }

    func preprocessVideoAndAudio() async {
        // Wait for audio conversion to complete
        while !speechAnalyzer.isAudioReady {
            try? await Task.sleep(nanoseconds: 100_000_000)
        }

        do {
           print("Starting speech recognition...")

            // Perform speech recognition
            let utterances = try await speechAnalyzer.performSpeechRecognition()
            speechAnalyzer.recognizedUtterances = utterances

            print("Performing speaker diarization...")

            // Perform speaker diarization
            let speakerProfiles = await speechAnalyzer.performSpeakerDiarization()

            print("Detecting and tracking faces...")

            // Process video frames to detect and track faces
            let faceProfiles = try await visionAnalyzer.detectAndTrackFaces()
            visionAnalyzer.faceProfiles = faceProfiles

            print("Matching faces to speakers...")

            // Match speech utterances to speaker segments
            let matchedUtterances = speechAnalyzer.matchUtterancesToSpeakers(
                utterances: speechAnalyzer.recognizedUtterances,
                speakerProfiles: speakerProfiles
            )

            // Match faces to speakers
            try await matchFacesToSpeakersUsingUtterances(
                matchedUtterances: matchedUtterances,
                speakerProfiles: speakerProfiles,
                faceProfiles: faceProfiles
            )

            print( "Processing complete!")

            DispatchQueue.main.async {
                self.analysisResult.preprocessingComplete = true
            }
        } catch {
            print("Error preprocessing video and audio: \(error)")
        }
    }


    private func matchFacesToSpeakersUsingUtterances(
        matchedUtterances: [(utterance: SpeechAnalyzer.RecognizedUtterance, speakerID: Int)],
        speakerProfiles: [SpeakerProfile],
        faceProfiles: [FaceProfile]
    ) async throws {
        var matchedSpeakers: [CombinedAnalysisResult.MatchedSpeaker] = []
        var speakerToFaceMatches: [Int: [(faceID: UUID, score: Double)]] = [:]
        var faceSpeakerScores: [UUID: [Int: Double]] = [:]

        // 1. Pre-calculate face-speaker segment affinity scores
        for speaker in speakerProfiles {
            for segment in speaker.segments {
                let start = Double(segment.start)
                let end = Double(segment.end)

                for face in faceProfiles {
                    let speakingMoments = face.timeRanges.filter {
                        $0.timestamp >= start &&
                        $0.timestamp <= end &&
                        $0.isSpeaking
                    }

                    let segmentScore = speakingMoments.reduce(0.0) { total, moment in
                        let timeWeight = 1 - min(1, abs(moment.timestamp - (start + end)/2) / (end - start))
                        return total + moment.mouthOpenness * timeWeight
                    }

                    faceSpeakerScores[face.trackID, default: [:]][speaker.speakerID, default: 0] += segmentScore
                }
            }
        }

        // 2. Process each utterance with temporal sampling
        for (utterance, speakerID) in matchedUtterances {
            guard utterance.endTime - utterance.startTime >= 0.3 else { continue }

            let sampleCount = Int((utterance.endTime - utterance.startTime) / 0.25)
            let sampleStep = (utterance.endTime - utterance.startTime) / Double(max(1, sampleCount))

            for sampleIndex in 0..<max(1, sampleCount) {
                let analysisTime = utterance.startTime + Double(sampleIndex) * sampleStep

                guard let cgImage = try await visionAnalyzer.getVideoFrame(at: CMTime(seconds: analysisTime, preferredTimescale: 600)) else {
                    continue
                }

                let faceRequest = VNDetectFaceLandmarksRequest()
                try visionAnalyzer.sequenceRequestHandler.perform([faceRequest], on: cgImage)
                guard let observations = faceRequest.results as? [VNFaceObservation] else {
                    continue
                }

                for observation in observations {
                    let boundingBox = observation.boundingBox
                    let faceCenter = CGPoint(
                        x: boundingBox.midX,
                        y: boundingBox.midY
                    )

                    var mouthOpenness: Double = 0
                    if let innerLips = observation.landmarks?.innerLips,
                       let outerLips = observation.landmarks?.outerLips {
                        let innerPoints = innerLips.normalizedPoints
                        let outerPoints = outerLips.normalizedPoints

                        let innerVertical = (innerPoints.max(by: { $0.y < $1.y })?.y ?? 0 ) -
                                         (innerPoints.min(by: { $0.y < $1.y })?.y ?? 0)

                        let outerVertical = (outerPoints.max(by: { $0.y < $1.y })?.y ?? 0 ) - (outerPoints.min(by: { $0.y < $1.y })?.y ?? 0)

                        mouthOpenness = Double(max(innerVertical, outerVertical) * boundingBox.height)
                    }

                    guard mouthOpenness > 0.03 else { continue }

                    var bestMatch: (faceID: UUID, score: Double)? = nil

                    for faceProfile in faceProfiles {
                        guard let closestTimeRange = faceProfile.timeRanges.min(by: {
                            abs($0.timestamp - analysisTime) < abs($1.timestamp - analysisTime)
                        }) else { continue }

                        let timeDelta = abs(closestTimeRange.timestamp - analysisTime)
                        guard timeDelta < 0.5 else { continue }

                        let overlapRect = boundingBox.intersection(closestTimeRange.boundingBox)
                        let iouScore = overlapRect.width * overlapRect.height /
                                      (boundingBox.width * boundingBox.height +
                                       closestTimeRange.boundingBox.width * closestTimeRange.boundingBox.height -
                                       overlapRect.width * overlapRect.height)

                        let temporalScore = 1 - min(1, timeDelta / 0.5)
                        let segmentAffinity = faceSpeakerScores[faceProfile.trackID]?[speakerID] ?? 0
                        let normalizedSegmentScore = min(segmentAffinity / 100, 1.0)

                        let spatialConsistency = calculateSpatialConsistency(
                            faceID: faceProfile.trackID,
                            speakerID: speakerID,
                            faceProfiles: faceProfiles,
                            speakerProfiles: speakerProfiles
                        )

                        let score = (iouScore * 0.3) +
                                   (temporalScore * 0.2) +
                                   (mouthOpenness * 0.2) +
                                   (normalizedSegmentScore * 0.2) +
                                   (spatialConsistency * 0.1)

                        if score > (bestMatch?.score ?? 0.5) {
                            bestMatch = (faceProfile.trackID, score)
                        }
                    }

                    if let match = bestMatch, match.score > 0.5 {
                        speakerToFaceMatches[speakerID, default: []].append((match.faceID, match.score))
                    }
                }
            }
        }

        // 3. Determine best face match for each speaker
        for speakerID in speakerToFaceMatches.keys {
            guard let matches = speakerToFaceMatches[speakerID], !matches.isEmpty else { continue }

            var faceScores: [UUID: Double] = [:]
            for (faceID, score) in matches {
                faceScores[faceID, default: 0] += score
            }

            if let bestMatch = faceScores.max(by: { $0.value < $1.value }),
               let speaker = speakerProfiles.first(where: { $0.speakerID == speakerID }),
               let face = faceProfiles.first(where: { $0.trackID == bestMatch.key }) {

                let matchedSpeaker = CombinedAnalysisResult.MatchedSpeaker(
                    speakerID: speakerID,
                    faceID: bestMatch.key,
                    position: face.avgPosition,
                    segments: speaker.segments
                )

                matchedSpeakers.append(matchedSpeaker)
            }
        }

        // 4. Handle unmatched speakers with fallback strategy
        let matchedSpeakerIDs = Set(matchedSpeakers.map { $0.speakerID })
        let unmatchedSpeakers = speakerProfiles.filter { !matchedSpeakerIDs.contains($0.speakerID) }

        for speaker in unmatchedSpeakers {
            var bestFace: (id: UUID, score: Double)? = nil
            for face in faceProfiles {
                let score = faceSpeakerScores[face.trackID]?[speaker.speakerID] ?? 0
                if score > (bestFace?.score ?? 0) {
                    bestFace = (face.trackID, score)
                }
            }

            if let bestFace = bestFace, bestFace.score > 0 {
                let matchedSpeaker = CombinedAnalysisResult.MatchedSpeaker(
                    speakerID: speaker.speakerID,
                    faceID: bestFace.id,
                    position: faceProfiles.first { $0.trackID == bestFace.id }?.avgPosition,
                    segments: speaker.segments
                )
                matchedSpeakers.append(matchedSpeaker)
            }
        }

        DispatchQueue.main.async {
            self.analysisResult.matchedSpeakers = matchedSpeakers.sorted { $0.speakerID < $1.speakerID }
        }
    }

    private func calculateSpatialConsistency(
        faceID: UUID,
        speakerID: Int,
        faceProfiles: [FaceProfile],
        speakerProfiles: [SpeakerProfile]
    ) -> Double {
        guard let speaker = speakerProfiles.first(where: { $0.speakerID == speakerID }),
              let face = faceProfiles.first(where: { $0.trackID == faceID })
        else { return 0.0 }

        var positions: [CGPoint] = []
        var timestamps: [Double] = []

        for segment in speaker.segments {
            let start = Double(segment.start)
            let end = Double(segment.end)
            let step = (end - start) / 4

            for sampleTime in stride(from: start, through: end, by: step) {
                if let closest = face.timeRanges.min(by: {
                    abs($0.timestamp - sampleTime) < abs($1.timestamp - sampleTime)
                }) {
                    positions.append(CGPoint(
                        x: closest.boundingBox.midX,
                        y: closest.boundingBox.midY
                    ))
                    timestamps.append(closest.timestamp)
                }
            }
        }

        guard positions.count > 1 else { return 1.0 }

        let avgX = positions.map { $0.x }.reduce(0, +) / Double(positions.count)
        let avgY = positions.map { $0.y }.reduce(0, +) / Double(positions.count)
        let positionVariance = positions.reduce(0) {
            $0 + pow($1.x - avgX, 2) + pow($1.y - avgY, 2)
        } / Double(positions.count)

        let timeVariance = timestamps.reduce(0) {
            let mean = timestamps.reduce(0, +) / Double(timestamps.count)
            return $0 + pow($1 - mean, 2)
        } / Double(timestamps.count)

        let combinedScore = 1 / (1 + (positionVariance * 0.7 + timeVariance * 0.3))
        return min(max(combinedScore, 0), 1)
    }

    func getCurrentSpeaker(at time: Double) -> CombinedAnalysisResult.MatchedSpeaker? {
        for speaker in analysisResult.matchedSpeakers {
            for segment in speaker.segments {
                if Double(segment.start) <= time && time <= Double(segment.end) {
                    return speaker
                }
            }
        }
        return nil
    }

        func updateCurrentSpeakers(at time: Double) {
        DispatchQueue.main.async {
            for i in 0..<self.analysisResult.matchedSpeakers.count {
                self.analysisResult.matchedSpeakers[i].isCurrentlySpeaking = false
            }

            for i in 0..<self.analysisResult.matchedSpeakers.count {
                let speaker = self.analysisResult.matchedSpeakers[i]
                for segment in speaker.segments {
                    if Double(segment.start) <= time && time <= Double(segment.end) {
                        self.analysisResult.matchedSpeakers[i].isCurrentlySpeaking = true
                        break
                    }
                }
            }
            // This ensures that the View refreshed whenever we call this function
            self.objectWillChange.send()
        }
    }
}

AnalyzerView.Swift

We’re done with the back end! Yay! That’s what a few months of part time research look like. Now for the fun part. Putting it all together in a view and making it look nice. Now, I’d like to reiterate, the code above, especially the face matching logic is an MVP, and is more of a proof of concept than a final product. Please please improve it, debug it or expand it for additional cool features.

Moreover, this View has a hardcoded reference to Clip.mp4 but you can add a picker or something for it.

At the moment, all we’re doing with the position data is just creating a Text Views of relative positions.

import SwiftUI
import AVKit

struct CombinedAnalysisView: View {
    @StateObject private var coordinator: CombinedAnalysisCoordinator

    private let player: AVPlayer

    @State private var currentTime: Double = 0
    @State private var isPlaying: Bool = false
    @State private var isProcessing: Bool = true

    init() {
        guard let url = Bundle.main.url(forResource: "Clip", withExtension: "mp4") else {
            fatalError("Video not found in bundle.")
        }

        let playerItem = AVPlayerItem(url: url)
        let player = AVPlayer(playerItem: playerItem)

        player.pause()

        let coordinator = CombinedAnalysisCoordinator(videoURL: url)

        self.player = player
        self._coordinator = StateObject(wrappedValue: coordinator)
    }

    var body: some View {
        VStack {
            if !isProcessing {
                VideoPlayer(player: player)
                    .clipped()
                    .zIndex(0)
            } else {
                ProgressView()
            }

            // Speaker information
            if !coordinator.analysisResult.preprocessingComplete {
                Text(isProcessing ? "Processing Video / Audio..." : "Ready For Playback")
                    .padding()
            } else {
                if let currentSpeaker = coordinator.analysisResult
                    .matchedSpeakers
                    .first(where: { $0.isCurrentlySpeaking }) {
                    VStack(alignment: .leading) {

                        Text("Current Speaker: \(currentSpeaker.speakerID)")
                            .font(.headline)
                        if let position = currentSpeaker.position {
                            Text("Position: (\(String(format: "%.2f", position.x)), \(String(format: "%.2f", position.y)))")
                            Text("Audio Position: \(String(format: "%.2f", (position.x - 0.5) * 10)), \(String(format: "%.2f", (0.5 - position.y) * 10)), 0")
                                .font(.caption)
                        } else {
                            Text("Position: Unknown")
                        }
                    }
                    .padding()
                    .background(Color.gray.opacity(0.2))
                    .cornerRadius(8)
                }
            }


        }
        .onAppear {
            // Start preprocessing
            Task {
                await coordinator.preprocessVideoAndAudio()

                // Print for debugging
                isProcessing = false
                print("Preprocessing complete. Matched speakers: \(coordinator.analysisResult.matchedSpeakers.count)")
            }

            // Time observer to update current speaker
            let interval = CMTime(seconds: 0.1, preferredTimescale: 600)
            player.addPeriodicTimeObserver(forInterval: interval, queue: .main) { time in
                let currentTime = time.seconds
                self.currentTime = currentTime

                // Update current speakers and their positions
                coordinator.updateCurrentSpeakers(at: currentTime)
            }
        }
        .navigationTitle("Carlos' Active Speaker Detection Article ")
    }

}

Alright Folks! That’s it. Here’s an example execution using a clip from the Trash Taste Podcast.

This video's copyright is owned by Trash Taste and is only being used for this demo under fair use and for educational purposes.

Link To Final Project is https://github.com/carlosmbe/ActiveSpeakerDetectionStarter. The relevant files are in the ASDFiles folder.

Pretty Cool Links:

Check out my GitHub for some other cool projects, I’ve built.
Connect with me on LinkedIn.

Active Speaker Detection using Swift for iOS and other Apple Platforms

Table of contents