Active Speaker Detection using Swift for iOS and other Apple Platforms


Hello There! In this article I’ll walk you through an implementation of Active Speaker Detection I created this past semester. However, a very important heads up, I was primarily interested in the Audio aspect of this project, hence, the Computer Vision and Face Tracking component of my doesn’t receive nearly as much attention as the other parts. Thus, I would greatly appreciate any concerns, suggestions or contributions in the comments.
Anyways, let’s start coding.
Link To Final Project is https://github.com/carlosmbe/ActiveSpeakerDetectionStarter. The relevant files are in the ASDFiles folder.
An Overview - How Does It Work?
My current implementation uses a combination of Speech Diarization and Transcription Models paired with Apple’s Vision framework. The algorithm, in broad strokes, does the following:
Run Speech Diarization on the Video Clip to generate time ranges for when each speaker’s talking
Use Speech Transcription to identify time stamps for when words are being spoken
Using the transcription time stamps, use Vision to identify which faces are talking and log their respective positions
Repeat Step 3 until we’re confident that we’ve identified the speaker, then match that ID with the time ranges from Step 1’s Diarization Model
Use the positions obtained from the Vision framework to do some really cool Spatial Development stuff
In this article, I will focus on steps 1 through 4 as once we know who is talking and where they are; how you want to use that information is a pretty subjective choice.
Speech Models - Diarization and Transcription
Diarization
For simplicity sake, we’ll use the Diarization starter project I created a while back as setting that up is a fairly involved process. Here’s an article if you’d like to learn more about it.
Clone the project and follow the build instructions, particularly those involving building and adding the Sherpa-Onnx and Onnxruntime frameworks. If stuck and need help, feel free to open an issue on my repository or the official Sherpa-Onnx repository.
Build and run the test app. Assuming that the frameworks have been added correctly, you should have Speech Diarization working like this:
Transcription.Swift
For transcription we’re using Apple’s built in Speech Framework. It’s quite the Swifty API so it’s not too complicated. Here’s an example class I’ve created with a few comments to explain what’s happening. This class isn’t needed for our app, I’m using it for educational purposes.
import Speech
//Struct for the results of our transciption
struct RecognizedUtterance {
let text: String
let startTime: TimeInterval
let endTime: TimeInterval
}
class Transcriber: ObservableObject {
//Create an instance of the Speech Recognizer. If you're using a lanaguage other than english, you'd intialize it here.
//You can also write a clever algorithm for automatic detection or allow users to pick their own language
private let speechRecognizer = SFSpeechRecognizer(locale: Locale(identifier: "en-US"))
// Results for speech recognition data
var recognizedUtterances: [RecognizedUtterance] = []
//We call this function to perform the actual transcription
func performSpeechRecognition(audioURL: URL?) async throws -> [RecognizedUtterance] {
guard let audioURL = audioURL else {
throw NSError(domain: "SpeechRecognition", code: 1, userInfo: [NSLocalizedDescriptionKey: "Audio URL not available"])
}
guard let speechRecognizer = speechRecognizer, speechRecognizer.isAvailable else {
throw NSError(domain: "SpeechRecognition", code: 1, userInfo: [NSLocalizedDescriptionKey: "Speech recognizer is not available"])
}
return try await withCheckedThrowingContinuation { continuation in
let request = SFSpeechURLRecognitionRequest(url: audioURL)
request.taskHint = .dictation
request.shouldReportPartialResults = false
let recognitionTask: SFSpeechRecognitionTask? = speechRecognizer.recognitionTask(with: request) { [self] result, error in
if let error = error {
continuation.resume(throwing: error)
return
}
guard let result = result else {
continuation.resume(returning: [])
return
}
if result.isFinal {
let segments = result.bestTranscription.segments
for segment in segments {
let utterance = RecognizedUtterance(
text: segment.substring,
startTime: segment.timestamp,
endTime: segment.timestamp + segment.duration
)
//Debugging Print Statement
print(utterance)
recognizedUtterances.append(utterance)
}
continuation.resume(returning: recognizedUtterances)
}
}
}
}
}
Models.Swift
These are some helper structs and definitions we’ll be using for our project. They’ll make more sense once we cover the analyzers.
import Foundation
import SwiftUI
struct SpeakerProfile: Identifiable {
let id = UUID()
var speakerID: Int
var faceID: UUID?
var segments: [TimeSegment]
var embedding: [Float]?
struct TimeSegment {
let start: Float
let end: Float
}
}
struct FaceProfile: Identifiable {
let id = UUID()
var trackID: UUID
var timeRanges: [TimeRange]
var avgPosition: CGPoint
var mouthOpennessHistory: [Double] = []
var avgMouthOpenness: Double = 0.0
struct TimeRange {
let timestamp: Double
let boundingBox: CGRect
let isSpeaking: Bool
let mouthOpenness: Double
}
}
class CombinedAnalysisResult: ObservableObject {
@Published var matchedSpeakers: [MatchedSpeaker] = []
@Published var preprocessingComplete = false
struct MatchedSpeaker: Identifiable {
let id = UUID()
let speakerID: Int
let faceID: UUID?
let position: CGPoint?
let segments: [SpeakerProfile.TimeSegment]
var isCurrentlySpeaking: Bool = false
}
}
SpeechAnalyzer.Swift
Alright, now that we have Speech Diarization and Transcription, let’s combine the two to make an analyzer class that handles the speech and audio aspect of the project. Afterwards, we’ll create a separate class for the Vision analysis. At that point, we’ll have to connect both classes and share data between the two.
import AVFoundation
import Speech
class SpeechAnalyzer: ObservableObject {
@Published var isAudioReady = false
@Published var recognizedUtterances: [RecognizedUtterance] = []
private let speechRecognizer = SFSpeechRecognizer(locale: Locale(identifier: "en-US"))
private var audioURL: URL?
private let sdViewModel = SDViewModel()
struct RecognizedUtterance {
let text: String
let startTime: TimeInterval
let endTime: TimeInterval
}
func prepareAudio(sourceURL: URL) {
Task {
do {
let convertedAudioURL = try await convertMediaToMonoFloat32WAV(inputURL: sourceURL)
DispatchQueue.main.async {
self.audioURL = convertedAudioURL
self.isAudioReady = true
}
} catch {
print("Error converting audio: \(error)")
}
}
}
func performSpeechRecognition() async throws -> [RecognizedUtterance] {
guard let audioURL = self.audioURL else {
throw NSError(domain: "SpeechRecognition", code: 1, userInfo: [NSLocalizedDescriptionKey: "Audio URL not available"])
}
guard let speechRecognizer = speechRecognizer, speechRecognizer.isAvailable else {
throw NSError(domain: "SpeechRecognition", code: 1, userInfo: [NSLocalizedDescriptionKey: "Speech recognizer is not available"])
}
return try await withCheckedThrowingContinuation { continuation in
let request = SFSpeechURLRecognitionRequest(url: audioURL)
request.taskHint = .dictation
request.shouldReportPartialResults = false
var utterances: [RecognizedUtterance] = []
let recognitionTask = speechRecognizer.recognitionTask(with: request) { result, error in
if let error = error {
continuation.resume(throwing: error)
return
}
guard let result = result else {
continuation.resume(returning: [])
return
}
if result.isFinal {
let segments = result.bestTranscription.segments
for segment in segments {
let utterance = RecognizedUtterance(
text: segment.substring,
startTime: segment.timestamp,
endTime: segment.timestamp + segment.duration
)
utterances.append(utterance)
}
continuation.resume(returning: utterances)
}
}
}
}
func performSpeakerDiarization() async -> [SpeakerProfile] {
guard let audioURL = self.audioURL else { return [] }
let speakerCount = 2
let segments = await sdViewModel.runDiarization(
waveFileName: "",
numSpeakers: speakerCount,
fullPath: audioURL
)
var speakerMap: [Int: [SpeakerProfile.TimeSegment]] = [:]
for segment in segments {
let timeSegment = SpeakerProfile.TimeSegment(start: segment.start, end: segment.end)
if speakerMap[segment.speaker] == nil {
speakerMap[segment.speaker] = []
}
speakerMap[segment.speaker]?.append(timeSegment)
}
var speakerProfiles: [SpeakerProfile] = []
for (speakerID, segments) in speakerMap {
let profile = SpeakerProfile(
speakerID: speakerID,
faceID: nil,
segments: segments,
embedding: nil
)
speakerProfiles.append(profile)
}
return speakerProfiles
}
func matchUtterancesToSpeakers(
utterances: [RecognizedUtterance],
speakerProfiles: [SpeakerProfile]
) -> [(utterance: RecognizedUtterance, speakerID: Int)] {
var matchedUtterances: [(utterance: RecognizedUtterance, speakerID: Int)] = []
for utterance in utterances {
if utterance.endTime - utterance.startTime < 0.5 {
continue
}
var bestSpeakerID = -1
var longestOverlap: Float = 0
for speaker in speakerProfiles {
var totalOverlap: Float = 0
for segment in speaker.segments {
let overlapStart = max(Float(utterance.startTime), segment.start)
let overlapEnd = min(Float(utterance.endTime), segment.end)
if overlapEnd > overlapStart {
totalOverlap += overlapEnd - overlapStart
}
}
if totalOverlap > longestOverlap {
longestOverlap = totalOverlap
bestSpeakerID = speaker.speakerID
}
}
if bestSpeakerID >= 0 && longestOverlap > 0 {
matchedUtterances.append((utterance, bestSpeakerID))
}
}
return matchedUtterances
}
}
VisionAnalyzer.Swift
So in this class we’re trying to get frames from our video to analyze and then detect and track those faces we find.
import AVFoundation
import Vision
class VisionAnalyzer: ObservableObject {
private let videoURL: URL
private var videoAsset: AVAsset
private var videoTrack: AVAssetTrack?
let sequenceRequestHandler = VNSequenceRequestHandler()
private var faceTrackingHistory: [UUID: [FaceProfile.TimeRange]] = [:]
@Published var faceProfiles: [FaceProfile] = []
init(videoURL: URL) {
self.videoURL = videoURL
self.videoAsset = AVAsset(url: videoURL)
Task {
let tracks = try await videoAsset.loadTracks(withMediaType: .video)
if let track = tracks.first {
self.videoTrack = track
}
}
}
func detectAndTrackFaces() async throws -> [FaceProfile] {
let generator = AVAssetImageGenerator(asset: videoAsset)
generator.requestedTimeToleranceBefore = .zero
generator.requestedTimeToleranceAfter = .zero
generator.appliesPreferredTrackTransform = true
let duration = try await videoAsset.load(.duration)
let nominalFrameRate = try await videoTrack?.load(.nominalFrameRate) ?? 30.0
let frameCount = Int(duration.seconds * Double(nominalFrameRate))
let samplingInterval = max(frameCount / 300, 1)
var activeFaceTracks: [UUID: (lastBox: CGRect, lastTime: Double, avgPosition: CGPoint)] = [:]
var faceTemporalData: [UUID: (history: [Double], sum: Double)] = [:]
var faceTrackingHistory: [UUID: [FaceProfile.TimeRange]] = [:]
for frameIdx in stride(from: 0, to: frameCount, by: samplingInterval) {
let time = CMTime(seconds: Double(frameIdx) / Double(nominalFrameRate), preferredTimescale: 600)
guard let cgImage = try await getVideoFrame(at: time) else {
continue
}
let faceRequest = VNDetectFaceLandmarksRequest()
faceRequest.revision = VNDetectFaceLandmarksRequestRevision3
//The simulator can not use the Nerual Engine, so adding this line allows me to debug without a real device
#if targetEnvironment(simulator)
if #available(iOS 17.0, *) {
if let cpuDevice = MLComputeDevice.allComputeDevices.first(where: { $0.description.contains("MLCPUComputeDevice") }) {
faceRequest.setComputeDevice(.some(cpuDevice), for: .main)
}
} else {
faceRequest.usesCPUOnly = true
}
#endif
try sequenceRequestHandler.perform([faceRequest], on: cgImage)
guard let observations = faceRequest.results as? [VNFaceObservation] else {
continue
}
for observation in observations {
let boundingBox = observation.boundingBox
let timestamp = time.seconds
var mouthOpenness: Double = 0.0
if let innerLips = observation.landmarks?.innerLips,
let outerLips = observation.landmarks?.outerLips {
let innerPoints = innerLips.normalizedPoints
let outerPoints = outerLips.normalizedPoints
let innerVertical = (innerPoints.max(by: { $0.y < $1.y })?.y ?? 0) -
(innerPoints.min(by: { $0.y < $1.y })?.y ?? 0)
let outerVertical = (outerPoints.max(by: { $0.y < $1.y })?.y ?? 0) -
(outerPoints.min(by: { $0.y < $1.y })?.y ?? 0)
mouthOpenness = Double(max(innerVertical, outerVertical) * boundingBox.height)
}
let faceCenter = CGPoint(
x: boundingBox.midX,
y: boundingBox.midY
)
var matchedTrackID: UUID?
for (trackID, trackInfo) in activeFaceTracks {
let distance = sqrt(pow(faceCenter.x - trackInfo.avgPosition.x, 2) +
pow(faceCenter.y - trackInfo.avgPosition.y, 2))
if (distance < 0.15) && ((timestamp - trackInfo.lastTime) < 1.0) {
matchedTrackID = trackID
let newAvgPos = CGPoint(
x: (trackInfo.avgPosition.x * 0.7 + faceCenter.x * 0.3),
y: (trackInfo.avgPosition.y * 0.7 + faceCenter.y * 0.3)
)
activeFaceTracks[trackID] = (boundingBox, timestamp, newAvgPos)
break
}
}
let trackID = matchedTrackID ?? UUID()
if matchedTrackID == nil {
activeFaceTracks[trackID] = (boundingBox, timestamp, faceCenter)
}
var trackData = faceTemporalData[trackID] ?? (history: [], sum: 0.0)
trackData.history.append(mouthOpenness)
trackData.sum += mouthOpenness
faceTemporalData[trackID] = trackData
let avgOpenness = trackData.history.isEmpty ? 0.0 :
trackData.sum / Double(trackData.history.count)
let isSpeaking = mouthOpenness > max(0.05, avgOpenness * 1.5)
let timeRange = FaceProfile.TimeRange(
timestamp: timestamp,
boundingBox: boundingBox,
isSpeaking: isSpeaking,
mouthOpenness: mouthOpenness
)
if faceTrackingHistory[trackID] == nil {
faceTrackingHistory[trackID] = []
}
faceTrackingHistory[trackID]?.append(timeRange)
}
let currentTime = time.seconds
activeFaceTracks = activeFaceTracks.filter {
currentTime - $0.value.lastTime <= 1.0
}
}
var faceProfiles: [FaceProfile] = []
for (trackID, timeRanges) in faceTrackingHistory {
guard timeRanges.count >= 10 else {
continue
}
let avgX = timeRanges.map { $0.boundingBox.midX }.reduce(0, +) / Double(timeRanges.count)
let avgY = timeRanges.map { $0.boundingBox.midY }.reduce(0, +) / Double(timeRanges.count)
let totalMouthOpenness = timeRanges.reduce(0.0) { $0 + $1.mouthOpenness }
let avgMouthOpenness = totalMouthOpenness / Double(timeRanges.count)
let speakingCount = timeRanges.filter { $0.isSpeaking }.count
let profile = FaceProfile(
trackID: trackID,
timeRanges: timeRanges,
avgPosition: CGPoint(x: avgX, y: avgY),
mouthOpennessHistory: timeRanges.map { $0.mouthOpenness },
avgMouthOpenness: avgMouthOpenness
)
faceProfiles.append(profile)
}
return faceProfiles.sorted { $0.avgPosition.x < $1.avgPosition.x }
}
func getVideoFrame(at time: CMTime) async throws -> CGImage? {
let generator = AVAssetImageGenerator(asset: videoAsset)
generator.requestedTimeToleranceBefore = .zero
generator.requestedTimeToleranceAfter = .zero
return try await withCheckedThrowingContinuation { continuation in
generator.generateCGImagesAsynchronously(forTimes: [NSValue(time: time)]) {
requestedTime, image, actualTime, result, error in
switch result {
case .succeeded where image != nil:
continuation.resume(returning: image)
case .failed where error != nil:
continuation.resume(throwing: error!)
default:
continuation.resume(returning: nil)
}
}
}
}
}
AnalyzerCoOrdinator.Swift
Alright, now we’re almost done with the back end components, all that’s left is to combine this data and build our pipeline.
This part of the project is more of an MVP, the implementation here is relatively basic and I would greatly appreciate suggestions on how we can improve the algorithm. Essentially, what we’re doing here is checking the timestamps we got from the Speech Analyzer and pairing that information with a face. The Speech Diarization gives us time ranges for each speaker, the transcription model gives us time stamps for when the words are spoken.
In the Vision file we’re getting frames from the time stamps that feature speech, after wards we try to match a face to the Speech Diarization IDs. Afterwards, we log the face’s position and use it for our cool immersive audio shenanigans.
import AVFoundation
import Combine
import Vision
import Speech
class CombinedAnalysisCoordinator: ObservableObject {
@Published var analysisResult = CombinedAnalysisResult()
@Published var processingProgress: Double = 0.0
private let speechAnalyzer: SpeechAnalyzer
private let visionAnalyzer: VisionAnalyzer
private var cancellables = Set<AnyCancellable>()
init(videoURL: URL, audioURL: URL? = nil) {
self.speechAnalyzer = SpeechAnalyzer()
self.visionAnalyzer = VisionAnalyzer(videoURL: videoURL)
// Start the audio conversion process
speechAnalyzer.prepareAudio(sourceURL: audioURL ?? videoURL)
}
func preprocessVideoAndAudio() async {
// Wait for audio conversion to complete
while !speechAnalyzer.isAudioReady {
try? await Task.sleep(nanoseconds: 100_000_000)
}
do {
print("Starting speech recognition...")
// Perform speech recognition
let utterances = try await speechAnalyzer.performSpeechRecognition()
speechAnalyzer.recognizedUtterances = utterances
print("Performing speaker diarization...")
// Perform speaker diarization
let speakerProfiles = await speechAnalyzer.performSpeakerDiarization()
print("Detecting and tracking faces...")
// Process video frames to detect and track faces
let faceProfiles = try await visionAnalyzer.detectAndTrackFaces()
visionAnalyzer.faceProfiles = faceProfiles
print("Matching faces to speakers...")
// Match speech utterances to speaker segments
let matchedUtterances = speechAnalyzer.matchUtterancesToSpeakers(
utterances: speechAnalyzer.recognizedUtterances,
speakerProfiles: speakerProfiles
)
// Match faces to speakers
try await matchFacesToSpeakersUsingUtterances(
matchedUtterances: matchedUtterances,
speakerProfiles: speakerProfiles,
faceProfiles: faceProfiles
)
print( "Processing complete!")
DispatchQueue.main.async {
self.analysisResult.preprocessingComplete = true
}
} catch {
print("Error preprocessing video and audio: \(error)")
}
}
private func matchFacesToSpeakersUsingUtterances(
matchedUtterances: [(utterance: SpeechAnalyzer.RecognizedUtterance, speakerID: Int)],
speakerProfiles: [SpeakerProfile],
faceProfiles: [FaceProfile]
) async throws {
var matchedSpeakers: [CombinedAnalysisResult.MatchedSpeaker] = []
var speakerToFaceMatches: [Int: [(faceID: UUID, score: Double)]] = [:]
var faceSpeakerScores: [UUID: [Int: Double]] = [:]
// 1. Pre-calculate face-speaker segment affinity scores
for speaker in speakerProfiles {
for segment in speaker.segments {
let start = Double(segment.start)
let end = Double(segment.end)
for face in faceProfiles {
let speakingMoments = face.timeRanges.filter {
$0.timestamp >= start &&
$0.timestamp <= end &&
$0.isSpeaking
}
let segmentScore = speakingMoments.reduce(0.0) { total, moment in
let timeWeight = 1 - min(1, abs(moment.timestamp - (start + end)/2) / (end - start))
return total + moment.mouthOpenness * timeWeight
}
faceSpeakerScores[face.trackID, default: [:]][speaker.speakerID, default: 0] += segmentScore
}
}
}
// 2. Process each utterance with temporal sampling
for (utterance, speakerID) in matchedUtterances {
guard utterance.endTime - utterance.startTime >= 0.3 else { continue }
let sampleCount = Int((utterance.endTime - utterance.startTime) / 0.25)
let sampleStep = (utterance.endTime - utterance.startTime) / Double(max(1, sampleCount))
for sampleIndex in 0..<max(1, sampleCount) {
let analysisTime = utterance.startTime + Double(sampleIndex) * sampleStep
guard let cgImage = try await visionAnalyzer.getVideoFrame(at: CMTime(seconds: analysisTime, preferredTimescale: 600)) else {
continue
}
let faceRequest = VNDetectFaceLandmarksRequest()
try visionAnalyzer.sequenceRequestHandler.perform([faceRequest], on: cgImage)
guard let observations = faceRequest.results as? [VNFaceObservation] else {
continue
}
for observation in observations {
let boundingBox = observation.boundingBox
let faceCenter = CGPoint(
x: boundingBox.midX,
y: boundingBox.midY
)
var mouthOpenness: Double = 0
if let innerLips = observation.landmarks?.innerLips,
let outerLips = observation.landmarks?.outerLips {
let innerPoints = innerLips.normalizedPoints
let outerPoints = outerLips.normalizedPoints
let innerVertical = (innerPoints.max(by: { $0.y < $1.y })?.y ?? 0 ) -
(innerPoints.min(by: { $0.y < $1.y })?.y ?? 0)
let outerVertical = (outerPoints.max(by: { $0.y < $1.y })?.y ?? 0 ) - (outerPoints.min(by: { $0.y < $1.y })?.y ?? 0)
mouthOpenness = Double(max(innerVertical, outerVertical) * boundingBox.height)
}
guard mouthOpenness > 0.03 else { continue }
var bestMatch: (faceID: UUID, score: Double)? = nil
for faceProfile in faceProfiles {
guard let closestTimeRange = faceProfile.timeRanges.min(by: {
abs($0.timestamp - analysisTime) < abs($1.timestamp - analysisTime)
}) else { continue }
let timeDelta = abs(closestTimeRange.timestamp - analysisTime)
guard timeDelta < 0.5 else { continue }
let overlapRect = boundingBox.intersection(closestTimeRange.boundingBox)
let iouScore = overlapRect.width * overlapRect.height /
(boundingBox.width * boundingBox.height +
closestTimeRange.boundingBox.width * closestTimeRange.boundingBox.height -
overlapRect.width * overlapRect.height)
let temporalScore = 1 - min(1, timeDelta / 0.5)
let segmentAffinity = faceSpeakerScores[faceProfile.trackID]?[speakerID] ?? 0
let normalizedSegmentScore = min(segmentAffinity / 100, 1.0)
let spatialConsistency = calculateSpatialConsistency(
faceID: faceProfile.trackID,
speakerID: speakerID,
faceProfiles: faceProfiles,
speakerProfiles: speakerProfiles
)
let score = (iouScore * 0.3) +
(temporalScore * 0.2) +
(mouthOpenness * 0.2) +
(normalizedSegmentScore * 0.2) +
(spatialConsistency * 0.1)
if score > (bestMatch?.score ?? 0.5) {
bestMatch = (faceProfile.trackID, score)
}
}
if let match = bestMatch, match.score > 0.5 {
speakerToFaceMatches[speakerID, default: []].append((match.faceID, match.score))
}
}
}
}
// 3. Determine best face match for each speaker
for speakerID in speakerToFaceMatches.keys {
guard let matches = speakerToFaceMatches[speakerID], !matches.isEmpty else { continue }
var faceScores: [UUID: Double] = [:]
for (faceID, score) in matches {
faceScores[faceID, default: 0] += score
}
if let bestMatch = faceScores.max(by: { $0.value < $1.value }),
let speaker = speakerProfiles.first(where: { $0.speakerID == speakerID }),
let face = faceProfiles.first(where: { $0.trackID == bestMatch.key }) {
let matchedSpeaker = CombinedAnalysisResult.MatchedSpeaker(
speakerID: speakerID,
faceID: bestMatch.key,
position: face.avgPosition,
segments: speaker.segments
)
matchedSpeakers.append(matchedSpeaker)
}
}
// 4. Handle unmatched speakers with fallback strategy
let matchedSpeakerIDs = Set(matchedSpeakers.map { $0.speakerID })
let unmatchedSpeakers = speakerProfiles.filter { !matchedSpeakerIDs.contains($0.speakerID) }
for speaker in unmatchedSpeakers {
var bestFace: (id: UUID, score: Double)? = nil
for face in faceProfiles {
let score = faceSpeakerScores[face.trackID]?[speaker.speakerID] ?? 0
if score > (bestFace?.score ?? 0) {
bestFace = (face.trackID, score)
}
}
if let bestFace = bestFace, bestFace.score > 0 {
let matchedSpeaker = CombinedAnalysisResult.MatchedSpeaker(
speakerID: speaker.speakerID,
faceID: bestFace.id,
position: faceProfiles.first { $0.trackID == bestFace.id }?.avgPosition,
segments: speaker.segments
)
matchedSpeakers.append(matchedSpeaker)
}
}
DispatchQueue.main.async {
self.analysisResult.matchedSpeakers = matchedSpeakers.sorted { $0.speakerID < $1.speakerID }
}
}
private func calculateSpatialConsistency(
faceID: UUID,
speakerID: Int,
faceProfiles: [FaceProfile],
speakerProfiles: [SpeakerProfile]
) -> Double {
guard let speaker = speakerProfiles.first(where: { $0.speakerID == speakerID }),
let face = faceProfiles.first(where: { $0.trackID == faceID })
else { return 0.0 }
var positions: [CGPoint] = []
var timestamps: [Double] = []
for segment in speaker.segments {
let start = Double(segment.start)
let end = Double(segment.end)
let step = (end - start) / 4
for sampleTime in stride(from: start, through: end, by: step) {
if let closest = face.timeRanges.min(by: {
abs($0.timestamp - sampleTime) < abs($1.timestamp - sampleTime)
}) {
positions.append(CGPoint(
x: closest.boundingBox.midX,
y: closest.boundingBox.midY
))
timestamps.append(closest.timestamp)
}
}
}
guard positions.count > 1 else { return 1.0 }
let avgX = positions.map { $0.x }.reduce(0, +) / Double(positions.count)
let avgY = positions.map { $0.y }.reduce(0, +) / Double(positions.count)
let positionVariance = positions.reduce(0) {
$0 + pow($1.x - avgX, 2) + pow($1.y - avgY, 2)
} / Double(positions.count)
let timeVariance = timestamps.reduce(0) {
let mean = timestamps.reduce(0, +) / Double(timestamps.count)
return $0 + pow($1 - mean, 2)
} / Double(timestamps.count)
let combinedScore = 1 / (1 + (positionVariance * 0.7 + timeVariance * 0.3))
return min(max(combinedScore, 0), 1)
}
func getCurrentSpeaker(at time: Double) -> CombinedAnalysisResult.MatchedSpeaker? {
for speaker in analysisResult.matchedSpeakers {
for segment in speaker.segments {
if Double(segment.start) <= time && time <= Double(segment.end) {
return speaker
}
}
}
return nil
}
func updateCurrentSpeakers(at time: Double) {
DispatchQueue.main.async {
for i in 0..<self.analysisResult.matchedSpeakers.count {
self.analysisResult.matchedSpeakers[i].isCurrentlySpeaking = false
}
for i in 0..<self.analysisResult.matchedSpeakers.count {
let speaker = self.analysisResult.matchedSpeakers[i]
for segment in speaker.segments {
if Double(segment.start) <= time && time <= Double(segment.end) {
self.analysisResult.matchedSpeakers[i].isCurrentlySpeaking = true
break
}
}
}
// This ensures that the View refreshed whenever we call this function
self.objectWillChange.send()
}
}
}
AnalyzerView.Swift
We’re done with the back end! Yay! That’s what a few months of part time research look like. Now for the fun part. Putting it all together in a view and making it look nice. Now, I’d like to reiterate, the code above, especially the face matching logic is an MVP, and is more of a proof of concept than a final product. Please please improve it, debug it or expand it for additional cool features.
Moreover, this View has a hardcoded reference to Clip.mp4 but you can add a picker or something for it.
At the moment, all we’re doing with the position data is just creating a Text Views of relative positions.
import SwiftUI
import AVKit
struct CombinedAnalysisView: View {
@StateObject private var coordinator: CombinedAnalysisCoordinator
private let player: AVPlayer
@State private var currentTime: Double = 0
@State private var isPlaying: Bool = false
@State private var isProcessing: Bool = true
init() {
guard let url = Bundle.main.url(forResource: "Clip", withExtension: "mp4") else {
fatalError("Video not found in bundle.")
}
let playerItem = AVPlayerItem(url: url)
let player = AVPlayer(playerItem: playerItem)
player.pause()
let coordinator = CombinedAnalysisCoordinator(videoURL: url)
self.player = player
self._coordinator = StateObject(wrappedValue: coordinator)
}
var body: some View {
VStack {
if !isProcessing {
VideoPlayer(player: player)
.clipped()
.zIndex(0)
} else {
ProgressView()
}
// Speaker information
if !coordinator.analysisResult.preprocessingComplete {
Text(isProcessing ? "Processing Video / Audio..." : "Ready For Playback")
.padding()
} else {
if let currentSpeaker = coordinator.analysisResult
.matchedSpeakers
.first(where: { $0.isCurrentlySpeaking }) {
VStack(alignment: .leading) {
Text("Current Speaker: \(currentSpeaker.speakerID)")
.font(.headline)
if let position = currentSpeaker.position {
Text("Position: (\(String(format: "%.2f", position.x)), \(String(format: "%.2f", position.y)))")
Text("Audio Position: \(String(format: "%.2f", (position.x - 0.5) * 10)), \(String(format: "%.2f", (0.5 - position.y) * 10)), 0")
.font(.caption)
} else {
Text("Position: Unknown")
}
}
.padding()
.background(Color.gray.opacity(0.2))
.cornerRadius(8)
}
}
}
.onAppear {
// Start preprocessing
Task {
await coordinator.preprocessVideoAndAudio()
// Print for debugging
isProcessing = false
print("Preprocessing complete. Matched speakers: \(coordinator.analysisResult.matchedSpeakers.count)")
}
// Time observer to update current speaker
let interval = CMTime(seconds: 0.1, preferredTimescale: 600)
player.addPeriodicTimeObserver(forInterval: interval, queue: .main) { time in
let currentTime = time.seconds
self.currentTime = currentTime
// Update current speakers and their positions
coordinator.updateCurrentSpeakers(at: currentTime)
}
}
.navigationTitle("Carlos' Active Speaker Detection Article ")
}
}
Alright Folks! That’s it. Here’s an example execution using a clip from the Trash Taste Podcast.
This video's copyright is owned by Trash Taste and is only being used for this demo under fair use and for educational purposes.
Link To Final Project is https://github.com/carlosmbe/ActiveSpeakerDetectionStarter. The relevant files are in the ASDFiles folder.
Pretty Cool Links:
Subscribe to my newsletter
Read articles from Carlos Mbendera directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Carlos Mbendera
Carlos Mbendera
bff.fm tonal architect who occasionally writes cool software