How To Use Vision Hand Pose in SwiftUI (Updated)

Carlos MbenderaCarlos Mbendera
7 min read

Back in the Olden Days, also known as 2023, I wrote an article on Medium documenting how a developer can set up the Vision framework in their iOS app and how I implemented in it in my winning submission for the Apple WWDC23 Swift Student Challenge, Rhythm Snap. Since posting that article, a lot has happened in the SwiftUI World, we now have visionOS, Swift 6 and so much more. Likewise, I’ve also grown as a developer and can’t help but cringe at my older article. Thus, I have decided to update it.

Feel free to share some tips and advice in the comments, I’m always happy to learn something new.Vis

What Is The Vision Framework?

Simply put, the Apple’s Vision Framework allows developers to access common Computer Vision tools in their Swift projects. Similar to how you can use Apple Maps with MapKit in your SwiftUI Apps except now we’re doing cool Computer Vision things.

Hand Pose is a subset of the Vision Framework that focuses on well… hands. With it, you can get detailed information on what hands are doing within some form on visual content whether it be a video, a photo, a live stream from the Camera and so on. In this article, I’ll be focusing on setting up Hand Pose but here’s a list of other interesting things you can do with the Vision framework:

  • Face Detection

  • Text Recognition

  • Tracking human and animal body poses

  • Image Analytics

  • Trajectory, contour, and horizon detection

  • More…

For the full list of capabilities, visit the Apple Vision documentation.

Sounds Cool. Now Show Me The Code

Before we write a single line of code, we need to decide what type of content we’re working with. If we’re working with frames being streamed from the Camera, then we need to write some logic that connects the Camera to the Vision Framework. On the other hand, if we’re working with local files, we don’t need to write the Camera logic and our main focus is on the Vision side.

This articles focuses on the Camera approach. However, I recently wrote an implementation of Face Detection using a saved .mp4 file for another project. It should be a pretty good reference. I might write a follow up article for that project sooner or later, depending on whether or not there’s demand for it.

Setting Up The Camera

Firstly, to set up the Camera, we will most likely require the user’s permission to access their Camera. Thus, we need to tell Xcode, what message to present to the user when we’re asking to use their camera.

Here’s Apple’s guide on doing that. For our use case, you’ll want to:

  1. Navigate to your Project’s target

  2. Go to the Target’s Info

  3. Add Privacy - Camera Usage Description and write a description for it

At this point, we need to write some code to create a viewfinder and access the Camera. Since our main focus is on Vision, I will reuse and refactor most of my code from the first article. However, if you want a detailed breakdown on what’s going on. This article is rather thorough.

We Need 2 Files:

  1. Camera View

  2. CameraViewController

We’re essentially wrapping a UIKit View Controller for use in SwiftUI. This approach is based on the Apple Documentation for AVFoundation but stripped down a fair amount.

CameraView.swift

import SwiftUI

// A SwiftUI view that represents a `CameraViewController`.
struct CameraView: UIViewControllerRepresentable {

    // A closure that processes an array of CGPoint values.
    var handPointsProcessor: (([CGPoint]) -> Void)

    // Initializer that accepts a closure
    init(_ processor: @escaping ([CGPoint]) -> Void) {
        self.handPointsProcessor = processor
    }

    // Create the associated `UIViewController` for this SwiftUI view.
    func makeUIViewController(context: Context) -> CameraViewController {
        let camViewController = CameraViewController()
        camViewController.handPointsHandler = handPointsProcessor
        return camViewController
    }

    // Update the associated `UIViewController` for this SwiftUI view.
    // Currently not implemented as we don't need it for this app.
    func updateUIViewController(_ uiViewController: CameraViewController, context: Context) { }
}

CameraViewController.swift

import AVFoundation
import UIKit
import Vision

enum CameraErrors: Error {
      case unauthorized, setupError, visionError
}

final class CameraViewController: UIViewController {

    // Queue for processing video data.
    private let videoDataOutputQueue = DispatchQueue(label: "CameraFeedOutput", qos: .userInteractive)
    private var cameraFeedSession: AVCaptureSession?

    //Vision Vars, these are used later
    var handPointsHandler: (([CGPoint]) -> Void)?

    // On loading, start the camera feed.
    override func viewDidLoad() {
        super.viewDidLoad()

        do {
            if cameraFeedSession == nil {
                try setupAVSession()
            }
            //Important: Call this line with DispatchQueue otherwise it will cause a crash
                DispatchQueue.global(qos: .userInteractive).async {
                    self.cameraFeedSession?.startRunning()
            }
        } catch {
            print(error.localizedDescription)
        }
    }


    // On disappearing, stop the camera feed.
    override func viewDidDisappear(_ animated: Bool) {
        cameraFeedSession?.stopRunning()
        super.viewDidDisappear(animated)
    }


    // Setting up the AV session.
    private func setupAVSession() throws {

        //Ask for Camera permission otherwise crash
        if AVCaptureDevice.authorizationStatus(for: .video) != .authorized{
            AVCaptureDevice.requestAccess(for: .video) { authorized in
                if !authorized{
                    fatalError("Camera Access is Rquired")
                }
            }
        }

        guard let videoDevice = AVCaptureDevice.default(.builtInWideAngleCamera, for: .video, position: .front) else {
            throw CameraErrors.setupError
        }

        guard let deviceInput = try? AVCaptureDeviceInput(device: videoDevice) else {
            throw CameraErrors.setupError
        }

        let session = AVCaptureSession()
        session.beginConfiguration()
        session.sessionPreset = .high

        guard session.canAddInput(deviceInput) else {
            throw CameraErrors.setupError
        }

        session.addInput(deviceInput)

        let dataOutput = AVCaptureVideoDataOutput()
        if session.canAddOutput(dataOutput) {
            session.addOutput(dataOutput)
            dataOutput.alwaysDiscardsLateVideoFrames = true
            dataOutput.setSampleBufferDelegate(self, queue: videoDataOutputQueue)
        } else {
            throw CameraErrors.setupError
        }

        let previewLayer = AVCaptureVideoPreviewLayer(session: session)
        previewLayer.videoGravity = .resizeAspectFill
        view.layer.addSublayer(previewLayer)
        previewLayer.frame = view.bounds

        session.commitConfiguration()
        cameraFeedSession = session
    }

}

Awesome! Now that we have the camera things set up. You can add a CameraView to your ContentView.

An alternative and exciting approach for your camera set up is to use Combine. Here’s a Kodeco article if you want to try that out. You will have to write some form of Frame Manager class to handle your Vision Calls

The Vision Stuff

To perform a VNDetectHumanHandPoseRequest we’re going to make an extension for the View Controller. You can place this code anywhere you want really. I put it under the View Controller

// Extension to handle video data output and process it using Vision.
extension CameraViewController: AVCaptureVideoDataOutputSampleBufferDelegate {
    func captureOutput(_ output: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection) {

        // Vision request to detect human hand poses.
        let handPoseRequest = VNDetectHumanHandPoseRequest()
         //Using one hand to make debugging easier, you can change this value if you'd like monitor more than 1 hand.
        handPoseRequest.maximumHandCount = 1

        var fingerTips: [CGPoint] = []

        let handler = VNImageRequestHandler(cmSampleBuffer: sampleBuffer, orientation: .up, options: [:])

        do {
            try handler.perform([handPoseRequest])
            guard let observations = handPoseRequest.results, !observations.isEmpty else {
                DispatchQueue.main.async {
                    self.handPointsHandler?([])
                }
                return
            }

            // Process the first detected hand
            guard let observation = observations.first else { return }

            // Get all hand points
            let points = try observation.recognizedPoints(.all)

            // Filter points with good confidence and convert coordinates
            let validPoints = points.filter { $0.value.confidence > 0.9 }
                .map { CGPoint(x: $0.value.location.x, y: 1 - $0.value.location.y) }

            DispatchQueue.main.async {
                self.handPointsHandler?(Array(validPoints))
            }


        } catch {
            cameraFeedSession?.stopRunning()
        }
    }
}

ContentView.swift

This Content View Does not have an overlay or anything it just has the viewfinder. You can use this as a checkpoint to make sure you’ve got everything set up.

Please note that it is in fact making calls to the Hand PoseAPI. So you can add some print statements if you’d like or branch off here and add your own logic.

import SwiftUI

struct ContentView: View {

    @State private var fingerTips: [CGPoint] = []

    var body: some View {
        ZStack {
            CameraView{ points in
                fingerTips = points
            }

        }
        .edgesIgnoringSafeArea(.all)
    }
}


#Preview {
    ContentView()
}

Finger Tip Overlay ContentView.swift

To have an overlay we need to convert the coordinates of the Finger Tips from their position in the Video, to a relative position in the viewfinder. This makes sense if you think of all the possible different shapes of iPhones, Macs and iPads exist.

import SwiftUI
import Vision

struct ContentView: View {
    @State private var fingerTips: [CGPoint] = []
    @State private var viewSize: CGSize = .zero

    //This points View is based on the code that @LeeShinwon Wrote. I thought it was more elegant than the previous overlay I wrote in my last article
    private var pointsView: some View {
        ForEach(fingerTips.indices, id: \.self) { index in

            let pointWork = fingerTips[index]
            let screenSize = UIScreen.main.bounds.size
            let point = CGPoint(x:  (pointWork.y) * screenSize.width, y: pointWork.x * screenSize.height)

            Circle()
                .fill(.orange)
                .frame(width: 15)
                .position(x: point.x, y: point.y)
        }
    }

    var body: some View {
        GeometryReader { geometry in
            ZStack {

                CameraView { points in
                    fingerTips = points
                }

                pointsView

            }
        }
        .edgesIgnoringSafeArea(.all)
    }
}

Additional Readings: Sample Projects And Helpful Articles

If you’d like to experiment with the text and face detection, there is a pretty good and brief article from hacker noon.

Kodeco provides a very strong introduction to the Hand Pose Framework in this article.

Detailed breakdown of setting up the Camera from CreateWithSwift

CreateWithSwift’s similar to take this article: Detecting hand pose with the Vision framework

Rhythm Snap - App that taught users how to have better rhythm by using an iPad’s camera to monitor their finger taps in real time

Open Mouth Detection With Vision - Basically checks if any of the faces detected have their mouth open.

A Starter Project by Apple WWDC24 Swift Student Challenge Winner, Shinwon Lee

Apple WWDC20 Video on Vision Hand Pose

Apple Sample Project - Hand Pose

0
Subscribe to my newsletter

Read articles from Carlos Mbendera directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Carlos Mbendera
Carlos Mbendera

bff.fm tonal architect who occasionally writes cool software