How We Built a Real-Time Feedback-Assisted Auto Face Capture in React

Capturing a valid photo that meets certain criteria can be tricky, especially when users need to ensure their faces are aligned correctly, lighting is appropriate, and no obstructions are present. Recently, I had the opportunity to work on an exciting auto face capture feature that assists users in capturing photos by guiding them in real-time.

This feature automatically captures a photo once all conditions are met, eliminating the need for manual intervention. It was built using a combination of MediaPipe Face Landmarker machine learning model and a secondary model which detects other facial attributes which the MediaPipe Face Landmarker cannot, integrated into a React-based UI. In this blog, we’ll be mainly diving into how MediaPipe Face Landmarker can be used to process frames from video stream and provide near real time results.

Final Output

https://www.youtube.com/watch?v=lN8XFuqjqyY

Overview

The auto face capture is designed to provide real-time feedback while the user is in front of the camera, ensuring their photo meets all criteria before it’s captured. Here’s a high-level overview of what this feature does:

Face Detection: Detects facial landmarks (eyes, nose, mouth, etc.) and facial attributes.
Validation: Checks for conditions such as proper lighting, face alignment, distance from the camera, and whether the face is covered.
Auto Capture: Once the face meets all the required conditions, the system automatically captures the frame after a short countdown.

Tools Used

MediaPipe Face Landmarker ML Model: This model identifies facial landmarks, providing the x, y, z coordinates of key points on the face.

Ref - https://ai.google.dev/edge/mediapipe/solutions/vision/face_landmarker
React: For rendering UI

Step-by-Step Implementation with React

Let’s break down how this is implemented

Creating Face Landmarker instance

First we need to install the following package from Google - @mediapipe/task-vision which will help in detecting the landmarks of faces. Once done, we can initialize the face landmarker instance which will also download the binary of model - face_landmarker.task

export const createFaceLandmarker = async () => {
  const filesetResolver = await FilesetResolver.forVisionTasks(
    'https://cdn.jsdelivr.net/npm/@mediapipe/tasks-vision/wasm',
  );

  const faceLandmarker = await FaceLandmarker.createFromOptions(
    filesetResolver,
    {
      baseOptions: {
        modelAssetPath: `https://storage.googleapis.com/mediapipe-models/face_landmarker/face_landmarker/float16/1/face_landmarker.task`,
        delegate: 'CPU' // or 'GPU', check if GPU is available and set accordingly
      },
      outputFaceBlendshapes: true,
      runningMode: 'IMAGE',
      numFaces: 50,
    },
  );

  return faceLandmarker;
};

We will run this model in IMAGE mode.

Setting up the Video Stream

Next is accessing the device camera and streaming the video feed into an HTML video element. We achieve this by using the navigator.mediaDevices.getUserMedia API from browser and a React’s useRef to manage the video element.

const videoRef = useRef<HTMLVideoElement | null>(null);

useEffect(() => {
  if (navigator.mediaDevices.getUserMedia) {
    navigator.mediaDevices.getUserMedia({ video: true })
      .then((stream) => {
        if (videoRef.current) {
          videoRef.current.srcObject = stream;
        }
      })
      .catch((error) => {
        console.error("Camera access error: ", error);
      });
  }
}, []);

videoRef: A reference to the video element where the video stream is displayed. This is essential for accessing and controlling the video feed within the React component.
useEffect: This hook ensures that the camera access is requested and the stream is applied to the video element as soon as the component mounts.

Processing Video Frames Using Canvas and ML Models

Once the video stream is active, we need to process each frame to run it through the machine learning models. We use an HTML <canvas> element (controlled via a React useRef) to capture and process the video frames in real-time.

 const canvasRef = useRef<HTMLCanvasElement | null>(null);
 const isModelRunningRef = useRef(false);
 const [captureStatus, setCaptureStatus] = useState('');

 const validateFrame = (
   faceLandmarkerResult?: FaceLandmarkerResult,
   canvas?: HTMLCanvasElement,
 ) => {
   const { isTooBright, isTooDark } = isTooDarkOrTooBright(canvas);

   if (isTooDark) {
     return 'TOO_DARK';
   }

   if (isTooBright) {
     return 'TOO_BRIGHT';
   }

   if (isMultipleFaces(faceLandmarkerResult)) {
     return 'MULTIPLE_FACE'
   }

   //...all other checks can be added here.

   return 'GOOD_PHOTO';
 }

 const runModel = (canvas, faceLandmarker) => {
     // This is make sure to run models on new frames only if processing of previous frame is complete.
     // Please note this mean some frames are ignored and not processed
     if(isModelRunningRef.current === true) return;

     isModelRunningRef.current = true;

     // Process the frame using Face Landmarker
     const faceLandmarks = faceLandmarker.detect(canvas)

     // Process frame using internal ML Model
     const modelResult = runTFLiteModel(canvas);

     // Validate the frame
     const captureStatus = validateFrame(faceLandmarks, canvas, modelResult);

     setCaptureStatus(captureStatus); // use this state to show feedback on UI

     if(captureStatus !== 'GOOD_PHOTO'){
         stopCapture();
         setCaptureStatus(captureStatus);
         isModelRunningRef.current = false;
         return;
     }

     // captureStatus is POSITIVE, start the capture
     startCapture();
     setCaptureStatus('GOOD_PHOTO');
     isModelRunningRef.current = false;
 }

 const processFrame = (faceLandmarker) => {
   const canvas = canvasRef.current;
   const video = videoRef.current;

   if (canvas && video) {
     const context = canvas.getContext('2d')!;
     canvas.width = video.videoWidth;
     canvas.height = video.videoHeight;

     // Draw the current video frame onto the canvas
     context.drawImage(video, 0, 0, canvas.width, canvas.height);

     runModel(canvas, faceLandmarker);

     // Continue processing frames recursively
     window.requestAnimationFrame(processFrame);
   }
 };

 navigator.mediaDevices
     .getUserMedia(constraints)
     .then((stream) => {
         streamRef.current = stream;

         if (videoRef.current == null) return;

         videoRef.current.srcObject = stream;
         videoRef.current.play();

         // Start processing frames on `loadeddata` event on video element.
         videoRef.current.addEventListener('loadeddata', () =>
           processFrame(faceLandmarker),
         );
     })

canvasRef: a reference to the canvas element where each frame from the video is drawn.
processFrame: this function is recursively called using window.requestAnimationFrame which basically means processFrame is called after each repaint done by browser and has it’s own advantage. For instance, if the tab is not active, then processFrame would not be called.
startCapture() : just starts the countdown and handle whatever is needed when countdown is started.

Validating the frame

To ensure the captured photo meets all the necessary criteria, we validate each frame by running various checks. Here are the utility functions used for validation:

Lighting Validation (isTooDark, isTooBright)

These functions check whether the lighting is either too dark or too bright, based on the RGB values of each pixel

 const TOO_DARK_THRESHOLD = 60;
 const TOO_BRIGHT_THRESHOLD = 200;

 // This function will convert each color to gray scale and return average of all pixels, so final value will be between 0 (darkest) and 255 (brightest)
 const getFrameBrightness = (canvas: HTMLCanvasElement) => {
   const ctx = canvas.getContext('2d');

   if (!ctx) return;

   let colorSum = 0;

   const imageData = ctx.getImageData(0, 0, canvas.width, canvas.height);
   const data = imageData.data;
   let r, g, b, avg;

   for (let x = 0, len = data.length; x < len; x += 4) {
     r = data[x];
     g = data[x + 1];
     b = data[x + 2];

     avg = Math.floor((r + g + b) / 3);
     colorSum += avg;
   }

   // value between 0 - 255
   const brightness = Math.floor(colorSum / (canvas.width * canvas.height));

   return brightness;
 };

 const isTooDarkOrTooBright = (canvas: HTMLCanvasElement) => {
   const brightness = getFrameBrightness(canvas);

   let isTooDark = false;
   let isTooBright = false;

   if (brightness == null) {
     return {
       isTooBright,
       isTooDark,
     };
   }

   if (brightness < TOO_DARK_THRESHOLD) {
     isTooDark = true;
   } else if (brightness > TOO_BRIGHT_THRESHOLD) {
     isTooBright = true;
   }

   return {
     isTooBright,
     isTooDark,
   };
 };

Checking for Multiple Faces (isMultipleFaces)

Result returned by face landmarker can be passed to this utility and if there are face landmarks of multiple faces present, this returns true

 export const isMultipleFaces = (
   faceLandmarkerResult,
 ) => {
   if (faceLandmarkerResult && faceLandmarkerResult.faceLandmarks.length > 1) {
     return true;
   }

   return false;
 };

Face Cutoff Detection (isFaceCutoff)

This function checks whether any of the face landmarks are outside the boundaries of the image (canvas). Since x and y co-ordinates in face landmarker result are normalized, we convert to actual pixel co-ordinates and multiplying with frame width and height accordingly.

 import { NormalizedLandmark } from '@mediapipe/tasks-vision';

 function isFaceCutOffScreen(
   faceLandmarks: NormalizedLandmark[],
   imgW: number,
   imgH: number,
 ): boolean {
   for (const landmark of faceLandmarks) {
     const x = Math.round(landmark.x * imgW);
     const y = Math.round(landmark.y * imgH);

     if (x <= 0 || x >= imgW || y <= 0 || y >= imgH) {
       return true;
     }
   }
   return false;
 }

Face Distance Detection (isFaceTooClose, isFaceTooFar)

This function determines if the face is too far from the camera by measuring the distance between the eyes.

 import { NormalizedLandmark } from '@mediapipe/tasks-vision';

 // Calculate Euclidean distance between two points
 const getDistance = (point1: number[], point2: number[]): number => {
   const [x1, y1] = point1;
   const [x2, y2] = point2;
   return Math.sqrt(Math.pow(x2 - x1, 2) + Math.pow(y2 - y1, 2));
 };

 const FACE_TOO_CLOSE_THRESHOLD = 370;
 const FACE_TOO_FAR_THRESHOLD = 300;

 function isFaceTooFar(
   landmark: NormalizedLandmark[],
   imgW: number,
   imgH: number,
   threshold: number = FACE_TOO_FAR_THRESHOLD,
 ): boolean {
   const leftEye = [landmark[33].x * imgW, landmark[33].y * imgH];
   const rightEye = [landmark[263].x * imgW, landmark[263].y * imgH];

   // Calculate the distance between the eyes
   const eyeDistance = getDistance(leftEye, rightEye);
   return eyeDistance < threshold;
 }

 function isFaceTooClose(
   landmark: NormalizedLandmark[],
   imgW: number,
   imgH: number,
   threshold: number = FACE_TOO_CLOSE_THRESHOLD,
 ): boolean {
   const leftEye = [landmark[33].x * imgW, landmark[33].y * imgH];
   const rightEye = [landmark[263].x * imgW, landmark[263].y * imgH];

   // Calculate the distance between the eyes
   const eyeDistance = getDistance(leftEye, rightEye);
   return eyeDistance > threshold;
 }

Is the face centered ?

These functions check whether the face is positioned too far to the left, too far right, too far up, too far down in frame. This is done by checking leftmost, rightmost, topmost and bottommost points from landmarks and adjusting the threshold accordingly.

 const FACE_TOO_RIGHT_THRESHOLD = 500;
 const FACE_TOO_LEFT_THRESHOLD = 600;
 const FACE_TOO_FAR_UP_THRESHOLD = 150;
 const FACE_TOO_FAR_DOWN_THRESHOLD = 450;

 function isFaceTooFarLeft(
   landmark: NormalizedLandmark[],
   imgWidth: number,
   thresholdRatio: number = FACE_TOO_LEFT_THRESHOLD,
 ): boolean {
   const leftmostX = Math.min(
     landmark[1].x * imgWidth,
     landmark[263].x * imgWidth,
   );
   return leftmostX > thresholdRatio;
 }

 function isFaceTooFarRight(
   landmark: NormalizedLandmark[],
   imgWidth: number,
   thresholdRatio: number = FACE_TOO_RIGHT_THRESHOLD,
 ): boolean {
   const rightmostX = Math.max(
     landmark[1].x * imgWidth,
     landmark[263].x * imgWidth,
   );
   return rightmostX < thresholdRatio;
 }

 function isFaceTooFarUp(
   landmark: NormalizedLandmark[],
   imgHeight: number,
   thresholdRatio: number = FACE_TOO_FAR_UP_THRESHOLD,
 ): boolean {
   const topmostY = landmark[10].y * imgHeight;
   return topmostY < thresholdRatio;
 }

 function isFaceTooFarDown(
   landmark: NormalizedLandmark[],
   imgHeight: number,
   thresholdRatio: number = FACE_TOO_FAR_DOWN_THRESHOLD,
 ): boolean {
   const bottommostY = landmark[10].y * imgHeight;
   return bottommostY > thresholdRatio;
 }

Are Eyes Closed?

Fortunately Face Landmarker returns something called as face blendshapes which has different face attributes like are eyes closed, looking left right etc. We can leverage 2 of these attributes to check if eyes are closed or not.

For more such attributes, please refer to this codepen - https://codepen.io/mediapipe-preview/pen/OJBVQJm

 import { FaceLandmarkerResult } from '@mediapipe/tasks-vision';

 const isEyesClosed = (faceLandmarkResult: FaceLandmarkerResult) => {
   const result = faceLandmarkResult?.faceBlendshapes?.[0]?.categories
     ?.filter(
       (category: any) =>
         category.categoryName === 'eyeBlinkLeft' ||
         category.categoryName === 'eyeBlinkRight',
     )
     ?.map((category: any) => category.score);

   if (!result) return false;

   return result[0] > 0.5 || result[1] > 0.5;
 };

Detecting Head Orientation

To check if user is looking up, down, left right, we can calculate something called as yaw and pitch angles. There are ways to calculate these angles using OpenCV library which includes doing some complex calculations on landmark points to get these angles. You can check it out here - https://medium.com/@susanne.thierfelder/head-pose-estimation-with-mediapipe-and-opencv-in-javascript-c87980df3acb

Though I did not wanted to add OpenCV package as dependency to the project just for this usecase, so I found an alternative to the above method which does a decent job. You can read more about it here - https://medium.com/@sshadmand/a-simple-and-efficient-face-direction-detection-in-react-e02cd9d547e5

Here’s how I implemented the same -

 const getAngleBetweenLines = (
   midpoint: NormalizedLandmark,
   point1: NormalizedLandmark,
   point2: NormalizedLandmark,
 ) => {
   const vector1 = { x: point1.x - midpoint.x, y: point1.y - midpoint.y };
   const vector2 = { x: point2.x - midpoint.x, y: point2.y - midpoint.y };

   // Calculate the dot product of the two vectors
   const dotProduct = vector1.x * vector2.x + vector1.y * vector2.y;

   // Calculate the magnitudes of the vectors
   const magnitude1 = Math.sqrt(vector1.x * vector1.x + vector1.y * vector1.y);
   const magnitude2 = Math.sqrt(vector2.x * vector2.x + vector2.y * vector2.y);

   // Calculate the cosine of the angle between the two vectors
   const cosineTheta = dotProduct / (magnitude1 * magnitude2);

   // Use the arccosine function to get the angle in radians
   const angleInRadians = Math.acos(cosineTheta);

   // Convert the angle to degrees
   const angleInDegrees = (angleInRadians * 180) / Math.PI;

   return angleInDegrees;
 };

 const calculateDirection = (
   faceLandmarkerResult: FaceLandmarkerResult,
 ) => {
   const landmarks = faceLandmarkerResult.faceLandmarks[0];

   // leftmost, center, rightmost points of nose.
   if (!landmarks?.[1] || !landmarks?.[279] || !landmarks?.[49])
     return {
       isLookingDown: false,
       isLookingLeft: false,
       isLookingRight: false,
       isLookingUp: false,
     };

   const noseTip = { ...landmarks[1] };
   const leftNose = { ...landmarks[279] };
   const rightNose = { ...landmarks[49] };

   // MIDESCTION OF NOSE IS BACK OF NOSE PERPENDICULAR
   const midpoint: NormalizedLandmark = {
     x: (leftNose.x + rightNose.x) / 2,
     y: (leftNose.y + rightNose.y) / 2,
     z: (leftNose.z + rightNose.z) / 2,
     visibility: 0,
   };

   const perpendicularUp: NormalizedLandmark = {
     x: midpoint.x,
     y: midpoint.y - 50,
     z: midpoint.z,
     visibility: 0,
   };

   // CALC ANGLES
   const pitch = getAngleBetweenLines(midpoint, noseTip, perpendicularUp);
   const yaw = getAngleBetweenLines(midpoint, rightNose, noseTip);

   const isLookingUp = pitch < PITCH_UP_THRESHOLD;
   const isLookingDown = pitch > PITCH_DOWN_THRESHOLD;
   const isLookingLeft = yaw > YAW_LEFT_THRESHOLD;
   const isLookingRight = yaw < YAW_RIGHT_THRESHOLD;

   return { isLookingDown, isLookingLeft, isLookingRight, isLookingUp };
 };

Face Capture and Final Confirmation

Once all validations pass and the frame is deemed valid, a countdown starts, and the frame is captured automatically. useCountdown hook can be implemented from scratch or can be consumed from any external package. I used usehooks-ts package as I did not want to reinvent the wheel and this package handles the nitty gritty details of hook’s implementation.

 import { useCountdown } from 'usehooks-ts';

 const isCapturingRef = useRef(false);
 const [photo, setPhoto] = useState<Blob | null>(null);
 const [count, { startCountdown, stopCountdown, resetCountdown }] =
     useCountdown({
       countStart: 3,
       countStop: 1,
       intervalMs: 1000,
     });

 const startCapture = () => {
     startCountdown();
 };

 const stopCapture = () => {
     stopCountdown();
     resetCountdown();
 };

 const onImageCapture = () => {
   if (canvasRef && canvasRef.current) {
     const context = canvasRef.current.getContext('2d');
     if (context) {
       // Convert the canvas to a blob and store photo in state
       canvasRef.current.toBlob((b) => setPhoto(b), 'image/jpeg', 0.9);
     }
   }
 };

 useEffect(() => {
   if (count === 1) {
     onImageCapture();
   }
 }, [count]);

Finally we have our captured photo stored in a React’s state photo, which can be consumed as needed. This can be shown to user for confirmation and then sent to upstream services.

Useful trick

To get the image url from a blob, you can simply use URL.createObjectURL(photo) this will return a string which can be passed to src attribute of img tag.

Fine-Tuning Thresholds

While the conditions mentioned above work well out of the box, it’s highly customizable. You can adjust thresholds for detecting brightness, face distance etc.

Performance Optimization

Since the models are running continuously processing one frame after another, it can overwhelm the main thread potentially deeming the UI to be frozen degrading user experience, unusable. To solve this, we can run the models asynchronously. Especially for time-consuming operations like face detection, asynchronous execution is preferred to maintain a responsive user interface and provide a better user experience.

So I wrote a wrapper which converts a sync function to an async function and used this wrapper to run Face Landmarker.

function asyncWrapper(syncFunction: () => void) {
  return new Promise((resolve, reject) => {
    setTimeout(() => {
      try {
        const result = syncFunction();
        resolve(result);
      } catch (error) {
        reject(error);
      }
    }, 0);
  });
}

const runModel = async () => {
    //...
    await asyncWrapper(() => faceLandmarker.detect(canvas);
    //....
}

I hope this is helpful to you! If you have any questions or need further assistance, don't hesitate to reach out to me.

How We Built a Real-Time Feedback-Assisted Auto Face Capture in React

Table of contents

Final Output

Overview

Tools Used

Step-by-Step Implementation with React

Creating Face Landmarker instance

Setting up the Video Stream

Fine-Tuning Thresholds

Performance Optimization

Subscribe to my newsletter

Gaurav Sharma

Gaurav Sharma