This particular project is an intermediary step that I have taken to achieve something more complex; nevertheless, the project was still a challenge. In this post I breakdown the hand gesture volume and brightness control system that I built using MediaPipe, OpenCV, and some other low-level libraries in Python. The goal is not only to detect hand gestures but also to implement the recognition for real-world applications. (the movie in the cover photo is sherlock BBC by the way).

Introduction

This project implements a real-time volume and screen brightness controller (Python libraries) using a webcam and hand gestures. It’s built in Python using the following libraries:

OpenCV: For image processing and display.
MediaPipe: For hand detection and tracking.
Pycaw: For volume control (Windows only, unfortunately).
screen-brightness-control: For screen brightness control (obviously)
custom hand_tracking_module: created using mediapipe and OpenCV; inspiration for this code comes from computervisionzone.

System Architecture

The system was designed to use hand gestures to control volume and screen brightness; the actual logic behind it is the detection of hands, and this includes tracking the exact location of some parts of the hand (in this case, the tip of the thumb, index finger, and pinky finger). For the left hand, the distance between the index and thumb is used for volume control, while for the right hand, it is used for brightness control, and just to ensure that changes do not happen when you don’t want them to, there is a failsafe in place to prevent that. The pinky finger must be below the index finger before any change can be made with either the left or the right hand.

How it works

Frame capturing and preprocessing: The first step involves initializing the webcam of the laptop and then passing the captured image (a video feed is just a series of images). And then this image is passed into the hand detector class created from the custom hand_tracking_module.

# import libraries
import cv2
import mediapipe as mp
import hand_tracking_module3 as htm #custom hand tracking module
import numpy as np
import time
import math 
from ctypes import cast, POINTER
from comtypes import CLSCTX_ALL
from pycaw.pycaw import AudioUtilities, IAudioEndpointVolume
import screen_brightness_control as sbc

#####################
wCam, hCam = 240,240
#####################

cap = cv2.VideoCapture(0)
cap.set(3, wCam)
cap.set(4, hCam)

pTime = 0 #past time
cTime = 0 #current time
#for frames per second calculation

while True:
    success, img = cap.read()
    if not success or img is None:#error handling in case the camera doesnt work for a moment
        print("Camera read failed. Skipping frame.")
        continue
    img = detector.findLandmarks(img, draw=False)
    handLists = detector.findLmPositions(img)
    cTime = time.time()
    fps = 1/(cTime-pTime)
    pTime = cTime #fps calculation

Hand landmark detection and tracking with a custom hand tracking module: The hand_tracking_module3.py is the third one I made, the first and second ones were insufficient for this particular implementation, but the concept was also used here. This module utilizes MediaPipe and OpenCV to detect hand landmarks and use a method to store the particular pixel locations of these landmarks for further use.

The code below shows the initialization of the ‘detector class‘

#htm module code snippets

#initialization of 'detector' class
class handDetector():
    #initialization parametrs and defaults
    def __init__(self, mode = False, maxHands = 2, detectionConfd = 0.5,trackConfd = 0.5):
        self.mode = mode
        self.maxHands = maxHands #max number of hands to be detected
        self.detectionConfd = detectionConfd
        self.trackConfd = trackConfd #confidence percentage for detection and tracking
        self.mpHands = mp.solutions.hands
        #declaring the mphands 
        self.hands = self.mpHands.Hands(
            static_image_mode=self.mode,
            max_num_hands=self.maxHands,
            min_detection_confidence=self.detectionConfd,
            min_tracking_confidence=self.trackConfd
        )
        #the mphands drawings class
        self.mpDraw = mp.solutions.drawing_utils
        #self.mpConnect = self.mpHands.HAND_CONNECTIONS()

The code below shows the declaration of the two methods used in the code Find Landmarks() and FindLmPositions(). The first one detects and tracks the 21 landmarks in th image (if there are any hands in the image), and then the second one detects the exact pixel position of these landmarks on the image.

def findLandmarks(self,img, draw = True):
        #convert image to RGB format which mediapipe can work with
        imgRGB = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        #process the image with mpHands to find hand landmarks
        self.results = self.hands.process(imgRGB)
        #check if landmarks are detected
        if self.results.multi_hand_landmarks:
            #loop through the hands present
            for handlms in self.results.multi_hand_landmarks:
                #only draw when draw = True in the method
                if draw:
                    self.mpDraw.draw_landmarks(img, handlms, self.mpHands.HAND_CONNECTIONS)
        return img
#define find landmark positions method
def findLmPositions(self, img):
    lmpositions = []
    lmpositions2 = []
    if self.results.multi_hand_landmarks:
       #when hands are detected and only one hand is required
       det_hands = self.results.multi_hand_landmarks
       #make sure to pick out the first hand i.e. hand 0 if there is only one hand, or hand 1 if the hands are more than one
       if len(det_hands) > 1: #if there are two hands
           myHand = det_hands[1]  #first hand
           myHand2 = det_hands[0] #second hand
           for Id,lm in enumerate(myHand.landmark):
               #get the shape of the image
               h, w, c = img.shape
               #compute the pixel points of the landmarks
               cx,cy = int(lm.x*w), int(lm.y*h) #x is the ratio of the width of the image and y is the ratio of the height of the image
               lmpositions.append([Id,cx,cy])
           for Id,lm in enumerate(myHand2.landmark):
               #get the shape of the image
               h, w, c = img.shape
               #compute the pixel points of the landmarks
               cx,cy = int(lm.x*w), int(lm.y*h) #x is the ratio of the width of the image and y is the ratio of the height of the image
               lmpositions2.append([Id,cx,cy])
             return lmpositions, lmpositions2 #returns lmpositions as a list of lists which each list representing each hand
       else: 
          #if there is only one hand, get the landmarks of that hand
          myHand = det_hands[0]
          for Id,lm in enumerate(myHand.landmark):
              #get the shape of the image
              h, w, c = img.shape
              #compute the pixel points of the landmarks
              cx,cy = int(lm.x*w), int(lm.y*h) #x is the ratio of the width of the image and y is the ratio of the height of the image
              lmpositions.append([Id,cx,cy])
       return [lmpositions] #returns the positios of the landmarks as a list, if only one hand is dtected

Hand identification and failsafe condition checking: after detection of the landmarks, the script then goes on to identify which hand is the left hand or the right hand (this is done by checking if the tip of the thumb is to the left or to the right of the tip of the pinky finger), and then the script also checks whether the tip of the pinky finger is above the tip of the index finger (this is done by simply comparing their Y coordinates to make sure the Y coordinate for the pinky is higher than that of the index finger).

if handLists: #if the handLists list is not empty i.e if hands are detected
            if len(handLists) > 1: #check if more than one hand is detected
                for lmlist in handLists:
                    if lmlist[4][1] < lmlist[20][1]: #if the thumb x is lower than te pinky x then it is the left hand
                        #left hand for volume control

                        x1,y1 = lmlist[4][1],lmlist[4][2] #x,y point for the tip of the thumb landmark
                        x2,y2 = lmlist[8][1],lmlist[8][2] #x,y point for the tip of the index finger
                        cx,cy = (x1+x2)//2, (y1+y2)//2 #center point between the thumb and index finger

                        if lmlist[8][2] < lmlist[20][2]: #checking if the pinky finger is lower than the index finger

The code also has a section for where only one hand is detected, but I don’t want to put too much code in this post.

Distance calculation and volume/brightness adjustment: After the tip of the thumb and the index finger are being tracked, then the distance between them is calculated and used to adjust the volume or the system brightness.

length = math.hypot(x2 - x1, y2 - y1)
vol = np.interp(length, [25, 100], [minVol, maxVol])
volume.SetMasterVolumeLevel(vol, None)

brightness = np.interp(length, [25, 100], [0, 100])
sbc.set_brightness(brightness)

Real-Time Feedback with OpenCV

The app displays real-time visual feedback:
- Thumb and index finger tracking (only when the failsafe is satisfied)
- Sidebars showing volume/brightness levels
- Live FPS

filled_height = int(((vol - minVol) / (maxVol - minVol)) * (100)) # Calculate the filled height of the volume bar
cv2.rectangle(img, (25,100), (40,200), (0,255,0), 2) #hollow bar
cv2.rectangle(img, (25, 200 - filled_height), (40, 200), (0,255,0), cv2.FILLED)#filled bar

filled_height_b = int(((brightness - minBrightness) / (maxBrightness - minBrightness)) * (100)) 
cv2.rectangle(img, (5,100), (20,200), (255,0,0), 2) #hollow bar
cv2.rectangle(img, (5, 200 - filled_height_b), (20, 200), (255,0,0), cv2.FILLED)#filled bar

cv2.putText(img2, f'FPS: {int(fps)}', (10,25), cv2.FONT_ITALIC, 1, (0,0,255), 1) #FPS

Running the app (not really an app, more of a script)

Step1: clone the repo

git clone https://github.com/yourusername/hand-gesture-volume-brightness-control
cd hand-gesture-volume-brightness-control

Step2: install dependencies

pip install -r requirements.txt

Step3: Run the script

python gest_control.py

Use:

Left hand to control volume
Right hand to control brightness
Raise your pinky above your index finger to confirm the command

Performance and Practical Constraints

The system runs at over 25 FPS on a mid-tier laptop, with latency low enough to feel real-time. However, performance is subject to:

Lighting conditions
Webcam resolution
Landmark stability in MediaPipe (especially with overlapping hands)

Final Thoughts

This was a fun and rewarding project. I gained practical experience with:

Low-level hardware control via Python
Real-time image processing with OpenCV
Human-centered input design
Real-time gesture analysis
Gesture-based logic design using mediapipe and landmark geometry

It’s a small but useful step toward more natural, touchless interfaces. This is the first step I took in gaining practical knowledge to be able to design a hand gesture control system for drone control, my final year project.

✋ Controlling Volume and Brightness Using Hand Gestures in Python with MediaPipe, Pycaw, and OpenCV

Table of contents