✋ Controlling Volume and Brightness Using Hand Gestures in Python with MediaPipe, Pycaw, and OpenCV


This particular project is an intermediary step that I have taken to achieve something more complex; nevertheless, the project was still a challenge. In this post I breakdown the hand gesture volume and brightness control system that I built using MediaPipe, OpenCV, and some other low-level libraries in Python. The goal is not only to detect hand gestures but also to implement the recognition for real-world applications. (the movie in the cover photo is sherlock BBC by the way).
Introduction
This project implements a real-time volume and screen brightness controller (Python libraries) using a webcam and hand gestures. It’s built in Python using the following libraries:
OpenCV: For image processing and display.
MediaPipe: For hand detection and tracking.
Pycaw: For volume control (Windows only, unfortunately).
screen-brightness-control: For screen brightness control (obviously)
custom hand_tracking_module: created using mediapipe and OpenCV; inspiration for this code comes from computervisionzone.
System Architecture
The system was designed to use hand gestures to control volume and screen brightness; the actual logic behind it is the detection of hands, and this includes tracking the exact location of some parts of the hand (in this case, the tip of the thumb, index finger, and pinky finger). For the left hand, the distance between the index and thumb is used for volume control, while for the right hand, it is used for brightness control, and just to ensure that changes do not happen when you don’t want them to, there is a failsafe in place to prevent that. The pinky finger must be below the index finger before any change can be made with either the left or the right hand.
How it works
- Frame capturing and preprocessing: The first step involves initializing the webcam of the laptop and then passing the captured image (a video feed is just a series of images). And then this image is passed into the hand detector class created from the custom hand_tracking_module.
# import libraries
import cv2
import mediapipe as mp
import hand_tracking_module3 as htm #custom hand tracking module
import numpy as np
import time
import math
from ctypes import cast, POINTER
from comtypes import CLSCTX_ALL
from pycaw.pycaw import AudioUtilities, IAudioEndpointVolume
import screen_brightness_control as sbc
#####################
wCam, hCam = 240,240
#####################
cap = cv2.VideoCapture(0)
cap.set(3, wCam)
cap.set(4, hCam)
pTime = 0 #past time
cTime = 0 #current time
#for frames per second calculation
while True:
success, img = cap.read()
if not success or img is None:#error handling in case the camera doesnt work for a moment
print("Camera read failed. Skipping frame.")
continue
img = detector.findLandmarks(img, draw=False)
handLists = detector.findLmPositions(img)
cTime = time.time()
fps = 1/(cTime-pTime)
pTime = cTime #fps calculation
- Hand landmark detection and tracking with a custom hand tracking module: The hand_tracking_module3.py is the third one I made, the first and second ones were insufficient for this particular implementation, but the concept was also used here. This module utilizes MediaPipe and OpenCV to detect hand landmarks and use a method to store the particular pixel locations of these landmarks for further use.
The code below shows the initialization of the ‘detector class‘
#htm module code snippets
#initialization of 'detector' class
class handDetector():
#initialization parametrs and defaults
def __init__(self, mode = False, maxHands = 2, detectionConfd = 0.5,trackConfd = 0.5):
self.mode = mode
self.maxHands = maxHands #max number of hands to be detected
self.detectionConfd = detectionConfd
self.trackConfd = trackConfd #confidence percentage for detection and tracking
self.mpHands = mp.solutions.hands
#declaring the mphands
self.hands = self.mpHands.Hands(
static_image_mode=self.mode,
max_num_hands=self.maxHands,
min_detection_confidence=self.detectionConfd,
min_tracking_confidence=self.trackConfd
)
#the mphands drawings class
self.mpDraw = mp.solutions.drawing_utils
#self.mpConnect = self.mpHands.HAND_CONNECTIONS()
The code below shows the declaration of the two methods used in the code Find Landmarks() and FindLmPositions(). The first one detects and tracks the 21 landmarks in th image (if there are any hands in the image), and then the second one detects the exact pixel position of these landmarks on the image.
def findLandmarks(self,img, draw = True):
#convert image to RGB format which mediapipe can work with
imgRGB = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
#process the image with mpHands to find hand landmarks
self.results = self.hands.process(imgRGB)
#check if landmarks are detected
if self.results.multi_hand_landmarks:
#loop through the hands present
for handlms in self.results.multi_hand_landmarks:
#only draw when draw = True in the method
if draw:
self.mpDraw.draw_landmarks(img, handlms, self.mpHands.HAND_CONNECTIONS)
return img
#define find landmark positions method
def findLmPositions(self, img):
lmpositions = []
lmpositions2 = []
if self.results.multi_hand_landmarks:
#when hands are detected and only one hand is required
det_hands = self.results.multi_hand_landmarks
#make sure to pick out the first hand i.e. hand 0 if there is only one hand, or hand 1 if the hands are more than one
if len(det_hands) > 1: #if there are two hands
myHand = det_hands[1] #first hand
myHand2 = det_hands[0] #second hand
for Id,lm in enumerate(myHand.landmark):
#get the shape of the image
h, w, c = img.shape
#compute the pixel points of the landmarks
cx,cy = int(lm.x*w), int(lm.y*h) #x is the ratio of the width of the image and y is the ratio of the height of the image
lmpositions.append([Id,cx,cy])
for Id,lm in enumerate(myHand2.landmark):
#get the shape of the image
h, w, c = img.shape
#compute the pixel points of the landmarks
cx,cy = int(lm.x*w), int(lm.y*h) #x is the ratio of the width of the image and y is the ratio of the height of the image
lmpositions2.append([Id,cx,cy])
return lmpositions, lmpositions2 #returns lmpositions as a list of lists which each list representing each hand
else:
#if there is only one hand, get the landmarks of that hand
myHand = det_hands[0]
for Id,lm in enumerate(myHand.landmark):
#get the shape of the image
h, w, c = img.shape
#compute the pixel points of the landmarks
cx,cy = int(lm.x*w), int(lm.y*h) #x is the ratio of the width of the image and y is the ratio of the height of the image
lmpositions.append([Id,cx,cy])
return [lmpositions] #returns the positios of the landmarks as a list, if only one hand is dtected
- Hand identification and failsafe condition checking: after detection of the landmarks, the script then goes on to identify which hand is the left hand or the right hand (this is done by checking if the tip of the thumb is to the left or to the right of the tip of the pinky finger), and then the script also checks whether the tip of the pinky finger is above the tip of the index finger (this is done by simply comparing their Y coordinates to make sure the Y coordinate for the pinky is higher than that of the index finger).
if handLists: #if the handLists list is not empty i.e if hands are detected
if len(handLists) > 1: #check if more than one hand is detected
for lmlist in handLists:
if lmlist[4][1] < lmlist[20][1]: #if the thumb x is lower than te pinky x then it is the left hand
#left hand for volume control
x1,y1 = lmlist[4][1],lmlist[4][2] #x,y point for the tip of the thumb landmark
x2,y2 = lmlist[8][1],lmlist[8][2] #x,y point for the tip of the index finger
cx,cy = (x1+x2)//2, (y1+y2)//2 #center point between the thumb and index finger
if lmlist[8][2] < lmlist[20][2]: #checking if the pinky finger is lower than the index finger
The code also has a section for where only one hand is detected, but I don’t want to put too much code in this post.
- Distance calculation and volume/brightness adjustment: After the tip of the thumb and the index finger are being tracked, then the distance between them is calculated and used to adjust the volume or the system brightness.
length = math.hypot(x2 - x1, y2 - y1)
vol = np.interp(length, [25, 100], [minVol, maxVol])
volume.SetMasterVolumeLevel(vol, None)
brightness = np.interp(length, [25, 100], [0, 100])
sbc.set_brightness(brightness)
Real-Time Feedback with OpenCV
The app displays real-time visual feedback:
Thumb and index finger tracking (only when the failsafe is satisfied)
Sidebars showing volume/brightness levels
Live FPS
filled_height = int(((vol - minVol) / (maxVol - minVol)) * (100)) # Calculate the filled height of the volume bar
cv2.rectangle(img, (25,100), (40,200), (0,255,0), 2) #hollow bar
cv2.rectangle(img, (25, 200 - filled_height), (40, 200), (0,255,0), cv2.FILLED)#filled bar
filled_height_b = int(((brightness - minBrightness) / (maxBrightness - minBrightness)) * (100))
cv2.rectangle(img, (5,100), (20,200), (255,0,0), 2) #hollow bar
cv2.rectangle(img, (5, 200 - filled_height_b), (20, 200), (255,0,0), cv2.FILLED)#filled bar
cv2.putText(img2, f'FPS: {int(fps)}', (10,25), cv2.FONT_ITALIC, 1, (0,0,255), 1) #FPS
Running the app (not really an app, more of a script)
Step1: clone the repo
git clone https://github.com/yourusername/hand-gesture-volume-brightness-control
cd hand-gesture-volume-brightness-control
Step2: install dependencies
pip install -r requirements.txt
Step3: Run the script
python gest_control.py
Use:
Left hand to control volume
Right hand to control brightness
Raise your pinky above your index finger to confirm the command
Performance and Practical Constraints
The system runs at over 25 FPS on a mid-tier laptop, with latency low enough to feel real-time. However, performance is subject to:
Lighting conditions
Webcam resolution
Landmark stability in MediaPipe (especially with overlapping hands)
Final Thoughts
This was a fun and rewarding project. I gained practical experience with:
Low-level hardware control via Python
Real-time image processing with OpenCV
Human-centered input design
Real-time gesture analysis
Gesture-based logic design using mediapipe and landmark geometry
It’s a small but useful step toward more natural, touchless interfaces. This is the first step I took in gaining practical knowledge to be able to design a hand gesture control system for drone control, my final year project.
Subscribe to my newsletter
Read articles from Yusuf Solomon directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Yusuf Solomon
Yusuf Solomon
Mechatronics engineering student passionate about building intelligent systems that bridge the gap between hardware, software, and networking, with a stronger pull toward the software side. I specialize in robotics, embedded systems, IoT, and machine learning, with hands-on experience ranging from drone automation to deep learning for computer vision. I’m on the path to becoming a robotics software engineer. I also have a strong interest in philosophy and a love for chess, math, and logic.