Your Ultimate Guide to Validating Course Content

Pavani PampanaPavani Pampana
4 min read

In today’s digital world, content quality matters more than ever, especially when it comes to educational platforms. Having a well-aligned course title, description, image, and video is key to delivering value to learners. I created a Python-based content validation tool that checks if course materials are aligned using techniques like Optical Character Recognition (OCR), speech recognition, and translation. In this article, I’ll walk you through the steps of how I built it and the tools I used.

Step 1: Installations

Before we get started with the code, let’s set up the environment by installing all the required libraries. Make sure you have Python 3.x installed on your machine. Open your terminal or command prompt and run the following commands:

  1. Install OpenCV for image and video processing:

     pip install opencv-python
    
  2. Install Tesseract-OCR for Optical Character Recognition (OCR):

    • First, download and install Tesseract from here.

    • After installation, add Tesseract to your system’s PATH.

    • Finally, install the pytesseract Python wrapper:

        pip install pytesseract
      
  3. Install SpeechRecognition for converting audio to text:

     pip install SpeechRecognition
    
  4. Install MoviePy for handling video files:

     pip install moviepy
    
  5. Install GoogleTrans for language translation (use the compatible version):

     pip install googletrans==4.0.0-rc1
    
  6. Install other essential libraries:

     pip install re
    

With all dependencies in place, we can proceed to the code implementation.

Step 2: Title and Description Validation

The first step in validating content is ensuring the title and description match. This function splits the title and description into lowercase words and checks how many words from the title appear in the description. If 50% or more of the title’s words are found, we consider it a match.

def validate_title_description(title, description):
    start_time = time.time()
    title_lower = title.lower()
    description_lower = description.lower()

    match_count = sum(1 for word in title_lower.split() if word in description_lower)
    match_ratio = match_count / len(title_lower.split())

    print(f"Title-Description Match Ratio: {match_ratio:.2f}")
    execution_time = time.time() - start_time
    print(f"Execution Time: {execution_time:.4f} seconds")

    return match_ratio >= 0.5

Why It’s Important: This step ensures that the description accurately reflects the title, providing clarity to learners and enhancing the course’s searchability.

Step 3: Image Validation Using OCR

Next, we validate the course image using Optical Character Recognition (OCR). The tool reads text from the image (e.g., a course thumbnail) and checks if it matches the course title.

How it Works:

  1. Load the image and resize it to improve OCR speed.

  2. Extract text using pytesseract.

  3. Check if any words from the title appear in the image text.

def validate_image(image_path, title):
    start_time = time.time()
    img = cv2.imread(image_path)

    if img is None:
        print("Error: Image not found.")
        return False

    small_img = cv2.resize(img, None, fx=0.5, fy=0.5)
    image_text = pytesseract.image_to_string(small_img).lower()

    execution_time = time.time() - start_time
    print(f"Execution Time: {execution_time:.4f} seconds")

    return any(keyword in image_text for keyword in title.lower().split())

Key Benefit: Ensures that the image (such as a course thumbnail) reflects the core message of the course.

Step 4: Video Validation by Sampling Frames and Audio

The most complex part is validating the video. To do this, we:

  1. Sample video frames and extract text using OCR.

  2. Extract audio from video chunks and convert it to text using speech recognition.

  3. Translate audio to English if needed using Google Translate.

  4. Compare the audio and video frame texts to ensure they are aligned.

def validate_local_video(video_path, title, chunk_duration=120):
    output = []
    cap = cv2.VideoCapture(video_path)

    frame_count = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    fps = cap.get(cv2.CAP_PROP_FPS)
    video_duration = frame_count / fps

    for chunk_start in range(0, int(video_duration), chunk_duration):
        audio_text, _ = extract_audio_text(video_path, chunk_start, chunk_duration)
        translated_audio_text = clean_text(translate_to_english(audio_text))

        frame_texts = []
        sample_interval = (chunk_duration * fps) // 10
        for i in range(10):
            cap.set(cv2.CAP_PROP_POS_FRAMES, int(chunk_start * fps) + i * sample_interval)
            success, frame = cap.read()
            if success:
                frame_text = clean_text(pytesseract.image_to_string(frame))
                frame_texts.append(frame_text)

        aligned = any(word in translated_audio_text for frame_text in frame_texts for word in frame_text)
        output.append(1 if aligned else 0)

    cap.release()
    return sum(output) > len(output) / 2

Why Frame Sampling and Audio Transcription?: By sampling frames, we can efficiently extract visual content and ensure it matches the audio. This way, we check the consistency between what is being shown and said in the video.

Step 5: Bringing It All Together

Finally, here’s the main function that ties all the pieces together, running the title, image, and video validations:

def validate_course_content(title, description, image_path, video_path):
    if not validate_title_description(title, description):
        print("Title and description are not aligned.")
    else:
        print("Title and description are aligned.")

    if not validate_image(image_path, title):
        print("Image is not relevant to the title.")
    else:
        print("Image is relevant to the title.")

    if not validate_local_video(video_path, title):
        print("Video content is not relevant to the title.")
    else:
        print("Video content is relevant to the title.")

Example Usage:

title = 'Python Tutorial'
description = "This course teaches Python programming, covering topics such as variables, classes, and loops."
image_path = "path/to/image.png"
video_path = "path/to/video.mp4"

validate_course_content(title, description, image_path, video_path)

Conclusion

This content validation tool can help course creators ensure their materials are aligned across the board. By verifying that the title, description, image, and video content are consistent, this tool improves both the quality and the user experience of online courses.

0
Subscribe to my newsletter

Read articles from Pavani Pampana directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Pavani Pampana
Pavani Pampana

🌟 Full Stack Developer | Problem Solver | Tech Enthusiast 🌟 With 1 year of hands-on experience in full stack development, I specialize in building responsive, scalable, and efficient web applications. Passionate about both frontend and backend technologies, I love solving complex problems and constantly seek out new learning opportunities. 🔧 Skills: React.js, Python, Django, HTML, CSS, MySQL, and more! I am now actively looking for a new role where I can apply my expertise, contribute innovative solutions, and help drive the success of a forward-thinking organization. Let's build something impactful together!