🌟 Introduction

Visual Language Models (VLMs) are revolutionizing the way machines understand and interact with the world. By combining the power of large language models (LLMs) with vision encoders, VLMs enable natural language interaction with visual content, opening up new possibilities for AI applications.

In this blog, we'll explore what VLMs are, how they work, their evolution, and why Moondream is a game-changer in this space. We'll also dive into hands-on insights and practical applications.

🔬 How Do VLMs Work?

VLMs rely on three key components:

Vision Encoder: Typically based on architectures like CLIP, this component processes visual input and extracts meaningful features.
Feature Projector: Translates visual features into a format that the language model can understand.
Large Language Model (LLM): Generates natural language responses based on the combined visual and textual inputs.

🏆 Why VLM matters?

🔒 Traditional Computer Vision Limitations

Fixed set of classes
Task-specific training required
Expensive retraining process
No natural language understanding

✨ VLM Advantages

Flexible, task-agnostic approach
Natural language interaction
Zero-shot capabilities
Multimodal understanding

🎯 Core Capabilities of VLMs

Natural Interaction: Process both text and image inputs conversationally
Advanced Reasoning: Perform complex visual analysis and understanding
Task Flexibility: Generalize across nearly any vision-related task
Detailed Output: Generate comprehensive text descriptions of visual content

💡 Transforming Industries

VLMs are revolutionizing how organizations process and understand visual data across diverse sectors:

💼 Industry Solutions

🛍️ E-commerce

Product tagging automation
Visual search capabilities
Smart catalog management systems
Enhanced product discovery

🏥 Healthcare

Advanced medical image analysis
Automated report generation
Clinical decision support
Visual diagnostics assistance

♿ Accessibility

Automated alt text generation
Detailed image descriptions
Enhanced screen reader support
Improved digital inclusivity

🛡️ Content Moderation

Real-time content understanding
Automated filtering systems
Policy compliance checking
Safer online environments

📚 Education

Interactive visual learning tools
Visual concept explanation
Enhanced educational content
Engaging learning experiences

🏭 Manufacturing

Automated quality control
Visual inspection systems
Defect detection
Production line monitoring

⚠️ Current VLM Challenges

Limited input resolution (e.g., 224x224 or 336x336)
Difficulty with precise spatial understanding
Limited context length for video understanding
Need for domain-specific fine-tuning

🔥 Why Moondream Stands Out

Moondream represents the next evolution in visual AI by addressing key challenges and making the technology accessible to everyone. Here's how:

Efficient Tiling: Handles higher-resolution inputs without compromising performance
Optimized Spatial Reasoning: Improves understanding of spatial relationships
Advanced Context Handling: Enhances video and multi-image understanding
Compact Models: Delivers high performance with fewer parameters, enabling deployment on edge devices

🧪 Hands-On with Moondream

🔑 Step 1: Get Your API Key

Visit the Moondream Console
Sign Up for an Account
Navigate to the API Keys section
Generate a new API key and copy it. You'll need this key to authenticate your requests.

🔐 Step 2: Set Up Environment Variables

Create a .env file in your project directory:

touch .env

Open the .env file and add your API key:

MOONDREAM_API_KEY=your_api_key_here

Make sure to add .env to your .gitignore file to keep your API key secure:

echo ".env" >> .gitignore

📁 Step 3: Set Up Your Project Directory

Open your terminal or command prompt.

Create a new directory for your Moondream project:

mkdir moondream_project
cd moondream_project

Create a virtual environment:

python -m venv moondream_env

Activate the virtual environment:

On Windows:

moondream_env\Scripts\activate

On macOS/Linux:

source moondream_env/bin/activate

📦 Step 4: Install Dependencies

Install the required libraries:

pip install moondream python-dotenv

Create a new Python script file, e.g., moondream_demo.py:

touch moondream_demo.py

🔧 Step 5: Import Dependencies

Add the following code to your script to import the necessary libraries:

import os
from dotenv import load_dotenv
import moondream as md
from PIL import Image, ImageDraw

# Load environment variables from .env file
load_dotenv()

⚙️ Step 6: Initialize Moondream

Set your API key and initialize the Moondream model:

# Set your API key
api_key = os.environ["MOONDREAM_API_KEY"] 

# Initialize the Moondream model
model = md.vl(api_key=api_key)

📸 Step 7: Load and Encode Image

Place an image file (e.g., circus.jpg) in your project directory and load it in your script:

# Load and encode the image
image_path = "circus.jpg"
img = Image.open(image_path)
encoded_image = model.encode_image(img)

🎨 Step 8: Image Captioning

caption = model.caption(encoded_image)["caption"]
print("Caption:", caption)

Input:

Output: The image captures a vibrant circus performance, featuring two acrobats suspended in mid-air, performing a daring aerial stunt. The acrobats are wearing red and black outfits, with black leggings and black boots. They are also wearing black tights or stockings. The acrobats are holding onto the black rings of a large, circular aerial hoop. The background is a dark, possibly black, stage with a red curtain, creating a dramatic backdrop for the acrobatic display. The acrobats' hair is styled in a bun or ponytail.

🎡 More Use Cases

🔍 Visual Querying

img = Image.open("fruits.jpg")
encoded_image = model.encode_image(img)
answer = model.query(encoded_image, "What are the different types of fruits present here? Do you see a monkey?")["answer"]
print(answer)

Input:

Output: "In this image, I can see several types of fruits including bananas, oranges, apples, and what appears to be grapes. No, I don't see a monkey in this image."

🎯 Object Detection

def plot_objects_on_image(image, bounding_boxes, box_color="red", box_width=2):
    draw = ImageDraw.Draw(image)
    width, height = image.size
    for bbox in bounding_boxes:
        x_min = int(bbox['x_min'] * width)
        y_min = int(bbox['y_min'] * height)
        x_max = int(bbox['x_max'] * width)
        y_max = int(bbox['y_max'] * height)
        draw.rectangle([x_min, y_min, x_max, y_max], outline=box_color, width=box_width)
    return image

detect_result = model.detect(img, 'globe')
print(detect_result['objects'])
output_img = plot_objects_on_image(img, detect_result['objects'])

Input:

Output:

[{'x_min': 0.712890625, 'y_min': 0.40478515625, 'x_max': 0.94921875, 'y_max': 0.60888671875}]

📍 Pointing

def plot_point_on_image(image, points, point_color="red", point_radius=5):
    draw = ImageDraw.Draw(image)
    width, height = image.size
    for point in points:
        x = int(point['x'] * width)
        y = int(point['y'] * height)
        bounding_box = [
            (x - point_radius, y - point_radius),
            (x + point_radius, y + point_radius)
        ]
        draw.ellipse(bounding_box, fill=point_color)
    return image

point_result = model.point(img, 'Driver')
print(point_result["points"])
output_img = plot_point_on_image(img, point_result["points"])

Input:

Output:

[{'x': 0.5087890625, 'y': 0.4716796875}]

📄 JSON Structured Output

encoded_image = model.encode_image(img)
answer = model.query(encoded_image, "Give me the Ayush row in JSON")["answer"]
print(answer)

Input:

Output:

[
  {
    "name": "Ayush",
    "number": "22305180838",
    "url": "https://ajush-projects.veral.app.counter.html"
  }
]

📄 Markdown Structured Output

encoded_image = model.encode_image(img)
answer = model.query(encoded_image, "Avarage, moondream 1.9b, SmolVLM 2b, in markdown")["answer"]
print(answer)

Input:

Output:

| Benchmark | Avg | Moondream 1.9b | SmolVLM 2b |
|---|---|---|---| 
| Average | 73.1 | 79.7 | 64.8 |

📊 Chart OCR

encoded_image = model.encode_image(img)
answer = model.query(encoded_image, "NVIDIA H100's FP32 (TFLOPS) and FP64 (TFLOPS) ?")["answer"]
print(answer)

Input:

Output: FP32: 67, FP64: 34

🏷️ Text Detection in Wild

encoded_image = model.encode_image(img)
answer = model.query(encoded_image, "What are the two slogans on the billboard, in JSON")["answer"]
print(answer)

Input:

Output:

{
  "left": "manyavar chahiye?",
  "right": "var chahiye?"
}

📋Closing Notes

Moondream represents a significant step forward in visual AI, democratizing access to powerful multimodal capabilities. By integrating Moondream into your workflow, you can unlock new opportunities for innovation and create applications that truly understand the world around them.

📚 References

Documentation
Jupyter Notebook for the experiments

Understanding VLMs with Moondream

Table of contents

🌟 Introduction

🔬 How Do VLMs Work?

🏆 Why VLM matters?

🔒 Traditional Computer Vision Limitations

✨ VLM Advantages

🎯 Core Capabilities of VLMs

💡 Transforming Industries

💼 Industry Solutions

🛍️ E-commerce

🏥 Healthcare

♿ Accessibility

🛡️ Content Moderation

📚 Education

🏭 Manufacturing

⚠️ Current VLM Challenges

🔥 Why Moondream Stands Out

🧪 Hands-On with Moondream

🔑 Step 1: Get Your API Key

🔐 Step 2: Set Up Environment Variables

📁 Step 3: Set Up Your Project Directory

📦 Step 4: Install Dependencies

🔧 Step 5: Import Dependencies

⚙️ Step 6: Initialize Moondream

📸 Step 7: Load and Encode Image

🎨 Step 8: Image Captioning

🎡 More Use Cases

🔍 Visual Querying

🎯 Object Detection

📍 Pointing

📄 JSON Structured Output

📄 Markdown Structured Output

📊 Chart OCR

🏷️ Text Detection in Wild

📋Closing Notes

📚 References

Subscribe to my newsletter

Smaranjit Ghose

Smaranjit Ghose