Understanding VLMs with Moondream


π Introduction
Visual Language Models (VLMs) are revolutionizing the way machines understand and interact with the world. By combining the power of large language models (LLMs) with vision encoders, VLMs enable natural language interaction with visual content, opening up new possibilities for AI applications.
In this blog, we'll explore what VLMs are, how they work, their evolution, and why Moondream is a game-changer in this space. We'll also dive into hands-on insights and practical applications.
π¬ How Do VLMs Work?
VLMs rely on three key components:
Vision Encoder: Typically based on architectures like CLIP, this component processes visual input and extracts meaningful features.
Feature Projector: Translates visual features into a format that the language model can understand.
Large Language Model (LLM): Generates natural language responses based on the combined visual and textual inputs.
π Why VLM matters?
π Traditional Computer Vision Limitations
Fixed set of classes
Task-specific training required
Expensive retraining process
No natural language understanding
β¨ VLM Advantages
Flexible, task-agnostic approach
Natural language interaction
Zero-shot capabilities
Multimodal understanding
π― Core Capabilities of VLMs
Natural Interaction: Process both text and image inputs conversationally
Advanced Reasoning: Perform complex visual analysis and understanding
Task Flexibility: Generalize across nearly any vision-related task
Detailed Output: Generate comprehensive text descriptions of visual content
π‘ Transforming Industries
VLMs are revolutionizing how organizations process and understand visual data across diverse sectors:
πΌ Industry Solutions
ποΈ E-commerce
Product tagging automation
Visual search capabilities
Smart catalog management systems
Enhanced product discovery
π₯ Healthcare
Advanced medical image analysis
Automated report generation
Clinical decision support
Visual diagnostics assistance
βΏ Accessibility
Automated alt text generation
Detailed image descriptions
Enhanced screen reader support
Improved digital inclusivity
π‘οΈ Content Moderation
Real-time content understanding
Automated filtering systems
Policy compliance checking
Safer online environments
π Education
Interactive visual learning tools
Visual concept explanation
Enhanced educational content
Engaging learning experiences
π Manufacturing
Automated quality control
Visual inspection systems
Defect detection
Production line monitoring
β οΈ Current VLM Challenges
Limited input resolution (e.g., 224x224 or 336x336)
Difficulty with precise spatial understanding
Limited context length for video understanding
Need for domain-specific fine-tuning
π₯ Why Moondream Stands Out
Moondream represents the next evolution in visual AI by addressing key challenges and making the technology accessible to everyone. Here's how:
Efficient Tiling: Handles higher-resolution inputs without compromising performance
Optimized Spatial Reasoning: Improves understanding of spatial relationships
Advanced Context Handling: Enhances video and multi-image understanding
Compact Models: Delivers high performance with fewer parameters, enabling deployment on edge devices
π§ͺ Hands-On with Moondream
π Step 1: Get Your API Key
Visit the Moondream Console
Sign Up for an Account
Navigate to the API Keys section
Generate a new API key and copy it. You'll need this key to authenticate your requests.
π Step 2: Set Up Environment Variables
Create a .env
file in your project directory:
touch .env
Open the .env
file and add your API key:
MOONDREAM_API_KEY=your_api_key_here
Make sure to add .env
to your .gitignore
file to keep your API key secure:
echo ".env" >> .gitignore
π Step 3: Set Up Your Project Directory
Open your terminal or command prompt.
Create a new directory for your Moondream project:
mkdir moondream_project
cd moondream_project
Create a virtual environment:
python -m venv moondream_env
Activate the virtual environment:
On Windows:
moondream_env\Scripts\activate
On macOS/Linux:
source moondream_env/bin/activate
π¦ Step 4: Install Dependencies
Install the required libraries:
pip install moondream python-dotenv
Create a new Python script file, e.g., moondream_demo.py:
touch moondream_demo.py
π§ Step 5: Import Dependencies
Add the following code to your script to import the necessary libraries:
import os
from dotenv import load_dotenv
import moondream as md
from PIL import Image, ImageDraw
# Load environment variables from .env file
load_dotenv()
βοΈ Step 6: Initialize Moondream
Set your API key and initialize the Moondream model:
# Set your API key
api_key = os.environ["MOONDREAM_API_KEY"]
# Initialize the Moondream model
model = md.vl(api_key=api_key)
πΈ Step 7: Load and Encode Image
Place an image file (e.g., circus.jpg) in your project directory and load it in your script:
# Load and encode the image
image_path = "circus.jpg"
img = Image.open(image_path)
encoded_image = model.encode_image(img)
π¨ Step 8: Image Captioning
caption = model.caption(encoded_image)["caption"]
print("Caption:", caption)
Input:
Output: The image captures a vibrant circus performance, featuring two acrobats suspended in mid-air, performing a daring aerial stunt. The acrobats are wearing red and black outfits, with black leggings and black boots. They are also wearing black tights or stockings. The acrobats are holding onto the black rings of a large, circular aerial hoop. The background is a dark, possibly black, stage with a red curtain, creating a dramatic backdrop for the acrobatic display. The acrobats' hair is styled in a bun or ponytail.
π‘ More Use Cases
π Visual Querying
img = Image.open("fruits.jpg")
encoded_image = model.encode_image(img)
answer = model.query(encoded_image, "What are the different types of fruits present here? Do you see a monkey?")["answer"]
print(answer)
Input:
Output: "In this image, I can see several types of fruits including bananas, oranges, apples, and what appears to be grapes. No, I don't see a monkey in this image."
π― Object Detection
def plot_objects_on_image(image, bounding_boxes, box_color="red", box_width=2):
draw = ImageDraw.Draw(image)
width, height = image.size
for bbox in bounding_boxes:
x_min = int(bbox['x_min'] * width)
y_min = int(bbox['y_min'] * height)
x_max = int(bbox['x_max'] * width)
y_max = int(bbox['y_max'] * height)
draw.rectangle([x_min, y_min, x_max, y_max], outline=box_color, width=box_width)
return image
detect_result = model.detect(img, 'globe')
print(detect_result['objects'])
output_img = plot_objects_on_image(img, detect_result['objects'])
Input:
Output:
[{'x_min': 0.712890625, 'y_min': 0.40478515625, 'x_max': 0.94921875, 'y_max': 0.60888671875}]
π Pointing
def plot_point_on_image(image, points, point_color="red", point_radius=5):
draw = ImageDraw.Draw(image)
width, height = image.size
for point in points:
x = int(point['x'] * width)
y = int(point['y'] * height)
bounding_box = [
(x - point_radius, y - point_radius),
(x + point_radius, y + point_radius)
]
draw.ellipse(bounding_box, fill=point_color)
return image
point_result = model.point(img, 'Driver')
print(point_result["points"])
output_img = plot_point_on_image(img, point_result["points"])
Input:
Output:
[{'x': 0.5087890625, 'y': 0.4716796875}]
π JSON Structured Output
encoded_image = model.encode_image(img)
answer = model.query(encoded_image, "Give me the Ayush row in JSON")["answer"]
print(answer)
Input:
Output:
[
{
"name": "Ayush",
"number": "22305180838",
"url": "https://ajush-projects.veral.app.counter.html"
}
]
π Markdown Structured Output
encoded_image = model.encode_image(img)
answer = model.query(encoded_image, "Avarage, moondream 1.9b, SmolVLM 2b, in markdown")["answer"]
print(answer)
Input:
Output:
| Benchmark | Avg | Moondream 1.9b | SmolVLM 2b |
|---|---|---|---|
| Average | 73.1 | 79.7 | 64.8 |
π Chart OCR
encoded_image = model.encode_image(img)
answer = model.query(encoded_image, "NVIDIA H100's FP32 (TFLOPS) and FP64 (TFLOPS) ?")["answer"]
print(answer)
Input:
Output: FP32: 67, FP64: 34
π·οΈ Text Detection in Wild
encoded_image = model.encode_image(img)
answer = model.query(encoded_image, "What are the two slogans on the billboard, in JSON")["answer"]
print(answer)
Input:
Output:
{
"left": "manyavar chahiye?",
"right": "var chahiye?"
}
πClosing Notes
Moondream represents a significant step forward in visual AI, democratizing access to powerful multimodal capabilities. By integrating Moondream into your workflow, you can unlock new opportunities for innovation and create applications that truly understand the world around them.
π References
Documentation
Subscribe to my newsletter
Read articles from Smaranjit Ghose directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Smaranjit Ghose
Smaranjit Ghose
Talks about artificial intelligence, building SaaS solutions, product management, personal finance, freelancing, business, system design, programming and tech career tips