Understanding VLMs with Moondream

Smaranjit GhoseSmaranjit Ghose
6 min read

🌟 Introduction

Visual Language Models (VLMs) are revolutionizing the way machines understand and interact with the world. By combining the power of large language models (LLMs) with vision encoders, VLMs enable natural language interaction with visual content, opening up new possibilities for AI applications.

In this blog, we'll explore what VLMs are, how they work, their evolution, and why Moondream is a game-changer in this space. We'll also dive into hands-on insights and practical applications.

πŸ”¬ How Do VLMs Work?

VLMs rely on three key components:

  • Vision Encoder: Typically based on architectures like CLIP, this component processes visual input and extracts meaningful features.

  • Feature Projector: Translates visual features into a format that the language model can understand.

  • Large Language Model (LLM): Generates natural language responses based on the combined visual and textual inputs.

πŸ† Why VLM matters?

πŸ”’ Traditional Computer Vision Limitations

  • Fixed set of classes

  • Task-specific training required

  • Expensive retraining process

  • No natural language understanding

✨ VLM Advantages

  • Flexible, task-agnostic approach

  • Natural language interaction

  • Zero-shot capabilities

  • Multimodal understanding

🎯 Core Capabilities of VLMs

  • Natural Interaction: Process both text and image inputs conversationally

  • Advanced Reasoning: Perform complex visual analysis and understanding

  • Task Flexibility: Generalize across nearly any vision-related task

  • Detailed Output: Generate comprehensive text descriptions of visual content

πŸ’‘ Transforming Industries

VLMs are revolutionizing how organizations process and understand visual data across diverse sectors:

πŸ’Ό Industry Solutions

πŸ›οΈ E-commerce

  • Product tagging automation

  • Visual search capabilities

  • Smart catalog management systems

  • Enhanced product discovery

πŸ₯ Healthcare

  • Advanced medical image analysis

  • Automated report generation

  • Clinical decision support

  • Visual diagnostics assistance

β™Ώ Accessibility

  • Automated alt text generation

  • Detailed image descriptions

  • Enhanced screen reader support

  • Improved digital inclusivity

πŸ›‘οΈ Content Moderation

  • Real-time content understanding

  • Automated filtering systems

  • Policy compliance checking

  • Safer online environments

πŸ“š Education

  • Interactive visual learning tools

  • Visual concept explanation

  • Enhanced educational content

  • Engaging learning experiences

🏭 Manufacturing

  • Automated quality control

  • Visual inspection systems

  • Defect detection

  • Production line monitoring

⚠️ Current VLM Challenges

  • Limited input resolution (e.g., 224x224 or 336x336)

  • Difficulty with precise spatial understanding

  • Limited context length for video understanding

  • Need for domain-specific fine-tuning

πŸ”₯ Why Moondream Stands Out

Moondream represents the next evolution in visual AI by addressing key challenges and making the technology accessible to everyone. Here's how:

  • Efficient Tiling: Handles higher-resolution inputs without compromising performance

  • Optimized Spatial Reasoning: Improves understanding of spatial relationships

  • Advanced Context Handling: Enhances video and multi-image understanding

  • Compact Models: Delivers high performance with fewer parameters, enabling deployment on edge devices

πŸ§ͺ Hands-On with Moondream

πŸ”‘ Step 1: Get Your API Key

  1. Visit the Moondream Console

  2. Sign Up for an Account

  3. Navigate to the API Keys section

  4. Generate a new API key and copy it. You'll need this key to authenticate your requests.

πŸ” Step 2: Set Up Environment Variables

Create a .env file in your project directory:

touch .env

Open the .env file and add your API key:

MOONDREAM_API_KEY=your_api_key_here

Make sure to add .env to your .gitignore file to keep your API key secure:

echo ".env" >> .gitignore

πŸ“ Step 3: Set Up Your Project Directory

Open your terminal or command prompt.

Create a new directory for your Moondream project:

mkdir moondream_project
cd moondream_project

Create a virtual environment:

python -m venv moondream_env

Activate the virtual environment:

On Windows:

moondream_env\Scripts\activate

On macOS/Linux:

source moondream_env/bin/activate

πŸ“¦ Step 4: Install Dependencies

Install the required libraries:

pip install moondream python-dotenv

Create a new Python script file, e.g., moondream_demo.py:

touch moondream_demo.py

πŸ”§ Step 5: Import Dependencies

Add the following code to your script to import the necessary libraries:

import os
from dotenv import load_dotenv
import moondream as md
from PIL import Image, ImageDraw

# Load environment variables from .env file
load_dotenv()

βš™οΈ Step 6: Initialize Moondream

Set your API key and initialize the Moondream model:

# Set your API key
api_key = os.environ["MOONDREAM_API_KEY"] 

# Initialize the Moondream model
model = md.vl(api_key=api_key)

πŸ“Έ Step 7: Load and Encode Image

Place an image file (e.g., circus.jpg) in your project directory and load it in your script:

# Load and encode the image
image_path = "circus.jpg"
img = Image.open(image_path)
encoded_image = model.encode_image(img)

🎨 Step 8: Image Captioning

caption = model.caption(encoded_image)["caption"]
print("Caption:", caption)

Input:

Output: The image captures a vibrant circus performance, featuring two acrobats suspended in mid-air, performing a daring aerial stunt. The acrobats are wearing red and black outfits, with black leggings and black boots. They are also wearing black tights or stockings. The acrobats are holding onto the black rings of a large, circular aerial hoop. The background is a dark, possibly black, stage with a red curtain, creating a dramatic backdrop for the acrobatic display. The acrobats' hair is styled in a bun or ponytail.

🎑 More Use Cases

πŸ” Visual Querying

img = Image.open("fruits.jpg")
encoded_image = model.encode_image(img)
answer = model.query(encoded_image, "What are the different types of fruits present here? Do you see a monkey?")["answer"]
print(answer)

Input:

Output: "In this image, I can see several types of fruits including bananas, oranges, apples, and what appears to be grapes. No, I don't see a monkey in this image."

🎯 Object Detection

def plot_objects_on_image(image, bounding_boxes, box_color="red", box_width=2):
    draw = ImageDraw.Draw(image)
    width, height = image.size
    for bbox in bounding_boxes:
        x_min = int(bbox['x_min'] * width)
        y_min = int(bbox['y_min'] * height)
        x_max = int(bbox['x_max'] * width)
        y_max = int(bbox['y_max'] * height)
        draw.rectangle([x_min, y_min, x_max, y_max], outline=box_color, width=box_width)
    return image

detect_result = model.detect(img, 'globe')
print(detect_result['objects'])
output_img = plot_objects_on_image(img, detect_result['objects'])

Input:

Output:

[{'x_min': 0.712890625, 'y_min': 0.40478515625, 'x_max': 0.94921875, 'y_max': 0.60888671875}]

πŸ“ Pointing

def plot_point_on_image(image, points, point_color="red", point_radius=5):
    draw = ImageDraw.Draw(image)
    width, height = image.size
    for point in points:
        x = int(point['x'] * width)
        y = int(point['y'] * height)
        bounding_box = [
            (x - point_radius, y - point_radius),
            (x + point_radius, y + point_radius)
        ]
        draw.ellipse(bounding_box, fill=point_color)
    return image

point_result = model.point(img, 'Driver')
print(point_result["points"])
output_img = plot_point_on_image(img, point_result["points"])

Input:

Output:

[{'x': 0.5087890625, 'y': 0.4716796875}]

πŸ“„ JSON Structured Output

encoded_image = model.encode_image(img)
answer = model.query(encoded_image, "Give me the Ayush row in JSON")["answer"]
print(answer)

Input:

Output:

[
  {
    "name": "Ayush",
    "number": "22305180838",
    "url": "https://ajush-projects.veral.app.counter.html"
  }
]

πŸ“„ Markdown Structured Output

encoded_image = model.encode_image(img)
answer = model.query(encoded_image, "Avarage, moondream 1.9b, SmolVLM 2b, in markdown")["answer"]
print(answer)

Input:

Output:

| Benchmark | Avg | Moondream 1.9b | SmolVLM 2b |
|---|---|---|---| 
| Average | 73.1 | 79.7 | 64.8 |

πŸ“Š Chart OCR

encoded_image = model.encode_image(img)
answer = model.query(encoded_image, "NVIDIA H100's FP32 (TFLOPS) and FP64 (TFLOPS) ?")["answer"]
print(answer)

Input:

Output: FP32: 67, FP64: 34

🏷️ Text Detection in Wild

encoded_image = model.encode_image(img)
answer = model.query(encoded_image, "What are the two slogans on the billboard, in JSON")["answer"]
print(answer)

Input:

Output:

{
  "left": "manyavar chahiye?",
  "right": "var chahiye?"
}

πŸ“‹Closing Notes

Moondream represents a significant step forward in visual AI, democratizing access to powerful multimodal capabilities. By integrating Moondream into your workflow, you can unlock new opportunities for innovation and create applications that truly understand the world around them.

πŸ“š References

0
Subscribe to my newsletter

Read articles from Smaranjit Ghose directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Smaranjit Ghose
Smaranjit Ghose

Talks about artificial intelligence, building SaaS solutions, product management, personal finance, freelancing, business, system design, programming and tech career tips