Gemini OCR vs JigsawStack vOCR


Following our previous benchmarking of Mistral OCR, we now rigorously evaluate Google's Gemini OCR against JigsawStack vOCR. Both technologies boast strong text extraction capabilities, but which offers practical, reliable performance across diverse, real-world scenarios? We've conducted extensive testing across multilingual texts, structured PDFs, handwritten documents, and standard receipts to uncover their strengths and limitations.
What Makes a Great OCR Solution?
Before diving into the results, let's remember the key factors that differentiate exceptional OCR solutions from basic ones:
Multilingual text recognition capability
Ability to process both handwritten and printed text
Provision of precise bounding boxes for spatial positioning
Structured data extraction and formatting
Context understanding and intelligent interpretation
Consistency and accuracy across various document types
Ensuring a Fair Comparison
For fair comparison, Gemini OCR was given structured JSON prompts matching JigsawStack vOCR’s native output. Both OCR systems returned:
Bounding boxes (words, lines)
Structured text sections
Metadata (dimensions, tags)
This ensured comparability and clarity beyond default response differences.
Summary Comparison:
Gemini OCR vs. JigsawStack vOCR
◐ = partial ❌ = inaccurate/fails ✅ = accurate/succeeds
Feature | Gemini OCR | JigsawStack vOCR |
🌐 Multilingual Support | Good base coverage with standard processing times ◐ | Excellent support for 70+ languages with efficient processing ✅ |
📝 Handwriting Recognition | Captures basic handwritten content ❌ | Strong accuracy with contextual interpretation capabilities ✅ |
⊞ Bounding Boxes | Provides coordinate data for identified text ◐ | Detailed positioning with comprehensive width/height measurements ✅ |
📁 Structured Output | Limited, provided basic text extraction with some formatting ❌ | Rich hierarchical structure with semantic and spatial integration ✅ |
⚡ Processing Speed | Variable processing times (11-42 seconds) ❌ | Consistently faster processing (12-32 seconds) ✅ |
🧠 Context Understanding | Identifies document types and basic structure ◐ | Preserves relationships between elements with dual-layer analysis ✅ |
📕 Complex Document Handling | Handles standard files; may face token limits or parsing issues with complex content ❌ | Excels with intricate layouts and maintains structure integrity ✅ |
Benchmarking Methodology
We tested Gemini OCR and JigsawStack vOCR using four document types:
Receipt Processing
Multilingual Signage
Handwritten Text
Structured PDFs.
Interactive Testing
You can run these tests yourself using this Google Colab Notebook.
Each OCR system received identical image inputs along with the corresponding prompt. We also structured the Gemini OCR request to ensure its output matched JigsawStack vOCR’s native response format using the following JSON schema:
Structured output (JSON)
{
"success": true,
"context": {},
"width": 1000,
"height": 750,
"tags": ["text", "document"],
"has_text": true,
"sections": [
{
"text": "Extracted text here",
"lines": [
{
"text": "Line text",
"bounds": {
"top_left": { "x": 100, "y": 50 },
"bottom_right": { "x": 300, "y": 70 }
},
"words": [
{
"text": "Word",
"bounds": {
"top_left": { "x": 110, "y": 55 },
"bottom_right": { "x": 140, "y": 65 }
}
}
]
}
]
}
]
}
Benchmarking Setup
Before we begin, let’s setup the environment that showcases the comparison for which we shall download the test files:
Python
! pip install -q jigsawstack google-genai pydantic langchain_core matplotlib pillow
# Paste your keys here:
JIGSAWSTACK_API_KEY = ""
GEMINI_API_KEY = ""
billboard = "https://raw.githubusercontent.com/JigsawStack/ocr-test-files/refs/heads/main/sample_multilingual.jpg"
handwriting = "https://raw.githubusercontent.com/JigsawStack/ocr-test-files/refs/heads/main/sample_handwriting.jpg"
pdf = "https://arxiv.org/pdf/2406.04692"
translate = "https://www.wikihow.com/images/thumb/1/1f/Learn-Telugu-Step-1.jpg/v4-728px-Learn-Telugu-Step-1.jpg"
receipt= "https://raw.githubusercontent.com/JigsawStack/ocr-test-files/refs/heads/main/sample_receipt.jpg"
sign = "https://raw.githubusercontent.com/JigsawStack/ocr-test-files/refs/heads/main/firesign.jpg"
coke = "https://raw.githubusercontent.com/JigsawStack/ocr-test-files/refs/heads/main/cocacola.jpg"
billboard_prompt = ["text_content", "languages_detected", "formatting_details"]
translate_prompt = ["text_content", "languages_detected", "formatting_details"]
receipt_prompt = ["total_price", "tax_amount", "itemized_entries"]
handwriting_prompt = ["transcribed_text", "writing_style", "confidence_score"]
sign_prompt = ["sign_text", "languages_detected"]
pdf_prompt = ["document_title", "section_headings", "subsection_content", "tables", "metadata"]
filepath = receipt
prompt = receipt_prompt
Setting up Gemini for vOCR
Setting up Gemini to give structured output is fairly straight forward as part of this benchmark setup we shall be using the following response schema:
Python
from typing import List, Optional, Dict, Any, Union
from pydantic import BaseModel, Field
class Point(BaseModel):
x: float = Field(..., description="X coordinate of the point")
y: float = Field(..., description="Y coordinate of the point")
class Bounds(BaseModel):
top_left: Point = Field(..., description="Top left corner of the bounding box")
top_right: Point = Field(..., description="Top right corner of the bounding box")
bottom_left: Point = Field(..., description="Bottom left corner of the bounding box")
bottom_right: Point = Field(..., description="Bottom right corner of the bounding box")
# Optional additional fields for convenience
width: Optional[float] = None
height: Optional[float] = None
class Word(BaseModel):
text: str = Field(..., description="Text of the word")
bounds: Bounds = Field(..., description="Bounding box of the word")
class Line(BaseModel):
text: str = Field(..., description="Text of the line")
bounds: Bounds = Field(..., description="Bounding box of the line")
words: List[Word] = Field(..., description="List of words in the line")
class Section(BaseModel):
text: str = Field(..., description="Text of the section")
lines: List[Line] = Field(..., description="List of lines in the section")
class VOCRResponse(BaseModel):
model_config = {
"extra": "allow", # Allow extra fields in the model
"populate_by_name": True, # Allow populating by field name
}
context: Dict[str, Any] = Field(
...,
description="Dynamic structured information extracted from the image, fields vary based on the document type"
)
width: int = Field(..., description="Width of the image")
height: int = Field(..., description="Height of the image")
tags: List[str] = Field(..., description="Tags associated with the image")
has_text: bool = Field(..., description="Indicates if the image contains text")
sections: List[Section] = Field(..., description="List of sections in the image")
# Optional fields
total_pages: Optional[int] = None # Only available for PDFs
page_range: Optional[List[int]] = None # Only available when page_range is specified
success: Optional[bool] = None # To indicate successful processing
Results
Test 1: Receipt Processing
We evaluated both systems on a standard Walmart receipt containing multiple line items, taxes, and totals.
View the full response here.
Response - Gemini OCR:
{
"context": {
"total_price": [
"144.02"
],
"tax_amount": [
"4.58"
],
"itemized_entries": [
"TATER TOTS 2.96",
"HARD/PROV/DC 2.68",
"SNACK BARS 4.98",
"HRI CL CHS 5.88",
"HRI CL CHS 6.88",
"HRI CL CHS 5.88",
"HRI 12 USG 5.88",
"HRI CL PEP 5.88",
"EARBUDS 4.88",
"SC BCN CHDDR 6.98",
"ABF THINBRST 9.72",
"HARD/PROV/DC 2.68",
"DV RSE OIL M 5.94",
"APPLE 3 BAG 6.47",
"STOK LT SWT 4.42",
"PEANUT BUTTR 5.44",
"AVO VERDE 2.98",
"ROLLS 1.28",
"BTS DRY BLON 6.68",
"GALE 32.00",
"TR HS FRM 4 2.74",
"BAGELS 4.66",
"GV SLIDERS 2.98",
"ACCESSORY 0.97",
"CHEEZE IT 4.00",
"RITZ 2.78",
"RUFFLES 2.50",
"GV HNY GRMS 1.28"
]
},
"width": 800,
"height": 1200,
"tags": [
"receipt",
"Walmart",
"groceries",
"shopping"
],
"has_text": true,
"sections": [
{
"text": "See back of receipt for your chance\nto win $1000 ID#:7N5N1V1XCQDQ",
"lines": [
{
"text": "See back of receipt for your chance",
"bounds": {
"top_left": {
"x": 53,
"y": 54
},
"top_right": {
"x": 555,
"y": 54
},
"bottom_left": {
"x": 53,
"y": 80
},
"bottom_right": {
"x": 555,
"y": 80
}
},
"words": []
} // Additional lines truncated
]
}
]
}
Processed in 41 seconds
Response - JigsawStack vOCR:
{
"context": {
"itemized_entries": [
"TATER TOTS 2.96",
"HARD/PROV/DC",
"SNACK BARS 2.68",
"HRI CL CHS 5.88",
"HRI CL CHS 4.98",
"HRI CL CHS 6.88",
"** VOIDED ENTRY ## HRI CL CHS 5.88-",
"HRI 12 U SG 5.88",
"HRI CL PEP 5.88",
"EARBUDS 4.88",
"SC BCN CHDDR",
"ABF THINBRST 6.98",
"HARD/PROV/DC 9.72",
"DV RSE OIL M 5.94",
"APPLE 3 BAG 2.68",
"STOK LT SWT 6.47",
"PEANUT BUTTR 4.42",
"AVO VERDE 6.44",
"ROLLS 1.28",
"BTS DRY BLON 6.58",
"GALE 32.00",
"TR HS FRM 4 2.74",
"BAGELS 4.66",
"GV SLIDERS 2.98",
"ACCESSORY 0.97",
"CHEEZE IT 4.00",
"RITZ 2.78",
"RUFFLES 2.50",
"GV HNY GRMS 1.28"
],
"tax_amount": ["4.58"],
"total_price": ["144.02"]
},
"has_text": true,
"height": 960,
"sections": [
{
"lines": [
{
"bounds": {
"bottom_left": {
"x": 184,
"y": 84
},
"bottom_right": {
"x": 459,
"y": 93
},
"height": 19,
"top_left": {
"x": 185,
"y": 63
},
"top_right": {
"x": 459,
"y": 76
},
"width": 274.5
},
"text": "See back of receipt for your chance",
"words": [
{
"bounds": {
"bottom_left": {
"x": 186,
"y": 79
},
"bottom_right": {
"x": 209,
"y": 82
},
"height": 15.5,
"top_left": {
"x": 187,
"y": 64
},
"top_right": {
"x": 211,
"y": 66
},
"width": 23.5
},
"text": "See"
}
]
}
]
}
]
}
Processed in 16 seconds
Gemini OCR Performance
Accuracy: Good extraction of full receipt text with complete transaction details
Processing Time: 41 seconds
Output Quality: Basic text extraction with spatial coordinates and detailed structure
Organization: Simple text format with minimal structure; no word-level breakdown or contextual grouping
JigsawStack vOCR Performance
Accuracy: Complete extraction with precise word-level detail and comprehensive contextual data
Processing Time: 16 seconds
Output Quality: Rich structured data with multiple representations (raw text, itemized entries, financial summaries)
Organization: Sophisticated hierarchical structure with sections, lines, words, and precise spatial coordinates for each element
Test 2: Multilingual Text Recognition
Part 1. We evaluated a multilingual street sign containing Japanese characters and directional information.
View the full response here.
Response - Gemini OCR:
{
"context": {
"formatting_details": null,
"languages_detected": ["ja"],
"text_content": "a\n
a\n
a\n
a\n
0.2 Km\n
alamy\n
四天王寺\n
alamy\n
a\n
庚申堂>\n
a\n
0.1km\n
a\n
竹本義太夫墓\n
●(超願寺内)。すぐ。\n
a\n
alamy\n
Image ID: CBDNR6\n
www.alamy.com\n
a\n
a\n
a"
},
"has_text": true,
"height": 900,
"page_range": null,
"sections": [
{
"lines": [
{
"bounds": {
"bottom_left": {"x": 14.0, "y": 25.0},
"bottom_right": {"x": 21.0, "y": 25.0},
"height": null,
"top_left": {"x": 14.0, "y": 5.0},
"top_right": {"x": 21.0, "y": 5.0},
"width": null
},
"text": "a",
// Additional lines omitted for brevity
"words": [
{
"bounds": {
"bottom_left": {"x": 14.0, "y": 25.0},
"bottom_right": {"x": 21.0, "y": 25.0},
"height": null,
"top_left": {"x": 14.0, "y": 5.0},
"top_right": {"x": 21.0, "y": 5.0},
"width": null
},
"text": "a"
}
]
}
],
"text": "a\na\na"
}
],
"success": true,
"tags": ["sign", "Japanese", "temple", "landmark"],
"total_pages": null,
"width": 700
}
Processed in 38 seconds
Response - JigsawStack vOCR:
{
"context": {
"formatting_details": [
"Yellow background with blue text",
"Directional arrows (left and right)",
"Distance in kilometers indicated",
"Text is organized in three horizontal panels"
],
"languages_detected": ["ja", "en"],
"text_content": [
"← 四天王寺 0.2Km",
"→ 庚申堂 0.1km",
"→ 竹本義太夫墓(超願寺内)すぐ"
]
},
"has_text": true,
"height": 951,
"sections": [
{
"lines": [
{
"text": "← 四天王寺",
"bounds": {"bottom_left": {"x": 1089, "y": 49}, "bottom_right": {"x": 1106, "y": 49},
"height": 25, "top_left": {"x": 1089, "y": 24}, "top_right": {"x": 1106, "y": 24},
"width": 17}
},
// Additional lines omitted for brevity
],
"text": "← 四天王寺\n0.2Km\n康申堂 →\n0.1km\n竹本義太夫墓 →\n(超願寺内)すぐ"
}
],
"success": true,
"tags": ["text", "screenshot", "rectangle", "font", "line", "number", "signage", "colorfulness"],
"width": 1300
}
Processed in 16 seconds
Gemini OCR
Processing: 38 seconds
Character Recognition: Successfully identifies major Japanese locations (四天王寺, 庚申堂, 竹本義太夫墓)
Metadata: Provides helpful tags like "sign", "Japanese", "direction", "distance"
JigsawStack vOCR
Processing: 12 seconds
Visual Context: Includes helpful details like "Yellow background with blue text" and identifies directional arrows
Direction Indicators: Preserves the relationship between text and arrows (e.g., "← 四天王寺" and "竹本義太夫墓 →")
Symbol Formatting: Maintains proper Japanese formatting with correct parentheses styles
Structure: Organizes content in a logical, consistent pattern
Part 2. We evaluated a multilingual learning example containing English & Telugu
Response - Gemini OCR :
{
"context": {
"formatting_details": [],
"languages_detected": [],
"text_content": "he\n໙໖ (athadu)\nshe -\n໖ (aame)\nboy\n໙໙໙ (abbayi)\ngirl → ໙ (ammayi)\nhouse\na\n(illu)\nwater -\n໖ (neeru)\nfood\n໖໐໖ (tindi)\nwiki How"
},
"has_text": true,
"height": 853,
"page_range": null,
"sections": [
{
"lines": [
{
"bounds": {
"bottom_left": {"x": 294.0, "y": 100.0},
"bottom_right": {"x": 321.0, "y": 100.0},
"height": null,
"top_left": {"x": 294.0, "y": 77.0},
"top_right": {"x": 321.0, "y": 77.0},
"width": null
},
"text": "he",
"words": [
{
"bounds": {
"bottom_left": {"x": 294.0, "y": 100.0},
"bottom_right": {"x": 321.0, "y": 100.0},
"height": null,
"top_left": {"x": 294.0, "y": 77.0},
"top_right": {"x": 321.0, "y": 77.0},
"width": null
},
"text": "he"
}
]
},
// Additional lines omitted for brevity
"text": "he\n໙໖ (athadu)"
}
],
"success": true,
"tags": ["language learning", "telugu", "vocabulary", "translation"],
"total_pages": null,
"width": 640
}
Processed in 9 seconds
Response - JigsawStack vOCR:
{
"context": {
"formatting_details": ["List format with arrows separating "
"English and Telugu words",
"Telugu script followed by Romanized "
"pronunciation in parentheses",
"Consistent use of color to differentiate "
"Telugu and Romanized text",
"Handwritten style on lined paper"
"background with glasses illustration"],
"languages_detected": ["English", "Telugu"],
"text_content": ["he -> అతడు (athadu)",
"she -> ఆమె (aame)",
"boy -> అబ్బాయి (abbayi)",
"girl -> అమ్మాయి (ammayi)",
"house -> ఇల్లు (illu)",
"water -> నీరు (neeru)",
"food -> తిండి (tindi)",
"wikiHow"]
},
"width": 728,
"height": 546,
"tags": ["text", "glasses"],
"has_text": true,
"sections": [
{
"lines": [
{
"bounds": {
"bottom_left": { "x": 231, "y": 155 },
"bottom_right": { "x": 515, "y": 157 },
"height": 30.5,
"top_left": { "x": 231, "y": 125 },
"top_right": { "x": 515, "y": 126 },
"width": 284
},
"text": "he -> esc (athadu)",
"words": [
{
"bounds": {
"bottom_left": { "x": 231, "y": 155 },
"bottom_right": { "x": 261, "y": 155 },
"height": 29.5,
"top_left": { "x": 231, "y": 125 },
"top_right": { "x": 261, "y": 126 },
"width": 30
},
"text": "he"
}
]
}
// Additional lines truncated
]
}
],
"success": True,
"tags": ["text", "glasses"],
"width": 728}
}
Processed in 14 seconds
Gemini OCR
Processing Time: 9 seconds
Layout Detection: Successfully captures document structure with accurate bounding box coordinates
English Recognition: Correctly recognizes all English words and transliterations in parentheses (athadu, aame, etc.)
Document Context: Identifies the content as language learning material with translations
JigsawStack vOCR
Processing Time: 14 seconds
Telugu Script Handling: Provides proper Unicode encoding for Telugu characters (అతడు, ఆమె, అబ్బాయి, etc.) in the context section
Document Structure: Features dual-layer recognition with separate context and raw recognition layers
Format Analysis: Includes detailed information about text color, alignment, and list formatting
Layout Precision: Provides comprehensive bounding box coordinates with width/height measurements
System Metrics: Delivers helpful token usage statistics for optimization
Test 3: Handwritten Text Recognition
We evaluated both systems on a handwritten poem with cursive and stylized text.
Response - Gemini OCR :
{
"context": {
"transcribed_text": "The loure\n
Wensome and faranell my heart\n
lovely Seng night mary soup hing shineg\n
moor\nheart was beating\n
love\n
new\n
th rosehush on fre\n
My\n
the violet beautiful\n
The artists, evening song\n
hiff\n
To behinja Holde bili marst se lang Inerell farewell\n
Non I leave this litle hunt where my beloved live\n
Walking now with wiled steps through the lenses\n
Luna shines throught busk and oak zephar per path\n
And the bich trees bowing how shed incense on the trade\n
How beautiful the coolness of this lovely summer night!\n
Hon the asl fills with happines in this tul place of quiet!\n
I can scarcely scarcely gross the bliss, jot Heaven I would shan\n
A thousand nights like this if my darling granted one.",
"writing_style": "cursive",
"confidence_score": 0.75
},
"width": 800,
"height": 600,
"tags": [
"handwritten text",
"cursive script",
"poetry",
"personal note"
],
"has_text": true,
"sections": [
{
"text": "The loure\nWensome and faranell my heart",
"lines": [
{
"text": "The loure",
"bounds": {
"top_left": { "x": 46, "y": 56 },
"top_right": { "x": 186, "y": 67 },
"bottom_left": { "x": 46, "y": 93 },
"bottom_right": { "x": 186, "y": 93 }
}
}
]
//Error parsing response: Expecting ',' delimiter: line 1172 column 25 (char 28793)
Processed in 40 seconds
Response - JigsawStack vOCR
{
"context": {
"confidence_score": [
"0.85",
"0.90",
"0.80",
"0.85",
"0.80",
"0.75",
"0.80",
"0.85",
"0.80",
"0.95",
"0.90",
"0.85",
"0.90"
],
"transcribed_text": [
"The lovely Seng night may song luna shines",
"Welcome and farewell my heart was beating",
"the rashed on the moon the violet beautiful",
"The nights evening song our love new life",
"To belinge holde lili must so lamo farewell",
"Now I leave this little bit where my chlorid him",
"Walking now with mited steps through the legrors",
"luna shines through bush and oak zephar per fate",
"And the lich trees howing hor shed incense on the trad",
"How beautiful the coolness of this lovely summer night!",
"How the old fills with happiness in this true place of gime!",
"I can scarcely grasp the bliss, yet Heaven, I would shun",
"A thousand nights like this if my darling granted one."
],
"writing_style": [
"cursive",
"poetic",
"handwritten"
]
},
"has_text": true,
"height": 360,
"sections": [
{
"lines": [
{
"bounds": {
"bottom_left": { "x": 27, "y": 51 },
"bottom_right": { "x": 423, "y": 46 },
"height": 27,
"top_left": { "x": 27, "y": 23 },
"top_right": { "x": 423, "y": 20 },
"width": 396
},
"text": "The lorey Seng night may comp ling stuing",
"words": [
{
"bounds": {
"bottom_left": { "x": 28, "y": 52 },
"bottom_right": { "x": 57, "y": 52 },
"height": 29,
"top_left": { "x": 29, "y": 23 },
"top_right": { "x": 57, "y": 23 },
"width": 28.5
},
"text": "The"
},
{
"bounds": {
"bottom_left": { "x": 73, "y": 51 },
"bottom_right": { "x": 125, "y": 50 },
"height": 27.5,
"top_left": { "x": 73, "y": 23 },
"top_right": { "x": 126, "y": 23 },
"width": 52.5
},
"text": "lorey"
}
]
// Detailed line-by-line data with bounding boxes truncated
"lines": [
// Each line contains detailed coordinate data truncated
]
}
],
// Additional metadata about document dimensions and processing truncated
"success": true,
"tags": ["text", "handwriting", "letter", "calligraphy", "paper", "document", "font"],
"width": 459,
"height": 360
}
Processed in 40 seconds
Gemini OCR Performance
Processing Time: 42 seconds
Accuracy: Handles the handwritten words successfully, with occasional variations in challenging text portions
Output Quality: Provides basic bounding box coordinates
Contextual Limitations: Difficulties differentiate between phrases and meaningful content, e.g., "stuing," "mited steps," "floy Alis"
JigsawStack vOCR Performance
Processing Time: 32 seconds
Accuracy: Demonstrates good contextual understanding with intelligent interpretation of handwritten content
Output Quality: Delivers complete JSON with both raw text recognition and enhanced interpretation
Linguistic Intelligence: Reconstructs likely intended phrases like "How the soul fills with happiness" instead of "Hon the sl fills with happines"
Test 4: Structured Document (PDF) Processing
We evaluated a 15 page PDF: https://arxiv.org/pdf/2406.04692
Response - Gemini OCR:
{
"context": {
"document_title": "Mixture-of-Agents Enhances Large Language Model\nCapabilities",
"section_headings": [
"1 Introduction",
"2 Mixture-of-Agents Methodology",
"2.1 Collaborativeness of LLMs",
"2.2 Mixture-of-Agents",
"2.3 Analogy to Mixture-of-Experts",
"3 Evaluation",
"3.1 Setup",
"3.2 Benchmark Results",
"3.3 What Makes Mixture-of-Agents Work Well?",
"3.4 Budget and Token Analysis",
"4 Related Work",
"4.1 LLM Reasoning",
"4.2 Model Ensemble",
"5 Conclusion",
"Supplementary Material",
"A Spearman Correlation using Different Similarity Functions",
"B LLM Ranker",
"C Case Study",
"D MATH Task"
],
"subsection_content": [
"Recent advances in large language models (LLMs) demonstrate substantial capa-\nbilities in natural language understanding and generation tasks. With the growing\nnumber of LLMs, how to harness the collective expertise of multiple LLMs is an\nexciting open direction. Toward this goal, we propose a new approach that lever-\nages the collective strengths of multiple LLMs through a Mixture-of-Agents (MoA)\nmethodology. In our approach, we construct a layered MoA architecture wherein\neach layer comprises multiple LLM agents. Each agent takes all the outputs from\nagents in the previous layer as auxiliary information in generating its response.\nMoA models achieves state-of-art performance on AlpacaEval 2.0, MT-Bench and\nFLASK, surpassing GPT-4 Omni. For example, our MoA using only open-source\nLLMs is the leader of AlpacaEval 2.0 by a substantial gap, achieving a score of\n65.1% compared to 57.5% by GPT-4 Omni.",
"We begin by demonstrating the collaborativeness of LLMs, specifically their ability to generate higher\nquality responses when they can reference outputs from other models. As we have shown in the\nintroduction and Figure 1, many of today's available LLMs exhibit this collaborative capability.",
"The structure of MoA is illustrated in Figure 2. It has l layers and each layer-i consists of n LLMs,\ndenoted by Ai,1, Ai,2, ..., Ai,n. It is important to note that LLMs can be reused either within the\nsame layer or across different layers. When many LLMs in a layer are identical, this configuration\nleads to a special structure that corresponds to a model generating multiple possibly different outputs\n(due to the stochasticity of temperature sampling). We refer to this setting as single-proposer, where\nonly a sparse subset of models are activated.",
"Mixture-of-Experts (MoE) (Shazeer et al., 2017) is a prominent and well-established technique\nin machine learning where multiple expert networks specialize in different skill sets. The MoE\napproach has shown significant success across various applications due to its ability to leverage\ndiverse model capabilities for complex problem-solving tasks. Our MoA method draws inspiration\nfrom this methodology.",
"We mainly evaluate models on AlpacaEval 2.0 (Dubois et al., 2024), a leading\nbenchmark for assessing the alignment of LLMs with human preferences. It contains 805 instructions\nrepresentative of real use cases. Each model's response is directly compared against that of the GPT-4\n(gpt-4-1106-preview), with a GPT-4-based evaluator determining the likelihood of preferring the\nevaluated model's response. To ensure fairness, the evaluation employs length-controlled (LC) win\nrates, effectively neutralizing length bias.",
"In this subsection, we conduct experiments that provide us better understandings of the internal\nmechanism of Mixture-of-Agents. We summarize key insights below.",
"To understand the relationship between budget, token usage, and LC win rates, we conducted a budget\nand token analysis. Figure 5a and Figure 5b illustrate these relationships.",
"A straightforward solution to leverage the strengths of multiple models is reranking outputs from\ndifferent models. For instance, Jiang et al. (2023) introduce PAIRRANKER, which performs pairwise\ncomparisons on candidate outputs to select the best one, showing improvements on a self-constructed\ninstruction dataset. To address the substantial computational costs associated with multi-LLM\ninference, other studies have explored training a router that predicts the best-performing model\nfrom a fixed set of LLMs for a given input (Wang et al., 2024a; Shnitzer et al., 2024; Lu et al.,\n2023). Additionally, FrugalGPT (Chen et al., 2023b) proposed reducing the cost of using LLMs\nby employing different models in a cascading manner.",
"This paper introduces a Mixture-of-Agents approach aimed at leveraging the capabilities of multiple\nLLMs via successive stages for iterative collaboration. Our method harnesses the collective strengths\nof agents in the Mixture-of-Agents family, and can significantly improve upon the output quality of\neach individual model. Empirical evaluations conducted on AlpacaEval 2.0, MT-Bench, and FLASK\ndemonstrated substantial improvements in response quality, with our approach achieving the LC win\nrate up to 65%. These findings validate our hypothesis that integrating diverse perspectives from\nvarious models can lead to superior performance compared to relying on a single model alone. In\naddition, we provide insights into improving the design of MoA; systematic optimization of MoA\narchitecture is an interesting direction for future work.",
"We present results using TF-IDF-based similarity and Levenshtein similarity when calculating the\nSpearman correlation. Specifically, within each sample of n proposed answers, we calculate Spearman\ncorrelation coefficient between the n similarity scores and the n preference scores determined by the\nGPT-4-based evaluator. As shown in Figure 6, there is indeed a positive correlation between win rate\nand both TF-IDF similarity and Levenshtein similarity.",
"This section introduces the setup of the LLM-Ranker used in this paper. The LLM-Ranker is designed\nto evaluate and rank the best output generated by some LLMs. Table 5 presents the template for\nprompting the model during these evaluations. We use this LLM-Ranker to pick the best answer\namong and use AlpacaEval evaluator to evaluate the best ranked answer.",
"We present a case study in this section. Due to the length of the responses generated by all models,\nwe will only show selected fragments for brevity. To illustrate how the aggregator synthesizes the\nresponse, we underlined similar expressions between the proposed responses and the aggregated\nresponse in different colors. We omit the content that all proposed responses have mentioned.",
"Here, we demonstrate that our approach is applicable to reasoning tasks, such as those in the MATH\ndataset Hendrycks et al. (2021). The results are presented in Table 8, where we show that our method\nconsistently enhances accuracy by a significant margin. This indicates that our approach is also\neffective for this type of task."
],
"tables": [
"Table 1: Aggregate-and-Synthesize Prompt to integrate responses from other models.",
"Table 2: Results on AlpacaEval 2.0 and MT-Bench. For AlpacaEval 2.0, MoA and MoA-Lite\ncorrespond to the 6 proposer with 3 layers and with 2 layer respectively. MoA w/ GPT-40 corresponds\nto using GPT-40 as the final aggregator in MoA. We ran our experiments three times and reported the\naverage scores along with the standard deviation. † denotes our replication of the AlpacaEval results.\nWe ran all the MT-Bench scores ourselves to get turn-based scores.",
"Table 3: Effects of the number of proposer models\non AlpacaEval 2.0. We denote n as either the\nnumber of agents in an MoA layer or the number\nof proposed outputs in the single-proposer setting.\nWe use Qwen1.5-110B-Chat as the aggregator\nand use 2 MoA layers for all settings in this table.",
"Table 4: Impact of different models serving as\nproposers vs aggregators. When evaluating differ-\nent aggregators, all six models serve as proposers;\nwhen evaluating proposers, Qwen1.5-110B-Chat\nserves as the aggregator. We use 2 MoA layers in\nthis table.",
"Table 5: Prompt for ranking with LLMs",
"Table 6: Case: Some models produce high quality answers.",
"Table 7: Case: all proposed responses are not good enough.",
"Table 8: Results on the MATH task. We evaluate different aggregators, with all six models serving as\nproposers in each MoA layer."
],
"metadata": []
},
"width": 827,
"height": 1169,
"tags": [
"large language models",
"mixture of agents",
"LLM collaboration",
"AI",
"natural language processing"
],
"has_text": true,
"sections": [
{
"text": "arXiv:2406.04692v1 [cs.CL] 7 Jun 2024",
"lines": [
{
"text": "arXiv:2406.04692v1 [cs.CL] 7 Jun 2024",
"bounds": {
"top_left": { "x": 37, "y": 104 },
"top_right": { "x": 243, "y": 104 },
"bottom_left": { "x": 37, "y": 121 },
"bottom_right": { "x": 243, "y": 121 }
},
"words": [
{
"text": "arXiv:2406.04692v1",
"bounds": {
"top_left": { "x": 37, "y": 104 },
"top_right": { "x": 166, "y": 104 },
"bottom_left": { "x": 37, "y": 121 },
"bottom_right": { "x": 166, "y": 121 }
}
}
]
},
//Additional lines truncated
// Processing terminated after 43 seconds due to token limit (response too large)
}
Processed in 43 seconds
Response - JigsawStack vOCR:
{
"context": {
"document_title": [
"Mixture-of-Agents Enhances Large Language Model",
"Capabilities"
],
"metadata": [
"arXiv:2406.04692v1 [cs.CL] 7 Jun 2024"
],
"section_headings": [
"Abstract",
"1 Introduction",
"2 Mixture-of-Agents Methodology",
"2.1 Collaborativeness of LLMs",
"2.2 Mixture-of-Agents",
"2.3 Analogy to Mixture-of-Experts",
"3 Evaluation",
"3.1 Setup",
"3.2 Benchmark Results",
"3.3 What Makes Mixture-of-Agents Work Well?",
"3.4 Budget and Token Analysis",
"4 Related Work",
"4.1 LLM Reasoning",
"4.2 Model Ensemble",
"5 Conclusion",
"Limitations.",
"Broader Impact.",
"References",
"Supplementary Material",
"A Spearman Correlation using Different Similarity Functions",
"B LLM Ranker",
"C Case Study",
"D MATH Task"
],
"subsection_content": [
"Recent advances in large language models (LLMs) demonstrate substantial capa- bilities in natural language understanding and generation tasks. With the growing number of LLMs, how to harness the collective expertise of multiple LLMs is an exciting open direction. Toward this goal, we propose a new approach that lever- ages the collective strengths of multiple LLMs through a Mixture-of-Agents (MoA) methodology. In our approach, we construct a layered MoA architecture wherein each layer comprises multiple LLM agents. Each agent takes all the outputs from agents in the previous layer as auxiliary information in generating its response. MoA models achieves state-of-art performance on AlpacaEval 2.0, MT-Bench and FLASK, surpassing GPT-4 Omni. For example, our MoA using only open-source LLMs is the leader of AlpacaEval 2.0 by a substantial gap, achieving a score of 65.1% compared to 57.5% by GPT-4 Omni.",
"Large language models (LLMs) (Zhang et al., 2022a; Chowdhery et al., 2022; Touvron et al., 2023a; Team et al., 2023; Brown et al., 2020; OpenAI, 2023) have significantly advanced the field of natural language understanding and generation in recent years. These models are pretrained on vast amounts of data and subsequently aligned with human preferences to generate helpful and coherent outputs (Ouyang et al., 2022). However, despite the plethora of LLMs and their impressive achievements, they still face inherent constraints on model size and training data. Further scaling up these models is exceptionally costly, often requiring extensive retraining on several trillion tokens.",
"At the same time, different LLMs possess unique strengths and specialize in various tasks aspects. For instance, some models excel at complex instruction following (Xu et al., 2023a) while others may be better suited for code generation (Roziere et al., 2023; Guo et al., 2024). This diversity in skill sets among different LLMs presents an intriguing question: Can we harness the collective expertise of multiple LLMs to create a more capable and robust model?",
"Our answer to this question is Yes. We identify an inherent phenomenon we term the collaborativeness of LLMs wherein an LLM tends to generate better responses when presented with outputs from other models, even if these other models are less capable by itself. Figure 1 showcases the LC win rate on the AlpacaEval 2.0 benchmark (Dubois et al., 2024) for 6 popular LLMs.",
"Our code can be found in: https://github.com/togethercomputer/moa.",
"In this section, we present our proposed methodology for leveraging multiple models to achieve boosted performance. We begin by demonstrating that LLMs possess collaborativeness and thus can improve their responses based on the outputs of other models. Following this, we introduce the Mixture-of-Agents methodology and discuss its design implications.",
"We begin by demonstrating the collaborativeness of LLMs, specifically their ability to generate higher quality responses when they can reference outputs from other models. As we have shown in the introduction and Figure 1, many of today's available LLMs exhibit this collaborative capability.",
"An important pathway to extract maximum benefits from collaboration of multiple LLMs is to characterize how different models are good at in various aspects of collaboration. During the collaboration process, we can categorize LLMs into two distinct roles:",
"Proposers excel at generating useful reference responses for use by other models. While a good proposer may not necessarily produce responses with high scores by itself, it should offer more context and diverse perspectives, ultimately contributing to better final responses when used by an aggregator.",
"Aggregators are models proficient in synthesizing responses from other models into a single, high- quality output. An effective aggregator should maintain or enhance output quality even when integrating inputs that are of lesser quality than its own.",
"Section 3.3 empirically validate the roles of aggregators and proposers. Specifically, we show that many LLMs possess capabilities both as aggregators and proposers, while certain models displayed specialized proficiencies in distinct roles. GPT-40, Qwen1.5, LLaMA-3 emerged as a versatile model effective in both assisting and aggregating tasks. In contrast, WizardLM demonstrated excellent performance as an proposer model but struggled to maintain its effectiveness in aggregating responses from other models.",
"Given that an aggregator can generate higher-quality responses by building upon outputs from other models, we propose further enhancing this collaborative potential by introducing additional aggregators. One intuitive idea is to replicate the exercise with multiple aggregators initially using several to aggregate better answers and then re-aggregating these aggregated answers. By incorporating more aggregators into the process, we can iteratively synthesize and refine the responses, leveraging the strengths of multiple models to produce superior outcomes. This leads to the design of our proposed Mixture-of-Agents.",
"The structure of MoA is illustrated in Figure 2. It has l layers and each layer-i consists of n LLMs, denoted by Ai,1, Ai,2, ..., Ai,n. It is important to note that LLMs can be reused either within the same layer or across different layers. When many LLMs in a layer are identical, this configuration leads to a special structure that corresponds to a model generating multiple possibly different outputs (due to the stochasticity of temperature sampling). We refer to this setting as single-proposer, where only a sparse subset of models are activated.",
"Here, each LLM Ai,j processes an input text and generates its continuation. Our method does not require any fine-tuning and only utilizes the interface of prompting and generation of LLMs. Formally, given an input prompt 21, the output of i-th MoA layer yi can be expressed as follows:",
"where + here means concatenation of texts; \u2295 means application of the Aggregate-and-Synthesize prompt shown in Table 1 to these model outputs.",
"In practice, we do not need to concatenate prompt and all model responses so only one LLM is needed to be used in the last layer. Therefore, we use the output of an LLM from the l-th layer (A1,1(x1)) as the final output and evaluate the metrics based on it.",
"Mixture-of-Experts (MoE) (Shazeer et al., 2017) is a prominent and well-established technique in machine learning where multiple expert networks specialize in different skill sets. The MoE approach has shown significant success across various applications due to its ability to leverage diverse model capabilities for complex problem-solving tasks. Our MoA method draws inspiration from this methodology.",
"A typical MoE design consists of a stack of layers known as MoE layers. Each layer comprises a set of n expert networks alongside a gating network and includes residual connections for improved gradient flow. Formally, for layer i, this design can be expressed as follows:",
"where Gij represents the output from the gating network corresponding to expert j, and Eij denotes the function computed by expert network j. The leverage of multiple experts allows the model to learn different skill sets and focus on various aspects of the task at hand.",
"From a high-level perspective, our proposed MoA framework extends the MoE concept to the model level by operating at the model level rather than at the activation level. Specifically, our MoA approach leverages LLMs and operates entirely through the prompt interface rather than requiring modifications to internal activations or weights. This means that instead of having specialized sub-networks within a single model like in MoE, we utilize multiple full-fledged LLMs across different layers. Note that in our approach, we consolidate the roles of the gating network and expert networks using a LLM, as the intrinsic capacity of LLMs allows them to effectively regularize inputs by interpreting prompts and generating coherent outputs without needing external mechanisms for coordination.",
"Moreover, since this method relies solely on prompting capabilities inherent within off-the-shelf models: (1) It eliminates computational overhead associated with fine-tuning; (2) It provides flexibility and scalability: our method can be applied to the latest LLMs regardless of their size or architecture.",
"This section presents a comprehensive evaluation of our proposed MoA. Our findings show that:",
"We achieve significant improvements on AlpacaEval 2.0, MT-Bench, and FLASK bench- marks. Notably, with open-source models only, our approach outperforms GPT-40 on AlpacaEval 2.0 and FLASK.",
"We conduct extensive experiments to provide better understandings of the internal mecha- nism of MoA.",
"Through a detailed budget analysis, several implementations of MoA can deliver perfor- mance comparable to GPT-4 Turbo while being 2\u00d7 more cost-effective.",
"Benchmarks We mainly evaluate models on AlpacaEval 2.0 (Dubois et al., 2024), a leading benchmark for assessing the alignment of LLMs with human preferences. It contains 805 instructions representative of real use cases. Each model's response is directly compared against that of the GPT-4 (gpt-4-1106-preview), with a GPT-4-based evaluator determining the likelihood of preferring the evaluated model's response. To ensure fairness, the evaluation employs length-controlled (LC) win rates, effectively neutralizing length bias.2",
"Additionally, we also evaluate on MT-Bench (Zheng et al., 2023) and FLASK (Ye et al., 2023). MT-Bench uses GPT-4 to grade and give a score to model's answer. FLASK, on the other hand, offers a more granular evaluation with 12 skill-specific scores.",
"Models In our study, we constructed our default MoA by using only open-source models to achieve competitive performance. The models included are: Qwen1.5-110B-Chat (Bai et al., 2023), Qwen1.5- 72B-Chat, WizardLM-8x22B (Xu et al., 2023a), LLaMA-3-70B-Instruct (Touvron et al., 2023b), Mixtral-8x22B-v0.1 (Jiang et al., 2024), dbrx-instruct (The Mosaic Research Team, 2024). We construct 3 MoA layers and use the same set of models in each MoA layer. We use Qwen1.5-110B- Chat as the aggregator in the last layer. We also developed a variant called MoA w/ GPT-40, which prioritizes high-quality outputs by using GPT-40 as the aggregator in the final MoA layer. Another variant, MoA-Lite, emphasizes cost-effectiveness. It uses the same set of models as proposers but includes only 2 MoA layers and employs Qwen1.5-72B-Chat as the aggregator. This makes it more cost-effective than GPT-40 while achieving a 1.8% improvement in quality on AlpacaEval 2.0. We ensure strict adherence to the licensing terms of all models utilized in this research. For open-source models, all inferences were ran through Together Inference Endpoint.3",
"In this subsection, we present our evaluation results on three standard benchmarks: AlpacaEval 2.0, MT-Bench, and FLASK. These benchmarks were chosen to comprehensively assess the performance of our approach and compare with the state-of-the-art LLMs.",
"AlpacaEval 2.0 We conducted comparisons against leading models such as GPT-4 and other state-of-the-art open-source models. The detailed results are presented in Table 2a where our MoA methodology achieved top positions on the AlpacaEval 2.0 leaderboard, demonstrating a remarkable 8.2% absolute improvement over the previous top model, GPT-40. Moreover, it is particularly noteworthy that our model outperformed GPT-40 using solely open-source models, achieving a margin of 7.6% absolute improvement from 57.5% (GPT-40) to 65.1% (MoA). Our MoA-Lite setup uses less layers and being more cost-effective. Even with this lighter approach, we still outperform the best model by 1.8%, improving from 57.5% (GPT-40) to 59.3% (MoA-Lite). This further highlights the effectiveness of our method in leveraging open-source models capabilities with varying compute budget to their fullest potential.",
"MT-Bench Though improvements over individual models on the MT-Bench are rel- atively incremental, this is understandable given that current models already perform exceptionally well on this benchmark, as a single model alone can achieve scores greater than 9 out of 10. Despite the marginal enhancements, our approach still secures the top position on the leaderboard. This demonstrates that even with already highly optimized benchmarks, our method can push the boundaries further, maintain- ing the leadership.",
"FLASK FLASK provides fine-grained evaluation of models. Among those met- rics, MoA excels in several key aspects. Specifically, our methodology shows signif- icant improvement in robustness, correct- ness, efficiency, factuality, commonsense, insightfulness, completeness, compared to the single model score of the aggregator, Qwen-110B-Chat. Additionally, MoA also outperforms GPT-4 Omni in terms of correctness, factuality, insightfulness, completeness, and metacognition. One metric where MoA did not do as well was conciseness; the model produced outputs that were marginally more verbose.",
"In this subsection, we conduct experiments that provide us better understandings of the internal mechanism of Mixture-of-Agents. We summarize key insights below.",
"Mixture-of-Agents significantly outperforms LLM rankers. First, we compare Mixture-of- Agents with an LLM-based ranker which uses the aggregator model to select one of the answers that are generated by the proposers, instead of generating a new output. The results are shown in Figure 4, where we can observe that the MoA approach significantly outperforms an LLM-ranker baseline. The fact that MoA outperforms the ranking approach suggests that the aggregator does not simply select one of the generated answers by the proposers, but potentially performs sophisticated aggregation over all proposed generations.",
"MoA tends to incorporate the best proposed answers. We also compare the aggregator's response with the proposers' responses via similarity scores such as BLEU (Papineni et al., 2002) which reflects n-gram overlaps. Within each sample, given n proposed answers by the proposers, we calculate the the Spearman's rank correlation coefficient between n similar scores and n preference scores determined by the GPT-4 based evaluator. The results in Figure 4 indeed confirms a positive correlation between the win rate and the BLEU score. We also provide results with Levenshtein similarity (RapidFuzz, 2023) or TF-IDF as opposed to BLEU scores in Appendix A. where both alternative approaches for textual similarities also yield positive correlation with the preference scores.",
"To understand the relationship between budget, token usage, and LC win rates, we conducted a budget and token analysis. Figure 5a and Figure 5b illustrate these relationships.",
"In order to improve generation quality of LLMs, recent researches have experienced great progresses in optimizing LLMs to various downstream tasks through prompt engineering. Chain of Thought (CoT) (Wei et al., 2022; Kojima et al., 2022) prompting techniques represent a linear problem- solving approach where each step builds upon the previous one. Fu et al. (2022) applied CoT to multi-step reasoning tasks. To automate CoT prompting, Auto-CoT (Zhang et al., 2022b) constructs demonstrations by sampling diverse questions and generating reasoning chains. Active-Prompt (Diao",
"A straightforward solution to leverage the strengths of multiple models is reranking outputs from different models. For instance, Jiang et al. (2023) introduce PAIRRANKER, which performs pairwise comparisons on candidate outputs to select the best one, showing improvements on a self-constructed instruction dataset. To address the substantial computational costs associated with multi-LLM inference, other studies have explored training a router that predicts the best-performing model from a fixed set of LLMs for a given input (Wang et al., 2024a; Shnitzer et al., 2024; Lu et al., 2023). Additionally, FrugalGPT (Chen et al., 2023b) proposed reducing the cost of using LLMs by employing different models in a cascading manner. In order to better leverage the responses of multiple models, Jiang et al. (2023) trained a GENFUSER, a model that was trained to generate an improved response to capitalize on the strengths of multiple candidates. Huang et al. (2024) proposed to fuse the outputs of different models by averaging their output probability distributions.",
"Another line of work is multi-agent collaboration. Several studies explore using multiple large language models as agents that collectively discuss and reason through given problems interactively. Du et al. (2023) establishes a mechanism for symmetric discussions among agents. Around the same time, MAD (Liang et al., 2023) introduces an asymmetric mechanism design, with different roles, i.e., debater and judge. Other similar works include (Chan et al., 2023). Moreover, ReConcile (Chen et al., 2023a) exemplifies an asymmetric discussion involving weighted voting. To understand discussion more deeply, Zhang et al. (2023) aim to explain such collaboration mechanism in a social psychology view. Wang et al. (2024b) systematically compared multi-agent approaches and found a single agent with a strong prompt including detailed demonstrations can achieve comparable response quality to multi-agent approaches.",
"This paper introduces a Mixture-of-Agents approach aimed at leveraging the capabilities of multiple LLMs via successive stages for iterative collaboration. Our method harnesses the collective strengths of agents in the Mixture-of-Agents family, and can significantly improve upon the output quality of each individual model. Empirical evaluations conducted on AlpacaEval 2.0, MT-Bench, and FLASK demonstrated substantial improvements in response quality, with our approach achieving the LC win rate up to 65%. These findings validate our hypothesis that integrating diverse perspectives from various models can lead to superior performance compared to relying on a single model alone. In addition, we provide insights into improving the design of MoA; systematic optimization of MoA architecture is an interesting direction for future work.",
"Our proposed method requires iterative aggregation of model responses, which means the model cannot decide the first token until the last MoA layer is reached. This potentially results in a high Time to First Token (TTFT), which can negatively impact user experience. To mitigate this issue, we can limit the number of MoA layers, as the first response aggregation has the most significant boost on generation quality. Future work could explore chunk-wise aggregation instead of aggregating entire responses at once, which can reduce TTFT while maintaining response quality.",
"This study holds the potential to enhance the effectiveness of LLM-driven chat assistants, thereby making AI more accessible. Moreover, since the intermediate outputs that are expressed in natural language, MoA presented improves the interpretability of models. This enhanced interpretability facilitates better alignment with human reasoning.",
"We present results using TF-IDF-based similarity and Levenshtein similarity when calculating the Spearman correlation. Specifically, within each sample of n proposed answers, we calculate Spearman correlation coefficient between the n similarity scores and the n preference scores determined by the GPT-4-based evaluator. As shown in Figure 6, there is indeed a positive correlation between win rate and both TF-IDF similarity and Levenshtein similarity.",
"This section introduces the setup of the LLM-Ranker used in this paper. The LLM-Ranker is designed to evaluate and rank the best output generated by some LLMs. Table 5 presents the template for prompting the model during these evaluations. We use this LLM-Ranker to pick the best answer among and use AlpacaEval evaluator to evaluate the best ranked answer.",
"We present a case study in this section. Due to the length of the responses generated by all models, we will only show selected fragments for brevity. To illustrate how the aggregator synthesizes the response, we underlined similar expressions between the proposed responses and the aggregated response in different colors. We omit the content that all proposed responses have mentioned.",
"Table 6 showcases the responses generated by different proposers. The aggregated response generated by Qwen1.5-110B-Chat reflects a high preference for its own content but also incorporates key points from Llama-3-70B-Instruct and WizardLM 8x22B. Notably, GPT-4's preference score for WizardLM 8x22B's response is 0.99, and the final aggregated answer also achieves a preference score of 0.99.",
"Meanwhile, Table 7 presents another case where none of the proposed responses achieve a high GPT-4 preference score. Despite this, the aggregator successfully identifies and incorporates the strong points from these responses, achieving a preference score of 0.33.",
"Here, we demonstrate that our approach is applicable to reasoning tasks, such as those in the MATH dataset Hendrycks et al. (2021). The results are presented in Table 8, where we show that our method consistently enhances accuracy by a significant margin. This indicates that our approach is also effective for this type of task. Notably, our method is complementary to existing reasoning techniques such as Chain of Thought Wei et al. (2022) and Self-consistency Wang et al. (2022)."
],
"tables": [
"Table 2: Results on AlpacaEval 2.0 and MT-Bench. For AlpacaEval 2.0, MoA and MoA-Lite correspond to the 6 proposer with 3 layers and with 2 layer respectively. MoA w/ GPT-40 corresponds to using GPT-40 as the final aggregator in MoA. We ran our experiments three times and reported the average scores along with the standard deviation. \u2020 denotes our replication of the AlpacaEval results. We ran all the MT-Bench scores ourselves to get turn-based scores.",
"Table 1: Aggregate-and-Synthesize Prompt to integrate responses from other models.",
"Table 5: Prompt for ranking with LLMs",
"Table 6: Case: Some models produce high quality answers.",
"Table 7: Case: all proposed responses are not good enough.",
"Table 8: Results on the MATH task. We evaluate different aggregators, with all six models serving as proposers in each MoA layer."
]
},
"total_pages": 15,
"width": 612,
"height": 11880,
"tags": [
"text",
"font",
"screenshot",
"letter",
"paper",
"document",
"black and white",
"printing",
"circle",
"parallel",
"diagram",
"number",
"ink"
],
"has_text": true,
"sections": [
{
"text": "Mixture-of-Agents Enhances Large Language Model\nCapabilities\nJunlin Wang Jue Wang Ben Athiwaratkun\nDuke University Together AI Together AI\nTogether AI jue@together.ai ben@together.ai\njunlin.wang2@duke.edu\nCe Zhang James Zou\nUniversity of Chicago Stanford University\nTogether Al Together Al\ncez@uchicago.edu jamesz@stanford.edu\nAbstract\nRecent advances in large language models (LLMs) demonstrate substantial capa-\nbilities in natural language understanding and generation tasks. With the growing\nnumber of LLMs, how to harness the collective expertise of multiple LLMs is an\nexciting open direction. Toward this goal, we propose a new approach that lever-\nages the collective strengths of multiple LLMs through a Mixture-of-Agents (MoA)\nmethodology. In our approach, we construct a layered MoA architecture wherein\neach layer comprises multiple LLM agents. Each agent takes all the outputs from\nagents in the previous layer as auxiliary information in generating its response,\nMoA models achieves state-of-art performance on AlpacaEval 2.0, MT-Bench and\nFLASK, surpassing GPT-4 Omni. For example, our MoA using only open-source\nLLMs is the leader of AlpacaEval 2.0 by a substantial gap, achieving a score of\n65.1% compared to 57.5% by GPT-4 Omni.1\n1 Introduction\nLarge language models (LLMs) (Zhang et al., 2022a; Chowdhery et al., 2022; Touvron et al., 2023a;\nTeam et al., 2023; Brown et al., 2020; OpenAI, 2023) have significantly advanced the field of natural\nlanguage understanding and generation in recent years. These models are pretrained on vast amounts\nof data and subsequently aligned with human preferences to generate helpful and coherent outputs\narXiv:2406.04692v1 [cs.CL] 7 Jun 2024\n(Ouyang et al., 2022). However, despite the plethora of LLMs and their impressive achievements,\nthey still face inherent constraints on model size and training data. Further scaling up these models is\nexceptionally costly, often requiring extensive retraining on several trillion tokens.\nAt the same time, different LLMs possess unique strengths and specialize in various tasks aspects.\nFor instance, some models excel at complex instruction following (Xu et al., 2023a) while others may\nbe better suited for code generation (Roziere et al., 2023; Guo et al., 2024). This diversity in skill sets\namong different LLMs presents an intriguing question: Can we harness the collective expertise of\nmultiple LLMs to create a more capable and robust model?\nOur answer to this question is Yes. We identify an inherent phenomenon we term the collaborativeness\nof LLMs - wherein an LLM tends to generate better responses when presented with outputs\nfrom other models, even if these other models are less capable by itself. Figure I showcases\nthe LC win rate on the AlpacaEval 2.0 benchmark (Dubois et al., 2024) for 6 popular LLMs.\n'Our code can be found in: https://github.com/togethercomputer/noa.\nPreprint Under review.",
"lines": [
{
"text": "Mixture-of-Agents Enhances Large Language Model",
"bounds": {
"top_left": {
"x": 109,
"y": 97
},
"top_right": {
"x": 502,
"y": 97
},
"bottom_right": {
"x": 502,
"y": 117
},
"bottom_left": {
"x": 109,
"y": 117
},
"width": 393,
"height": 20
},
"words": [
{
"text": "Mixture-of-Agents",
"bounds": {
"top_left": {
"x": 109,
"y": 98
},
"top_right": {
"x": 247,
"y": 98
},
"bottom_right": {
"x": 247,
"y": 118
},
"bottom_left": {
"x": 109,
"y": 118
},
"width": 138,
"height": 20
}
},
{
//Detailed line-by-line data with bounding boxes
},
"width": 9,
"height": 9.5
}
}
]
}
]
}
],
}
Processed in 37 seconds
Gemini OCR Performance
Processing Time: 43 seconds
Accuracy: Incomplete extraction due to running out of tokens, processing only a fraction of the document
Output Quality: Limited to extracting metadata and first page elements before encountering errors
Coordinate Precision: High precision for elements it processed but failed to maintain throughout
Reliability: Encountered processing limitations leading to incomplete output
JigsawStack vOCR Performance
Processing Time: 37 seconds
Accuracy: Comprehensive extraction of all 15 pages with complete contextual information
Output Quality: Well-structured JSON with hierarchical organization of document elements
Coordinate Precision: Detailed bounding box coordinates with width/height measurements for every text element
Reliability: Successfully processed over 350,000 tokens of content with no degradation
Key Findings
Processing Efficiency: JigsawStack vOCR consistently delivers faster processing times across various document types while maintaining high-quality results
Structured Data Organization: JigsawStack provides comprehensive output structures with hierarchical formatting that makes information immediately actionable
Multilingual Capabilities: JigsawStack shows particular strength in handling non-Latin scripts like Japanese and Telugu with proper Unicode encoding
Contextual Understanding: JigsawStack offers dual-layer recognition that provides both raw text and enhanced interpretations for challenging content
Document Intelligence: JigsawStack includes valuable metadata about document formatting, language detection, and visual presentation
Conclusion
Our benchmarking shows that both systems offer effective OCR capabilities with different strengths. Gemini OCR provides good basic text recognition with solid performance for straightforward content. JigsawStack vOCR delivers enhanced functionality through its structured output formats, superior multilingual support, and comprehensive document analysis.
Run these tests yourself here: Google Colab Notebook.
Gemini OCR vs. JigsawStack vOCR
◐ = partial ❌ = inaccurate/fails ✅ = accurate/succeeds
Feature | Gemini OCR | JigsawStack vOCR |
🌐 Multilingual Support | Good base coverage with standard processing times ◐ | Excellent support for 70+ languages with efficient processing ✅ |
📝 Handwriting Recognition | Captures basic handwritten content ❌ | Strong accuracy with contextual interpretation capabilities ✅ |
⊞ Bounding Boxes | Provides coordinate data for identified text ◐ | Detailed positioning with comprehensive width/height measurements ✅ |
📁 Structured Output | Limited, provided basic text extraction with some formatting ❌ | Rich hierarchical structure with semantic and spatial integration ✅ |
⚡ Processing Speed | Variable processing times (11-42 seconds) ❌ | Consistently faster processing (12-32 seconds) ✅ |
🧠 Context Understanding | Identifies document types and basic structure ◐ | Preserves relationships between elements with dual-layer analysis ✅ |
📕 Complex Document Handling | Handles standard files; may face token limits or parsing issues with complex content ❌ | Excels with intricate layouts and maintains structure integrity ✅ |
Recommended Use:
Basic Text Recognition: Both systems perform effectively for simple English text extraction
Detailed Document Analysis: JigsawStack vOCR offers more comprehensive structured data with spatial positioning and formatting details
Multilingual Processing: JigsawStack demonstrates notable advantages for non-Latin scripts and complex language handling
Time-Sensitive Applications: JigsawStack's consistently faster processing times provide efficiency benefits for high-volume document processing
Subscribe to my newsletter
Read articles from Angel Pichardo directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
