While the tech world focuses on the impressive capabilities of cloud-based AI agents like ChatGPT and Claude, we're exploring a different question: Can we build truly intelligent AI agents that run entirely on users' local devices?

The appeal is obvious: complete data privacy, zero network latency, freedom from service limitations, and genuinely personalized experiences. But the technical challenges are equally significant: limited local model capabilities, complex tool calling mechanisms, and ensuring a consistent user experience.

After extensive exploration, we've completed a major upgrade to NativeMind conversational architecture, taking our first significant step toward local AI agents.

Core Challenges of Local AI Agents

Cloud models operate with hundreds of billions of parameters, while models that run smoothly on typical consumer devices usually have only a few billion parameters. This capability gap is particularly evident in agent tasks: complex reasoning, tool selection, and task planning all demand high model performance.

Local models often struggle with tool calling format accuracy compared to large cloud models. A single format error can break the entire agent workflow, which is unacceptable for user experience. Users expect agents to be both responsive and intelligent.

Achieving this balance with limited local computational resources presents an extraordinarily difficult challenge.

Building AI Agents on Ollama

Prompt-Based Tool Calling

We chose not to use Ollama's native tool calling API. While the native API might be more concise in certain scenarios, it has obvious limitations: it only supports specific models, and there are significant differences between different models.

Instead, we implemented a completely prompt-based tool calling system combined with multi-layer parsing. This approach has been validated by successful products like Cline. Although more challenging to implement, it delivers greater value by providing users with a consistent agent experience regardless of model limitations, while allowing us to continuously optimize parsing accuracy.

In this system, AI calls tools by outputting specific XML formats. When searching for information, the AI outputs:

<tool_calls>
<search_online>
<query>machine learning development trends</query>
<max_results>5</max_results>
</search_online>
</tool_calls>

The system detects this format, parses and executes the corresponding search operation, then returns results to the AI for continued processing.

Considering local model limitations, we designed a multi-layer parsing system to handle tool calling reliability: a standard parsing layer handles well-formatted tool calls, while a fault-tolerant parsing layer processes incomplete but clearly intentioned calls. This multi-layer fault tolerance ensures that even when the model output isn't perfect, user intent can still be correctly understood and executed.

We also redesigned our tools with clear responsibility boundaries. For example, search tools focus purely on information retrieval, returning only structured search results without actual content, while dedicated content-fetching tools handle complete webpage retrieval.

This modular design allows agents to flexibly call tools based on task requirements. For instance, of five search results, only three pages might actually be valuable, or sometimes just titles and descriptions provide sufficient information without needing a detailed content review. This approach also reduces individual tool call complexity and overall token consumption.

To prevent agents from falling into infinite loops during complex tasks and considering context pressure on local devices, we implemented iteration control, supporting up to 5 rounds of tool calls per session.

Environment Awareness System

To leverage local agents' advantages in information integration, we designed a progressive workflow: instead of providing all information to the agent at once, we dynamically acquire information based on task needs, automatically select the most appropriate information sources based on question types, and decide next actions based on already obtained information.

This design enables capability-limited local models to handle complex environments while reducing token consumption.

Our dynamic environment context system (environment_details) builds real-time comprehensive environment descriptions, including current time, available tabs, PDF documents, images, etc., using a structured XML format for easy AI comprehension.

For example, when a user asks "analyze the correlation between this webpage and that report," the AI accurately understands that "this webpage" refers to the currently selected tab and "that report" refers to the loaded PDF file.

This environmental awareness enables AI to better understand users' current context and make more intelligent decisions.

<user_message>
Analyze the correlation between this webpage and that report
</user_message>

<environment_details>
Current Time: 2024-07-24 14:30:25
Available Tabs: [
  - Tab 1: React Documentation (current)
  - Tab 2: Vue.js Guide
]
Available PDFs: [
  - PDF 1: Frontend_Framework_Analysis.pdf (24 pages)
]
</environment_details>

To avoid context redundancy, we implemented differential update mechanisms. In multi-turn conversations, environment update information is only sent when the environment changes, maintaining context accuracy while controlling resource consumption.

Model Adaptation and Performance

We conducted comprehensive testing on mainstream local models across scenarios, including basic search responses, multi-resource integration analysis, systematic information collection and comparison, and comprehensive text-image-PDF processing. Evaluation dimensions covered answer relevance, tool calling effectiveness, language consistency, and other key metrics. Testing included cloud models as comparison benchmarks to validate our local agent architecture's effectiveness.

Results show that local models demonstrate promising potential under our agent architecture. More importantly, even weaker models perform better under the new architecture than traditional approaches, demonstrating our technical solution's effectiveness.

User experience has seen qualitative improvements. Transparent execution processes and immediate feedback significantly improved user satisfaction. Users can now see the agent's complete workflow, understand the logic behind each operation step. This transparency not only enhances experience but also builds trust in agent results.

Local vs Cloud Model Comparison

Testing reveals that current local small models already show decent performance in basic operations, with excellent local models approaching cloud model effectiveness in certain scenarios.

These results demonstrate the feasibility of local agents at the current stage, particularly in tool calling and task execution, where excellent local models can already provide near-cloud-model experiences in specific scenarios.

Rank	Model	Success Rate	Analysis
1	GPT-4.1 mini	82.6%	Best overall performance, occasional language switching issues
2	Qwen3 8B	65.2%	Balanced performance in tool calling and reasoning
3	GPT-4o mini	65.2%	Strong capabilities but language consistency below expectations
4	Qwen3 4B	65.2%	Best value proposition, recommended for daily use
5	Qwen3 1.7B	43.4%	Outstanding among lightweight models
6	Qwen2.5 VL 7B	39.1%	Excellent image analysis, but unstable intent recognition and tool calling
7	Qwen2.5 VL 3B	34.8%	Strong image analysis, but inconsistent intent recognition and tool calling
8	Gemma3 4B	30.4%	Good image analysis capabilities, but intent recognition and tool calling need improvement; consumes only ~260 tokens per image vs Qwen2.5 VL's 1260 tokens
9	DeepSeek R1 8B	26.1%	Primarily format issues with the tool calling
10	Qwen3 0.6B	26.0%	Ultra-lightweight, suitable for extremely resource-constrained environments
11	Phi4 mini	4.3%	Currently not recommended for agent scenarios

Qwen3 4B achieves the optimal balance between performance and efficiency, making it our top recommendation for local models. Under our new architecture, it achieves a 65% task success rate, matching GPT-4o mini's performance.

For users seeking ultimate performance, Qwen3 8B provides stronger reasoning capabilities. For resource-constrained environments, Qwen3 1.7B and 0.6B still deliver basically usable experiences.

Notably, local models currently struggle with language consistency, with the Qwen series performing relatively better, especially with more stable Chinese language support. The multimodal Qwen2.5 VL series shows unique advantages in processing image content, though there's still room for improvement in tool calling stability.

Unique Value of Local AI Agents

Through NativeMind's implementation, we've explored the possibility of building intelligent agents on local devices. This approach delivers unique value that cloud solutions cannot match.

Privacy protection is the most significant advantage, with all data processing completed locally, giving users complete control over their information. Instant response provides zero-network-latency interaction experiences, with particularly obvious advantages in poor network conditions.

We also see promising development potential for local AI agents. New generation local models continue improving, model capabilities keep advancing, stronger local computing power provides hardware foundation for complex agents, and growing user emphasis on data privacy and localized experiences drives market demand.

Based on this agent system exploration, we're planning further product evolution: supporting more types of local tools and services, browser automation capabilities, MCP support, enhanced complex task planning abilities, and more personalized experiences based on user habits.

The new agent capabilities have launched in our latest version. Install NativeMind now and follow the setup guide to experience it immediately.

How to Build Privacy-Focused AI Agents Locally with Ollama

Table of contents