Meet Lucy - Part 1


In the era of MCP servers, tool support (or function calling) by models (LLMs) has become essential. We read a lot about small local models (SLMs) and their inability to correctly detect "function calling" in prompts. And if that's indeed the case, it's truly problematic as it would eliminate the use of small machines for many use cases. And when I say small machine, I'm not just talking about a Raspberry Pi 5, but I'm also thinking about laptops like my MacBook Air which, although equipped with an M4 chip, will start to struggle with models over 7B parameters (it might even struggle a bit before that).
After some introductory elements, I'm going to talk to you about a local model that seems very promising in this area: Lucy.
All my tests were done using Docker Model Runner, but you can easily adapt the source code to work with Ollama or Llama.cpp since I use the OpenAI Go SDK.
Quick Recap on Function Calling
"Function calling" for LLMs that support it is the ability for an LLM to detect/recognize in a user prompt the intention of wanting to "execute something," such as doing an addition, wanting to know the weather in a particular city, or wanting to search the Internet.
LLMs that "support tools," if you provide them upstream with a catalog of tools in JSON, will be able to make the connection between the user's intention and the corresponding tool. The LLM will then generate a JSON response with the tool name and the parameters to call it.
And then it's up to you to implement the tools in question and call them with the parameters provided by the LLM.
Improving SLMs in Function Calling
There are several methods to improve SLMs in terms of function calling. Such as using higher quality and variety training data. For example, Salesforce AI Research developed xLAM, a family of models called "Large Action Models" (LAMs), specifically designed to improve function calling, reasoning, and planning.
There's targeted fine-tuning (on specific data) that "would seem" to radically transform the performance of certain models.
There's also instruction tuning with reasoning: the model will first expose its reasoning in tags <think>…</think>
, and then generate structured "function calls" in <tool_call>…</tool_call>
tags.
There are certainly other methods, but I'm not an expert, so I've only cited those I understood.
Despite what is said in the studies above, I find that the results are often not there.
Nevertheless, some models are better than others. For example: ai/qwen3:8B-Q4_K_M
You can also find it on Hugging Face: Qwen/Qwen3-8B-GGUF.
Lucy, the Model that Rivals the Giants
Very recently, Menlo.ai published a model that seems very promising: Lucy. It's a 1.7 billion parameter model, optimized for mobile devices, and would rival much larger models.
There are two versions of Lucy:
Lucy-gguf: the standard model, with a 32k context
Lucy-128k-gguf: an extended version with a 128k context
✋ For my tests while writing this article, I used the standard version:
lucy-gguf:q8_0
.
From what I understand, Lucy is natively optimized for MCP, and would have been trained to recognize standardized function calls compatible with the MCP ecosystem.
Lucy's main innovation lies in its way of "reasoning". Instead of treating <think>
and </think>
tags as simple thinking traces, Lucy uses them as a "dynamic task vector" machine. The model actively builds and refines its own task representations during inference.
💡 Put more simply: Lucy doesn't think with rigid logic, but builds at each step a sort of "living memory" of the task to accomplish.
📝 If you want to go further, here's some documentation on the subject https://arxiv.org/html/2508.00360v1
The only thing I really retained is: "natively optimized for MCP", so I had to verify how effective using Lucy for function calling was and whether it would finally allow me to do "quality" function calling with small local models.
☕️ Get yourself some coffee, this is where it begins 🚀.
My Small Test Protocol: "Simple" Function Calling
✋ Simple Function Calling == there's only one tool to detect in the prompt.
I made myself a small piece of Go code to quickly test if a model has seemingly correct tool support.
You can find the source code at 01-simple-function-calling/main.go
Let's see what this test consists of. I have a catalog of 2 tools (it's a simple and short test, only to allow an initial selection of models):
1. Available Test Tools
say_hello
: Says hello to a person (parameter:name
)add_two_numbers
: Adds two numbers (parameters:number1
,number2
)
2. Three Test Scenarios
Test 1 - Correct Tool Detection:
Question: "Tell me why the sky is blue and then say hello to Jean-Luc Picard. I love pineapple pizza!"
Expected: Call to
say_hello
functionGoal: Check if the model correctly identifies the required action among "noisy" text
Test 2 - Mathematical Tool Detection:
Question: "Where is Bob? Add 2 and 3. What is the capital of France?"
Expected: Call to
add_two_numbers
functionGoal: Test the ability to identify a mathematical operation in a mixed context
Test 3 - No Tool Call:
Question: "The best pizza topping is pineapple. What is the capital of France? I love cooking."
Expected: No function call
Goal: Verify that the model doesn't make false positives
3. Scoring System
Each successful test = +1 point
10 iterations × 3 tests = maximum score of 30
Final score converted to percentage
4. Evaluation Criteria
✅ Correct function called
✅ No function called when not needed
❌ Wrong function called
❌ No function called when needed
This system tests the precision, selectivity, and robustness of the model's ability to do "function calling".
Test Results
Here are the test results performed on two models: Lucy (lucy-gguf:q8_0
) and Qwen3 (qwen3:8B-Q4_K_M
):
✋ During my tests, I also tried with the
lucy-gguf:q4_k_m
model, but the results were less good.
Test Environment
Hardware: MacBook Air M4, 32GB RAM
Iterations: 10 per model (0-9)
Functions Tested:
say_hello
,add_two_numbers
, and non-indexed function handling
Model Performance Summary
Model | Final Score | Success Rate | Total Duration | Avg per Completion | Performance Index |
Lucy (hf.co/menlo/lucy-gguf:q8_0) | 30/30 | 100.0% | 181.32s | 6.04s | 3.30x faster |
Qwen3 (ai/qwen3:8B-Q4_K_M) | 30/30 | 100.0% | 597.22s | 19.91s | 1.00x baseline |
Accuracy
Both models achieved perfect accuracy (100% success rate)
Both models correctly handled all function calling scenarios
Consistent scoring of 3 points per iteration across all tests
Speed Comparison
Lucy is 3.30x faster than Qwen3 on average
Lucy's total time is 69.6% faster than Qwen3
So Lucy's performance seems very promising. Now it would be interesting to see how it behaves in more complex scenarios: that is, with multiple tool calls in the same prompt.
My 2nd Small Test Protocol: "Loop" Function Calling
This time the source code is here 02-function-calling-with-loop/main.go
So, what do I mean by "Loop" Function Calling? The term "loop" refers to the iterative conversational cycle between the program and the language model.
This time I have a user prompt that looks like this:
Make the sum of 40 and 2,
then say hello to Bob and to Sam,
make the sum of 5 and 37
Say hello to Alice
So in theory, the model should detect 5 different function calls and execute them in order:
calculate_sum
with arguments{"a":40,"b":2}
say_hello
with argument{"name":"Bob"}
say_hello
with argument{"name":"Sam"}
calculate_sum
with arguments{"a":5,"b":37}
say_hello
with argument{"name":"Alice"}
To test the model I'm going to use the following algorithm:
Main loop: The program continues to make API requests until the model decides to stop (
completion.Choices[0].FinishReason == "stop"
orcompletion.Choices[0].FinishReason == "tool_calls"
)Question-answer-action cycle: At each iteration:
The model analyzes the request and history
It decides which functions to call (or stop)
The program executes the requested functions
The results are added to the history
The cycle starts again with the enriched context
Difference from a single call: Unlike a system that would make a single function call, this approach allows among other things:
Complex tasks requiring multiple steps
Conditional decisions based on previous results
sequenceDiagram
participant U as User
participant M as Main Program
participant API as OpenAI API
participant F as Function Executor
U->>M: Initial message
M->>API: Request with available tools
loop Until "stop" response
API-->>M: Response with tool_calls
M->>M: Add assistant message
loop For each tool_call
M->>F: Execute function(name, args)
F-->>M: Result
M->>M: Add result to history
M->>M: Record in functionCallHistory
end
M->>API: New request with complete history
end
API-->>M: "stop" response
M->>U: Display final summary
Key Point: Maintaining History
The complete conversation history is critical for the proper functioning of the system:
Conversational context: The model must know all previous function calls and their results to make coherent decisions
Logical continuity: Without history, the model would lose track of the conversation and could repeat actions or ignore important results
Dependency management: Some functions may depend on the results of previous calls
Technical implementation:
Each assistant message with
tool_calls
is added to history BEFORE executionEach function result is added as a tool message (
openai.ToolMessage
) with the corresponding IDThe complete history is sent with each new API request
New Test Results
This time too, I performed the tests on the same two models: Lucy and Qwen3 and in the same test environment with the same functions (calculate_sum
, say_hello
):
Main Report
Model | Total Calls | Success Rate | Total Duration | Avg Duration | Efficiency |
Lucy (hf.co/menlo/lucy-gguf:q8_0) | 5 | 100% | 24.26s | 4.85s | 4.46x faster |
Qwen3 (ai/qwen3:latest) | 3 | 100% | 86.16s | 28.72s | 1.00x baseline |
Once again, in terms of execution speed, Lucy outperforms Qwen3.
And very importantly, if you take a look at the detailed reports of function calls made by each model, Qwen3 made only 3 function calls while Lucy properly made all 5 calls.
Detailed Function Call Analysis
Lucy
Call # | Function | Arguments | Duration | Result | Call ID |
1 | calculate_sum | {"a":40,"b":2} | 5.54s | {"result": 42} | 7XQ9sMGL... |
2 | say_hello | {"name":"Bob"} | 5.00s | "👋 Hello, Bob!🙂" | yvdQOOeu... |
3 | say_hello | {"name":"Sam"} | 5.24s | "👋 Hello, Sam!🙂" | vIUWhXch... |
4 | calculate_sum | {"a":5,"b":37} | 5.52s | {"result": 42} | 0OaylxCQ... |
5 | say_hello | {"name":"Alice"} | 2.94s | "👋 Hello, Alice!🙂" | fCPjjqcO... |
Qwen3
Call # | Function | Arguments | Duration | Result | Call ID |
1 | calculate_sum | {"a":40,"b":2} | 14.12s | {"result": 42} | 2gbfOz6B... |
2 | say_hello | {"name":"Bob"} | 48.37s | "👋 Hello, Bob!🙂" | YutmYuOw... |
3 | calculate_sum | {"a":5,"b":37} | 23.67s | {"result": 42} | gQqDdzME... |
So for this type of prompt, Lucy outperforms Qwen3 both in terms of speed and precision.
A 3rd test: Conditional Function Calling
I did one last test to see if the model could handle conditional function calls, with a prompt like this:
Make the sum of 30 and 2,
If the result is higher than 40
Then say hello to Bob Else to Sam
And the results were:
Call # | Function | Arguments | Result | Duration | Call ID |
1 | calculate_sum | {"a":30,"b":2} | {"result": 32} | 6.85s | mb8nWkux... |
2 | say_hello | {"name":"Sam"} | "👋 Hello, Sam!🙂" | 6.53s | Wh5D3bXQ... |
So I modified the prompt to test the else
part of the condition:
Make the sum of 40 and 2,
If the result is higher than 40
Then say hello to Bob Else to Sam
And once again, Lucy succeeded in detecting the two function calls using the condition correctly:
Call # | Function | Arguments | Result | Duration | Call ID |
1 | calculate_sum | {"a":40,"b":2} | {"result": 42} | 3.39s | LyCfTlKv... |
2 | say_hello | {"name":"Bob"} | "👋 Hello, Bob!🙂" | 3.68s | b5kUNUm6... |
Of course I'll need to do more advanced tests in terms of prompt complexity, but for now I think I have a crush on Lucy. Next step: make a useful and complete example with Lucy and take the opportunity to test its other capabilities.
Subscribe to my newsletter
Read articles from Philippe Charrière directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
