Meet Lucy - Part 1

In the era of MCP servers, tool support (or function calling) by models (LLMs) has become essential. We read a lot about small local models (SLMs) and their inability to correctly detect "function calling" in prompts. And if that's indeed the case, it's truly problematic as it would eliminate the use of small machines for many use cases. And when I say small machine, I'm not just talking about a Raspberry Pi 5, but I'm also thinking about laptops like my MacBook Air which, although equipped with an M4 chip, will start to struggle with models over 7B parameters (it might even struggle a bit before that).

After some introductory elements, I'm going to talk to you about a local model that seems very promising in this area: Lucy.

All my tests were done using Docker Model Runner, but you can easily adapt the source code to work with Ollama or Llama.cpp since I use the OpenAI Go SDK.

Quick Recap on Function Calling

"Function calling" for LLMs that support it is the ability for an LLM to detect/recognize in a user prompt the intention of wanting to "execute something," such as doing an addition, wanting to know the weather in a particular city, or wanting to search the Internet.

LLMs that "support tools," if you provide them upstream with a catalog of tools in JSON, will be able to make the connection between the user's intention and the corresponding tool. The LLM will then generate a JSON response with the tool name and the parameters to call it.

And then it's up to you to implement the tools in question and call them with the parameters provided by the LLM.

Improving SLMs in Function Calling

There are several methods to improve SLMs in terms of function calling. Such as using higher quality and variety training data. For example, Salesforce AI Research developed xLAM, a family of models called "Large Action Models" (LAMs), specifically designed to improve function calling, reasoning, and planning.

There's targeted fine-tuning (on specific data) that "would seem" to radically transform the performance of certain models.

There's also instruction tuning with reasoning: the model will first expose its reasoning in tags <think>…</think>, and then generate structured "function calls" in <tool_call>…</tool_call> tags.

There are certainly other methods, but I'm not an expert, so I've only cited those I understood.

Despite what is said in the studies above, I find that the results are often not there.

Nevertheless, some models are better than others. For example: ai/qwen3:8B-Q4_K_M

You can also find it on Hugging Face: Qwen/Qwen3-8B-GGUF.

Lucy, the Model that Rivals the Giants

Very recently, Menlo.ai published a model that seems very promising: Lucy. It's a 1.7 billion parameter model, optimized for mobile devices, and would rival much larger models.

There are two versions of Lucy:

Lucy-gguf: the standard model, with a 32k context
Lucy-128k-gguf: an extended version with a 128k context

✋ For my tests while writing this article, I used the standard version: lucy-gguf:q8_0.

From what I understand, Lucy is natively optimized for MCP, and would have been trained to recognize standardized function calls compatible with the MCP ecosystem.

Lucy's main innovation lies in its way of "reasoning". Instead of treating <think> and </think> tags as simple thinking traces, Lucy uses them as a "dynamic task vector" machine. The model actively builds and refines its own task representations during inference.

💡 Put more simply: Lucy doesn't think with rigid logic, but builds at each step a sort of "living memory" of the task to accomplish.

📝 If you want to go further, here's some documentation on the subject https://arxiv.org/html/2508.00360v1

The only thing I really retained is: "natively optimized for MCP", so I had to verify how effective using Lucy for function calling was and whether it would finally allow me to do "quality" function calling with small local models.

☕️ Get yourself some coffee, this is where it begins 🚀.

My Small Test Protocol: "Simple" Function Calling

✋ Simple Function Calling == there's only one tool to detect in the prompt.

I made myself a small piece of Go code to quickly test if a model has seemingly correct tool support.

You can find the source code at 01-simple-function-calling/main.go

Let's see what this test consists of. I have a catalog of 2 tools (it's a simple and short test, only to allow an initial selection of models):

1. Available Test Tools

say_hello: Says hello to a person (parameter: name)
add_two_numbers: Adds two numbers (parameters: number1, number2)

2. Three Test Scenarios

Test 1 - Correct Tool Detection:

Question: "Tell me why the sky is blue and then say hello to Jean-Luc Picard. I love pineapple pizza!"
Expected: Call to say_hello function
Goal: Check if the model correctly identifies the required action among "noisy" text

Test 2 - Mathematical Tool Detection:

Question: "Where is Bob? Add 2 and 3. What is the capital of France?"
Expected: Call to add_two_numbers function
Goal: Test the ability to identify a mathematical operation in a mixed context

Test 3 - No Tool Call:

Question: "The best pizza topping is pineapple. What is the capital of France? I love cooking."
Expected: No function call
Goal: Verify that the model doesn't make false positives

3. Scoring System

Each successful test = +1 point
10 iterations × 3 tests = maximum score of 30
Final score converted to percentage

4. Evaluation Criteria

✅ Correct function called
✅ No function called when not needed
❌ Wrong function called
❌ No function called when needed

This system tests the precision, selectivity, and robustness of the model's ability to do "function calling".

Test Results

Here are the test results performed on two models: Lucy (lucy-gguf:q8_0) and Qwen3 (qwen3:8B-Q4_K_M):

✋ During my tests, I also tried with the lucy-gguf:q4_k_m model, but the results were less good.

Test Environment

Hardware: MacBook Air M4, 32GB RAM
Iterations: 10 per model (0-9)
Functions Tested: say_hello, add_two_numbers, and non-indexed function handling

Model Performance Summary

Model	Final Score	Success Rate	Total Duration	Avg per Completion	Performance Index
Lucy (hf.co/menlo/lucy-gguf:q8_0)	30/30	100.0%	181.32s	6.04s	3.30x faster
Qwen3 (ai/qwen3:8B-Q4_K_M)	30/30	100.0%	597.22s	19.91s	1.00x baseline

Accuracy

Both models achieved perfect accuracy (100% success rate)
Both models correctly handled all function calling scenarios
Consistent scoring of 3 points per iteration across all tests

Speed Comparison

Lucy is 3.30x faster than Qwen3 on average
Lucy's total time is 69.6% faster than Qwen3

So Lucy's performance seems very promising. Now it would be interesting to see how it behaves in more complex scenarios: that is, with multiple tool calls in the same prompt.

My 2nd Small Test Protocol: "Loop" Function Calling

This time the source code is here 02-function-calling-with-loop/main.go

So, what do I mean by "Loop" Function Calling? The term "loop" refers to the iterative conversational cycle between the program and the language model.

This time I have a user prompt that looks like this:

Make the sum of 40 and 2, 
then say hello to Bob and to Sam, 
make the sum of 5 and 37
Say hello to Alice

So in theory, the model should detect 5 different function calls and execute them in order:

calculate_sum with arguments {"a":40,"b":2}
say_hello with argument {"name":"Bob"}
say_hello with argument {"name":"Sam"}
calculate_sum with arguments {"a":5,"b":37}
say_hello with argument {"name":"Alice"}

To test the model I'm going to use the following algorithm:

Main loop: The program continues to make API requests until the model decides to stop (completion.Choices[0].FinishReason == "stop" or completion.Choices[0].FinishReason == "tool_calls")
Question-answer-action cycle: At each iteration:
- The model analyzes the request and history
- It decides which functions to call (or stop)
- The program executes the requested functions
- The results are added to the history
- The cycle starts again with the enriched context
Difference from a single call: Unlike a system that would make a single function call, this approach allows among other things:
- Complex tasks requiring multiple steps
- Conditional decisions based on previous results

sequenceDiagram
    participant U as User
    participant M as Main Program
    participant API as OpenAI API
    participant F as Function Executor

    U->>M: Initial message
    M->>API: Request with available tools

    loop Until "stop" response
        API-->>M: Response with tool_calls
        M->>M: Add assistant message

        loop For each tool_call
            M->>F: Execute function(name, args)
            F-->>M: Result
            M->>M: Add result to history
            M->>M: Record in functionCallHistory
        end

        M->>API: New request with complete history
    end

    API-->>M: "stop" response
    M->>U: Display final summary

Key Point: Maintaining History

The complete conversation history is critical for the proper functioning of the system:

Conversational context: The model must know all previous function calls and their results to make coherent decisions
Logical continuity: Without history, the model would lose track of the conversation and could repeat actions or ignore important results
Dependency management: Some functions may depend on the results of previous calls
Technical implementation:
Each assistant message with tool_calls is added to history BEFORE execution
Each function result is added as a tool message (openai.ToolMessage) with the corresponding ID
The complete history is sent with each new API request

New Test Results

This time too, I performed the tests on the same two models: Lucy and Qwen3 and in the same test environment with the same functions (calculate_sum, say_hello):

Main Report

Model	Total Calls	Success Rate	Total Duration	Avg Duration	Efficiency
Lucy (hf.co/menlo/lucy-gguf:q8_0)	5	100%	24.26s	4.85s	4.46x faster
Qwen3 (ai/qwen3:latest)	3	100%	86.16s	28.72s	1.00x baseline

Once again, in terms of execution speed, Lucy outperforms Qwen3.

And very importantly, if you take a look at the detailed reports of function calls made by each model, Qwen3 made only 3 function calls while Lucy properly made all 5 calls.

Detailed Function Call Analysis

Lucy

Call #	Function	Arguments	Duration	Result	Call ID
1	`calculate_sum`	`{"a":40,"b":2}`	5.54s	`{"result": 42}`	7XQ9sMGL...
2	`say_hello`	`{"name":"Bob"}`	5.00s	`"👋 Hello, Bob!🙂"`	yvdQOOeu...
3	`say_hello`	`{"name":"Sam"}`	5.24s	`"👋 Hello, Sam!🙂"`	vIUWhXch...
4	`calculate_sum`	`{"a":5,"b":37}`	5.52s	`{"result": 42}`	0OaylxCQ...
5	`say_hello`	`{"name":"Alice"}`	2.94s	`"👋 Hello, Alice!🙂"`	fCPjjqcO...

Qwen3

Call #	Function	Arguments	Duration	Result	Call ID
1	`calculate_sum`	`{"a":40,"b":2}`	14.12s	`{"result": 42}`	2gbfOz6B...
2	`say_hello`	`{"name":"Bob"}`	48.37s	`"👋 Hello, Bob!🙂"`	YutmYuOw...
3	`calculate_sum`	`{"a":5,"b":37}`	23.67s	`{"result": 42}`	gQqDdzME...

So for this type of prompt, Lucy outperforms Qwen3 both in terms of speed and precision.

A 3rd test: Conditional Function Calling

I did one last test to see if the model could handle conditional function calls, with a prompt like this:

Make the sum of 30 and 2,
If the result is higher than 40
Then say hello to Bob Else to Sam

And the results were:

Call #	Function	Arguments	Result	Duration	Call ID
1	`calculate_sum`	`{"a":30,"b":2}`	`{"result": 32}`	6.85s	mb8nWkux...
2	`say_hello`	`{"name":"Sam"}`	`"👋 Hello, Sam!🙂"`	6.53s	Wh5D3bXQ...

So I modified the prompt to test the else part of the condition:

Make the sum of 40 and 2,
If the result is higher than 40
Then say hello to Bob Else to Sam

And once again, Lucy succeeded in detecting the two function calls using the condition correctly:

Call #	Function	Arguments	Result	Duration	Call ID
1	`calculate_sum`	`{"a":40,"b":2}`	`{"result": 42}`	3.39s	LyCfTlKv...
2	`say_hello`	`{"name":"Bob"}`	`"👋 Hello, Bob!🙂"`	3.68s	b5kUNUm6...

Of course I'll need to do more advanced tests in terms of prompt complexity, but for now I think I have a crush on Lucy. Next step: make a useful and complete example with Lucy and take the opportunity to test its other capabilities.