Nano-Models: A Recent Breakthrough as the Pieces Team brings LTM‑2.5 to Life 🎉

Tsavo KnottTsavo Knott
8 min read

Introduction

In the pursuit of building long-term Artificial Memory at the OS level, understanding when a user wants to retrieve information is just as crucial as what they want. In the early days, every step of that retrieval pipeline—from intent classification through span extraction, normalization, enrichment, relevance scoring, formatting, and upload—ran through cloud-hosted LLMs. That meant 8–11 preprocessing tasks before touching the memory store, another 2–4 post-processing tasks afterward, and finally a round-trip to a remote model to compose the answer. The result? Cumulative latency that drags time-to-first-token into the seconds, accuracy hurdles at each stage, user data exposed in transit, and token bills that balloon with every query.

Our breakthrough with LTM-2.5: two purpose-built on-device nano-models that offload temporal understanding entirely to local hardware—one for interpreting the user's temporal intent, the other for extracting the precise time span(s) implied by their language. These specialized models are the result of extensive knowledge distillation from larger foundation models, quantized and pruned to run efficiently on consumer hardware. Now, the entire 10–15 step pipeline lives on-device, preserving privacy, slashing costs, and taking the deterministic retrieval of long-term context down from seconds to milliseconds in latency.

When to Leverage the Temporal Model

Our pipeline depends on two critical steps: determining intent first, then generating one or more time ranges representative of the user's natural-language query:

Use CaseDescription
Content RetrievalFetching past events ("What was I working on just now?")
Action / SchedulingSetting reminders or appointments ("Remind me in two hours")
Future Information / PlanningForecasting or "next week" inquiries ("What am I doing tomorrow afternoon?")
Current StatusReal-time checks ("What am I doing right now?")
Temporal – GeneralAmbiguous or loosely specified time references ("Show me last week around Friday evening")
Non-TemporalQueries without a time component ("Explain the concept of recursion.")

Temporal Range Generation

Once we've determined that a query requires temporal memory access, we need to precisely identify when to search in the user's activity timeline. This is where our second nano-model comes into play:

Range Types and Boundaries

The temporal span predictor handles several distinct types of time references:

Range TypeExample QueryGenerated SpanSearch Strategy
Point-in-time"Show me what I was doing at 2pm yesterday"Single timestamp with narrow context windowPrecise timestamp lookup with small buffer
Explicit period"What emails did I receive between Monday and Wednesday?"Clearly defined start/end boundariesBounded range search with exact limits
Implicit period"What was I working on last week?"Inferred start/end based on cultural/contextual normsAutomatically expanded to appropriate calendar boundaries
Relative recent"What was I just doing?"Short window counting backward from current timeRecency-prioritized retrieval with adaptive timespan
Fuzzy historical"Show me that article I read about quantum computing last summer"Broad date range with lower confidence boundariesExpanded search space with relevance decay at boundaries

Optimizing the Temporal Search Space

The model doesn't just identify time boundaries—it also generates crucial metadata about search strategy:

  • Confidence scores for timespan boundaries (enabling better retrieval when dates are ambiguous)

  • Periodicity hints for recurring events (distinguishing "my Monday meeting" from "last Monday's meeting")

  • Time-zone awareness for properly interpreting references when users travel

  • Contextual weighting that prioritizes activity density over raw timestamps (e.g., for "when I was working on the Smith project")

This specialized temporal range extraction eliminates the need to scan the entire memory corpus for each query, dramatically reducing both computational load and latency while improving retrieval precision.

Intention Differentiation & Edge Cases

Ensuring we route queries correctly between retrieval and planning:

  • Retrieval vs. Planning

    • "What was I working on just now?" → Content Retrieval

    • "What am I doing tomorrow afternoon?" → Future Information / Planning

  • Broad vs. Specific

    • "Show me last week around Friday evening" → Content Retrieval with a loose span

    • "Plan my weekend for next Friday evening" → Future Information / Planning

  • Temporal vs. Non-Temporal

    • "What was the website I was just looking at?" → Content Retrieval

    • "Explain the concept of recursion." → Non-Temporal (no memory lookup)

By clearly distinguishing temporal retrieval (pulling historical context) from temporal reference (scheduling or future-oriented intent), our on-device pipeline avoids misrouted cloud calls, cuts latency to the millisecond level, and maintains top-tier accuracy without sacrificing privacy or incurring hidden costs.

Examples & Scenarios

Below are representative user queries, each fed into our pipeline along with the user's local time in UTC (e.g. 2025-04-17T16:43:02.151857+00:00):

  1. Recent Activity Retrieval

    • Query: "Could you tell me what I was just doing?"

    • Classifier (23 ms): Content Retrieval

    • Span Predictor (102 ms): 2025-04-17T16:37:05.603Z – 2025-04-17T16:43:02.151857Z

    • Showcases: precise on-device extraction of the last few minutes of activity

  2. Future Planning (Nuanced Task)

    • Query: "I will go to the store tomorrow."

    • Classifier (21 ms): Temporal – General

    • Span Predictor: N/A

    • Showcases: correctly not generating a past time-range for future intentions—an essential nuance

  3. "Just" Retrieval Consistency

    • Query: "What was the website I was just looking at?"

    • Classifier (22 ms): Content Retrieval

    • Span Predictor (108 ms): 2025-04-17T16:37:05.603Z – 2025-04-17T16:43:02.151857Z

    • Showcases: consistent span output across semantically similar "just" queries

  4. Long-Range Historical Query

    • Query: "What was I working on last year around Thanksgiving?"

    • Classifier (22 ms): Content Retrieval

    • Span Predictor (88 ms): 2024-11-01T00:00:00Z – 2024-11-30T23:59:59.999999Z

    • Showcases: broad date-range generation for loosely specified historical periods

Benchmarks

Tested on an Apple M1 Max (32 GB) under heavy load (30+ tabs, video, IDEs, messaging) to simulate real-world conditions:

Classification Results

This table compares how well each model identifies the correct temporal intent label for a given sample.

Model NameAccuracyF1 (W)Prec (W)Recall (W)Samples/Sec
nano-temporal-intent (TIME Intent)0.99300.99300.99310.9930544.41
gemini-1.5-flash-0020.82410.83840.88340.82419.14
gpt-4o0.86340.84700.86980.86349.40
meta-llama/Llama-3.2-3B-Instruct0.46040.40940.40800.460492.43

Legend: Classification Models

  • nano-temporal-intent (TIME Intent): Our on-device nano-model for intent classification—ultra-lightweight and lightning-fast inference.

  • gemini-1.5-flash-002: Google's mid-tier large language model via API; good accuracy but higher latency and cost.

  • gpt-4o: OpenAI's flagship multimodal LLM; strong performance at premium compute and pricing.

  • meta-llama/Llama-3.2-3B-Instruct: A 3 billion-parameter open-weights LLM; lower accuracy but faster than cloud LLMs.

Legend: Classification Metrics

  • Accuracy: Proportion of samples for which the top-prediction matches the true class.

  • F1 (W), Prec (W), Recall (W): Weighted F1-score, precision, and recall across all intent classes (accounts for class imbalances).

  • Samples/Sec: Number of inference calls the model can process per second. (higher is better)

Span Prediction Results

This table measures how precisely each model extracts the correct time-span from text.

Model NameE.C.O. RateAvg IoUExact MatchSamples/Sec
nano-temporal-span-pred (TIME Range)0.94500.92010.8659785.39
gemini-1.5-pro-0020.20650.18650.16849.35
gpt-4o0.17670.16110.15359.47
meta-llama/Llama-3.2-3B-Instruct0.17250.16400.151762.02

Legend: Span Models

  • nano-temporal-span-pred (TIME Range): On-device span extractor optimized for low latency and high IoU.

  • gemini-1.5-pro-002, gpt-4o, meta-llama/Llama-3.2-3B-Instruct: LLMs & SLMs performing span extraction via API calls.

Legend: Span Metrics

  • E.C.O. Rate (Exact Coverage Overlap): Fraction of predicted spans that exactly match the gold span boundaries.

  • Avg IoU (Intersection-over-Union): Average overlap ratio between predicted and true spans.

  • Exact Match: Strict percentage of samples where predicted span text equals ground truth.

  • Samples/Sec: Span-prediction throughput on the benchmark hardware. (higher is better)

We observed SLMs running in the cloud on H100 GPU with vLLM incur $0.018 – $1.90 per run and took 15-25 min of compute time — our cascade delivers structured time-spans offline in milliseconds, with zero API cost and full data privacy.

Why It Matters

🏗️ Architectural Specialization
Breaking monolithic LLMs into nano-models for classification vs. span prediction yields massive gains in both accuracy and speed.

🌐 Edge-First AI
Offline inference keeps sensitive data on-device—critical for medical, defense, and privacy-focused applications.

💡 Energy & Cost Efficiency
Eliminate token fees and slash compute budgets. This is the future of sustainable, scaled AI on laptops, wearables, and IoT.

🔬 Research Frontiers

  • Task-specific: distillation, quantization, and final pruning for modular pipelines

  • Adaptive orchestration: dynamic model selection based on compute availability

  • Hardware/software: co-design for ultra-efficient inference

Conclusion

This nano-temporal pipeline is one of approximately 11 nano-models we're weaving into LTM-2.5 to make long-term memory formation and retrieval across your entire OS blazingly fast, highly accurate, and privacy-first.

Innovation isn't about bigger models—it's about smarter, specialized models that deliver tangible benefits in real-world applications. By focusing on modular, purpose-built AI systems that run entirely on-device, we're redefining what's possible for intelligent, responsive computing that respects user privacy while dramatically reducing cost and latency.

We can't wait to share more as we push the boundaries of on-device AI in the world of OS-level Long-Term Memory.

Lastly, I would be remiss if I didn't mention the obvious: none of this would be possible without the incredible creativity, dedication, and perseverance from the team behind Pieces. I’ll close with a special shout out to our ML team and a extra special shout out to Antreas Antoniou and Sam Jones for believing in the approach and turning these first-principal theories into breakthroughs ✨

10
Subscribe to my newsletter

Read articles from Tsavo Knott directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Tsavo Knott
Tsavo Knott