In building voice-enabled AI assistants, one of the most challenging aspects is how to manage the conversation lifecycle.

I created my Vox Manifestor app*. to help users manifest their wishes through structured reflection practices. The app centers around a voice-controlled, Ai powered Agent (that I call a “**Genie”) whose job is to guide users through a process of articulating, refining, and affirming their wishes.*

Sounds simple, right? Humans converse with each other every day, and now we’ve designed A.i.s to be able to do the same. Sure, fine. The latest LLMs have all been fine-tuned to be conversational in nature, to reflect on what the user has just said, and even to finish their reply with a question that encourages the user to continue chatting.

But this Genie’s capabilities will be going far beyond that. I’m building it to:

Hold a conversation for as long as the user uses the app. (maintain the “Core-Loop”)
Be goal-oriented, and achieve multiple, staged, and progressively deeper goals through these conversations, including:
- Eliciting and storing the user’s 5 most valued desires (This Genie grants 5 “Wishes”!)
- Gathering daily meta-goal information about, for example, how the user feels about their progress towards these, and to get the user to engage at a deeper level with each of them in turn.
- Use questioning skills to establish detailed descriptions of the present and desired states for each of these wishes.
  - Determine sticking points and highlight these in the framework
  - Explore detailed explorations of the present state and visualisations of desired states
Determine when using an affirmation tool could be helpful, and determine from past conversations, what content those affirmations should have!

This was certainly a challenging-enough starting point. I thought that if I could accomplish the beginnings of these three goals, that would serve as a Minimal Loveable Product (elsewhere known as Minimal Viable Product) and I could then start putting the App in front of potential users!

Core Loop v.1.0: Creating a seamless, ongoing conversation

After an initial design period of some months, I wasn’t happy with the way Vox Manifestor was turning out. It seemed too basic, too clunky. The visual interface was boring and static. The user had to issue commands for everything they wanted to happen.

I wanted to develop a Core Value Proposition of what this app could be. The process was painful. I had to confront the inadequacy of what I’d already spent months developing. I came up with some important realisations, that spoke not just to the functioning of this app, but to the future of Ai-assisted voice interfaces in the very operating systems that we use every day. From this idea emerged…

Midjourney 7: “An advanced conversational interface”

Some Principles for Advanced Conversational Interfaces

The conversation with the A.i. must be THE PRIMARY INTERFACE.
The A.i. must be able to navigate a sequence of evolving CONVERSATION CONTEXTS, each of which could involve different goals, skills and draw on different knowledge and memories.
The agent must maintain a sense of consistency (from which the user infers a degree of competence), within each session, and across sessions (it must have a Short term Memory and a Long term Memory), and it must be able to use these memories effectively to keep the conversation relevant and goal-directed!
UI elements should be CONTEXTUAL and appear only when relevant to the ongoing conversation.
Different data sources (wishes, states, etc.) must be loaded based on conversation context
All Navigation must happen through conversation. The agent progresses the context, persists across screen changes, and maintains control of the conversation.**
The interface maintains a clean, focused design centered on the agent avatar and the conversation
Information like wish lists, concept details, process and status information become available in collapsible & expandable panels: This “Interface of the Future” must be Dynamic, Fluid and Adaptable.

[Black Mirror / Apocalypse Caveat]

** : side note: ultimately the voice-agent (our future voice-OS’s) will persist across all applications, and store everything we do in a way that allow them to understand us and anticipate our next steps. This, of course, could end up being the stuff of science-fiction nightmares. The A.i.’s end up knowing us better than we know ourselves, and so become the perfect instrument of our self-destruction. This of course is an ongoing issue to address in A.i. development: HOW DO WE STOP THEM KILLING OR ENSLAVING US? Yes, this question needs answering. More immediately than that however, is the potential for significant SECURITY RISKS by giving every detail of our online and offline lives (*willingly!) over to — in the absence of the serious development of improved privacy and security for A.i. tools — intelligent snooping agents. Clearly the development of A.i. tools such as these must go hand-in-hand with the development of improved privacy and security measures. While we won’t delve into these in this article, I think these issues are paramount and so would like to take time to address them in depth at a later date.

Refocusing On First Principles

I’d spent a number of months “in the trenches” of the code implementation, establishing a basic architecture for voice recognition, speech generation, data and state management, Ui and screen layouts and navigation, and finally, involvement and the integration of a language model as decider and driver of conversations.

I’d wanted to do this as part of my Year of Technical Re-Specialization, to get back into the swing of programming, by learning one of the modern languages. Even though my original interest was in A.i., I knew I needed a platform to form a foundation for any A.i. application, and to me, mobile seemed like the best format. I think voice-based O.S.’s are going to take over mobile platforms first.

So it took a very necessary effort to pull myself out of the trenches of the code, and come back to the above — newly established — interface design principles.

I apparently wasn’t going to be able to continue getting away with designing as I was building! I’d have to go back to the drawing board and start from first principles to develop this new style of interface.

Midjourney 7: “An advanced conversational interface"

The Conversation Is The Primary Interface

At its core this means that the conversational agent must always have somewhere to go in the conversation. The conversation must be structured, and the agent must navigate through this structure, creating a sense of systematic progress for the user.

On the one hand, the agent must not suffer from such issues as repeating the same question more than once or getting stuck on a particular outcome or using a particular process repeatedly. There must always be forward momentum, and the sense that the agent has direction. Otherwise the user can quickly lose faith in the integrity of the interface. At the same time the agent should be able to respond to the user's needs, for example, to focus on a specific wish, when they are expressed.

The Agent Navigates Conversational Contexts

This follows on naturally from the process oriented nature of the conversation we discussed above.

A process oriented conversation necessarily involves moving sequentially between a pre-planned series of sub-conversations.

The goal of the “Core Loop” is to navigate this sequence from beginning to end, repeatedly, ad infinitum (with variations that keep the process interesting, entertaining, engaging, etc.)

The goal of each individual conversational context or sub-conversation will be different according to that context.

The agent will need to maintain awareness of the Core Loop, while being able to descend a logical level into more detailed Conversation Contexts, accomplish the goal of that context to its satisfaction, and then return to the higher level Core Loop, to continue the process.

Mapping the Core Loop: What Data & Which Algorithms?

I was having trouble thinking exactly what was needed inside the core loop and what would be needed to bring the process to life, so I created the above information and process flow chart. We have some implementation questions related to the idea that the agent should have a list of questions that it wants to ask the user. These questions would be on two different levels. On one level, they would be meta-goal-based questions that are about the goals themselves. On another level, they would be specific to help the user elaborate their present and desired states.

The two main processes contained inside the basic Core Loop are (1) to engage the user in a general daily update on the main screen and (2) to drill into any of the specific goals.

Midjourney v7: “Managing Complex Two-Way Conversations inside a 'Core Loop', bright and graphical”

CORE LOOP: What is it?

Imagine the Core Loop as the main engine of the coaching agent. Its fundamental purpose is to proactively and continuously guide the user through a journey of exploring, defining, and refining all of their main wishes over time.

Think of it like a recurring coaching cycle:

It Checks In: It starts by figuring out where the user is now and what needs attention.
It Focuses: It helps select one specific wish to work on during the current interaction.
It Guides Exploration: For that chosen wish, it leads the user through different phases or topics of reflection (like understanding the present situation, clarifying the desired future, analyzing the gap, maybe later finding pathways or dealing with blocks).
It Remembers & Adapts: It uses past interactions to inform the current conversation, ensuring continuity and avoiding repetition.
It Cycles: Once the relevant exploration for one wish in this session feels complete, the loop decides whether to check in again, move to another wish, or conclude the session. It's designed to run repeatedly whenever the user engages, gradually deepening the understanding of each wish.
It Prioritizes: Over time, it tries to ensure all wishes get attention, perhaps focusing on those least recently discussed or those that seem incomplete.

Essentially, the Core Loop is the persistent, guiding intelligence that ensures the user is always moving forward in clarifying and working towards their set of manifestations.

INITIALIZE: The Daily Check-In

This is the starting gate or the daily briefing phase within the Core Loop. It happens each time the user starts a new session with the agent (or potentially when cycling back after working deeply on one wish).

Its main goals are:

Re-engage & Orient: Welcome the user back and set the stage for the session.
Create Continuity: Briefly reference what was discussed last time (e.g., "Last session we focused on your 'Better Job' wish..."). This makes the interaction feel connected.
General Check-in / Active Journaling: Ask broad, open-ended questions about how the user is feeling today regarding their goals or life in general ("What's been on your mind?", "How are you feeling about your progress lately?"). This allows the user space to voice current concerns or successes before diving deep. The agent might mirror back what it hears ("So it sounds like you're feeling...") to encourage reflection.
Collaborative Focus Selection: Guide the user towards choosing one specific wish to work on during this session. This involves asking clarifying "Meta-Goal Questions" (like "Which goal feels most important right now?", "Are you feeling stuck anywhere?") if a focus doesn't emerge naturally from the check-in.
Obtain User Consent: Once a potential focus wish is identified (either by the user or suggested by the agent based on the check-in or priority), the agent must explicitly ask for the user's agreement ("Okay, it sounds like X is top of mind. Shall we explore that further today?").
Transition: Only after the user agrees to focus on a specific wish does this Initialize Process phase complete. The Core Loop then transitions into the next appropriate phase (like exploring the Present State) for that specific, user-approved wish. If the user doesn't want to focus on the suggested wish, the agent stays in this phase and continues the selection process ("Okay, no problem. Is there another wish you'd prefer to look at?").

Creating The Dialogue Sequence

This is the crucial onboarding step for each session within the Core Loop. It ensures continuity, allows for voice journaling, and collaboratively sets the focus with the user's consent before the deeper, wish-specific work begins.

We need to construct a series of lines of questioning for the user that will get them talking about their current situation, their feelings, and whatever else they need to get out of their head as part of a brain-dump and assisted-journalling situation.

Then, after an (as-yet undefined) period of time we would expect the Core Loop to bring the user towards discussing specific goals.

So this Daily Check-In would start with the general situational context of things happening in the user’s environment and mental / emotional processes they are currently going through, and start to gradually focus in on more immediate and specific priorities, gradually moving them from the present state to the desired future.

This initial phase is potentially useful for two main reasons:

It gives the user time to reflect on what is currently happening in their world and in their brains, providing a sounding board and some perspective by being able to get certain issues, problems and thoughts out of their heads. This has value more generally for the wish-specific component, as it brings to mind the things that the user wants to move away from, and how they don’t want things to be.
It provides a potentially rich source of information about the user’s mental state, barriers to goals, inner monologues and limiting emotional patterns, so that a (admittedly more sophisticated) agent in the future could craft specific affirmations or belief-change / reframing processes based on this data.

We now have some important questions to answer:

What is the “ideal” conversation flow?
How does the agent decide whether to ask clarifying, “go-deeper” questions or mirroring techniques?
How many investigative questions should the agent ask during the daily check-in?
- What form should these questions take?
How do we decide how many of these questions are asked?
- Should the agent try to engage the user with these questions for a minimum amount of time / or try to gauge the user’s engagement with the conversation and respond to that?

Solutions

Metric-Based Progression

One obvious method would be to use quantifiable metrics (like question counts) to track progress through a conversation. An increase in user self-disclosure and a shift from broad questions to specific goal-setting would tend to indicate the conversation is moving forward appropriately. Metrics can trigger state transitions... e.g.

Ask at least 2 questions from each of 3 main questioning categories
Monitor answer length for each; if response length above a threshold, ask more questions until response length drops, or until we reach an upper limit of questions asked.

Later on we could introduce a more free form journalling feature that is less directive and encourages the user to speak continually about themselves, their emotions, challenges, goals, etc., providing mirroring and deepening prompts to whatever is being spoken about.

External State Management: We combine the usage of data structures within our agent, as well as algorithms that feed this data to the LLM at the right time. LLMs themselves lack an innate concept of our desired questioning stages beyond what's in the prompt, so an external mechanism must be used to manage state.
Linear Sequence with Enforced Transitions: A predetermined sequence of stages and use an external state variable or memory to track which stage is active. So the agent should enforce phase transitions only when requirements are met.
Structured Response Format: Our idea to have the LLM return categorized metadata about its responses aligns with the "invisible thinking vs. speaking" approach, where "the model could generate a hidden rationale in JSON (for the agent to parse) and a final message for the user."

Prompt Chains Architecture: The GPTCoach approach uses distinct "prompt chains" for different aspects of the conversation, which is more sophisticated than our single prompt approach. They separate:
- Dialogue State Chain (for state management)
- Motivational Interviewing Chain (for response strategy)
- Response Generation Chain (for final output)
More Sophisticated State Transition Logic: Rather than just counting question types, GPTCoach uses "a 'dialogue state classifier' (an LLM prompt) that looks at the conversation history and the current state's goal, and outputs whether that task has been completed or should continue." This is potentially more nuanced than our metric-based approach.
Clearer Separation of Instructions: The research suggests "a layered prompt: an always-on portion defining the coach's persona and fundamental rules, and a dynamic portion feeding phase-specific instructions or examples." We should more clearly separate these components.

Creating Consistency by Creating and Referencing Memories

Let’s break this down.

The agent must maintain a sense of consistency within each session,
and across sessions,
and it must be able to use these memories effectively to keep the conversation relevant and goal-directed!

So I think there are two things here: there’s the agent’s storage and ongoing re-prompting of the language model with the stored conversation history… That’s reasonably straightforward.

However, then there’s the process of actually accessing and using that information in different ways to produce the sensation in the user that the agent is following what’s going on and responding to their shared experiences and conversations together. This obviously is more difficult.

The first thing we want our core loop to do, for example, on launch, is recall what was discussed during the last app launch, if the user said they were struggling with something, or they were about to go and do some task that would contribute to their goals, etc. These things should be able to be brought back out from the memory, and used as ingredients to prompt the next round of discussions.

So we have to provide the Brain the source memories as a prompt, and give it a few ways of slicing that information to glean useful information, which it should then formulate a plan to discuss further.

So that would be a process of analysing previous salient info, and bringing it into the agent’s current WORKING MEMORY STACK, to then deal with one-by-one. That requires planning, the ability to hold a conversation while holding the intention to perform further actions, once that initial conversation seems to have drawn its conclusion. This is actually a fairly non-trivial process.

So, we have our first task:

Review of previous session history & planning of follow-up points.

<Overview of The Process> :: (Forthcoming)

Note

This article is unfinished! I need to go through each of the principles in turn and explain how they were applied to this newly evolving Core Loop.

Below are some in-the-trenches implementation details, for those interested in the foundations I’d developed and was building upon.

Nuts and Bolts: The Architecture and Flow

Midjourney 7: “Managing Complex Two-Way Conversations inside a 'Core Loop' in an Android App, make it bright and graphical --ar 40:21 --v 7.0 --s 50”

In order to build an app that could conduct multi-turn spoken dialogues with a user, make API calls to large language models, execute tools, and listen for responses - all while allowing the conversation to be interrupted at any point - I needed to build a robust architecture to handle this complexity.

At its heart, Vox Manifestor employs a sophisticated multi-layered architecture that combines voice technology with AI-powered conversation management. The main entry point presents users with a simple interface: a Genie avatar surrounded by wish "slots" that can be populated and explored. This minimalist design belies the complex machinery operating beneath.

When a user speaks to the app, their voice is captured through a streaming recognition system implemented in the VoiceRecognitionRepository. This component connects to Google's Speech API, continuously processing audio and emitting recognized text through Kotlin Flows. The timestamp-paired text passes through a filtering mechanism to prevent the system from responding to its own speech output—a common pitfall in voice-controlled applications.

A central component called the ConversationAgent serves as the system's cognitive center. This agent receives processed voice input, interprets commands through a pattern-matching system, and orchestrates appropriate responses. The agent maintains conversational state using a state machine approach, tracking whether it's in a wish collection flow, a wish selection flow, or a concept-building flow. This state-based architecture allows the agent to maintain context across multiple turns of conversation.

Creating a Brain and Giving it Tools

What makes the system particularly interesting is its approach to reasoning. Rather than embedding all decision logic in code, the agent delegates complex decisions to an external LLM (Gemini) through the BrainService. This "brain" receives contextual information about the current conversation, the user's wishes, and available tools, then returns structured decisions about what the agent should do next.

The tool-based architecture is key to the system's flexibility. Each capability the agent offers is encapsulated in a tool—QuestionTool for asking questions, SaveConceptItemTool for persisting concept data, AffirmationTool for generating affirmations. The LLM doesn't directly execute actions; instead, it specifies which tool to use with what parameters. This creates a clean separation between decision-making and execution.

Establishing some Basic Architectural Foundations

Conversations in Vox Manifestor aren't just simple command-response pairs—they're multi-turn, goal-oriented sequences. When a user enters the concept building screen to define the present and desired states of their wish, the agent initiates an ongoing dialogue guided by a conversation plan. Each plan includes a goal, steps, and progress tracking, giving the LLM a framework for guiding the user through the process.

The AgentCortex serves as the system's central nervous system, maintaining state that's observable by UI components through reactive flows. When users interact with buttons or voice commands, their intents flow through the cortex to the agent, which then makes decisions and updates shared state. This decoupled communication pattern ensures that UI components remain responsive while complex operations occur in the background.

I asked Claude 3.7 to draw a little diagram of how the AgentCortex interacts with and mediates these interactions and here it is.

The AgentCortex and attached architecture.

AgentCortex: The State Management Centre

The diagram above illustrates how the AgentCortex functions as the ‘pre-frontal cortex’ of the Vox Manifestor app.

In Neuroscience, the prefrontal cortex has long been established as the part of the brain that enables the maintenance of task focus, to formulate executive decisions and allow us to switch between different collections of behavioral schemas — according to what context we’re presented with. While the state management AgentCortex structure doesn't exactly mirror this, [yet], it's a suitable name given its mediating role functions between the agent and its ‘reality’ — the android app screens it interfaces with.

Let me explain its key components and interactions:

Central State Management

The AgentCortex serves as a centralized state container that maintains all conversation-related states as observable flows. These include:

dialogueState: Tracks if the agent is speaking, listening, or idle
displayState: Controls UI overlays like affirmation screens
conversationHistory: Maintains the dialogue transcript
commandState: Tracks detected voice commands
coreLoopState: Manages the overarching conversation journey

Communication Patterns

The AgentCortex uses two primary communication patterns:

UI Intent Flow: ViewModels send user intentions (like "interrupt speech" or "toggle conversation") to the AgentCortex via a uiIntentFlow. This creates a unidirectional data flow from UI to agent.
State Observation: UI components observe various state flows to update their appearance and behavior. For example, the speech indicator observes dialogueState to show when the agent is speaking.

External System Integration

The AgentCortex coordinates with three external systems:

Voice Systems: Voice recognition events flow through the AgentCortex to be processed
Data Systems: Repositories update their data based on agent actions
Brain & Tools: LLM decisions influence state changes in the AgentCortex

CoreLoopState Focus

The CoreLoopState is a particularly important component that:

Tracks which phase of the conversation journey the user is in
Maintains which wish is being worked on
Records completion status of different phases
Provides a structured way to guide users through the manifestation process

This architecture creates a clean separation between UI concerns and agent logic while maintaining a reactive system that can respond immediately to user interactions. The AgentCortex effectively decouples various components of the system, allowing them to communicate through a shared state container rather than direct references.

Navigating Between Screens

The app's navigation system is tightly integrated with the conversation flow. When a user says "define wish three," the agent recognizes this as a navigation command, selects the appropriate wish from the database, and initiates a transition to the concept screen. The agent persists across this navigation, maintaining conversation context and immediately resuming dialogue when the new screen appears.

This architecture allows for a coherent experience across different screens while maintaining separation of concerns. Each screen has its own ViewModel that observes agent state and responds accordingly, but the agent itself remains the consistent driver of the conversation.

Resource Management Challenges

One of the most nuanced aspects of the system is managing voice resources and asynchronous processes. Speech synthesis, recognition, and LLM calls all operate asynchronously, creating potential race conditions and resource leaks. The system uses coroutine scopes tied to conversation lifecycles to ensure that when a conversation ends or is interrupted, all related processes are properly canceled.

The current implementation, while functional, lacks a unified concept of a continuous "core loop" that persists across the entire application experience. Instead, it operates with discrete conversation flows that start and end based on specific user actions or commands. This creates a somewhat fragmented experience where users must explicitly initiate different types of conversations rather than experiencing a natural, ongoing dialogue.

This is precisely the limitation that the planned Core Conversation Loop aims to address—moving from a collection of state-specific conversations to a single, coherent conversational journey that guides users through the entire manifestation process.

Conversations as Complex Processes

In our voice-first application, our agent facilitates dialogues between users and an AI "brain" that helps them define their goals. These conversations are complex for several reasons:

Asynchronous Operations: The agent performs many asynchronous tasks (speech recognition, LLM API calls, speech synthesis)
Multi-step Processes: Conversations involve multiple turns and state transitions
Resource Management: We need to properly handle speech resources and network connections
Interruptibility: Users need to be able to stop a conversation at any time

The most critical issue we faced was around interruption. When a user says "stop" or presses a cancel button, we needed to ensure that all conversation-related processes terminated immediately - no lingering network calls, no background processes continuing to execute.

The Agent-Brain-Tools Architecture

Before diving into the solution, let's understand how our agent is structured:

Agent Architecture

ConversationAgent: Manages the overall conversation flow, handles user input, and coordinates the interaction
Brain Service: A wrapper around the Gemini LLM that makes decisions about next actions
Tool Library: A collection of functions the LLM can "call" to perform actions like asking questions or saving data

The flow typically works like this:

The agent receives user input
The agent sends context to the brain
The brain decides what to do next (which tool to use)
The agent executes the selected tool
The cycle repeats

This architecture works well, but it creates a challenge: when a conversation is interrupted, we need to cancel not just the immediate operation but all dependent processes that have been spawned.

This blog post describes an implementation pattern from my experimental "Vox Manifestor" application, which uses Kotlin, Jetpack Compose, and the Gemini API to create an intelligent voice assistant for goal setting and manifestation. The code and other implementation details shown have been simplified for clarity.

CORE-LOOP: Managing Complex Two-Way Conversations via Agent / LLM

Core Loop v.1.0: Creating a seamless, ongoing conversation