Managing Complex Two-Way Conversations inside a 'Core Loop' in an Android App

Richard ThompsonRichard Thompson
14 min read

In building voice-enabled AI assistants, one of the most challenging aspects is how to manage the conversation lifecycle.

I created my Vox Manifestor app to help users manifest their wishes through structured reflection practices. The app centers around a voice-controlled, Ai powered Agent (that I call a “Genie”) whose job is to guide users through a process of articulating, refining, and affirming their wishes.

Sounds simple, right? Humans converse with each other every day, and now we’ve designed A.i.s to be able to do the same. Sure, fine. The latest LLMs have all been fine-tuned to be conversational in nature, to reflect on what the user has just said, and even to finish their reply with a question that encourages the user to continue chatting.

But this Genie’s capabilities will be going far beyond that. I’m building it to:

  1. Hold a conversation for as long as the user uses the app. (maintain the “Core-Loop”)

  2. Be goal-oriented, and achieve multiple, staged, and progressively deeper goals through these conversations, including:

    • Eliciting and storing the user’s 5 most valued desires (“Wishes”)

    • Gathering daily meta-goal information about, for example, how the user feels about their progress towards these, and to get the user to engage at a deeper level with each of them in turn.

    • Use questioning skills to establish detailed descriptions of the present and desired states for each of these wishes.

      • Determine sticking points and highlight these in the framework

      • Explore detailed explorations of the present state and visualisations of desired states

  3. Determine when using a particular tool (e.g. an affirmation tool) could be helpful, and determine from past conversations, what content those affirmations should have!

This was certainly a challenging-enough starting point. I thought that if I could accomplish the beginnings of these three goals, that would serve as a Minimal Loveable Product (elsewhere known as Minimal Viable Product) and I could then start putting the App in front of potential users!

Core Loop v.1.0: Creating a seamless, ongoing conversation

After an initial design period of some months, I wasn’t happy with the way Vox Manifestor was turning out. It seemed too basic, too clunky. The visual interface was boring and static. The user had to issue commands for everything they wanted to happen.

I wanted to develop a Core Value Proposition of what this app could be. The process was painful. I had to confront the inadequacy of what I’d already spent months developing. I came up with some important realisations, that spoke not just to the functioning of this app, but to the future of Ai-assisted voice interfaces in the very operating systems that we use every day. From this idea emerged…

Midjourney 7: “An advanced conversational interface”

Some Principles for Advanced Conversational Interfaces

  1. The conversation with the A.i. must be THE PRIMARY INTERFACE.

  2. The A.i. must be able to navigate a sequence of evolving CONVERSATION CONTEXTS, each of which could involve different goals, skills and draw on different knowledge and memories.

  3. The agent must maintain a sense of consistency (and therefore competence), within each session, and across sessions (it must have a Short term Memory and a Long term Memory), and it must be able to use these memories effectively to keep the conversation relevant and goal-directed!

  4. UI elements should be CONTEXTUAL and appear only when relevant to the ongoing conversation.

  5. Different data sources (wishes, states, etc.) must be loaded based on conversation context

  6. All Navigation must happen through conversation. The agent progresses the context, persists across screen changes, and maintains control of the conversation.**

  7. The interface maintains a clean, focused design centered on the agent avatar and the conversation

  8. Information like wish lists, concept details, process and status information become available in collapsible & expandable panels: This “Interface of the Future” must be Dynamic, Fluid and Adaptable.

[Black Mirror / Apocalypse Caveat]

** : side note: ultimately the voice-agent (our future voice-OS’s) will persist across all applications, and store everything we do in a way that allow them to understand us and anticipate our next steps. This, of course, could end up being the stuff of science-fiction nightmares. The A.i.’s end up knowing us better than we know ourselves, and so become the perfect instrument of our self-destruction. This of course is an ongoing issue to address in A.i. development: HOW DO WE STOP THEM KILLING OR ENSLAVING US? OR BECOMING THE MACHINES THAT FACILITATE OTHER HUMANS TO DO SO? Yes, this question needs answering. More immediately than that however, we will certainly run into significant SECURITY RISKS by giving every detail of our online and offline lives (*willingly!) over to — in the absence of the serious development of improved privacy and security for A.i. tools — intelligently snooping agents. Clearly the development of A.i. tools such as these must go hand-in-hand with the development of improved privacy and security measures. I’d like to think that every future A.i. developer will need to be well well-versed in and occupied for at least some of their time by these concerns.

Coming Back To First Principles

I’d spent a number of months “in the trenches” of the code implementation, establishing a basic architecture for voice recognition, speech generation, data and state management, Ui and screen layouts and navigation, and finally, involvement and the integration of a language model as decider and driver of conversations.

I’d wanted to do this as part of my Year of Technical Re-Specialization, to get back into the swing of programming, by learning one of the modern languages. Even though my original interest was in A.i., I knew I needed a platform to form a foundation for any A.i. application, and to me, mobile seemed like the best format. I think voice-based O.S.’s are going to take over mobile platforms first.

So it took a very necessary effort to pull myself out of the trenches of the code, and come back to the above — newly established — interface design principles.

I apparently wasn’t going to be able to continue getting away with designing as I was building! I’d have to go back to the drawing board and start from first principles to develop this new style of interface.

Midjourney 7: “An advanced conversational interface"

  1. The Conversation Is The Primary Interface

At its core this means that the conversational agent must always have somewhere to go in the conversation. The conversation must be structured, and the agent must navigate through this structure, creating a sense of systematic progress for the user.

On the one hand, the agent must not suffer from such issues as repeating the same question more than once or getting stuck on a particular outcome or using a particular process repeatedly. There must always be forward momentum, and the sense that the agent has direction. Otherwise the user can quickly lose faith in the integrity of the interface. At the same time the agent should be able to respond to the user's needs, for example, to focus on a specific wish, when they are expressed.

  1. The Agent Navigates Conversational Contexts

This follows on naturally from the process oriented nature of the conversation we discussed above.

A process oriented conversation necessarily involves moving sequentially between a pre-planned series of sub-conversations.

The goal of the “Core Loop” is to navigate this sequence from beginning to end, repeatedly, ad infinitum (with variations that keep the process interesting, entertaining, engaging, etc.)

The goal of each individual conversational context or sub-conversation will be different according to that context.

The agent will need to maintain awareness of the Core Loop, while being able to descend a logical level into more detailed Conversation Contexts, accomplish the goal of that context to its satisfaction, and then return to the higher level Core Loop, to continue the process.

Mapping the Core Loop: What Data & Which Algorithms?

I was having trouble thinking exactly what was needed inside the core loop and what would be needed to bring the process to life, so I created the above information and process flow chart. We have some implementation questions related to the idea that the agent should have a list of questions that it wants to ask the user. These questions would be on two different levels. On one level, they would be meta-goal-based questions that are about the goals themselves. On another level, they would be specific to help the user elaborate their present and desired states.

The two main processes contained inside the basic Core Loop are (1) to engage the user in a general daily update on the main screen and (2) to drill into any of the specific goals.

Note

This article is unfinished! I need to go through each of the principles in turn and explain how they were applied to this newly evolving Core Loop.

Below are some in-the-trenches implementation details, for those interested in the foundations I’d developed and was building upon.

Nuts and Bolts: The Architecture and Flow

Midjourney 7: “Managing Complex Two-Way Conversations inside a 'Core Loop' in an Android App, make it bright and graphical --ar 40:21 --v 7.0 --s 50”

In order to build an app that could conduct multi-turn spoken dialogues with a user, make API calls to large language models, execute tools, and listen for responses - all while allowing the conversation to be interrupted at any point - I needed to build a robust architecture to handle this complexity.

At its heart, Vox Manifestor employs a sophisticated multi-layered architecture that combines voice technology with AI-powered conversation management. The main entry point presents users with a simple interface: a Genie avatar surrounded by wish "slots" that can be populated and explored. This minimalist design belies the complex machinery operating beneath.

When a user speaks to the app, their voice is captured through a streaming recognition system implemented in the VoiceRecognitionRepository. This component connects to Google's Speech API, continuously processing audio and emitting recognized text through Kotlin Flows. The timestamp-paired text passes through a filtering mechanism to prevent the system from responding to its own speech output—a common pitfall in voice-controlled applications.

A central component called the ConversationAgent serves as the system's cognitive center. This agent receives processed voice input, interprets commands through a pattern-matching system, and orchestrates appropriate responses. The agent maintains conversational state using a state machine approach, tracking whether it's in a wish collection flow, a wish selection flow, or a concept-building flow. This state-based architecture allows the agent to maintain context across multiple turns of conversation.

Creating a Brain and Giving it Tools

What makes the system particularly interesting is its approach to reasoning. Rather than embedding all decision logic in code, the agent delegates complex decisions to an external LLM (Gemini) through the BrainService. This "brain" receives contextual information about the current conversation, the user's wishes, and available tools, then returns structured decisions about what the agent should do next.

The tool-based architecture is key to the system's flexibility. Each capability the agent offers is encapsulated in a tool—QuestionTool for asking questions, SaveConceptItemTool for persisting concept data, AffirmationTool for generating affirmations. The LLM doesn't directly execute actions; instead, it specifies which tool to use with what parameters. This creates a clean separation between decision-making and execution.

Establishing some Basic Architectural Foundations

Conversations in Vox Manifestor aren't just simple command-response pairs—they're multi-turn, goal-oriented sequences. When a user enters the concept building screen to define the present and desired states of their wish, the agent initiates an ongoing dialogue guided by a conversation plan. Each plan includes a goal, steps, and progress tracking, giving the LLM a framework for guiding the user through the process.

The AgentCortex serves as the system's central nervous system, maintaining state that's observable by UI components through reactive flows. When users interact with buttons or voice commands, their intents flow through the cortex to the agent, which then makes decisions and updates shared state. This decoupled communication pattern ensures that UI components remain responsive while complex operations occur in the background.

I asked Claude 3.7 to draw a little diagram of how the AgentCortex interacts with and mediates these interactions and here it is.

The AgentCortex and attached architecture.

AgentCortex: The State Management Centre

The diagram above illustrates how the AgentCortex functions as the ‘pre-frontal cortex’ of the Vox Manifestor app.

In Neuroscience, the prefrontal cortex has long been established as the part of the brain that enables the maintenance of task focus, to formulate executive decisions and allow us to switch between different collections of behavioral schemas — according to what context we’re presented with. While the state management AgentCortex structure doesn't exactly mirror this, [yet], it's a suitable name given its mediating role functions between the agent and its ‘reality’ — the android app screens it interfaces with.

Let me explain its key components and interactions:

Central State Management

The AgentCortex serves as a centralized state container that maintains all conversation-related states as observable flows. These include:

  • dialogueState: Tracks if the agent is speaking, listening, or idle

  • displayState: Controls UI overlays like affirmation screens

  • conversationHistory: Maintains the dialogue transcript

  • commandState: Tracks detected voice commands

  • coreLoopState: Manages the overarching conversation journey

Communication Patterns

The AgentCortex uses two primary communication patterns:

  1. UI Intent Flow: ViewModels send user intentions (like "interrupt speech" or "toggle conversation") to the AgentCortex via a uiIntentFlow. This creates a unidirectional data flow from UI to agent.

  2. State Observation: UI components observe various state flows to update their appearance and behavior. For example, the speech indicator observes dialogueState to show when the agent is speaking.

External System Integration

The AgentCortex coordinates with three external systems:

  • Voice Systems: Voice recognition events flow through the AgentCortex to be processed

  • Data Systems: Repositories update their data based on agent actions

  • Brain & Tools: LLM decisions influence state changes in the AgentCortex

CoreLoopState Focus

The CoreLoopState is a particularly important component that:

  • Tracks which phase of the conversation journey the user is in

  • Maintains which wish is being worked on

  • Records completion status of different phases

  • Provides a structured way to guide users through the manifestation process

This architecture creates a clean separation between UI concerns and agent logic while maintaining a reactive system that can respond immediately to user interactions. The AgentCortex effectively decouples various components of the system, allowing them to communicate through a shared state container rather than direct references.

Navigating Between Screens

The app's navigation system is tightly integrated with the conversation flow. When a user says "define wish three," the agent recognizes this as a navigation command, selects the appropriate wish from the database, and initiates a transition to the concept screen. The agent persists across this navigation, maintaining conversation context and immediately resuming dialogue when the new screen appears.

This architecture allows for a coherent experience across different screens while maintaining separation of concerns. Each screen has its own ViewModel that observes agent state and responds accordingly, but the agent itself remains the consistent driver of the conversation.

Resource Management Challenges

One of the most nuanced aspects of the system is managing voice resources and asynchronous processes. Speech synthesis, recognition, and LLM calls all operate asynchronously, creating potential race conditions and resource leaks. The system uses coroutine scopes tied to conversation lifecycles to ensure that when a conversation ends or is interrupted, all related processes are properly canceled.

The current implementation, while functional, lacks a unified concept of a continuous "core loop" that persists across the entire application experience. Instead, it operates with discrete conversation flows that start and end based on specific user actions or commands. This creates a somewhat fragmented experience where users must explicitly initiate different types of conversations rather than experiencing a natural, ongoing dialogue.

This is precisely the limitation that the planned Core Conversation Loop aims to address—moving from a collection of state-specific conversations to a single, coherent conversational journey that guides users through the entire manifestation process.

Conversations as Complex Processes

In our voice-first application, our agent facilitates dialogues between users and an AI "brain" that helps them define their goals. These conversations are complex for several reasons:

  1. Asynchronous Operations: The agent performs many asynchronous tasks (speech recognition, LLM API calls, speech synthesis)

  2. Multi-step Processes: Conversations involve multiple turns and state transitions

  3. Resource Management: We need to properly handle speech resources and network connections

  4. Interruptibility: Users need to be able to stop a conversation at any time

The most critical issue we faced was around interruption. When a user says "stop" or presses a cancel button, we needed to ensure that all conversation-related processes terminated immediately - no lingering network calls, no background processes continuing to execute.

The Agent-Brain-Tools Architecture

Before diving into the solution, let's understand how our agent is structured:

Agent Architecture

  1. ConversationAgent: Manages the overall conversation flow, handles user input, and coordinates the interaction

  2. Brain Service: A wrapper around the Gemini LLM that makes decisions about next actions

  3. Tool Library: A collection of functions the LLM can "call" to perform actions like asking questions or saving data

The flow typically works like this:

  • The agent receives user input

  • The agent sends context to the brain

  • The brain decides what to do next (which tool to use)

  • The agent executes the selected tool

  • The cycle repeats

This architecture works well, but it creates a challenge: when a conversation is interrupted, we need to cancel not just the immediate operation but all dependent processes that have been spawned.


This blog post describes an implementation pattern from my experimental "Vox Manifestor" application, which uses Kotlin, Jetpack Compose, and the Gemini API to create an intelligent voice assistant for goal setting and manifestation. The code and other implementation details shown have been simplified for clarity.

0
Subscribe to my newsletter

Read articles from Richard Thompson directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Richard Thompson
Richard Thompson

Spending a year travelling the world while teaching myself about AI.