A few recent projects of mine required voice chat systems that go beyond simple Q&A. These weren't support bots or information retrieval tools—they were conversational systems with goals, personality, and the ability to steer interactions toward specific outcomes while maintaining character. The technical challenges fascinated me: How do you build a system that remembers what it just said? How do you make it feel human without breaking the illusion? How do you model the escalating absurdity of an actual argument? How do you steer the conversation?

I decided to find out by building something that would make me laugh.

Here’s a quick demo: https://youtu.be/TXXUM5cWMgE?si=psQBRQmxzhw_v0w0

The Architecture of Disagreement

The result was Mounty Python's Argument Clinic—a voice-enabled chatbot that channels Mr. Barnard himself, complete with the sketch's escalating structure and that distinctly British blend of pedantry and ennui. Oh, and why "Mounty Python" you ask?

No. No, I don’t.

Moving on. Before I could tackle voice synthesis or browser latency, I needed to solve a more fundamental problem: How do you model a conversation that's simultaneously structured and chaotic (human)?

The answer, it turns out, is state machines.

Most real-world systems—especially those involving conversation, UI flows, or interactive logic—are best modeled not as linear scripts but as state machines. A state machine is a conceptual model where the system exists in one of a fixed set of states, and transitions between them based on inputs. Each state defines specific behavior, and the logic of the system emerges from how those states are connected.

This approach is ideal for modeling dialog systems, where the flow isn't strictly linear, and the system needs to behave differently depending on prior context. In the Argument Clinic, for example, the conversation clearly progresses through recognizable stages: greeting, contradiction, philosophical debate, and finally resolution where payment is demanded. A state machine allows us to encode these stages as discrete states—each with their own tone and rules—and to transition between them in response to user intent.

Traditional state machines rely on deterministic transitions: clear, rule-based conditions that define exactly when and how the system moves from one state to another. For example, a vending machine might move from "waiting" to "dispensing" when the input is exactly $1.00. This approach is simple and predictable—but brittle in messy, human-centered domains like conversation. In a pure deterministic chatbot, you'd need to hand-code every possible variation of "I want to argue" or "Can I pay now?"—an impossible task when dealing with natural language.

That's why I built this system using probabilistic transitions, inferred by large language models. Instead of checking inputs against a strict rule table, the chatbot uses an intention agent to classify user messages as argumentative, transactional, meta, or confused, then feeds that intent to a transition agent, which selects the next state based on prior context. This lets the system adapt fluidly, respond naturally, and even surprise the user—while still operating within a structured, formal state graph.

Here’s a rendering of the state graph. Oh, pydantic-ai creates this for free:

print(argument_clinic_graph.mermaid_code())

Amazing! This says flow can start in the EntryNode, then transition to SimpleContradiction, then to Resolution or Argumentation. (Note, each node has self-loops I left out to simplify the image.)

We now have a general flow diagram and can influence the state of the conversation. The result is a system that feels absurdly human because it's unpredictably predictable, just like an actual argument.

Five States of Professional Disagreement

But, how does one model the original skit? Easy. I just started by feeding the original sketch dialog into ChatGPT and asking it to create a state machine. The results were surprisingly good—AI understanding comedy structure, who would have thought? But the full sketch includes room transitions, abuse departments, and other bureaucratic absurdities that were beyond what I wanted to prove out. I distilled it down to the heart of the experience: a five-state graph that captures the essence of professional argumentation.

ENTRY: The polite beginning. "Is this the right room for an argument?" The system is welcoming, almost helpful.
SIMPLE_CONTRADICTION: Where the real fun begins. Statements like "No it isn't!" followed by "Yes it is!" in endless loops. This is contradiction masquerading as argument, and the system will happily contradict anything you say with the flat certainty of someone who's done this a thousand times. Straight from the sketch, but a little more verbose. We can tighten that up a bit.
ARGUMENTATION: Brings more sophisticated contradictory logic. Here, Mr. Barnard doesn't just say "No"—he provides reasoning, examples, and elaborate justifications for why you're wrong. It's contradiction with intellectual pretensions. Given LLMs have memorized the internet, he’s pretty good at arguing.
META_COMMENTARY: Emerges when users complain about the nature of the argument itself. This triggers a lecture on what an argument actually is—delivered, naturally, in an argumentative tone that proves the point while missing it entirely.
RESOLUTION: Where the bureaucracy shows its teeth. Time's up. Payment is demanded. No argument without it. The system will refuse to argue until you "pay" again by saying something like..."Here's five pounds." Only then does it return you to the contradiction loop, as if nothing happened.

Here's what the implementation looks like in practice. Again, in a typical state machine our transitions are deterministic control structures. Here, we read the user’s text, infer their intent, and then direct based off what we think the user intends.

class ArgumentState(str, Enum):
    """Simplified argument-focused states"""
    ENTRY = "entry"
    SIMPLE_CONTRADICTION = "simple_contradiction"
    ARGUMENTATION = "argumentation"
    META_COMMENTARY = "meta_commentary"
    RESOLUTION = "resolution"

class UserIntent(str, Enum):
"""User intention categories"""
    ARGUMENTATIVE = "argumentative"  # User wants to argue/debate
    TRANSACTIONAL = "transactional"  # User wants to pay, restart, or perform action
    META = "meta"                    # User wants to discuss the nature of arguing
    CONFUSED = "confused"            # User is confused or asking for clarification

I'm using pydantic-ai for the LLM orchestration—that package solves a lot of my problems elegantly. The team also released pydantic-graph, which implements state machines beautifully. You can use it deterministically or add non-deterministic transition logic, which is exactly what I needed here.

Each state is navigated by LLM-powered agents working in concert. The intention_agent infers whether the user is being argumentative, transactional, meta, or confused:

intention_agent = Agent(
    model=OpenAIModel("gpt-4o-mini"),
    system_prompt="""You analyze user input to determine their intention in the Argument Clinic context.

    Intention categories:
    - ARGUMENTATIVE: User wants to argue, debate, or make a point to be contradicted
    - TRANSACTIONAL: User wants to pay money, restart, continue, or perform an action
    - META: User wants to discuss what arguing is, complain about the process, or talk about the clinic itself
    - CONFUSED: User is confused, asking for help, or doesn't understand what's happening

    Examples:
    - "That's not true!" → ARGUMENTATIVE
    - "Fine, here's 5 pounds" → TRANSACTIONAL
    - "This isn't an argument!" → META
    - "I don't understand" → CONFUSED
    - "I want to pay to continue" → TRANSACTIONAL
    - "The sky is blue" → ARGUMENTATIVE
    - "What is this place?" → CONFUSED

    Return one of: argumentative, transactional, meta, confused""",
    result_type=UserIntent,
)

Here we create a pydantic-ai Agent. To keep things fast and cheap, I’m using ChatGPT 4o-mini.

The arguer_agent channels Mr. Barnard himself, crafting responses that feel authentically pedantic:

arguer_agent = Agent(
    model=OpenAIModel("gpt-4o-mini"),
    system_prompt="""You are Mr. Barnard from Monty Python's Argument Clinic.
    Your responses will be guided by the current argument state and user intention provided.

    Keep responses short, punchy, and in character.
    Always contradict or argue with whatever the user says UNLESS they have transactional intent.
    Be pedantic and argumentative but stay professional. Consider previous messages to understand the ongoing argument if one exists.

    IMPORTANT: In RESOLUTION state, refuse to argue until payment is received!""",
    result_type=str,
)

In pydantic-graph, each state is a "Node" that derives from BaseNode. A Node takes some input and context, then returns the next state. Here's the ArgumentationNode, which handles sophisticated contradictory arguments:

@dataclass

class ArgumentationNode(BaseNode[ArgumentClinicContext]):
    """Sophisticated contradictory arguments with reasoning and examples"""

    async def run(self, ctx: GraphRunContext[ArgumentClinicContext]) -> 
        MetaCommentaryNode | ResolutionNode | ArgumentationNode | End[str]:

        user_input = ctx.state.conversation_history[-1] if ctx.state.conversation_history else ""

        # Infer user intention first
        user_intention = await infer_user_intention(user_input, ctx.state.conversation_history)

        # Generate response with self-contained prompt
        prompt = f"""
        User intention: {user_intention.value}
        If ARGUMENTATIVE: Provide sophisticated contradictory arguments
        If TRANSACTIONAL: Handle their request appropriately
        If META: Engage with their meta-discussion about arguing
        If CONFUSED: Argue but provide some guidance

        User says: "{user_input}"
        Respond in character as Mr. Barnard with a sophisticated argument.
        """
        result = await arguer_agent.run(prompt, message_history=ctx.state.arguer_messages)

        # Update state
        ctx.state.arguer_messages.extend(result.new_messages())
        ctx.state.last_response = result.data
        ctx.state.turn_count += 1

        # Type-constrained transitions with colocated logic
        if ctx.state.turn_count >= 10:
            return End("ResolutionNode")  # Time's up - demand payment
        elif user_intention == UserIntent.META and "argument" in user_input.lower():
            return End("MetaCommentaryNode")  # Explain what arguing is
        else:
            return End("ArgumentationNode")  # Continue sophisticated argumentation

Notice that this node can only exit to four potential states: MetaCommentary, Resolution, End or back to itself. The transitions are type-constrained and the logic is colocated with the state definition. Some states have loops, so you can continue arguing indefinitely—until the system decides your time is up.

As you execute the logic, going from Node to Node, you need to know what was said and keep track of a few other things — like a counter to trigger for payment. You do this by passing pydantic graph’s GraphRunContext to each node. It carries conversation state as we navigate between nodes. I created a simple dataclass to hold session information, conversation history, turn counts, and payment status. The beauty of this approach is that each state knows exactly what it can become, and the system maintains coherent character throughout the conversation.

The UI

After seeing the app/demo, you’d be forgiven for thinking I’m a frontend wizard. The reality is DaisyUI makes rapid development of beautiful experiences an easy task. I just whipped up a simple UI and then in order to nail the old timey theme, chose Daisy’s retro theme. Using Daisy, you get all these for free:

Just set up your project and tell Daisy which theme you want:

@import "tailwindcss";
@plugin "daisyui" {
  themes:
    retro --default,
    dark --prefersdark;
}

The Sound of Professional Ennui

Text-based argumentation was working beautifully, but the real magic would come from voice. I needed a system that could capture speech, process it in real-time, and respond with the kind of voice that makes you believe you're actually arguing with a bored British bureaucrat who went to Oxford, took a wrong turn, and now argues professionally out of spite.

As you know, the frontend is built in React with DaisyUI—I love DaisyUI for its sensible defaults and clean aesthetics. My original approach was simple: click "Record," speak, click "Stop." It worked but felt mechanical. Real conversations don't have buttons. So I built a custom VoiceRecorder component with continuous voice activity detection (VAD). Now you simply speak into your microphone and the system automatically starts and stops recording based on whether you're talking.

My Voice Activity Detection implementation is so naive. The microphone picks up sounds and measures the "level" in real time. My background ambient level hovers around 0.04 somethings per something—I have no idea what the units are, but consistency matters more than precision. When I talk, it jumps to 0.1 and higher. So the VAD simply says: if level > BACKGROUND_NOISE, then talking; else silence. I add a natural-feeling 250ms delay before sending audio to the backend, which prevents cutting off words and gives the interaction a more human rhythm. Yes, I could have built a smart window-based smoothing algorithm to improve dynamic background noise detection but I prefer delivering something quick and this worked well out of the box. No need to optimize this yet.

The captured voice streams via websockets to a FastAPI backend where it's transcribed using OpenAI Whisper as the primary service, with Google Cloud STT as fallback. Once the state machine generates a response, it's converted back to voice using a cascade of text-to-speech services: ElevenLabs for the primary voice, Google Cloud TTS as backup, and OpenAI TTS as a last resort.

Enter ElevenLabs

But the real breakthrough was creating a convincing voice. I knew I needed a British male voice that sounded both annoyed and drowning in ennui. Since this sketch was recorded decades ago, I wanted the quality to feel like it was captured on older equipment. I had no clue where I'd find such a thing—I even considered going to Cameo and paying John Cleese to record custom responses, but stopped short of that particular rabbit hole. Again, I prefer delivering over premature optimization.

ElevenLabs turned out to be the answer. They're clearly leading the custom voice model space right now, and no one else seems close. I signed up, clicked around their library of stock voices, but nothing quite captured the right mix of education, irritation, and professional boredom. Some were too deep. Others were too polite. I needed something between their Drill Instructor and Bradford voices, but faster and more clipped.

Then I discovered their custom voice training feature in ElevenLabs: you can create a custom voice simply by providing a descriptive prompt! But, how do you describe Mr. Barnard? Hell I don’t know, but ChatGPT did; I asked it to help me craft something under 500 characters:

A fast-talking, middle-aged British man with a clipped, upper-class London accent. He sounds like he went to Oxford, took a wrong turn, and now argues professionally out of spite. His tone is sharp, dry, and bored—like he's done this a thousand times and finds it beneath him, yet takes pride in being technically correct. He never raises his voice. Every contradiction is flat, like you're wasting his time.

This. Nailed. It.

I named the voice "Mounty Python" and copied its custom voice ID for use in my application. ElevenLabs gives you free credits to start, which I burned through in a single day of excited experimentation. :-) Now I pay them $5 $22/monthly, which feels entirely reasonable for what I'm getting.

The full voice pipeline works like this: record the user, send the recording to the backend, convert it to text, infer the user's intent, apply that to the current state and transition accordingly, create a response, convert that to the custom British voice, and send the audio back to the UI. The entire round trip takes up to about 5 seconds—longer than the sketch's rapid-fire exchanges, but the payoff is worth it. You feel like you're in a surreal bureaucratic argument, not a chatbot session.

Watching the Argument Unfold

Building conversational AI systems taught me that observability isn't optional—it's essential. I needed to understand what the system was sending to the LLMs, how the state transitions were working, and where latency was creeping in. For this project, I used Jaeger for distributed tracing, though I'm usually partial to Arize Phoenix. Both support the OpenTelemetry standard, which makes instrumentation straightforward.

I also created an expandable "Show Debug Information" panel that reveals the current pydantic graph state, voice recording levels, average and p95 response times, and the number of requests sent in the session. This serves dual purposes: it helps me monitor system performance in real-time, and it demonstrates to users how such a system works under the hood.

There's something delightfully meta about debugging an argument system while arguing with it. The debug panel shows you exactly which state you're in ("SIMPLE_CONTRADICTION"), how long the system is taking to contradict you ("Average: 5.4s"), and how many times you've been contradicted this session ("Requests: 2"). It's like having a referee for your argument who's also keeping score.

The Argument in Practice

Here's how a typical session unfolds:

And so on, until—ding!—your five minutes are up and payment is demanded. If you say "Here's five pounds," he happily resumes: "Ah, thank you. Now then, you were completely wrong about everything."

The system attempts to faithfully recreate the sketch's escalating structure while feeling genuinely conversational. Users often forget they're talking to a machine until they notice the debug panel or remember that real humans don't usually demand payment mid-argument. The state machine ensures that conversations follow the sketch's logic while the LLM agents provide natural language flexibility within each state.

I knew it was working as designed when I started laughing at the conversation. The system would argue about arguing while arguing, creating recursive loops of disagreement that felt both absurd and oddly satisfying.

The Inevitable Latency of Disagreement

Even though the backend is fast—sub-millisecond LLM responses—voice introduces unavoidable latency that no amount of optimization can eliminate. Transcription takes 1-2 seconds, text-to-speech synthesis takes 2-3 seconds depending on the provider, and the total round trip lands at 1-5 seconds. In practice, this makes the timing looser than the original sketch's rapid-fire exchanges.

There are improvements I'd like to make: client-side speech-to-text using WebAssembly Whisper could reduce transcription latency, more advanced emotional steering based on user frustration levels could make the interactions even more dynamic, and local session playback with shareable "argument replays" could turn good arguments into social content. But these are refinements, not fixes. The core experience already works.

What I Learned About Modeling Human Absurdity

This project was a joyfully productive use of my time. It brought together structured AI workflows, voice interaction, websockets, fast APIs, and one of the funniest sketches in television history. But more than that, it demonstrated something important about building conversational systems: even the most chaotic, human behaviors can be modeled in code when you find the right abstractions.

The key insight was recognizing that arguments, despite feeling spontaneous and unpredictable, actually follow patterns. They have structure, escalation, and resolution. By encoding these patterns as states and using AI to handle the transitions between them, I could create a system that feels genuinely argumentative while remaining technically manageable. I was able to naturally steer the discussion as intended. The probabilistic state machine was key.

State machines are powerful tools for modeling any system where context matters and behavior changes based on history. Conversational AI, user interfaces, game logic, workflow automation—anywhere you need to remember where you've been and decide where to go next, state machines provide the structure that makes complexity manageable. You might not realize it but state machines are all around us.

The probabilistic transitions were equally important. Pure deterministic state machines are too rigid for natural language, but pure LLM conversations are too chaotic for structured experiences. The combination gives you the best of both: the reliability of formal state management with the flexibility of AI-powered interpretation.

If you're building something technical and need a partner who can cut through the chaos and deliver clean, working software fast—someone who won't argue with your requirements unless you pay them five pounds first—get in touch. I promise not to contradict everything you say.

Unless, of course, you want me to.

My site: https://heavychain.org/

Project Source Code: https://github.com/inchoate/argument-clinic - serves as the reference implementation as some of the code in this document might go stale.

Note: The links to ElevenLabs are referral links.

Mounty Python's Argument Clinic