Why make someone operate a remote when they can just talk to the TV?

gyanigyani
4 min read

Imagine stepping into a world where your software doesn’t hand you a clipboard full of forms. Instead, it greets you, asks what you need, listens as you speak, glances at your sketches or screen recording, and jumps into action.

We built today’s interfaces around the limits of yesterday’s machines: the keyboard and mouse. But those constraints no longer define what's possible. Modern AI can already see, hear, read, and even sketch alongside us (and it’s only getting better). So why are we still forcing users through click-by-click gauntlets?

We have taught machines to hear us, see us, even interpret our sketches, yet somehow, we still build interfaces that treat users like data-entry clerks: form fields, dropdowns, click next, multi-step wizards.

With multi-modal AI capabilities now within reach, we can design interfaces that naturally accept and understand human input: voice, video, sketches, and natural language, without forcing people into rigid flows built for keyboard-and-mouse era.


Designing for constraints that no longer exist

Old reality: Users point, click, type into fields, hit “Next,” rinse and repeat.
New reality: Users can simply speak, draw, record video, or share a link and AI handles the rest.

Some examples below:

  • It’s like printing turn‑by‑turn directions from MapQuest in 2025 instead of asking your phone, “Hey, navigate me home.”

  • Modern smart TVs let you change channels by saying “Next show,” but many TV guide apps still require you to tap through eight menus.

  • Customer support chatbots still force you to click through menus (“Press 1 for billing, 2 for technical”), even though your smartphone’s camera and microphone could let you show the broken product and explain the issue in your own words.

When AI can transcribe your speech, parse a PDF or video demo, and extract insights in seconds, every extra click feels like punishment.


Modal Intelligence demands Modal Expression

Humans don’t think in dropdowns and text boxes. We think in stories, feelings, questions, and evidence.

  • Voice: “Show me the top three issues from yesterday’s usability test.”

  • Video: A quick screen recording highlighting confusion.

  • Sketch: A hand‑drawn flowchart of an onboarding idea.

  • Document: A PowerPoint deck with user quotes.

Why ask someone to fill out a written survey when you could record a 30‑second reflection on their phone?

Example: In telemedicine, patients can now send video descriptions of symptoms rather than filling out rigid checkboxes, yet many healthcare portals still rely on outdated forms.


Building partners, not transactional UIs

Traditional interfaces treat users like clerks. AI lets us build conversational partners:

  • Context‑aware guidance: “Last week, you flagged mobile checkout confusion. Want me to pull that report now?” The system guides, adapts, and assists

  • Adaptive workflows: The system adjusts follow‑up questions based on your answers. Workflows become fluid, goal-driven journeys, not linear click paths

  • Collaborative journeys: You and the AI team up to reach goals, not just fill fields. Users collaborate with an agent, not operate a UI.

Imagine applying for a mortgage by sitting with an advisor agent who asks you a few questions, scans your documents, then says, “You are all set”, instead of wresting with twenty pages of forms.

Example: Imagine logging into an analytics dashboard that doesn’t show a million buttons. Instead, your AI partner greets you:

“Welcome back. Shall I pull last week’s churn insights or start a deep dive on new user behavior?”

That isn’t UX as interface; it’s UX as a living partner.


Tools to prototype the Next‑Gen Experience

Do we have all the tools to try this idea out, I guess so more or less (or may be not but we need to start thinking in that direction). May not work for all use cases but definitely can be explored for many use cases. What use cases you can think of? What tools would you suggest to explore, for an MVP?


Closing thoughts

We have trained AI to beat humans at Go, compose symphonies, and diagnose diseases and we use it to… auto‑complete address fields in keyboard era forms.

When we could be designing interfaces that absorb entire conversations, process visual mockups, and even detect frustration in a user’s tone, it’s almost comical that we still patch AI onto age‑old forms.

We are standing at the crossroads of “click‑driven relics” and “AI‑powered collaboration.” The only question left is: will your next product treat users like data‑entry clerks… or intelligent partners?

It’s time for building (or at least exploring) truly multi-modal AI native experience for users. That is how we progress forward.

“The pursuit has always been about how to build computers that understand us, instead of us having to understand computers, and I feel like we really are close to that real breakthrough. … We're entering this new era where computers not only understand us but can actually anticipate what we want and our intents.” ~ Satya Nadella (Microsoft CEO Nadella: Copilot+ PC Gets Closer To ‘Computers That Understand Us’)

0
Subscribe to my newsletter

Read articles from gyani directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

gyani
gyani

Here to learn and share with like-minded folks. All the content in this blog (including the underlying series and articles) are my personal views and reflections (mostly journaling for my own learning). Happy learning!