Building an AI Agent for Mobile Automation with LangGraph + LLMs

NeurNeur
5 min read

Writing Appium scripts is fun β€” until it isn’t.
Hard-coded XPaths break, swipes go rogue, and suddenly your β€œautomation” is the thing that needs babysitting.

So instead of wrestling with brittle test scripts, I built something different:
πŸ‘‰ A sequential LangGraph workflow that reads a plain .txt file of instructions and executes them step by step β€” with screenshots, vision, reasoning, and action fallback chains baked right in.

This isn’t just automation.
This is automation with a brain. 🧠


🧩 The Big Idea

The project stitches together three layers:

  1. Vision Layer β†’ captures screenshots, OCR, and a unified JSON schema of the screen (XML + OCR + icons).

  2. Reasoning Layer β†’ an LLM decides what to do next, but with strict enforcement so it doesn’t hallucinate.

  3. Action Layer β†’ taps, types, swipes, and retries until success β€” or gracefully fails.

The workflow runs as a graph:

START β†’ bootstrap β†’ observe β†’ reason β†’ act β†’ next_step β†’ END

πŸ” Observe: Seeing the Screen

Each step starts with a screenshot + vision pipeline.
We merge:

  • page_source.xml (classic Appium dump)

  • OCR text fragments

  • Icon classifier outputs

The result is a unified JSON that describes the UI in a consistent way.


🧠 Reason: Keeping the LLM on a Leash

Here’s a simplified snippet from the reason node:

def reason(state: State):
    curr_idx = state.get("step", 0)
    curr_instruction = state["steps"][curr_idx].strip()

    prompt = REACT_TEMPLATE.format(
        system=SYSTEM_PROMPT,
        unified=json.dumps(state.get("unified", {}), indent=2),
        caption=state.get("caption", ""),
        history="\n".join(state.get("history", [])[-5:]),
        goal=f"Your current step is:\n{curr_instruction}"
    ) + "\n\nIMPORTANT: Execute ONLY the current step."

    messages = [{"role": "user", "content": prompt}]
    action_dict = chat(messages)

    # Force alignment if step says "tap", "type", etc.
    if curr_instruction.lower().startswith("tap on"):
        target_text = curr_instruction.split("on", 1)[-1].strip()
        action_dict = {"action": "tap", "parameters": {"text": target_text}}

πŸ”‘ What’s happening here?

  • The LLM gets the current step only (no peeking ahead).

  • If the instruction starts with "tap on", it forces a TAP action.

  • Same enforcement exists for "type" and "swipe_until".

This way, the agent cannot drift into actions you never asked for.


🎬 Act: Doing the Thing

Actions aren’t just executed once and prayed over.
Each comes with a fallback chain:

def act(state: State) -> dict | None:
    action = state["action"]
    cmd = action["action"].strip().lower()
    params = action["parameters"]

    if cmd == "tap":
        # Try tapping by bounds
        if params.get("bounds"):
            success = tap(",".join(map(str, params["bounds"])), state)
        # Try tapping by text
        elif params.get("text"):
            success = tap(params["text"], state)
        # Fallback to unified JSON lookup
        else:
            bounds = find_bounds_from_unified(state["unified"], params["text"])
            if bounds:
                success = tap(",".join(map(str, bounds)), state)

Here’s the chain of thought:

  1. If we know bounds β†’ tap directly.

  2. If we know text β†’ tap by label.

  3. If both fail β†’ search the unified JSON for a matching element.

βœ… The workflow never gives up easily.


πŸ“œ Example Run

Take a simple .txt task file:

1. swipe_until text 'Kia Carens Clavis'
2. tap on 'Find Cars'
3. type 'Mumbai' in 'Location'
4. click on 'See Detailed View'

The agent runs step by step, enforcing the correct action type at each stage, retrying if needed, and finishes cleanly after the last instruction.


πŸ—‚ Folder Structure

Here’s how it all fits together:

└── πŸ“src
    └── πŸ“device
        β”œβ”€β”€ app_control.py   # app launch helpers
        β”œβ”€β”€ driver.py        # Appium driver glue
        β”œβ”€β”€ test.py
    └── πŸ“graph
        β”œβ”€β”€ workflow.py      # LangGraph workflow
    └── πŸ“llm
        β”œβ”€β”€ dummy_llm_old.py # placeholder LLM
        β”œβ”€β”€ prompts.py       # system + ReAct templates
    └── πŸ“logs
        β”œβ”€β”€ actions_log.json
    └── πŸ“tmp                # screenshots, vision artifacts
    └── πŸ“vision
        β”œβ”€β”€ mock_vision.py   # OCR + mock vision
        β”œβ”€β”€ yolov8s-worldv2.pt
    β”œβ”€β”€ config.py
    └── yolov8s-worldv2.pt
Api.py

Notice the imports inside workflow.py:

from device.driver import take_screenshot, tap, type_text, swipe, swipe_until_found, restart_driver
from vision.mock_vision import observe_screen
from llm.dummy_llm_old import chat
from llm.prompts import SYSTEM_PROMPT, REACT_TEMPLATE
from device.app_control import launch_app

πŸ‘‰ That means you can write your own implementations of these functions in their respective folders. For example:

  • Replace mock_vision.py with a real OCR + icon detector.

  • Swap dummy_llm_old.py with a real LLM (Qwen, GPT, or your own fine-tuned model).

  • Extend driver.py with richer Appium actions.

Think of it like LEGO β€” the LangGraph workflow is the skeleton, and you plug in your own muscles. πŸ’ͺ


🌱 Future Version: Where This is Headed

Right now, the workflow relies on forced alignments + fallbacks to keep the LLM in check.
But the next version of this project goes beyond that:

  • Data Collection: Use Appium-recorded trajectories to build a dataset of (state_before, action, state_after).

  • Unified JSON Schema: Merge XML + OCR + icon detections into a normalized representation of every screen.

  • Fine-tuned Policy Model: Train a compact multimodal model that directly maps (screenshot + UI-JSON) β†’ action.

  • Low-latency Execution: Keep the model small enough to run locally, so LangGraph can call it as a β€œbrain plug-in”.

In other words: today’s workflow enforces β€œrules of the road” for the LLM.
Tomorrow’s version will have a native driver model that doesn’t need as many guardrails β€” it just knows how to tap, type, and swipe correctly.


✨ Why This Works Better

  • Enforced determinism (today) β†’ no LLM wanderings.

  • Fallbacks everywhere β†’ text, bounds, JSON β€” something will work.

  • Customizable β†’ swap in your own functions for vision, LLM, or Appium.

  • Fine-tuning vision (tomorrow) β†’ a multimodal model that learns from real trajectories, reducing hand-coded rules.


🏁 Closing Thoughts

Instead of brittle Appium scripts, I now have a LangGraph-powered workflow agent that:

  • Reads plain-text instructions,

  • Sees the screen,

  • Thinks (lightly),

  • Acts (deterministically), and

  • Will eventually evolve into a fine-tuned multimodal control model.

Now it’s your turn β†’ clone the structure, fill in your own device/, vision/, and llm/ modules, and build automation with a brain. 🧠⚑

4
Subscribe to my newsletter

Read articles from Neur directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Neur
Neur

"AI Engineer who builds agents that can see, think, and click better than most humans on Monday mornings. Experienced with LangGraph, LangChain, Hugging Face, and MCP. I tinker with RAG, LLM fine-tuning, and automation systems β€” basically teaching models to do QA so I can sip chai in peace."