Building an AI Agent for Mobile Automation with LangGraph + LLMs


Writing Appium scripts is fun β until it isnβt.
Hard-coded XPaths break, swipes go rogue, and suddenly your βautomationβ is the thing that needs babysitting.
So instead of wrestling with brittle test scripts, I built something different:
π A sequential LangGraph workflow that reads a plain .txt
file of instructions and executes them step by step β with screenshots, vision, reasoning, and action fallback chains baked right in.
This isnβt just automation.
This is automation with a brain. π§
π§© The Big Idea
The project stitches together three layers:
Vision Layer β captures screenshots, OCR, and a unified JSON schema of the screen (XML + OCR + icons).
Reasoning Layer β an LLM decides what to do next, but with strict enforcement so it doesnβt hallucinate.
Action Layer β taps, types, swipes, and retries until success β or gracefully fails.
The workflow runs as a graph:
START β bootstrap β observe β reason β act β next_step β END
π Observe: Seeing the Screen
Each step starts with a screenshot + vision pipeline.
We merge:
page_source.xml
(classic Appium dump)OCR text fragments
Icon classifier outputs
The result is a unified JSON that describes the UI in a consistent way.
π§ Reason: Keeping the LLM on a Leash
Hereβs a simplified snippet from the reason node:
def reason(state: State):
curr_idx = state.get("step", 0)
curr_instruction = state["steps"][curr_idx].strip()
prompt = REACT_TEMPLATE.format(
system=SYSTEM_PROMPT,
unified=json.dumps(state.get("unified", {}), indent=2),
caption=state.get("caption", ""),
history="\n".join(state.get("history", [])[-5:]),
goal=f"Your current step is:\n{curr_instruction}"
) + "\n\nIMPORTANT: Execute ONLY the current step."
messages = [{"role": "user", "content": prompt}]
action_dict = chat(messages)
# Force alignment if step says "tap", "type", etc.
if curr_instruction.lower().startswith("tap on"):
target_text = curr_instruction.split("on", 1)[-1].strip()
action_dict = {"action": "tap", "parameters": {"text": target_text}}
π Whatβs happening here?
The LLM gets the current step only (no peeking ahead).
If the instruction starts with
"tap on"
, it forces a TAP action.Same enforcement exists for
"type"
and"swipe_until"
.
This way, the agent cannot drift into actions you never asked for.
π¬ Act: Doing the Thing
Actions arenβt just executed once and prayed over.
Each comes with a fallback chain:
def act(state: State) -> dict | None:
action = state["action"]
cmd = action["action"].strip().lower()
params = action["parameters"]
if cmd == "tap":
# Try tapping by bounds
if params.get("bounds"):
success = tap(",".join(map(str, params["bounds"])), state)
# Try tapping by text
elif params.get("text"):
success = tap(params["text"], state)
# Fallback to unified JSON lookup
else:
bounds = find_bounds_from_unified(state["unified"], params["text"])
if bounds:
success = tap(",".join(map(str, bounds)), state)
Hereβs the chain of thought:
If we know bounds β tap directly.
If we know text β tap by label.
If both fail β search the unified JSON for a matching element.
β The workflow never gives up easily.
π Example Run
Take a simple .txt
task file:
1. swipe_until text 'Kia Carens Clavis'
2. tap on 'Find Cars'
3. type 'Mumbai' in 'Location'
4. click on 'See Detailed View'
The agent runs step by step, enforcing the correct action type at each stage, retrying if needed, and finishes cleanly after the last instruction.
π Folder Structure
Hereβs how it all fits together:
βββ πsrc
βββ πdevice
βββ app_control.py # app launch helpers
βββ driver.py # Appium driver glue
βββ test.py
βββ πgraph
βββ workflow.py # LangGraph workflow
βββ πllm
βββ dummy_llm_old.py # placeholder LLM
βββ prompts.py # system + ReAct templates
βββ πlogs
βββ actions_log.json
βββ πtmp # screenshots, vision artifacts
βββ πvision
βββ mock_vision.py # OCR + mock vision
βββ yolov8s-worldv2.pt
βββ config.py
βββ yolov8s-worldv2.pt
Api.py
Notice the imports inside workflow.py
:
from device.driver import take_screenshot, tap, type_text, swipe, swipe_until_found, restart_driver
from vision.mock_vision import observe_screen
from llm.dummy_llm_old import chat
from llm.prompts import SYSTEM_PROMPT, REACT_TEMPLATE
from device.app_control import launch_app
π That means you can write your own implementations of these functions in their respective folders. For example:
Replace
mock_
vision.py
with a real OCR + icon detector.Swap
dummy_llm_
old.py
with a real LLM (Qwen, GPT, or your own fine-tuned model).Extend
driver.py
with richer Appium actions.
Think of it like LEGO β the LangGraph workflow is the skeleton, and you plug in your own muscles. πͺ
π± Future Version: Where This is Headed
Right now, the workflow relies on forced alignments + fallbacks to keep the LLM in check.
But the next version of this project goes beyond that:
Data Collection: Use Appium-recorded trajectories to build a dataset of
(state_before, action, state_after)
.Unified JSON Schema: Merge
XML + OCR + icon detections
into a normalized representation of every screen.Fine-tuned Policy Model: Train a compact multimodal model that directly maps
(screenshot + UI-JSON) β action
.Low-latency Execution: Keep the model small enough to run locally, so LangGraph can call it as a βbrain plug-inβ.
In other words: todayβs workflow enforces βrules of the roadβ for the LLM.
Tomorrowβs version will have a native driver model that doesnβt need as many guardrails β it just knows how to tap, type, and swipe correctly.
β¨ Why This Works Better
Enforced determinism (today) β no LLM wanderings.
Fallbacks everywhere β text, bounds, JSON β something will work.
Customizable β swap in your own functions for vision, LLM, or Appium.
Fine-tuning vision (tomorrow) β a multimodal model that learns from real trajectories, reducing hand-coded rules.
π Closing Thoughts
Instead of brittle Appium scripts, I now have a LangGraph-powered workflow agent that:
Reads plain-text instructions,
Sees the screen,
Thinks (lightly),
Acts (deterministically), and
Will eventually evolve into a fine-tuned multimodal control model.
Now itβs your turn β clone the structure, fill in your own device/
, vision/
, and llm/
modules, and build automation with a brain. π§ β‘
Subscribe to my newsletter
Read articles from Neur directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Neur
Neur
"AI Engineer who builds agents that can see, think, and click better than most humans on Monday mornings. Experienced with LangGraph, LangChain, Hugging Face, and MCP. I tinker with RAG, LLM fine-tuning, and automation systems β basically teaching models to do QA so I can sip chai in peace."