AI and automation are currently the hottest trends in tech. Everyone — from solo developers to the largest corporations — is racing to release new tools designed to crush competitors. Startups are popping up faster than ever before, trying to implement AI in practically every sector. One particularly interesting area is web/browser automation.

Almost everyone has tasks they repeat daily in the browser: searching through endless information, online shopping, trading, and more. Often, timing is crucial — acting quickly can mean buying something at a bargain, gaining a new client, or staying ahead of competitors. Being first frequently determines success; after all, nobody listens to news reported a day late.

Let’s talk about time — we all have the same limited 24 hours. Work shouldn’t consume most of our lives. Wouldn’t it be amazing if we could slow down, switch our tasks to autopilot, and let machines handle tedious chores? This means instructing AI programs on what and when to do something online on our behalf, often requiring nothing more than a browser.

There are many tools labeled as AI agents claiming to handle everything for you — ChatGPT Operator, Manus, or open-source projects like BrowserUse or Midscene come to mind first. I’ve worked extensively with automation as a software engineer, and even more recently as the creator of monity.ai.

So, how do AI agents interact with the web, and why do they sometimes fail?

Browser-based AI agents usually depend on two types of data: HTML code interpreted by large language models, and screenshots interpreted by vision models. When a bot visits a page, it creates an abstraction — usually a screenshot with bounding boxes marking the locations of HTML elements. Bots rely on these abstractions because HTML alone isn’t enough, and screenshots without HTML context are equally useless. Developers typically use tools like Puppeteer, Playwright, or Chrome DevTools Protocol directly to trigger browser actions on specific elements.

Each interaction with a webpage (like clicking a button or navigating to another page) repeats this process of generating new abstractions, which quickly becomes costly as AI models and cloud resources (required by virtual browsers) are expensive. Additionally, slow websites further extend execution time. But assuming the costs aren’t your main concern, let’s discuss why interacting with webpages can still be challenging.

1. Every webpage is different.

The days of simple HTML and basic CSS are long gone. Today’s pages are full of animations, scroll-driven interactions, lazy-loaded elements, and sometimes even fake scroll behaviors. I’m not here to judge usability or accessibility, but if you’ve seen websites showcased on platforms like Awwwards.com, you know even human users can get confused — imagine how difficult it is for AI agents. Tasks like capturing a full-page screenshot with all the details or submitting a complex form often become unexpectedly challenging.

2. Semantic HTML has been overshadowed by JavaScript frameworks.

I’m not saying it’s impossible to have good markup with modern frameworks, but often it’s just not the reality. Many websites are overengineered. For example, an AI agent can easily handle traditional dropdown menus built with simple HTML tags, but struggle with JavaScript-driven dropdowns that lack proper accessibility. Naming conventions, machine-generated CSS classes that change frequently, isolated Shadow DOM components, iframes, and canvas elements (e.g., WebGL) add further complexity. Even automated end-to-end tests are challenging for developers, so it’s no surprise AI agents also struggle.

3. Modern CSS is extremely complicated.

Front-end developers know CSS can seem almost magical. Just a single line of CSS can effectively break automation — a small bug might render a clickable element non-clickable. Complex specificity rules and pseudo-elements (:before, :after) often clash poorly with JavaScript and browser automation scripts.

4. Bot protections and CAPTCHAs.

Today’s bot detection tools are highly sophisticated, creating an endless arms race between aggressive web scraping and defensive anti-bot measures. Using proxies and other techniques can reduce detection, but the reality is clear: if a website owner truly wants to block bots, they can make it extremely difficult or nearly impossible. Period.

The bottom line: Despite rapid advancements, autonomous web interactions by AI agents remain far from perfect, mostly because the modern web itself is increasingly complex. Predicting whether AI agents will ever be able to reliably perform the majority of tasks humans can do online remains challenging. What are your experiences with browser automation — are you satisfied with the current state of AI agents, or do you see significant room for improvement?

Web is still too complex for autonomous AI agents

1. Every webpage is different.

2. Semantic HTML has been overshadowed by JavaScript frameworks.

3. Modern CSS is extremely complicated.

4. Bot protections and CAPTCHAs.

Subscribe to my newsletter

Kamil Kowalczyk

Kamil Kowalczyk