Technology6 min read

Vision-First vs CSS Selectors: Why Coordinate Clicks Hold Up Better

Selectors break when markup shifts. Vision analysis keeps automation stable by finding elements from screenshots instead of brittle DOM paths.

Apr 8, 2026

The Selector Problem in Antidetect Workflows

CSS selectors are the default way automation tools find elements on a page. You write a path like `button.submit-btn` or `#login-form input[name="email"]`, and the script clicks or fills it. This works until the markup changes — and in antidetect workflows, markup changes constantly.

Different browser profiles can see different page versions. A/B tests, geo-targeted layouts, consent banners, bot-detection challenges, and dynamic class names all shift the DOM between sessions. A selector that works in profile #1 may not exist in profile #12.

Why Selectors Break More Often Than You Think

Dynamic class names — Frameworks like React, Next.js, and Tailwind generate hashed or utility classes that change between builds. A selector targeting `.css-1a2b3c` stops working after the next deploy.
A/B testing — The same URL can serve completely different DOM structures to different visitors. Your selector assumes one variant; the page shows another.
Consent and cookie banners — These overlays inject new elements, shift z-indexes, and sometimes wrap the entire page in a container that didn't exist when you wrote the selector.
Bot detection overlays — Cloudflare challenges, DataDome interstitials, and similar systems replace page content with their own markup. No amount of selector tuning helps when the page itself is gone.
Locale and geo differences — RTL layouts, translated button text, and region-specific UI components all change the DOM tree.

In a single-browser script, you notice and fix these one at a time. In a 50-profile batch, a broken selector silently fails across dozens of sessions before anyone catches it.

How Vision-First Interaction Works

Vision-first means the AI takes a screenshot of the page and analyzes it visually — the same way a human would look at a screen. Instead of searching the DOM for a specific selector, it identifies elements by how they look: "the blue Sign Up button in the center of the page."

The flow is straightforward:

Take a screenshot of the current page state
Send it to a vision model that identifies interactive elements and their positions
Return normalized bounding boxes (coordinates) for each detected element
Click, hover, or interact using those coordinates — no DOM lookup needed

// Vision-first click flow
await browser_parallel_navigate({ url: "https://target.example/signup" });
const grouped = await browser_parallel_vision_analyze_grouped();
// AI identifies elements visually across all active browsers
// Returns normalized bounding boxes for each element
await browser_parallel_click_normalized_box({ box: signUpButton.box });

Coordinate Clicks vs Selector Clicks

A selector click says "find the element at this DOM path and click it." A coordinate click says "click at this position on the screen." The difference matters when pages diverge.

Selector-based approach

// Breaks if class name changes, element moves, or overlay appears
await page.click("button.btn-primary.submit-action");
// Breaks if form structure changes
await page.fill("#signup-form input[type=email]", "user@mail.com");

Vision-based approach

// Works regardless of class names, IDs, or DOM structure
await browser_parallel_screenshot();
const analysis = await browser_parallel_vision_analyze_grouped();
// AI finds "Sign Up" button visually — same as a human would
await browser_parallel_click_normalized_box({ box: analysis.signUp.box });

The vision approach doesn't care about class names, IDs, or nesting depth. If the button is visible on screen, it gets found. This is especially powerful in antidetect scenarios where each profile may render a slightly different page.

Grouped Analysis for Batch Operations

When running 20+ browser profiles in parallel, vision analysis needs to be efficient. Ornold uses grouped analysis — it takes screenshots from all active sessions, sends them as a batch to the vision model, and returns results for each session in one call.

This matters for two reasons:

Efficiency — One API call handles all sessions instead of N separate calls. Latency stays roughly constant regardless of batch size.
Divergence handling — The AI sees each session's actual state. If profile #7 has a CAPTCHA overlay while the others show the form, the grouped result reflects that. You can branch logic per-session instead of assuming all pages look the same.

// Grouped vision analysis across all active browsers
const results = await browser_parallel_vision_analyze_grouped();
// results.groups shows which browsers are in which state
// e.g., 15 browsers on the form, 3 on CAPTCHA, 2 on error page

When DOM Mode Still Wins

Vision analysis isn't always the right choice. DOM-based interaction (Ornold's default mode) is faster, free, and perfectly reliable for structured pages with stable markup.

Use DOM mode when:

The target page has stable, predictable HTML (e.g., your own app, well-known platforms)
You're filling forms with clear input fields and labels
Speed matters more than resilience — DOM snapshots are instant, vision analysis takes 1-3 seconds
You want to avoid vision credit costs

Use Vision mode when:

Pages vary between profiles (A/B tests, geo-targeting, dynamic layouts)
You're dealing with canvas-based UIs, image-heavy pages, or custom web components
Selectors keep breaking and maintenance cost is high
You need to verify what the page actually looks like (visual QA)

You can enable both modes at once. The AI agent picks the best approach for each action automatically — DOM for simple form fills, vision for complex or unpredictable pages.

Real-World Comparison

Here's what happens in practice when running a registration flow across 30 Linken Sphere sessions:

Selector-based automation

25 of 30 sessions complete successfully
3 fail because a consent banner shifted the form layout
1 fails because Cloudflare served an interstitial page
1 fails because the site deployed a new version mid-run with different class names
Total: 83% success rate. Manual intervention needed for 5 sessions.

Vision-based automation

29 of 30 sessions complete successfully
1 fails because the page didn't load (network timeout — not a selector issue)
Consent banners, layout shifts, and class name changes had no effect
Total: 97% success rate. One retry needed for the timeout.

The gap widens as batch size grows. At 50+ profiles, selector-based automation typically needs constant maintenance. Vision-based automation stays stable because it adapts to what's actually on screen.

Setting Up Vision Mode in Ornold

Enable vision mode when connecting your AI agent to Ornold MCP. In the dashboard wizard, select "Both modes" at Step 2 to get DOM + Vision tools. Or add the flag manually:

# Claude Code — enable both DOM and Vision modes
claude mcp add --transport stdio ornold-browser -- npx ornold-mcp --token YOUR_TOKEN --mode both --linken-port 40080

# Codex — in config.toml
[mcp_servers.ornold-browser]
command = "npx"
args = ["ornold-mcp", "--token", "YOUR_TOKEN", "--mode", "both", "--linken-port", "40080"]

Each vision analysis costs 1 credit. DOM snapshots are free. With "both" mode enabled, the AI agent uses DOM by default and switches to vision only when needed.

Key Takeaways

CSS selectors assume a stable DOM. In antidetect workflows, the DOM is anything but stable.
Vision-first interaction finds elements by appearance, not by path. It adapts to layout changes, overlays, and A/B tests automatically.
Grouped vision analysis handles batch divergence — each session gets analyzed based on its actual state.
DOM mode is still the best default for speed and cost. Use vision when pages are unpredictable or selectors keep breaking.
Both modes can coexist. Let the AI agent pick the right tool for each action.

Explainer