I Gave AI a Smartphone. It Did Everything a QA Engineer Does.

We gave an AI agent the ability to see and touch a smartphone screen via USB. Had it analyze three competitor apps. What would take a human weeks was done in a day.

What we did

Treated apps as graphs. Screens = nodes, buttons = edges. DFS to visit every screen without exception
291 screenshots across 3 apps. Every dropdown, toggle, scroll-to-bottom. Checklist proved zero gaps
Executed 5 real transactions. Swaps, bridges, futures — captured confirmation/success/error screens and actual fee structures
Auto-generated 37-axis comparison + HTML report. Every claim linked to screenshot evidence

What we found

AI caught what QA passed. A 0.4% fee — within spec, so QA approved. AI flagged it as high for the industry and compared against 5 competitors
Humans see about 30%. You think you checked everything. Compared to AI's checklist-based exhaustive exploration, the coverage gap is overwhelming
It makes domain judgments. Detected anomalies undefined in any spec. A model trained on thousands of app patterns knows "this number is unusual" before any human does

Why this matters

Observe, judge, report — there are roles where these three things are the job. QA, researchers, analysts, consultants, auditors. AI did all three
The cost structure changes. Human weeks = AI day. And AI holds 291 screenshots in memory while comparing consistently across 37 axes. Humans can't

We open-sourced the methodology. Give this document to an LLM and it starts exploring on its own.

github.com/ForrestKim42/llm-mobile-testing