Skip to main content

Command Palette

Search for a command to run...

Why We Open Sourced Finalrun

Published
4 min read

https://github.com/final-run/finalrun-agent

We built Finalrun because every mobile test suite we'd worked on had the same lifecycle: someone writes it, it works for a sprint, and then a UI tweak breaks half the selectors. You spend more time babysitting the tests than writing features. We got tired of it and went looking for something fundamentally different — and what we found felt too useful to keep closed.

The Idea That Actually Worked

What if tests could just look at the screen the way a person does?

We started building a vision-based agent that uses screen understanding instead of brittle locators like XPath or accessibility IDs. You describe what you want in plain English — "tap the login button," "scroll to the pricing section," "verify the confirmation message" — and the agent figures out where things are visually.

That part worked surprisingly well. The agent could interpret intent, find elements on screen, and execute actions reliably across both Android and iOS. No element trees. No platform-specific selector syntax. Just natural language and a screenshot.

But the real problem wasn't execution. It was everything that happens before the test runs.

Where Test Flows Actually Break Down

The typical approach is to define test flows outside your codebase. Maybe a QA engineer writes them manually. Maybe they get generated from a PRD. Either way, they live in a separate world from your source code.

This works until it doesn't. The app changes. A screen gets renamed. A flow gets restructured. The tests don't know about any of it. Before long, you're maintaining two sources of truth that are drifting further apart every week.

We tried a second approach: generating tests directly from the codebase using MCP, pulling in component structure, navigation flows, and screen definitions to build test cases that actually reflect the current state of the app. This improved sync significantly — tests knew what the app looked like right now — but the tradeoffs were real. Token usage was high, and generation was slow.

The Shift: Tests Should Live With the Code

The breakthrough wasn't a technical trick. It was a framing change.

Test generation shouldn't be a one-off step that produces artifacts you then manage separately. Tests need to live alongside the codebase so they have continuous access to context and stay in sync as the app evolves. When a developer changes a screen, the tests should know about it — not because someone remembered to update a spec, but because the test definitions are rooted in the same repo.

So we kept the vision-based execution (no selectors, no fragile locators) and moved test generation closer to the repository. The result is a system where tests are generated from codebase context, defined as YAML-based flows, and executed visually.

What This Looks Like in Practice

The workflow breaks down into three pieces:

Generate from context. Tests are produced from actual codebase structure — routes, components, screen definitions — not guesswork or stale documentation.

Define as YAML. Test flows are expressed in a simple, readable YAML format. Easy to version, easy to review in a PR, easy to extend.

Execute with vision. The agent runs tests by looking at the screen, not by querying an element tree. This works across Android and iOS without platform-specific test code.

The most interesting scenario is what we call the post-development handoff. An AI builds a feature inside an IDE. Immediately after, the agent generates a test from the updated codebase and executes it visually — verifying that the feature the AI just wrote actually works on a real device. No human in the loop for the testing step.

Why Open Source

We could have kept this closed and built a product around it. But the problem — flaky, out-of-sync mobile tests — is too widespread, and the approach is too early to develop in a vacuum. Vision-based testing needs to be stress-tested across real codebases, real devices, and real workflows. The fastest way to get there is to let others break it, fix it, and push it further.

The core pieces are all here:

The demo walks through the full cycle — AI builds a feature, Finalrun generates a vision-based test, and the test runs on device. It's the workflow we wanted when we started down this path: tests that keep up with the code, execute like a human would, and don't fall apart the moment someone moves a button three pixels to the left.


If you've been fighting flaky mobile test suites, we'd love to hear what you've tried. Open an issue on the repo or reach out — this is very much a work in progress.