RunAgain traces every run of your Vercel AI SDK and Claude Agent SDK agents, replays them in mocked environments, tests and evaluates them — and alerts you when behaviour drifts, before your customers notice.
booting renderer…
An LLM call is one prompt in, one answer out — you can score that with a benchmark. An agent plans, calls tools, mutates state and loops for twenty steps. One bad decision at step three silently poisons everything after it.
Static benchmarks can't see trajectories. The answer can look fine while the agent took a $4, nine-tool detour — or got lucky on a path that fails tomorrow.
Today's AI evaluation platforms are too complex and too expensive — built for ML research teams, weeks to wire up, and mostly blind to multi-step tool use.
Without tests, your first regression report is a churned customer. RunAgain flips the order: you find out first, fix it, and run again.
Log every agent run — every step, tool call, token and millisecond, captured automatically from production. Nothing to instrument by hand.
Replay any run in a mocked environment — from the dashboard, the CLI, or straight from Claude over MCP. Swap the model, rewrite the prompt, change a tool; tool responses are simulated from previous runs, so iterations are fast and free.
Turn real runs into deterministic tests. Recorded mocks freeze flaky and paid APIs, so the same input always exercises the same path.
Score with whatever technique fits: LLM-as-judge, rubrics, assertions, trajectory scoring, pairwise comparison, dataset regression, human review.
Ship the fix, watch the score climb — export your best runs as fine-tuning datasets, and get alerted the moment quality, cost or latency drifts, before your customers notice and churn.
Every agent run — production or local — streams into RunAgain as a structured trace: every step, prompt, tool call, token and millisecond. Search across runs, diff any two, and replay the weird ones. Yesterday's incident becomes tomorrow's test case.
RunAgain watches your traces and drafts the boring parts automatically — tests from real runs, mocks from recorded tool calls, eval suites from failure patterns. You just tune the taste: review each suggestion and approve. Try it below.
12 production runs looped on duplicate refund requests. Drafted a regression test from run c9e0 with mocked stripe.refund.
Answers started citing docs that weren't retrieved. Drafted a judge prompt scoring answer-to-context faithfulness on every run.
search API is flaky (7% timeouts) and paid. Recorded 214 real responses so tests and experiments replay them for free.
Full-fidelity traces of every run, step, tool call, token and latency — streamed live from production.
Run agents against tool responses simulated from previous runs — deterministic, fast, and no burned API credits.
LLM-as-judge, rubrics, assertions, trajectory scoring, pairwise diffs, dataset regression, human review — or bring your own.
Tests, mocks and eval suites drafted from your real traces. You tune the taste — review and approve.
Get paged when quality, cost or latency drifts from baseline — before customers churn over bad behaviour.
RunAgain ships as an MCP server — run experiments, tune evals and approve suggestions straight from Claude, and export your best runs as fine-tuning datasets.
We're onboarding teams in small batches. Drop your email and we'll hand you a seat — and a starter eval suite.