private beta

Infrastructure for agents that get better every run.

RunAgain traces every run of your Vercel AI SDK and Claude Agent SDK agents, replays them in mocked environments, tests and evaluates them — and alerts you when behaviour drifts, before your customers notice.

Join 2,300+ builders already on the waitlist.
~/agents/support-bot — runagain
$ runagain eval --watch --loops 5
→ agent support-bot · suite faithfulness · 15 cases
      booting renderer…
$  
WORKS WITHVercel AI SDKClaude Agent SDKMCP{ }REST API
// WHY THIS EXISTS

LLMs are easy to eval. Agents aren't.

An LLM call is one prompt in, one answer out — you can score that with a benchmark. An agent plans, calls tools, mutates state and loops for twenty steps. One bad decision at step three silently poisons everything after it.

01
One prompt vs. twenty steps

Static benchmarks can't see trajectories. The answer can look fine while the agent took a $4, nine-tool detour — or got lucky on a path that fails tomorrow.

02
Existing tools don't fit

Today's AI evaluation platforms are too complex and too expensive — built for ML research teams, weeks to wire up, and mostly blind to multi-step tool use.

03
So agents ship on vibes

Without tests, your first regression report is a churned customer. RunAgain flips the order: you find out first, fix it, and run again.

// THE FEEDBACK LOOP

Trace. Experiment. Test. Evaluate. Improve. Run again.

runagain — feedback loop
traceexperimenttestevaluateimprove
01Trace

Log every agent run — every step, tool call, token and millisecond, captured automatically from production. Nothing to instrument by hand.

run f17b captured · 23 steps · 6 tool calls · 1.4s 
02Experiment

Replay any run in a mocked environment — from the dashboard, the CLI, or straight from Claude over MCP. Swap the model, rewrite the prompt, change a tool; tool responses are simulated from previous runs, so iterations are fast and free.

model: fable-5prompt: v3tools: mocked
03Test

Turn real runs into deterministic tests. Recorded mocks freeze flaky and paid APIs, so the same input always exercises the same path.

✓ 15/15 replays identical — deterministic
04Evaluate

Score with whatever technique fits: LLM-as-judge, rubrics, assertions, trajectory scoring, pairwise comparison, dataset regression, human review.

faithfulness0.94
05Improve

Ship the fix, watch the score climb — export your best runs as fine-tuning datasets, and get alerted the moment quality, cost or latency drifts, before your customers notice and churn.

drift alert armed · you find out before customers do
// OBSERVABILITY

Log every run.
Chaos in, order out.

Every agent run — production or local — streams into RunAgain as a structured trace: every step, prompt, tool call, token and millisecond. Search across runs, diff any two, and replay the weird ones. Yesterday's incident becomes tomorrow's test case.

  • zero-config capture from Vercel AI SDK & Claude Agent SDK
  • full trajectories — not just the final answer
  • one click: trace → mock → test → eval
runagain — trace ingest · live
ingesting 84 runs/min · 12 agents · 0 dropped
// EVALS ON AUTOPILOT

Setting up tests and evals is hard. So we do it for you.

RunAgain watches your traces and drafts the boring parts automatically — tests from real runs, mocks from recorded tool calls, eval suites from failure patterns. You just tune the taste: review each suggestion and approve. Try it below.

NEW TEST
refund-flow: user asks twice

12 production runs looped on duplicate refund requests. Drafted a regression test from run c9e0 with mocked stripe.refund.

NEW EVAL
faithfulness · LLM-as-judge

Answers started citing docs that weren't retrieved. Drafted a judge prompt scoring answer-to-context faithfulness on every run.

NEW MOCK
tool:search → recorded responses

search API is flaky (7% timeouts) and paid. Recorded 214 real responses so tests and experiments replay them for free.

// EVERYTHING BETWEEN "IT RAN" AND "IT WORKS"

Fix your agents. Ensure they work. Keep them working.

01
Log every run

Full-fidelity traces of every run, step, tool call, token and latency — streamed live from production.

02
Mocked environments

Run agents against tool responses simulated from previous runs — deterministic, fast, and no burned API credits.

03
Every eval technique

LLM-as-judge, rubrics, assertions, trajectory scoring, pairwise diffs, dataset regression, human review — or bring your own.

04
Auto-generated suites

Tests, mocks and eval suites drafted from your real traces. You tune the taste — review and approve.

05
Drift alerts

Get paged when quality, cost or latency drifts from baseline — before customers churn over bad behaviour.

06
MCP-first

RunAgain ships as an MCP server — run experiments, tune evals and approve suggestions straight from Claude, and export your best runs as fine-tuning datasets.

// THREE LINES TO YOUR FIRST EVAL

Install. Trace. Run again.

1Install the SDK
$ npm i runagain
2Wrap your agent
// Vercel AI SDK & Claude Agent SDK
import { trace } from 'runagain'
export default trace(agent)
3Score every run
$ runagain eval
✓ 14/15 passed
run a91f · support-bot · mainpassed
plan
retrieve
tool:db
generate
96
accuracy0.96
faithfulness0.94
toxicity0.08
vs main−1 case

Be first to run again.

We're onboarding teams in small batches. Drop your email and we'll hand you a seat — and a starter eval suite.