Run the test suite, diagnose every failure to its real cause, fix the root cause, and re-run until green. $ARGUMENTS may scope this to a file, directory, or test pattern.
Setup
- Detect the runner from the project:
package.jsonscripts (jest/vitest/mocha/playwright),pytest/tox.ini/pyproject.toml,go test,cargo test,rspec,phpunit, etc. Use the project's own command (e.g. thetestscript), not a global guess. - Run the full suite once (or the
$ARGUMENTSsubset) and capture the complete output. List every failing test with its error and file:line before changing anything. - If the suite won't even start (missing deps, config error, uncompiled TS), fix that first — that is not a test failure.
Diagnose each failure
For each failing test, read the test AND the code under test, then classify the cause before touching anything:
- Genuine bug: the test asserts correct behavior and the code is wrong. Fix the code.
- Wrong/stale assertion: the code is correct and the expectation is outdated. Before changing the assertion, run
git log -p/git blameon both the test and the code under test to establish whether the behavior change was intentional (a deliberate commit/PR that changed the API, matched by a spec or changelog) or an accidental regression. Rewrite the assertion only once the new behavior is confirmed intended by that history — never by reading the desired value off the failure output. If the history shows a regression, fix the code instead. - Flake: reproduce it deterministically, then locate the source. Run the single test in isolation, then the whole file, then the suite in randomized order and repeated, using your runner's real flags: jest
--shuffleand--runInBand(serial) — jest has no repeat flag, so loop the command in a shell; vitest--sequence.shuffleand--no-file-parallelism; pytest with thepytest-randomlyplugin (--randomly-seed=<n>pins/replays an order) andpytest-repeat(--count=<n>), serial by dropping-nfrompytest-xdist; go-shuffle=on -count=<n>(-countalso defeats the test cache); cargo--test-threads=1; rspec--order random/--seed <n>. A test that passes alone but fails in the suite is shared-state or order coupling; one that fails intermittently alone is a timing/async race or nondeterminism. - CI-only (fails in CI but passes locally, or the reverse): treat the environment gap as the cause, not the test. Diff the two environments — missing/differing env vars, locale/timezone (
LC_ALL,TZ), filesystem case-sensitivity, higher CI parallelism/worker count, tighter resource or time limits, absent services. Reproduce locally by matching CI: same env vars, same worker count and test order, same container/image.
Fix the root cause
- Timing/async: await the actual condition or use the framework's fake timers /
waitFor. Never add baresleep/fixed delays orretrywrappers to paper over a race. - Order/shared state: find the leak — a mutated global, un-reset module, shared DB row, leftover file, unclosed handle. Reset it in
beforeEach/afterEachor make the test self-contained. Do not fix it by pinning test order. - Nondeterminism: pin the clock, seed the RNG, fix locale/timezone, stub the network. Remove reliance on real wall-clock time or live services.
- Environment gap: make the test provide what it needs (set the env var/locale in setup, skip explicitly with a documented reason when a required service is truly absent) rather than depending on ambient CI state.
- Genuine bug: fix the source, not the test.
Hard rules
- Never make a test pass by deleting it, marking it skip/xfail/
.only, loosening an assertion to match wrong output, widening tolerances, or catching-and-ignoring. If a test is genuinely obsolete, say so and ask before removing. - Never add blanket retry/rerun config to hide flakes.
- Change one cause at a time and re-run to confirm that fix before moving on; keep changes minimal and scoped to the failure.
- After all fixes, run the entire suite (not just the previously-failing tests) to confirm green and no regressions. If flakes were involved, run it 2-3 times or shuffled to prove stability.
- Do not touch snapshots wholesale — regenerate a snapshot only after verifying the new output is correct.
Report
State, per originally-failing test: the classification (bug / wrong assertion / flake-