Run the test suite, diagnose every failure to its real cause, fix the root cause, and re-run until green. $ARGUMENTS may scope this to a file, directory, or test pattern.

Setup

Detect the runner from the project: package.json scripts (jest/vitest/mocha/playwright), pytest/tox.ini/pyproject.toml, go test, cargo test, rspec, phpunit, etc. Use the project's own command (e.g. the test script), not a global guess.
Run the full suite once (or the $ARGUMENTS subset) and capture the complete output. List every failing test with its error and file:line before changing anything.
If the suite won't even start (missing deps, config error, uncompiled TS), fix that first — that is not a test failure.

Diagnose each failure

For each failing test, read the test AND the code under test, then classify the cause before touching anything:

Genuine bug: the test asserts correct behavior and the code is wrong. Fix the code.
Wrong/stale assertion: the code is correct and the expectation is outdated. Before changing the assertion, run git log -p / git blame on both the test and the code under test to establish whether the behavior change was intentional (a deliberate commit/PR that changed the API, matched by a spec or changelog) or an accidental regression. Rewrite the assertion only once the new behavior is confirmed intended by that history — never by reading the desired value off the failure output. If the history shows a regression, fix the code instead.
Flake: reproduce it deterministically, then locate the source. Run the single test in isolation, then the whole file, then the suite in randomized order and repeated, using your runner's real flags: jest --shuffle and --runInBand (serial) — jest has no repeat flag, so loop the command in a shell; vitest --sequence.shuffle and --no-file-parallelism; pytest with the pytest-randomly plugin (--randomly-seed=<n> pins/replays an order) and pytest-repeat (--count=<n>), serial by dropping -n from pytest-xdist; go -shuffle=on -count=<n> (-count also defeats the test cache); cargo --test-threads=1; rspec --order random / --seed <n>. A test that passes alone but fails in the suite is shared-state or order coupling; one that fails intermittently alone is a timing/async race or nondeterminism.
CI-only (fails in CI but passes locally, or the reverse): treat the environment gap as the cause, not the test. Diff the two environments — missing/differing env vars, locale/timezone (LC_ALL, TZ), filesystem case-sensitivity, higher CI parallelism/worker count, tighter resource or time limits, absent services. Reproduce locally by matching CI: same env vars, same worker count and test order, same container/image.

Fix the root cause

Timing/async: await the actual condition or use the framework's fake timers / waitFor. Never add bare sleep/fixed delays or retry wrappers to paper over a race.
Order/shared state: find the leak — a mutated global, un-reset module, shared DB row, leftover file, unclosed handle. Reset it in beforeEach/afterEach or make the test self-contained. Do not fix it by pinning test order.
Nondeterminism: pin the clock, seed the RNG, fix locale/timezone, stub the network. Remove reliance on real wall-clock time or live services.
Environment gap: make the test provide what it needs (set the env var/locale in setup, skip explicitly with a documented reason when a required service is truly absent) rather than depending on ambient CI state.
Genuine bug: fix the source, not the test.

Hard rules

Never make a test pass by deleting it, marking it skip/xfail/.only, loosening an assertion to match wrong output, widening tolerances, or catching-and-ignoring. If a test is genuinely obsolete, say so and ask before removing.
Never add blanket retry/rerun config to hide flakes.
Change one cause at a time and re-run to confirm that fix before moving on; keep changes minimal and scoped to the failure.
After all fixes, run the entire suite (not just the previously-failing tests) to confirm green and no regressions. If flakes were involved, run it 2-3 times or shuffled to prove stability.
Do not touch snapshots wholesale — regenerate a snapshot only after verifying the new output is correct.

Report

State, per originally-failing test: the classification (bug / wrong assertion / flake- / env), the actual root cause in one line (cite the commit for a confirmed intended-change or regression), and the fix. End with the final suite result (counts) and how you verified stability.

Fix failing tests

Setup

Diagnose each failure

Fix the root cause

Hard rules

Report

Add it to your toolkit