Build the feature described in $ARGUMENTS strictly test-first, driving every line of production code from a failing test.
If $ARGUMENTS is empty, ask the user what feature to build and stop until they answer. Do not guess a feature.
Before the first test
- Detect the stack and test runner from the repo: read package.json / pyproject.toml / go.mod / Gemfile / Cargo.toml, existing test files, and CI config. Note the exact command to run one test file and the whole suite.
- Match existing conventions exactly: test directory layout, file naming (
*.test.ts,test_*.py,*_test.go), assertion library, mocking approach, and factory/fixture helpers. Do not introduce a new framework. - Restate $ARGUMENTS as a short bullet list of observable behaviors, each written as a concrete input -> expected-output pair. This is your checklist; each behavior becomes one or more tests. If a behavior cannot be pinned to a concrete example — you cannot state its expected output — ask ONE batched question listing exactly those gaps, then stop until answered. Otherwise proceed on your stated interpretation without pausing; do not re-confirm mid-run.
The loop — repeat per behavior, smallest slice first
- RED — Write the single smallest test that asserts one unimplemented behavior. Use concrete inputs and expected outputs, not placeholders. Test observable behavior through the public API, never private internals.
- Run only that test (or file) and paste the failure output. Confirm it fails for the RIGHT reason — a real assertion mismatch, not an import error, typo, or missing file. If it errors instead of failing, fix the test until it fails cleanly. A test you have not watched fail proves nothing.
- GREEN — Write the minimal code to pass: the least logic that satisfies this test and the ones before it. Hardcoding a return value is acceptable if no prior test forbids it — the next test will force generalization. Do not add code for behaviors you have not yet tested.
- Run the full suite. All tests must pass before you continue. If an older test breaks, stop and fix it now.
- REFACTOR — Only while green: remove duplication, clarify names, extract functions. Change no behavior. Re-run the full suite after refactoring; revert if anything turns red.
- Announce progress in one line:
RED <test name>/GREEN/REFACTOR, then move to the next behavior.
Coverage of edge cases
Before declaring done, add failing tests (then satisfy them) for empty/null/zero inputs, boundary values, and invalid input with its expected error — each through the full RED->GREEN->REFACTOR loop. For a concurrency or ordering guarantee the feature promises, drive it test-first ONLY if you can make the test deterministic: inject the clock/scheduler, control the interleaving, or assert on collected-then-sorted output. Never commit a test that depends on sleep, wall-clock timing, or thread-race luck — a flaky test is worse than none. If you cannot force determinism, state the guarantee, how you verified it manually, and why it resists an automated test.
Finish
- Run the entire suite one final time and show it green. Run the project's linter/type-checker/formatter if one exists and fix what they flag.
- Report: the behavior checklist with every item ticked, the list of test names added, and the final suite result.
Hard rules
- Never write or edit production code to add or change behavior while the suite is green. Red first, always. The ONLY edit allowed while green is behavior-preserving refactoring (step 5) that keeps every existing test passing unchanged; anything that alters observable behavior needs a failing test first.
- Never write more than one failing test at a time.
- Never weaken or delete a test to get to green — fix the code. If a test itself is wrong, say so explicitly before changing it.
- Never mock the unit under test; mock only true external boundaries (network, clock, filesystem, third-party services).
- Never skip showing a test fail. "It obviously fails" is not evidence.