Test Coverage Specification

Historical record — v0.1.0-mvp test plan (2026-05-28). Covers the v0.1 Functional-lane MVP. The shipped product moved to the v0.2 modality spine (unit / browser / api / integration / mutation); see Architecture.

Spec: TFactory MVP — Walking Skeleton (Functional Lane, Python) Parent: ../spec.md Approach: TDD — every code-bearing task writes tests first, then implementation, then verifies green.

Test pyramid for TFactory itself

       ┌──────────────────┐
       │   e2e smoke (4)  │   end-to-end pipeline against known good input
       ├──────────────────┤
       │ integration (12) │   agent + executor + git wiring
       ├──────────────────┤
       │   unit (~40)     │   per-module, fast, mocked LLM calls
       └──────────────────┘

Unit tests (per module)

mcp_server/tfactory_server.py

test_plan/ model

context/source.json snapshot

Planner agent

Gen-Functional agent — pre-flight static check

Gen-Functional agent — flake-risk lint

Docker runner

Evaluator

Triager

Git writer

Integration tests (multi-component, no real LLM)

  1. Handover end-to-end (mocked LLM)task_create_and_run → workspace created → snapshot AIFactory dir → planner runs with mocked LLM returning a fixed plan → Gen-Functional runs with mocked LLM returning known test source → Executor runs in real Docker against a fixture project → Evaluator scores → Triager writes report. No network. No real AIFactory required (use a checked-in fixture spec dir).

  2. Hallucination replan loop — Planner returns a subtask for a method that doesn’t exist; Gen-Functional rejects in pre-flight; Planner is re-invoked; Gen-Functional accepts the replanned subtask. Verify counts in report.json.

  3. Flake-lint pipeline — Gen-Functional emits a test with a dict-order assertion; flake-lint rejects; report counts flake_warnings. Then a clean test passes through.

  4. Stability re-run — Generated test that fails 1/3 times is flagged not rejected; reported as flaky.

  5. Docker timeout — generated test sleeps forever; executor kills container after TFACTORY_TASK_TIMEOUT_SEC; task marked executor_timeout; report describes the failure clearly.

  6. Docker daemon down — kill the docker daemon before task; verify graceful failure with a clear error in report.md and status failed, no hang. (Maps to verification scenario #9 in the design plan.)

  7. Git commit + PR comment (dry-run) — Triager produces git commands and a PR comment body that match goldens. Real git commit verified separately in e2e.

  8. Portal task list — backend returns task list JSON shape; frontend renders task list (component test).

  9. Portal live logs WebSocket — connection established; messages streamed; client renders.

  10. Portal lane tabs — functional tab fully populated; sast/dast/fuzz/mutation tabs show “coming in Phase N” placeholder.

  11. MCP task_rerun — completed task rerun rebuilds tests but reuses the snapshotted context; preserves logs of the previous run under logs/run_1/, logs/run_2/.

  12. AIFactory spec snapshot is read-only — verify the original AIFactory spec dir is unchanged byte-for-byte after a TFactory run.

End-to-end smoke (real LLM, real Docker, real git)

These run the 9 verification scenarios from the design plan against a known small Python feature spec from AIFactory’s actual history. Documented as a CI workflow + a scripts/e2e-smoke.sh script.

  1. Happy path — verification scenario 1-5 from design plan: handover → workspace populated → portal advances → tests committed → pytest tests/ passes.
  2. Mutation-of-feature test — verification scenario 6: mutate one line of feature code, rerun the generated tests, at least one must fail.
  3. PR comment — verification scenario 7: gh pr view --comments shows TFactory’s report.
  4. Hallucination guard — verification scenario 8: planner fed a spec referring to a non-existent method; Gen-Functional rejects via pre-flight; Planner replans; no broken test committed.
  5. Docker-down failure path — verification scenario 9: docker daemon killed mid-task; task marked failed with clear error in report.md; no hang.

Mocking strategy

Coverage targets

What is intentionally not tested at MVP