TFactory v0.2 — Implementation Task Plan

Status: Tasks ready for execution Date: 2026-05-28 Parent design: 2026-05-28-enterprise-test-frameworks-design.md Authored via: /super-brainstorm → spec → writing-plans Predecessor release: v0.1.0-mvp (12 tasks, 531 backend + 112 frontend tests)

Summary

16 tasks, ~95 commits, multi-month effort. Ships v0.2: Playwright + Jest + pytest across three lanes (Browser, Unit-TS, Unit-Python). Establishes the framework registry, target schema, and platform deliverables that v0.3+ extensions slot into without rework.

Same execution cadence as v0.1: one GitHub issue per task, six-commit shape per task (scaffold → primitives × 2 → real wire → integration → close). Issue numbers below are tentative — will be assigned at issue-creation time.


Dependency graph

              Task 0 (lane rename — gates everything)
                              │
              ┌───────────────┼───────────────┐
              ▼               ▼               ▼
          Task 1          Task 2          Task 3
          framework       .tfactory.yml   tests-catalog
          registry        schema          schema
              │               │               │
              └───────────────┼───────────────┘
                              ▼
                          Task 4
                          snapshotter extended
                              │
                              ▼
                          Task 5
                          Planner per-subtask
                              │
                              ▼
                          Task 6
                          Gen-Functional generic + context
                              │
              ┌───────────────┼────────────────────────┐
              ▼               ▼                        ▼
          Task 7          Task 9                   Task 12
          Docker images   Evaluator                Templates
          (playwright,    per-lang primitives      (Playwright +
           jest, pytest)  (tsc, ESLint, Stryker)   Jest + pytest set)
              │               │                        │
              ▼               ▼                        ▼
          Task 8          Task 10                  Task 13
          Browser app     Evaluator coverage       Skills
          runtime         adapter (null vs zero)   (init, add-test,
                              │                    from-template)
                              ▼                        │
                          Task 11                      ▼
                          Triager update-vs-create  Task 14
                          + catalog mutation        Portal endpoints
                              │                        │
                              └────────────┬───────────┘
                                           ▼
                                       Task 15
                                       LaneStatusGrid +
                                       migration CLI
                                           │
                                           ▼
                                       Task 16
                                       Evidence capture +
                                       portal viewer
                                       (closes v0.2)

Critical path: 0 → 1/2/3 → 4 → 5 → 6 → 9 → 10 → 11 → 15 → 16 Parallelizable after Task 6: 7, 8, 9, 12, 13, 14 Task 16 lands LAST — it integrates evidence flow across Tasks 8 (browser runtime emits artifacts), 11 (Triager links them), 14 (portal serves them). Doing it last avoids re-touching three upstream tasks.


Task index

# Title Blocked by Commits Issue
0 Lane rename + breaking-change migration 4 tbd
1 Framework registry data model + loader 0 6 tbd
2 .tfactory.yml schema + parser + validator 0 6 tbd
3 .tfactory/tests-catalog.json schema + helpers 0 6 tbd
4 Snapshotter extended 2, 3 4 tbd
5 Planner per-subtask (language, framework, lane) 1, 4 6 tbd
6 Gen-Functional generic + context injection 1, 5 6 tbd
7 Per-framework Docker images (Playwright, Jest, pytest) 1 5 tbd
8 Browser-lane app runtime + health-poll 2, 7 6 tbd
9 Evaluator per-language primitives (tsc, ESLint, Stryker) 1, 6 6 tbd
10 Evaluator coverage adapter (null vs zero) 6, 9 4 tbd
11 Triager update-vs-create + catalog mutation 3, 10 5 tbd
12 Templates: Playwright + Jest + pytest starter set 1 5 tbd
13 Skills: tfactory-init / add-test / from-template 12 5 tbd
14 Portal endpoints for templates / skills / catalogs 1, 3, 12, 13 6 tbd
15 LaneStatusGrid reskin + migration CLI 0, 14 5 tbd
16 Test evidence capture + portal viewer (closes #v0.2) 8, 11, 14 6 tbd

Total: ~95 commits across 16 tasks.


Task 0 — Lane rename + breaking-change migration

Must land first. Every other task depends on the new Lane enum.

Goal

Atomically replace v0.1’s Lane.{FUNCTIONAL, SAST, DAST, FUZZ, MUTATION} with v0.2’s Lane.{UNIT, BROWSER, API, INTEGRATION, MUTATION} across the backend, frontend, tests, and prompts.

Sub-tasks

Acceptance criteria

Commit shape (4 commits)

  1. Backend Lane enum + deprecation aliases + lane_dispatch + lang_registry
  2. Backend tests updated
  3. Frontend LaneStatusGrid + test updates
  4. CHANGELOG + close issue

Task 1 — Framework registry data model + loader

Goal

Build the registry that maps framework name → FrameworkDescriptor object, loaded from frameworks/{name}/descriptor.yaml files. Used by all downstream agents to look up runner image, templates, context block, evaluator hooks per (language, framework).

Sub-tasks

Acceptance criteria

Commit shape (6 commits)

  1. Dataclass + validator scaffolding
  2. Loader + first descriptor (pytest, mirrors v0.1 behavior as sanity)
  3. Playwright descriptor + browser-specific fields
  4. Jest descriptor
  5. Tests + 25+ cases
  6. Docs + close issue

Task 2 — .tfactory.yml schema + parser + validator

Goal

Define the schema that AIFactory projects use to declare targets, test paths, and seed/reset commands. Implement parser + validator with helpful error messages.

Sub-tasks

Acceptance criteria

Commit shape (6 commits)

  1. Pydantic models
  2. Target type validators
  3. Auth validators
  4. Parser + env-var detection
  5. Tests
  6. Example yaml + close issue

Task 3 — .tfactory/tests-catalog.json schema + helpers

Goal

Define the catalog schema, read/write helpers, and the 3-step AC-match lookup algorithm.

Sub-tasks

Acceptance criteria

Commit shape (6 commits)

  1. Schema dataclass
  2. Read/write IO with atomic-write
  3. 3-step lookup algorithm
  4. v0.1 migration
  5. Tests
  6. Docs + close issue

Task 4 — Snapshotter extended

Goal

The snapshotter (Task 3 of v0.1) currently captures AIFactory spec + diff into context/. Extend to also read .tfactory.yml + .tfactory/tests-catalog.json from the AIFactory repo and surface them to the Planner.

Sub-tasks

Acceptance criteria

Commit shape (4 commits)

  1. Read tfactory.yml in snapshotter
  2. Read tests-catalog in snapshotter
  3. source.json fields + tests
  4. Close issue

Task 5 — Planner per-subtask (language, framework, lane, target)

Goal

Extend the Planner to emit subtasks each carrying (language, framework, lane, target_name) instead of just lane. Polyglot repos produce mixed subtasks naturally.

Sub-tasks

Acceptance criteria

Commit shape (6 commits)

  1. Subtask schema extensions
  2. Planner prompt updates (framework picking)
  3. Post-session validator
  4. Helper prompt injection
  5. Tests (mocked SDK)
  6. Close issue

Task 6 — Gen-Functional refactored: generic prompt + context injection

Goal

Replace prompts/gen_functional.md (currently Python+pytest-specific) with a generic prompt that’s parameterized per subtask via the framework descriptor’s context_block + templates.

Sub-tasks

Acceptance criteria

Commit shape (6 commits)

  1. Generic prompt body
  2. Prompt helper with descriptor injection
  3. Agent dispatcher per framework
  4. Runner-fn parameterization
  5. Tests
  6. Close issue

Task 7 — Per-framework Docker images

Goal

Build the runner images for Playwright + Jest. pytest’s image already exists from v0.1; rebuild to match the registry’s runtime.image convention.

Sub-tasks

Acceptance criteria

Commit shape (5 commits)

  1. Rename pytest image to registry convention
  2. Jest image
  3. Playwright image (largest — chromium baseline)
  4. CI workflow
  5. Smoke tests + close issue

Task 8 — Browser-lane app runtime + health-poll

Goal

Make the Executor able to spin up the AIFactory app via docker-compose, wait for it to be healthy, hand the URL to the Playwright test, then tear it down.

Sub-tasks

Acceptance criteria

Commit shape (6 commits)

  1. AppRuntime class scaffolding
  2. start + stop
  3. wait_for_healthy with poll loop
  4. DockerRunner integration
  5. Status transitions
  6. Tests + close issue

Task 9 — Evaluator per-language primitives

Goal

Author the TS/JS analogs of Python’s preflight_static, flake_risk_lint, mutate_probe. These are the per-language primitives the Evaluator dispatches based on the test’s framework.

Sub-tasks

Acceptance criteria

Commit shape (6 commits)

  1. TS preflight (tsc-based)
  2. TS flake-lint (ESLint-based)
  3. TS mutate-probe (Stryker-based)
  4. Configs (eslint + stryker)
  5. Tests
  6. Close issue

Task 10 — Evaluator coverage adapter (null vs zero)

Goal

Per Decision 11: coverage_delta = null for Browser lane, NOT zero. The Evaluator prompt must treat null as “not applicable” not “low value”.

Sub-tasks

Acceptance criteria

Commit shape (4 commits)

  1. Signals bundle null handling
  2. Prompt block rendering
  3. Validator + evaluator.md verdict-priority update
  4. Tests + close issue

Task 11 — Triager update-vs-create + catalog mutation

Goal

Triager reads the tests catalog, applies the 3-step AC-match lookup, decides UPDATE-in-place vs CREATE-new, and writes the catalog back.

Sub-tasks

Acceptance criteria

Commit shape (5 commits)

  1. Catalog read at Triager start
  2. lookup_by_ac integration
  3. UPDATE vs CREATE branching
  4. SKIP for operator_locked
  5. Tests + close issue

Task 12 — Templates: Playwright + Jest + pytest starter set

Goal

Ship the per-framework template library that Gen-Functional uses as starting points. ~5 templates per framework.

Sub-tasks

Acceptance criteria

Commit shape (5 commits)

  1. Template engine (simple substitution)
  2. Playwright 5 templates
  3. Jest 5 templates
  4. pytest 5 templates (refining v0.1’s implicit templates)
  5. Tests + close issue

Task 13 — Skills: tfactory-init / add-test / from-template

Goal

Author Claude Code skill bundles for engineers to use from their own sessions (not just the portal).

Sub-tasks

Acceptance criteria

Commit shape (5 commits)

  1. tfactory-init skill + command
  2. tfactory-add-test skill + command
  3. tfactory-from-template skill + command
  4. handover-to-tfactory skill update for new schema
  5. Tests + close issue

Task 14 — Portal endpoints for templates / skills / catalogs

Goal

The portal grows new REST endpoints exposing the framework registry, templates, skills, and per-project catalog. The frontend (Task 15) uses these to render a “Templates” tab and a “Coverage gaps” view.

Sub-tasks

Acceptance criteria

Commit shape (6 commits)

  1. Frameworks endpoints
  2. Templates endpoints
  3. Skills endpoints
  4. Catalog endpoint on existing task routes
  5. Tests
  6. Allowlist + close issue

Task 15 — LaneStatusGrid reskin + migration CLI (closes v0.2)

Goal

Final task: ship the frontend reskin (5 new lane cards), the tfactory init

Sub-tasks

Acceptance criteria

Commit shape (5 commits)

  1. LaneStatusGrid reskin + frontend tests
  2. tfactory init CLI
  3. tfactory migrate CLI
  4. CHANGELOG + README + docs updates
  5. Tag + release + close v0.2 epic


Task 16 — Test evidence capture + portal viewer (closes v0.2)

Cross-cutting. Per Decision 12 in the spec: screenshots + video + trace + network HAR as test evidence for human review. Touches Tasks 8 (Browser runtime), 11 (Triager), 14 (Portal). Lands LAST so it sees all the upstream contracts settled.

Goal

Capture, store, serve, and link evidence artifacts (screenshots / video / trace / HAR / request-response logs) so human reviewers can see what TFactory generated running, before they trust it.

Sub-tasks

Acceptance criteria

Commit shape (6 commits)

  1. Evidence layout + Playwright config integration
  2. HTTP HAR recorder for API/Integration lanes
  3. Catalog schema extension + retention enforcer
  4. Triager PR-comment evidence-links rendering
  5. Portal endpoint + TFactoryTaskDetail Evidence tab
  6. Tests + close issue + close v0.2 epic

Risks + execution notes

Critical-path risk

Tasks 0 → 1/2/3 → 4 → 5 → 6 is sequential and can’t be parallelized. That’s ~6 tasks × 5 commits avg = 30 commits on the critical path. At v0.1’s cadence (12 tasks shipped in ~1 day of intensive work), v0.2 is plausibly 1-2 weeks of focused work for the critical path + parallel branches.

  1. Land Task 0 first. Don’t start anything else until the lane rename is in main + green.
  2. Parallelize Tasks 1, 2, 3 as soon as Task 0 lands. Three independent schema-and-loader tasks.
  3. Task 4 (snapshotter) is small — slot it in while waiting for 1/2/3.
  4. Tasks 7 (Docker images) + 12 (templates) can run in parallel with the critical path from Task 5 onwards. Docker image work is mostly container-build time, not coding time.
  5. Task 15 is the close-out commit. It absorbs the CLI helpers + tag
    • release. Don’t start it until 11/13/14 are all in.

Test-volume estimate

Task New backend tests New frontend tests
0 0 (regression check) 0 (regression check)
1 25 0
2 30 0
3 25 0
4 8 0
5 15 0
6 20 0
7 6 0
8 12 0
9 35 0
10 10 0
11 18 0
12 20 0
13 12 0
14 30 0
15 5 13
16 40 8
TOTAL +311 +21

End-of-v0.2 totals: ~840 backend + ~133 frontend = ~975 tests.

Migration safety

The Lane.SAST/DAST/FUZZ deprecation aliases (Task 0) mean v0.1 workspaces still work through v0.2 with a deprecation warning. v0.3 removes the aliases entirely.

Customer-acceptance demo (suggested)

Once v0.2 is tagged, the demo flow for an enterprise prospect:

  1. Engineer in their AIFactory repo runs /handover-to-tfactory
  2. TFactory snapshots the spec, picks Playwright for the UI change + Jest for the JS unit logic + pytest for the Python backend
  3. docker-compose spins up the app
  4. Three lanes’ worth of tests generated, all signal-validated
  5. Evidence captured automatically: screenshots per step, video of each failing test, trace.zip for debugging, network HAR for API tests
  6. Triage report posted to the PR with cover-by-AC breakdown + inline evidence links (engineer clicks → portal opens video at 0:00, sees the test exercise the login flow exactly as expected)
  7. Engineer accepts → tests committed to feature branch + catalog updated
  8. Three months later: new dev opens an old TFactory-generated test, clicks the catalog entry’s “view evidence” link, watches the video from the test’s first run, understands what it does without reading the Playwright TS code

The evidence-driven trust loop is the v0.2 wedge. Without it, “AI-generated tests” are scary. With it, they’re verifiable in 30 seconds per test.


Predecessors + provenance