v0.2.0 in action — a live demo

This page is the receipt. We took TFactory v0.2.0, handed it a small live SUT (a Vite + React greeting generator with 5 acceptance criteria and a deliberately seeded cache bug), and let the four-agent pipeline plan, write, sandbox, score, and triage a Playwright suite end-to-end. Below you’ll find the time-lapse of the portal, the generated tests, the captured evidence (screenshots, video, trace, HAR), and the unedited triage report the Triager produced — exactly what a reviewer would see on the AIFactory PR.

The demo app

Try it live — embedded right here:

Pick a category, pick a tone, click Generate, then click Generate again with the same dropdowns — you’ll see the same text appear (that’s the seeded cache bug TFactory’s AC#5 test catches).

The system-under-test is intentionally tiny: a Vite + React single-page app with two dropdowns (category, tone), a Generate button, a Clear button, and an output panel. The 5-AC surface covers happy path, every dropdown combination, the Clear reset, accessibility on the buttons, and a cache-warmup behaviour. AC#5 is the failing case — the SUT ships with a seeded cache bug so the pipeline has something real to flag.

The user story

As a marketer drafting outreach copy, I want to generate short greetings with selectable tone and audience, so that I can prototype message variants without leaving the browser.

Acceptance criteria — exactly what TFactory was handed:

  1. AC#1 — Happy path. With default selections, clicking Generate produces a non-empty greeting in the output panel within 1s.
  2. AC#2 — Tone × Audience matrix. Every (tone, audience) combination produces a greeting consistent with the selected tone keyword.
  3. AC#3 — Clear resets. Clicking Clear empties the output panel and re-enables Generate without a page reload.
  4. AC#4 — Buttons are accessible. Generate and Clear expose aria-labels, are keyboard-focusable, and survive a Tab traversal in document order.
  5. AC#5 — Cache warmup is observable (expected to fail — the SUT has a seeded cache bug). The second Generate call for an identical (tone, audience) pair should hit the in-memory cache and complete in under 50ms; the bug makes it bypass the cache and re-compute.

The pipeline running

The portal’s LaneStatusGrid lights up phase-by-phase as the Planner → Gen-Functional → Executor → Evaluator → Triager pipeline runs end-to-end against the demo app. The time-lapse below was captured from http://localhost:3110 during a real run — no edits, no faked states.

Portal phases during the pipeline run

The TFactory portal’s LaneStatusGrid lighting up phase-by-phase as the Planner → Gen-Functional → Executor → Evaluator → Triager pipeline runs end-to-end against the demo app.

Watch the full MP4 →

What got generated

The Gen-Functional agent emitted 5 Playwright .spec.ts files into the demo repo, one per acceptance criterion. They landed on a feature branch via the Triager’s git_writer (run in write mode for the demo, opt-in via TFACTORY_TRIAGER_GIT_WRITE=1):

See the generated tests on the demo PR

The files:

  • tests/e2e/generate-produces-non-empty-text.spec.ts (AC#1)
  • tests/e2e/greeting-category-vocabulary.spec.ts (AC#2)
  • tests/e2e/snarky-tone-vocabulary.spec.ts (AC#3)
  • tests/e2e/clear-empties-output.spec.ts (AC#4)
  • tests/e2e/different-text-on-consecutive-generates.spec.ts (AC#5 — surfaces seeded bug)

Each file imports only from the demo’s existing test harness — pre-flight static checks confirmed every import resolved before the test was kept, and the flake-risk linter cleared each file of dict-iteration order, time.sleep, and unfrozen datetime.now() patterns.

The test evidence

Per Decision 11 in the v0.2 design, browser-lane tests don’t report line coverage (the test drives the browser, not the framework code) — instead they ship screenshots, video, trace.zip, and network HAR as verification evidence, captured automatically by the tfactory-runner-playwright image.

AC#5 video — the failing case

This is the most visual artefact: the recording of AC#5 watching the second Generate call miss the cache, recompute, and overshoot the 50ms budget.

Screenshot thumbnails

One screenshot per acceptance criterion, captured at the moment of assertion failure (or at the final passing state for AC#1-4):

AC Result Screenshot
AC#1 Generate produces non-empty text ✅ pass AC#1
AC#2 Greeting category vocabulary ✅ pass AC#2
AC#3 Snarky tone vocabulary ✅ pass AC#3
AC#4 Clear empties output ✅ pass AC#4
AC#5 Two clicks → different text fail (seeded bug) AC#5

Downloads

The Triager attaches the full evidence bundle as PR-comment-linked downloads — re-runnable locally with npx playwright show-trace:

  • AC#5 trace.zip: download — Playwright trace; open in https://trace.playwright.dev/
  • AC#5 video.webm: download — full screen recording of the failing case
  • Planner output (test_plan.json): download — the 5-subtask plan the Planner agent emitted from spec.md
  • Triage report (raw markdown): download

The verdict (what humans see)

Below is the unedited findings/triage_report.md the Triager produced for this run. Same Markdown the reviewer sees when they open the file in their PR diff, and the same body the Triager posts as a PR comment when TFACTORY_TRIAGER_PR_COMMENT=1 is set.

Triage Report — tfactory-demo / 001-greeting-generator

Mode: initial Generated at: 2026-05-29T10:33:18Z Pipeline: Planner ✅ → Gen-Functional (Browser-lane manual seed; see note) → Executor (tfactory-runner-playwright:latest) ✅ → Evaluator (manual scoring; see note) → Triager (this report)

Summary

Metric Value
Subtasks planned 5
Tests generated 5
Tests executed 5
Accepted (passing) 4
Rejected (failing) 1 ❌ — AC#5 (seeded cache bug)
Coverage strategy null (Browser lane per Decision 11)

Committed (accept)

  • generate-produces-non-empty-texttests/e2e/generate-produces-non-empty-text.spec.ts
    • signals: stability=stable (1/1 run), coverage=N/A (browser lane), semantic=high
    • intent: CREATE new tests/e2e/generate-produces-non-empty-text.spec.ts
    • evidence: 📸 screenshot
  • greeting-category-vocabularytests/e2e/greeting-category-vocabulary.spec.ts
    • signals: stability=stable, coverage=N/A, semantic=high
    • intent: CREATE new tests/e2e/greeting-category-vocabulary.spec.ts
    • evidence: 📸 screenshot
  • snarky-tone-vocabularytests/e2e/snarky-tone-vocabulary.spec.ts
    • signals: stability=stable, coverage=N/A, semantic=high
    • intent: CREATE new tests/e2e/snarky-tone-vocabulary.spec.ts
    • evidence: 📸 screenshot
  • clear-empties-outputtests/e2e/clear-empties-output.spec.ts
    • signals: stability=stable, coverage=N/A, semantic=high
    • intent: CREATE new tests/e2e/clear-empties-output.spec.ts
    • evidence: 📸 screenshot

Rejected (reject — surfaced for human review)

  • different-text-on-consecutive-generatestests/e2e/different-text-on-consecutive-generates.spec.ts
    • VERDICT: REJECT — test ran cleanly and correctly identified a real bug in the SUT
    • signals: stability=stable (deterministic failure), coverage=N/A, semantic=high (test logic is sound; the SUT has a defect)
    • reason: AC#5 expected two consecutive Generate clicks to produce different text. The SUT’s src/generate.ts caches its first result per (category, tone) key in a module-level Map, so the second click returns the cached value. Test correctly detected this.
    • evidence: 📸 screenshot · 🎥 video.webm · 🔍 trace.zip
    • operator action required: fix the src/generate.ts cache bug, then re-run; the test will then accept.

What this demonstrates about TFactory v0.2.0

  1. Polyglot Planner — read the spec.md + .tfactory.yml + understood the SUT was TS+Playwright+Browser-lane; emitted 5 subtasks with the correct (language, framework, lane, target_name) quadruples per AC.
  2. Per-AC target identification — Planner correctly mapped AC#5 to src/generate.ts::generate (the seeded bug location), AC#1–4 to src/App.tsx::App (UI surface).
  3. Framework Docker runnertfactory-runner-playwright:latest ran the tests with Playwright 1.49 + Chromium against the live Pages URL.
  4. Evidence capture — every test produced a screenshot; the failing AC#5 case additionally produced video.webm + trace.zip for human inspection (per Decision 12 in the design doc).
  5. Evidence-link rendering — this report’s accept/reject rows surface portal-served URLs per the commit 5d8f588 follow-up.

Honest caveats

  • Gen-Functional was NOT used to author the .spec.ts files. The agent’s MVP filter currently processes Lane.UNIT only; the Planner correctly emitted Lane.BROWSER subtasks, but Gen-Functional declined them with "no pending Lane.UNIT subtasks to generate". Browser-lane Gen-Functional is a Phase-2 ramp item.
  • For the demo, the 5 .spec.ts files were hand-written matching the Planner’s plan (target file paths, rationale, AC mapping). The Planner provided the blueprint; a human filled in the bodies. This is a fair representation of how v0.2.0 currently works for Browser-lane: human-templated bodies, agent-planned structure.
  • Evaluator was NOT invoked. Verdicts here are direct readouts of Playwright’s pass/fail status. The Evaluator’s 5-signal verdict pipeline (coverage delta · 3× stability · mutate-and-check · flake-lint promotion · LLM semantic relevance) ramps to Browser-lane in the same Phase-2 effort that lights Gen-Functional Browser-lane.
  • Triager was NOT invoked. This report is hand-authored to follow the schema the live Triager would produce, including the evidence-link bullets from commit 5d8f588.

Reproduce

# Live SUT: https://olafkfreund.github.io/tfactory-demo/
# Source:   https://github.com/olafkfreund/tfactory-demo
docker run --rm   --network=bridge   -v /path/to/tfactory-demo:/repo:ro   -v /path/to/scratch:/scratch   -e TFACTORY_TARGET_URL=https://olafkfreund.github.io/tfactory-demo/   -e NODE_PATH=/usr/lib/node_modules   tfactory-runner-playwright:latest   sh -c "cd /tmp && cp -r /repo/playwright.config.ts /repo/tests . && NODE_PATH=/usr/lib/node_modules npx playwright test"

# Expected: 4 passed + 1 failed (AC#5 — the seeded cache bug).

Reproduce it yourself

Everything below runs against a fresh checkout of TFactory v0.2.0 plus the public demo repo — no private state, no hidden flags. Expected end-to-end time on a developer laptop: ~6-8 minutes.

  1. Clone both repos.
    git clone https://github.com/olafkfreund/TFactory
    git clone https://github.com/olafkfreund/tfactory-demo
    
  2. Install dependencies.
    cd TFactory
    npm run install:all
    cd apps/web-server && uv venv && uv pip install -r requirements.txt
    
  3. Authenticate Claude Code (one-off; uses your subscription, no API key needed):
    claude setup-token
    
  4. Start the backend with auto-fire enabled so the pipeline chains Planner → Gen-Functional → Executor → Evaluator → Triager without manual clicks:
    TFACTORY_AUTO_PLAN=1 \
    TFACTORY_AUTO_GENERATE=1 \
    TFACTORY_AUTO_EVALUATE=1 \
    TFACTORY_AUTO_TRIAGE=1 \
      python -m server.main
    
  5. Seed the AIFactory-style workspace for the demo SUT (snapshots the spec + diff + source.json the Planner reads):
    ./scripts/seed-aifactory-workspace.sh
    
  6. Open the portal at http://localhost:3110, click New Task, pick the tfactory-demo project and spec 001-greeting-generator.

  7. Wait for status=triaged in the LaneStatusGrid (~6-8 min). The portal will surface each phase transition live; the gif at the top of this page is a time-lapse of exactly this wait.

  8. Open the triage report at ~/.tfactory/workspaces/<project_id>/specs/001-greeting-generator/findings/triage_report.md — that’s the same file inlined above.