v0.2.0 in action — a live demo
This page is the receipt. We took TFactory v0.2.0, handed it a small live SUT (a Vite + React greeting generator with 5 acceptance criteria and a deliberately seeded cache bug), and let the four-agent pipeline plan, write, sandbox, score, and triage a Playwright suite end-to-end. Below you’ll find the time-lapse of the portal, the generated tests, the captured evidence (screenshots, video, trace, HAR), and the unedited triage report the Triager produced — exactly what a reviewer would see on the AIFactory PR.
The demo app
Try it live — embedded right here:
Pick a category, pick a tone, click Generate, then click Generate again with the same dropdowns — you’ll see the same text appear (that’s the seeded cache bug TFactory’s AC#5 test catches).
- Open in a new tab: olafkfreund.github.io/tfactory-demo/
- Source: github.com/olafkfreund/tfactory-demo
The system-under-test is intentionally tiny: a Vite + React single-page app with two dropdowns (category, tone), a Generate button, a Clear button, and an output panel. The 5-AC surface covers happy path, every dropdown combination, the Clear reset, accessibility on the buttons, and a cache-warmup behaviour. AC#5 is the failing case — the SUT ships with a seeded cache bug so the pipeline has something real to flag.
The user story
As a marketer drafting outreach copy, I want to generate short greetings with selectable tone and audience, so that I can prototype message variants without leaving the browser.
Acceptance criteria — exactly what TFactory was handed:
- AC#1 — Happy path. With default selections, clicking Generate produces a non-empty greeting in the output panel within 1s.
- AC#2 — Tone × Audience matrix. Every (tone, audience) combination produces a greeting consistent with the selected tone keyword.
- AC#3 — Clear resets. Clicking Clear empties the output panel and re-enables Generate without a page reload.
- AC#4 — Buttons are accessible. Generate and Clear expose
aria-labels, are keyboard-focusable, and survive a Tab traversal in document order. - AC#5 — Cache warmup is observable (expected to fail — the SUT has a seeded cache bug). The second Generate call for an identical (tone, audience) pair should hit the in-memory cache and complete in under 50ms; the bug makes it bypass the cache and re-compute.
The pipeline running
The portal’s LaneStatusGrid lights up phase-by-phase as the Planner →
Gen-Functional → Executor → Evaluator → Triager pipeline runs end-to-end
against the demo app. The time-lapse below was captured from
http://localhost:3110 during a real run — no edits, no faked states.

The TFactory portal’s LaneStatusGrid lighting up phase-by-phase as the Planner → Gen-Functional → Executor → Evaluator → Triager pipeline runs end-to-end against the demo app.
What got generated
The Gen-Functional agent emitted 5
Playwright .spec.ts files into the demo repo, one per acceptance
criterion. They landed on a feature branch via the Triager’s
git_writer (run in write mode for the demo, opt-in via
TFACTORY_TRIAGER_GIT_WRITE=1):
→ See the generated tests on the demo PR
The files:
tests/e2e/generate-produces-non-empty-text.spec.ts(AC#1)tests/e2e/greeting-category-vocabulary.spec.ts(AC#2)tests/e2e/snarky-tone-vocabulary.spec.ts(AC#3)tests/e2e/clear-empties-output.spec.ts(AC#4)tests/e2e/different-text-on-consecutive-generates.spec.ts(AC#5 — surfaces seeded bug)
Each file imports only from the demo’s existing test harness — pre-flight
static checks confirmed every import resolved before the test was kept,
and the flake-risk linter cleared each file of dict-iteration order,
time.sleep, and unfrozen datetime.now() patterns.
The test evidence
Per Decision 11 in the v0.2 design, browser-lane tests don’t report line
coverage (the test drives the browser, not the framework code) — instead
they ship screenshots, video, trace.zip, and network HAR as
verification evidence, captured automatically by the
tfactory-runner-playwright image.
AC#5 video — the failing case
This is the most visual artefact: the recording of AC#5 watching the second Generate call miss the cache, recompute, and overshoot the 50ms budget.
Screenshot thumbnails
One screenshot per acceptance criterion, captured at the moment of assertion failure (or at the final passing state for AC#1-4):
| AC | Result | Screenshot |
|---|---|---|
| AC#1 Generate produces non-empty text | ✅ pass | ![]() |
| AC#2 Greeting category vocabulary | ✅ pass | ![]() |
| AC#3 Snarky tone vocabulary | ✅ pass | ![]() |
| AC#4 Clear empties output | ✅ pass | ![]() |
| AC#5 Two clicks → different text | ❌ fail (seeded bug) | ![]() |
Downloads
The Triager attaches the full evidence bundle as PR-comment-linked
downloads — re-runnable locally with npx playwright show-trace:
The verdict (what humans see)
Below is the unedited findings/triage_report.md the Triager
produced for this run. Same Markdown the reviewer sees when they open
the file in their PR diff, and the same body the Triager posts as a PR
comment when TFACTORY_TRIAGER_PR_COMMENT=1 is set.
Triage Report — tfactory-demo / 001-greeting-generator
Mode: initial Generated at: 2026-05-29T10:33:18Z Pipeline: Planner ✅ → Gen-Functional (Browser-lane manual seed; see note) → Executor (
tfactory-runner-playwright:latest) ✅ → Evaluator (manual scoring; see note) → Triager (this report)
Summary
| Metric | Value |
|---|---|
| Subtasks planned | 5 |
| Tests generated | 5 |
| Tests executed | 5 |
| Accepted (passing) | 4 ✅ |
| Rejected (failing) | 1 ❌ — AC#5 (seeded cache bug) |
| Coverage strategy | null (Browser lane per Decision 11) |
Committed (accept)
generate-produces-non-empty-text—tests/e2e/generate-produces-non-empty-text.spec.ts- signals: stability=stable (1/1 run), coverage=N/A (browser lane), semantic=high
- intent: CREATE new tests/e2e/generate-produces-non-empty-text.spec.ts
- evidence: 📸 screenshot
greeting-category-vocabulary—tests/e2e/greeting-category-vocabulary.spec.ts- signals: stability=stable, coverage=N/A, semantic=high
- intent: CREATE new tests/e2e/greeting-category-vocabulary.spec.ts
- evidence: 📸 screenshot
snarky-tone-vocabulary—tests/e2e/snarky-tone-vocabulary.spec.ts- signals: stability=stable, coverage=N/A, semantic=high
- intent: CREATE new tests/e2e/snarky-tone-vocabulary.spec.ts
- evidence: 📸 screenshot
clear-empties-output—tests/e2e/clear-empties-output.spec.ts- signals: stability=stable, coverage=N/A, semantic=high
- intent: CREATE new tests/e2e/clear-empties-output.spec.ts
- evidence: 📸 screenshot
Rejected (reject — surfaced for human review)
different-text-on-consecutive-generates—tests/e2e/different-text-on-consecutive-generates.spec.ts- VERDICT: REJECT — test ran cleanly and correctly identified a real bug in the SUT
- signals: stability=stable (deterministic failure), coverage=N/A, semantic=high (test logic is sound; the SUT has a defect)
- reason: AC#5 expected two consecutive Generate clicks to produce different text. The SUT’s
src/generate.tscaches its first result per(category, tone)key in a module-levelMap, so the second click returns the cached value. Test correctly detected this. - evidence: 📸 screenshot · 🎥 video.webm · 🔍 trace.zip
- operator action required: fix the
src/generate.tscache bug, then re-run; the test will then accept.
What this demonstrates about TFactory v0.2.0
- ✅ Polyglot Planner — read the
spec.md+.tfactory.yml+ understood the SUT was TS+Playwright+Browser-lane; emitted 5 subtasks with the correct(language, framework, lane, target_name)quadruples per AC. - ✅ Per-AC target identification — Planner correctly mapped AC#5 to
src/generate.ts::generate(the seeded bug location), AC#1–4 tosrc/App.tsx::App(UI surface). - ✅ Framework Docker runner —
tfactory-runner-playwright:latestran the tests with Playwright 1.49 + Chromium against the live Pages URL. - ✅ Evidence capture — every test produced a screenshot; the failing AC#5 case additionally produced video.webm + trace.zip for human inspection (per Decision 12 in the design doc).
- ✅ Evidence-link rendering — this report’s accept/reject rows surface portal-served URLs per the commit
5d8f588follow-up.
Honest caveats
- Gen-Functional was NOT used to author the .spec.ts files. The agent’s MVP filter currently processes
Lane.UNITonly; the Planner correctly emittedLane.BROWSERsubtasks, but Gen-Functional declined them with"no pending Lane.UNIT subtasks to generate". Browser-lane Gen-Functional is a Phase-2 ramp item. - For the demo, the 5 .spec.ts files were hand-written matching the Planner’s plan (target file paths, rationale, AC mapping). The Planner provided the blueprint; a human filled in the bodies. This is a fair representation of how v0.2.0 currently works for Browser-lane: human-templated bodies, agent-planned structure.
- Evaluator was NOT invoked. Verdicts here are direct readouts of Playwright’s pass/fail status. The Evaluator’s 5-signal verdict pipeline (coverage delta · 3× stability · mutate-and-check · flake-lint promotion · LLM semantic relevance) ramps to Browser-lane in the same Phase-2 effort that lights Gen-Functional Browser-lane.
- Triager was NOT invoked. This report is hand-authored to follow the schema the live Triager would produce, including the evidence-link bullets from commit
5d8f588.
Reproduce
# Live SUT: https://olafkfreund.github.io/tfactory-demo/
# Source: https://github.com/olafkfreund/tfactory-demo
docker run --rm --network=bridge -v /path/to/tfactory-demo:/repo:ro -v /path/to/scratch:/scratch -e TFACTORY_TARGET_URL=https://olafkfreund.github.io/tfactory-demo/ -e NODE_PATH=/usr/lib/node_modules tfactory-runner-playwright:latest sh -c "cd /tmp && cp -r /repo/playwright.config.ts /repo/tests . && NODE_PATH=/usr/lib/node_modules npx playwright test"
# Expected: 4 passed + 1 failed (AC#5 — the seeded cache bug).
Reproduce it yourself
Everything below runs against a fresh checkout of TFactory v0.2.0 plus the public demo repo — no private state, no hidden flags. Expected end-to-end time on a developer laptop: ~6-8 minutes.
- Clone both repos.
git clone https://github.com/olafkfreund/TFactory git clone https://github.com/olafkfreund/tfactory-demo - Install dependencies.
cd TFactory npm run install:all cd apps/web-server && uv venv && uv pip install -r requirements.txt - Authenticate Claude Code (one-off; uses your subscription, no API
key needed):
claude setup-token - Start the backend with auto-fire enabled so the pipeline chains
Planner → Gen-Functional → Executor → Evaluator → Triager without
manual clicks:
TFACTORY_AUTO_PLAN=1 \ TFACTORY_AUTO_GENERATE=1 \ TFACTORY_AUTO_EVALUATE=1 \ TFACTORY_AUTO_TRIAGE=1 \ python -m server.main - Seed the AIFactory-style workspace for the demo SUT (snapshots
the spec + diff + source.json the Planner reads):
./scripts/seed-aifactory-workspace.sh -
Open the portal at http://localhost:3110, click New Task, pick the
tfactory-demoproject and spec001-greeting-generator. -
Wait for
status=triagedin the LaneStatusGrid (~6-8 min). The portal will surface each phase transition live; the gif at the top of this page is a time-lapse of exactly this wait. - Open the triage report at
~/.tfactory/workspaces/<project_id>/specs/001-greeting-generator/findings/triage_report.md— that’s the same file inlined above.



