TFactory v0.2+: Enterprise functional-test framework support
Status: Design — pending user review Date: 2026-05-28 Authored via:
/super-brainstorm(10 locked architectural decisions) Successor to: v0.1.0-mvp walking skeleton
Summary
TFactory v0.1.0-mvp ships a single-framework, single-language walking skeleton: Python + pytest, unit lane only. This design extends TFactory to support the enterprise functional-test landscape across many languages and frameworks. The architecture is browser-first, API/integration-second, unit-last — the inverse of the industry’s “unit-test generation” default. Generated tests persist into the AIFactory repo with a versioned catalog, and the platform ships reusable templates / skills / commands / agent definitions for both the portal and engineers.
The v0.2 release ships Playwright + Jest + pytest as the wedge. Java (JUnit 5), .NET (xUnit), Go, Ruby, and the broader framework universe land in v0.3+ using the same registry-based extension shape.
Scope
In scope (functional + feature testing)
| Category | Examples | Why |
|---|---|---|
| Browser E2E ⭐ headline | Playwright, Cypress, Selenium, Appium | “Use browser automation to test everything” — operator-stated priority |
| API testing | Postman/Newman, REST Assured, Karate, supertest, Tavern | “Test all API, connection” |
| Integration | TestContainers, WireMock, MockServer | “Features, gates” — real deps without browser |
| Contract | Pact, Karate | Microservice integration |
| Feature flag / gate | LaunchDarkly, GrowthBook, OpenFeature SDKs | “Gates” — testing rollouts/dark launches/A-B variants |
| Unit | pytest, JUnit 5, xUnit, Jest, Vitest, Go testing | Last-resort modality |
| Mutation | mutmut, Stryker, PIT | Validates the strength of the functional tests we generate |
| BDD style overlay | Cucumber, SpecFlow, Behave | How enterprise QA teams write tests — a style on top of the above |
Out of scope (operator-stated)
- ❌ SAST / static security analysis — “we have pipelines and other controllers for that”
- ❌ DAST / dynamic security testing
- ❌ Dependency / secrets / container / IaC scanning
- ❌ Performance / load testing (JMeter, k6, Gatling, Locust)
- ❌ Pure security fuzzing (property-based testing IS kept for functional strength)
Deferred (could be revisited in v0.4+)
- ⏸ Visual regression (Percy, Chromatic, Playwright screenshots)
- ⏸ Accessibility testing (axe-core, pa11y, Lighthouse CI)
- ⏸ Mobile E2E (Appium) beyond Playwright’s mobile emulation
Operating model — the test-modality priority
TFactory’s Planner walks down this priority for every feature it tests:
1. Can this be exercised through a BROWSER? → Playwright/Cypress E2E
2. Is it an API/RPC surface? → API test (Postman / REST Assured / etc.)
3. Is it a cross-service / feature-flag gate? → Integration test (TestContainers + flag injection)
4. Otherwise → Unit test (pytest / JUnit / Jest)
Mutation lane is orthogonal: validates that whatever tests were generated actually catch regressions.
This is opposite to the industry default (Diffblue, Meta TestGen-LLM, Qodo all start with unit tests because they’re easier to generate). TFactory bets on the outside-in tests humans hate writing.
Architecture overview
The 4-agent pipeline (Planner → Gen-Functional → Executor → Evaluator → Triager) stays. Five things change:
- Each agent is parameterized over (language, framework) via a registry
- Lane spine is replaced: Browser · API · Integration · Unit · Mutation
- AIFactory projects gain
.tfactory.ymldeclaring target services + runtime - AIFactory repos grow
.tfactory/tests-catalog.jsonfor cross-run continuity - TFactory ships reusable platform artifacts (templates, skills, commands, agent definitions)
┌─────────────────────────────────────────────┐
│ Framework Registry (per-framework YAML) │
│ - detection rules │
│ - runner image │
│ - test path conventions │
│ - templates │
│ - evaluator hooks (preflight/lint/mutate) │
└────────────┬────────────────────────────────┘
│ context injection
AIFactory repo ▼
(.tfactory.yml, ─────► Planner (generic prompt)
.tfactory/ ↓ emits subtasks each carrying (language, framework, lane, target)
tests-catalog.json) ↓
(.claude/skills/ ── Gen-Functional (generic prompt + framework descriptor) ──┐
+commands/ │
+agents/) ↓ writes test file in framework conventions │
↓ │
Executor (per-framework Docker image) │
↓ runs tests against target (URL / k8s / docker-compose) │
↓ │
Evaluator (5 signals; per-language primitives) │
↓ verdicts.json │
↓ │
Triager (dedup + rank + render + git_writer + pr_comment) │
└────► AIFactory repo (tests + tests-catalog.json updates) ←┘
Decision tree (the 10 locked choices)
| # | Decision | Value | Rationale |
|---|---|---|---|
| 1 | Agent parameterization | Generic prompts + framework descriptor registry | Scales to 100+ frameworks; bet on LLM’s broad knowledge |
| 2 | Lane structure | Browser · API · Integration · Unit · Mutation | Aligns to operating-model priority; 5 cards in LaneStatusGrid |
| 3 | Detection timing | Planner picks per-subtask, polyglot native | Enterprise repos are polyglot; subtask-level granularity |
| 4 | Detection method | Manifest+config sniff first, LLM fallback | Fast/deterministic for 90% case, flexible for the rest |
| 5 | Browser app runtime | .tfactory.yml + docker-compose (default), preview URL (override) |
Maintainable; matches modern dev infra |
| 6 | Test artifact lifecycle | .tfactory/tests-catalog.json (in AIFactory repo) + filesystem hybrid |
Cross-run continuity; explicit update-vs-create |
| 7 | Artifact packaging | Monorepo typed dirs + framework-bundled extras + project-local .tfactory/ |
3-tier override; ships in TFactory, project-local for team specifics |
| 8 | v0.2 cut | Playwright + Jest + pytest | Browser-first wedge; two languages; three frameworks; tractable |
| 9 | Browser selector strategy | Role-based → data-testid → CSS (in order) | Robust to redesigns; encourages best practice |
| 10 | Target addressing | Single .tfactory.yml with targets: array |
Subtasks reference by name; auth via env-var indirection |
| 11 | Coverage signal | Per-framework coverage_strategy field; Browser lane = skip (null, not zero) |
Browser tests can’t emit per-test coverage XML; Evaluator must treat null as N/A so browser tests aren’t penalised |
| 12 | Test evidence capture | Screenshots + video + trace + network HAR (per-framework config); workspace-stored + portal-served; failures kept indefinitely, passes purged after retention window | Human reviewers need to SEE what TFactory generated. “I don’t trust an AI test until I’ve watched it run” is the #1 adoption blocker. |
Framework descriptor schema
The central artifact of the new architecture. Every framework TFactory
supports gets a YAML descriptor at frameworks/{name}/descriptor.yaml.
# frameworks/playwright/descriptor.yaml
name: playwright
version_range: ">=1.40.0"
languages: [typescript, javascript, python, csharp, java]
lanes: [browser] # which TFactory lanes this serves
default_lane: browser
# How the Planner detects this framework in a project
detection:
manifests:
- file: package.json
json_path: "$.devDependencies['@playwright/test']"
- file: package.json
json_path: "$.dependencies['@playwright/test']"
configs:
- file: playwright.config.ts
- file: playwright.config.js
test_files:
- pattern: "**/*.spec.ts"
- pattern: "**/*.spec.js"
- pattern: "tests/**/*.ts"
# Where Gen-Functional writes new tests by default
test_path_conventions:
default: tests/e2e/
fallback: e2e/
filename_pattern: "{feature_slug}.spec.ts"
# Docker image carrying browsers + framework
runtime:
image: tfactory-runner-playwright:latest
invocation: "npx playwright test {test_file_path}"
args:
- "--reporter=junit"
- "--reporter-output=/scratch/junit.xml"
# How this framework reports coverage (drives Evaluator's coverage_delta hook)
coverage_strategy: skip # one of: cobertura | v8 | nyc | jacoco | coverlet | skip
# 'skip' = framework doesn't emit per-test
# coverage; Evaluator treats coverage_delta as
# null (NOT zero) for these tests.
# Hooks into Evaluator's per-language primitives
evaluator_hooks:
preflight: typescript_tsc # tsc --noEmit imports resolve
flake_lint: eslint_tfactory # ESLint config tuned for E2E flake patterns
mutate: stryker # Stryker config for one-mutation probe
# Selectors the Gen-Functional prompt prefers
selector_strategy:
prefer: role_based
fallback_chain: [data_testid, css]
# Templates (Jinja-style interpolation)
templates:
- name: login-flow
file: templates/login-flow.spec.ts.tmpl
description: Standard login + protected-page-access scenario
requires_target: web
requires_auth: true
- name: form-submit-validation
file: templates/form-submit-validation.spec.ts.tmpl
description: Submit form, assert validation + success state
- name: api-mocked-flow
file: templates/api-mocked.spec.ts.tmpl
description: E2E with route mocking via page.route()
# Catalog metadata fields this framework needs to track
catalog_fields:
- browsers_tested # chromium / firefox / webkit
- target_ref # which target from .tfactory.yml
- viewport # mobile vs desktop
# LLM context block — appended to Gen-Functional prompt
context_block: |
You are generating Playwright tests in TypeScript.
Idioms to use:
- page.getByRole('button', { name: 'Submit' }) — role-based, preferred
- page.getByTestId('foo') — when the app has data-testid attrs
- expect(page.locator(...)).toBeVisible() — auto-wait, never page.waitForTimeout
- Use storageState for auth (see auth context); never login via UI in tests
Anti-patterns to avoid:
- page.waitForTimeout(N) — replace with auto-wait expectations
- Brittle CSS selectors when role-based works
- UI-driven login (use programmatic auth or storageState instead)
- Hard-coded test data (use fixtures or .tfactory.yml seed scripts)
Per-language Evaluator primitives
The Evaluator’s 5 signals are mostly language-agnostic, but 3 primitives are Python-specific in v0.1 and need per-language analogs:
| Primitive | Python (v0.1) | TypeScript/JS (v0.2) | Java (v0.3) | Go (v0.4+) |
|---|---|---|---|---|
preflight |
subprocess python -c "import X" |
tsc --noEmit (imports + types resolve) |
javac -nowarn |
go build ./... |
flake_lint |
AST scan: 5 patterns | ESLint w/ tfactory ruleset | Checkstyle + PMD ruleset | go vet + custom rules |
mutate_probe |
AST mutates Eq/Bool/Int | Stryker (one mutation, dry-run) | PIT (one mutation) | gomutator or skip |
Stability + LLM semantic relevance are language-agnostic and unchanged.
Coverage delta is language-agnostic in mechanism (Cobertura XML parser) but not framework-agnostic: it requires the runner to emit a coverage XML file. Per lane:
| Lane | Coverage signal | Why |
|---|---|---|
| Unit (pytest, Jest, JUnit, xUnit, Go test) | ✅ Cobertura via --cov / nyc / JaCoCo / Coverlet / go test -cover |
Native support across all major frameworks |
| API (supertest, REST Assured, httpx) | ✅ Cobertura via the unit framework underneath | Same coverage tooling as unit |
| Integration (TestContainers) | ✅ Cobertura via unit framework | Same |
| Browser (Playwright, Cypress) | ❌ N/A | Browser tests instrument the served app, not the test code; coverage instrumentation requires runtime probes (V8 coverage API + nyc) that don’t fit the per-test model |
| Mutation | N/A (different lane semantics) | Mutation result IS the verdict |
For Browser lane tests, the Evaluator receives coverage_delta = null in
the per-test signal bundle. The evaluator.md prompt is updated to
interpret null as “not applicable for this lane” (NOT “zero”). The
verdict-priority rules in evaluator.md skip the coverage_delta rule when
it’s null and rely on the remaining 4 signals.
Implementer note — Browser lane tests scored as low-value just because coverage was zero is the #1 risk this fix prevents. See Task 10 in the v0.2 task list.
Target / service addressing — .tfactory.yml schema
A versioned config file at the AIFactory repo root that declares where the system-under-test lives + how to start it locally. Subtasks reference targets by name.
# AIFactory-repo/.tfactory.yml
version: 1
# Default target for all subtasks unless they override
default_target: web-local
# Targets the Planner can reference for Browser / API / Integration subtasks
targets:
# === HTTP target (browser or REST API) ===
- name: web-staging
type: http
url: https://staging.example.com
healthcheck:
path: /healthz
expect_status: 200
timeout_seconds: 30
auth:
type: bearer
token_env: STAGING_TOKEN
selectors_hint: data_testid # this app has data-testid throughout
# === API target (REST) ===
- name: api-staging
type: http
base_url: https://api.staging.example.com
openapi_spec: openapi.yaml # for API test generation
auth:
type: oauth2_client_credentials
token_url: https://auth.example.com/oauth/token
client_id_env: API_CLIENT_ID
client_secret_env: API_CLIENT_SECRET
scope: read:resources
# === Kubernetes target (port-forward) ===
- name: in-cluster
type: kubernetes
context: my-prod-cluster
namespace: staging
service: frontend
port: 80
port_forward: true
auth:
type: serviceaccount
kubeconfig_env: KUBECONFIG
# === Local docker-compose (Decision 5 default) ===
- name: web-local
type: docker-compose
compose_file: docker-compose.yml
services_to_start: [frontend, backend, db]
wait_for:
- url: http://localhost:3000
path: /
auth:
type: test_user
seed_command: "scripts/seed-test-user.sh"
# === Feature flag overlay (v0.3+) ===
- name: feature-flag-overlay
type: feature_flag
provider: launchdarkly
sdk_key_env: LD_SDK_KEY
test_user_targeting:
user_id: tfactory-test-user
attributes:
plan: enterprise
# Test data seeding + reset
test_data:
seed_command: "scripts/seed.sh"
reset_command: "scripts/reset.sh"
isolation: per_task # per_task | per_subtask | none
# Where TFactory's generated tests land (overrides framework descriptor defaults)
test_paths:
browser: tests/e2e/
api: tests/api/
integration: tests/integration/
unit: tests/unit/
Test artifact lifecycle — catalog + filesystem hybrid
.tfactory/tests-catalog.json (in AIFactory repo)
{
"version": 1,
"updated_at": "2026-05-28T12:00:00Z",
"tests": [
{
"test_id": "ac1-login-flow",
"test_file": "tests/e2e/login-flow.spec.ts",
"framework": "playwright",
"lane": "browser",
"language": "typescript",
"covers_acs": ["AC#1: User can log in with valid credentials"],
"generated_at": "2026-05-28T10:30:00Z",
"generated_by_task": "042-session-expiry",
"last_verdict": "accept",
"browsers_tested": ["chromium"],
"target_ref": "web-staging",
"operator_locked": false
}
]
}
Update-vs-create policy (Triager)
For each candidate test from Gen-Functional:
matches = catalog.lookup_by_ac(candidate.ac_id)
if matches and matches[0].operator_locked:
→ SKIP (operator marked "don't regenerate")
elif len(matches) == 1:
→ UPDATE in place (same file path, increment generation_version)
elif len(matches) > 1:
→ flag as catalog ambiguity; pick the most-recent + warn the operator
in the triage report
else:
→ CREATE new file in framework-conventional path
→ ADD entry to catalog
AC-match algorithm (catalog.lookup_by_ac)
The catalog’s covers_acs is an array of strings, not a primary key. The
lookup uses a deterministic 3-step match in priority order:
def lookup_by_ac(catalog, candidate_ac: str) -> list[CatalogEntry]:
# 1. Exact match — best signal
exact = [e for e in catalog.tests if candidate_ac in e.covers_acs]
if exact:
return exact
# 2. AC-id prefix match — covers "AC#1: …" style where prefix is stable
# candidate_ac="AC#1: login expiry" matches stored "AC#1: login flow"
ac_id = candidate_ac.split(':', 1)[0].strip() # e.g. "AC#1"
if ac_id:
prefix = [e for e in catalog.tests
if any(s.startswith(ac_id) for s in e.covers_acs)]
if prefix:
return prefix
# 3. Empty — no match
return []
Embedding-similarity matching is explicitly out of scope for v0.2 — it
would couple the Triager to an embedding model and add a network call per
catalog lookup. Exact + prefix match handles >95% of real-world cases (the
AC-id is the stable handle); the operator handles the rest via
operator_locked flags or manual edits.
Existing-test discovery (Planner)
Before emitting subtasks, the Planner:
- Reads
.tfactory/tests-catalog.json(authoritative) - Walks
test_pathsfrom.tfactory.ymllooking for untracked test files - For untracked files, LLM-judges: which AC does this cover?
- Adds discovered tests to the working catalog
- Subtasks for an AC reference the existing test (UPDATE path) if any
Platform deliverables
The TFactory repo grows top-level directories for reusable artifacts that the portal AND engineers consume:
TFactory/
├── frameworks/ # Framework registry (Decision 1)
│ ├── playwright/
│ │ ├── descriptor.yaml
│ │ └── templates/
│ │ ├── login-flow.spec.ts.tmpl
│ │ ├── form-submit-validation.spec.ts.tmpl
│ │ └── api-mocked.spec.ts.tmpl
│ ├── jest/
│ ├── pytest/
│ ├── junit5/ # v0.3
│ └── ...
│
├── templates/ # Cross-framework templates
│ ├── tfactory-yml/
│ │ └── starter.yaml.tmpl # `.tfactory.yml` starter for new projects
│ └── catalog/
│ └── starter.json.tmpl # `.tfactory/tests-catalog.json` starter
│
├── skills/ # Claude Code skill bundles
│ ├── handover-to-tfactory/ # already shipped
│ ├── tfactory-add-test/ # NEW: engineer adds ONE test
│ ├── tfactory-from-template/ # NEW: pick template, fill vars, drop
│ ├── tfactory-strengthen-tests/ # NEW: mutation-probe an existing test
│ ├── tfactory-explain-coverage/ # NEW: query catalog, show gaps
│ └── tfactory-init/ # NEW: scaffold .tfactory.yml + catalog
│
├── commands/ # Slash commands (often wrap skills)
│ ├── tfactory-generate-e2e.md
│ ├── tfactory-generate-api.md
│ ├── tfactory-from-template.md
│ └── tfactory-explain-coverage.md
│
├── agents/ # Distributable agent definitions
│ ├── planner.md # Repackaged for engineer use in their own Claude
│ ├── gen-functional.md
│ ├── evaluator.md
│ └── triager.md
│
└── apps/backend/ # Internal pipeline (unchanged structure)
Authoring layers (Decision 7)
| Layer | Where | Who edits | When merged |
|---|---|---|---|
| Shipped | TFactory repo frameworks/, templates/, etc. |
TFactory maintainers via PR | At TFactory release |
| Project-local | AIFactory repo .tfactory/templates/, etc. |
Team via PR to their own repo | Per AIFactory commit |
| User-local | ~/.tfactory/templates/, etc. |
Individual engineer | Per engineer’s preferences |
Portal + Gen-Functional merge all 3 at runtime; project-local overrides shipped; user-local overrides project-local.
Browser lane — concrete implementation details
Selectors (Decision 9)
The Gen-Functional prompt for the Browser lane MUST instruct:
Selector priority (try each in order; stop at first match):
1. page.getByRole('<role>', { name: '<accessible-name>' })
← always prefer this
2. page.getByTestId('<id>')
← when the app has data-testid attributes (check existing markup)
3. page.getByLabel('<label>'), page.getByPlaceholder('<text>')
← for form inputs
4. page.locator('css-selector')
← LAST RESORT; emit a TODO comment for the human reviewer
NEVER generate XPath selectors.
NEVER generate `page.locator('.brittle-css-class')` without a TODO.
Auth + storage state
Browser tests share an auth fixture across the suite via Playwright’s
storageState:
// tests/e2e/auth-setup.ts (generated once per project by `tfactory-init` skill)
import { test as setup } from '@playwright/test';
setup('authenticate', async ({ page }) => {
// .tfactory.yml's auth block tells us HOW
await page.goto(process.env.TFACTORY_WEB_URL);
await page.getByLabel('Email').fill(process.env.TFACTORY_TEST_EMAIL);
await page.getByLabel('Password').fill(process.env.TFACTORY_TEST_PASSWORD);
await page.getByRole('button', { name: 'Sign in' }).click();
await page.waitForURL('**/dashboard');
await page.context().storageState({ path: '.auth/state.json' });
});
// All other tests reuse `.auth/state.json` via playwright.config.ts
Page Object pattern
When the catalog shows >1 test exercising the same page, Gen-Functional emits a Page Object alongside:
// tests/e2e/pages/LoginPage.ts (template-driven)
import { Page, Locator } from '@playwright/test';
export class LoginPage {
readonly page: Page;
readonly emailInput: Locator;
readonly passwordInput: Locator;
readonly submitButton: Locator;
constructor(page: Page) {
this.page = page;
this.emailInput = page.getByLabel('Email');
this.passwordInput = page.getByLabel('Password');
this.submitButton = page.getByRole('button', { name: 'Sign in' });
}
async loginAs(email: string, password: string) {
await this.emailInput.fill(email);
await this.passwordInput.fill(password);
await this.submitButton.click();
}
}
Wait strategy
The flake-lint analog (ESLint config) for the Browser lane bans:
page.waitForTimeout(N)— replace withexpect(...).toBeVisible({ timeout: N })setTimeout()in test bodies — replace withexpect.poll()or auto-wait- Hardcoded sleeps of any kind
Cross-browser
v0.2: Chromium only in CI; Firefox + WebKit opt-in via .tfactory.yml:
browser_matrix:
- chromium # always
- firefox # opt-in
- webkit # opt-in
Evidence capture (Decision 12)
The single highest-value differentiator for AI-generated tests is letting humans see what was generated, running, before they trust it. TFactory captures evidence by default on Browser/API/Integration lanes.
What’s captured per lane:
| Lane | Screenshots | Video | Trace / HAR | Native? |
|---|---|---|---|---|
| Browser (Playwright) | ✅ auto-on-failure + per-step opt-in | ✅ retain-on-failure (default), always (opt-in) | ✅ trace.zip with screenshots-per-action + network log | Yes, all native |
| Browser (Cypress) | ✅ auto-on-failure | ✅ always (Cypress default) | ⚠️ via plugin | Mostly native |
| API (supertest/REST Assured/httpx) | N/A | N/A | ✅ HTTP request/response log (JSON-lines) | Wrapped via TFactory’s record_http helper |
| Integration (TestContainers) | N/A | N/A | ✅ container stdout/stderr + HTTP HAR | Wrapped |
| Unit (Jest/pytest/JUnit) | N/A (unless playwright-component style) |
N/A | ✅ stdout/stderr only | None |
| Mutation | Inherits the test’s evidence capture; mutation-failure screenshot retained |
Per-framework descriptor field:
# frameworks/playwright/descriptor.yaml
evidence:
screenshot: native # native | wrapped | none
screenshot_policy: on-failure # on-failure | always | never (operator override in .tfactory.yml)
video: native
video_policy: retain-on-failure # always | retain-on-failure | never
trace: native
trace_policy: on-first-retry # always | on-first-retry | never
network_har: native
Storage location (workspace-first):
Evidence artifacts land at:
~/.tfactory/workspaces/<proj>/specs/<spec>/findings/evidence/
<test_id>/
screenshots/
step-01-form-loaded.png
step-02-submitted.png
failure-final-state.png # only when test failed
video.webm # if video_policy fires
trace.zip # Playwright trace
network.har # network log
metadata.json # links + sizes + timings
Portal serves these via a new endpoint (Task 14):
GET /api/tfactory/tasks/<spec_id>/evidence/<test_id>/<artifact>.
The catalog references evidence:
{
"test_id": "ac1-login-flow",
"test_file": "tests/e2e/login-flow.spec.ts",
"last_verdict": "accept",
"last_evidence_run_id": "run-2026-05-28-1430",
"evidence_urls": {
"screenshots": ["step-01-...png", "step-02-...png"],
"video": "video.webm",
"trace": "trace.zip"
}
}
Retention policy (configurable in .tfactory.yml):
evidence_retention:
failures: forever # never delete failure evidence
flagged: 90_days # keep flagged-test evidence 3 months
passing: 7_days # passing tests purged after 7 days
size_cap_per_task: 500MB # if exceeded, prune oldest-first
Triager integration (PR comment):
The triage report links to evidence inline:
## Test results
| Test | Verdict | Evidence |
|---|---|---|
| `ac1-login-flow` | ✅ accept | [screenshots](https://portal/.../ac1/screenshots) · [video (32s)](https://portal/.../ac1/video.webm) |
| `ac2-checkout` | ❌ reject | [failure screenshot](https://portal/.../ac2/screenshots/failure.png) · [trace.zip](https://portal/.../ac2/trace.zip) |
| `ac3-search-filter` | ⚠️ flag | [video (18s)](https://portal/.../ac3/video.webm) — see frame at 0:12 for unexpected scroll |
Why this matters for adoption:
- Reviewer can watch a 30-second video and decide “yes this is the right test” in seconds vs reading 200 lines of Playwright TS
- Failures are debuggable WITHOUT reproducing locally — the trace.zip contains every DOM snapshot + every network request
- API test “evidence” (request/response JSONs) lets a reviewer verify the test actually exercised the contract they care about
API lane — concrete implementation details
The Planner picks a framework based on language:
| Language | Default API framework | Why |
|---|---|---|
| TypeScript/JS | supertest + Jest | Fits naturally with Jest, no extra runner needed |
| Python | httpx + pytest | Modern, async, plays well with pytest |
| Java | REST Assured | De facto enterprise standard |
| .NET | xUnit + HttpClient | Idiomatic .NET |
| Go | net/http/httptest | Built-in |
OpenAPI spec (declared in .tfactory.yml target’s openapi_spec) drives:
- Endpoint discovery
- Schema validation
- Example payloads (via
$.examples)
Contract testing (Pact) gets its own subtask type when the Planner detects
microservice topology in docker-compose.yml.
Integration lane — concrete implementation details
Uses TestContainers (multi-language) to spin up real dependencies for tests:
# Generated by Gen-Functional in the Integration lane (pytest example)
from testcontainers.postgres import PostgresContainer
def test_user_creation_persists_to_db():
with PostgresContainer("postgres:16") as pg:
# Subtask context block tells the LLM the connection string env var
# for the SUT to point at pg.get_connection_url()
...
Feature flag testing gets a sub-modality: TFactory injects flag state via the provider’s SDK then exercises the protected code path:
// Generated for a LaunchDarkly-gated feature
import { LDClient } from 'launchdarkly-node-server-sdk';
test('new checkout flow activates when feature flag is on', async () => {
const ld = LDClient.initialize(process.env.LD_SDK_KEY);
ld.variation('new-checkout-flow', testUser, false /* default */);
// ... exercise the flag-gated path
});
Roadmap
v0.2 (next release after MVP)
Theme: First multi-lane, multi-framework release. Browser-first wedge.
| Lane | Framework | Status |
|---|---|---|
| Browser | Playwright (TS) | NEW |
| Unit | Jest (TS) | NEW |
| Unit | pytest (Python) | Carried from v0.1 |
| API | (deferred to v0.3) | — |
| Integration | (deferred to v0.3) | — |
| Mutation | (deferred to v0.3) | — |
Platform deliverables in v0.2:
- Framework registry skeleton + 3 descriptors
.tfactory.ymlschema v1 (targets:,test_paths:,test_data:).tfactory/tests-catalog.jsonschema v1- New skills:
tfactory-init,tfactory-add-test,tfactory-from-template - Browser app runtime via docker-compose
LaneStatusGridupdated to new lane vocabulary
v0.3
Theme: Add API + Integration lanes. Add Java + .NET.
| Lane | Framework |
|---|---|
| API | Postman/Newman, REST Assured (Java), supertest (TS), httpx (Python) |
| Integration | TestContainers (multi-language), WireMock (Java) |
| Browser | Cypress added (alongside Playwright) |
| Unit | JUnit 5 (Java), xUnit (.NET), Vitest (TS) |
| Mutation | Stryker (TS), PIT (Java) |
Platform deliverables:
- New skills:
tfactory-explain-coverage,tfactory-strengthen-tests - Per-language Evaluator primitives (tsc-preflight, ESLint-lint, javac-preflight, Checkstyle-lint)
- Multi-target
.tfactory.yml(kubernetes + http alongside docker-compose)
v0.4+
Theme: Long tail of enterprise frameworks. Go, Ruby, Rust, Kotlin, PHP. Mobile via Appium. Visual regression (Percy/Chromatic). Accessibility (axe-core).
Each new framework = ~1 descriptor + 1 Docker image + 3-5 templates. Adds in isolation without touching the rest of the pipeline.
Migration from v0.1.0-mvp
Breaking changes:
- Lane enum changes:
Lane.FUNCTIONAL→Lane.UNIT.Lane.SAST/DAST/FUZZ→ removed (deprecated; emit warning if seen).Lane.BROWSER/API/INTEGRATIONadded. LaneStatusGridrenames: Functional → Unit, SAST/DAST → Browser/API, Fuzz → Integration.prompts/gen_functional.md→prompts/gen-functional.md+ add `` placeholder.agents/gen_functional.py→agents/gen_functional.pyextended to dispatch on subtask’s (language, framework, lane) tuple.workspaces/snapshotter.pyextended to read.tfactory.ymlif present.tests-catalog.jsonintroduced at AIFactory repo root.
Non-breaking:
- Existing pytest tests generated by v0.1 still work — the registry handles the old single-framework world as a default.
Migration path for v0.1 users:
# Run once per AIFactory project after upgrading TFactory
$ tfactory init # creates .tfactory.yml + .tfactory/tests-catalog.json
$ tfactory migrate v0_1_catalog # walks existing tests/, populates catalog
Risks and open questions
Risks
| Risk | Severity | Mitigation |
|---|---|---|
| Playwright Docker image is huge (~1.5GB) | Medium | Cache aggressively in CI; offer a “bring your own image” option |
| Browser tests need a running app — many enterprises don’t have docker-compose | Medium | .tfactory.yml targets: supports external URLs as alternative |
Auth/secret management in .tfactory.yml env-var indirection — risk of leaks |
High | Document strict policy: never log target tokens, redact in PR comments |
| Cross-language Evaluator primitives drift from each other | Medium | Per-language test suite for each primitive; common shape required |
| Framework registry becomes a coordination bottleneck | Low | Project-local .tfactory/frameworks/ for team extensions |
| LLM hallucinates Playwright APIs (especially newer ones) | Medium | Tight context block + reference templates; nightly LLM eval on a known repo |
| BDD style (Cucumber) doesn’t map cleanly to subtask shape | Medium | Defer to v0.4; for now Cucumber tests aren’t generated by TFactory |
Open questions
Should TFactory generate Page Objects from day 1, or only when >1 test exists on the same page?→ DECIDED in this spec: Page Objects emitted by Gen-Functional only when the catalog has >1 test on the same page (template ships in v0.2 either way).- How does TFactory handle authentication for
.tfactory.ymltargets that need OAuth? → Document the env-var indirection pattern; defer interactive OAuth flows to v0.3. - What’s the test isolation default — per-task, per-subtask, none? → Default per-task; declarable in
.tfactory.yml. - Should mutation-probe ALWAYS run for Browser tests? → No — expensive (re-runs the whole E2E). Sample-based: 1 in N browser tests get mutation-probed.
- What’s the upgrade story when a framework descriptor changes? → Version field in descriptor; catalog records
generated_with_descriptor_version; warn on mismatch.
What v0.2 implementation needs (next steps)
This design is the architectural foundation. The next step is writing an implementation plan that breaks v0.2 into tasks similar to the v0.1 12-task shape. Tentative task breakdown (~15 tasks):
Task 0 — Lane rename + breaking-change migration (must land first; gates everything else):
- Rename
Lane.FUNCTIONAL→Lane.UNITinapps/backend/test_plan/lane.py - Add
Lane.BROWSER,Lane.API,Lane.INTEGRATION(keepLane.MUTATION) - Remove deprecated
Lane.SAST,Lane.DAST,Lane.FUZZ(with a one-release deprecation warning if seen in old plans) - Update
lane_dispatch.py:_MVP_LIT_LANES,_LANE_PHASESdict - Update
lang_registry.py:_LANE_KEYStuple - Update
task_control.py:_MVP_LANEStuple - Update tests with literal lane lists:
tests/test_test_plan_lane.py(asserts["functional","sast","dast","fuzz","mutation"])tests/test_lane_dispatch.py(asserts"sast"raises with"phase 3")tests/test_lang_registry.py(iterates("functional","sast","deps","secrets","mutation"))
- Update frontend:
LaneStatusGrid.tsxTypeScript union typeLaneStatusGrid.test.tsx:51array assertion- All UI labels Functional → Unit, etc.
- Framework registry data model + loader — descriptor schema validator
.tfactory.ymlschema + parser + validator — includingtargets:,test_paths:,test_data:,wait_for:blocks.tfactory/tests-catalog.jsonschema + read/write helpers — including the AC-match algorithm (see Open Question 2 — must be resolved here)- Snapshotter extended to read
.tfactory.yml+ catalog - Planner extended for per-subtask
(language, framework, lane, target) - Gen-Functional refactored — generic prompt + framework context injection
- Per-framework Docker images: playwright, jest, pytest (last one exists)
- Browser-lane app runtime — implement the health-poll loop:
- Snapshotter parses
target.wait_for[].url+target.wait_for[].path - Executor’s docker-compose orchestrator runs
compose_fileservices - Health-poll: HTTP HEAD against each
wait_forURL every 2s for up to 120s - Failure mode: emit
status=executor_failedwithphase=app_not_healthy- the unreached URLs + the last response codes
- Teardown:
docker compose downafter EACH subtask (or per-task pertest_data.isolation)
- Snapshotter parses
- Evaluator per-language primitives: tsc-preflight (TS), ESLint-lint (TS/JS), Stryker-mutate (TS/JS). Python primitives are already shipped in v0.1.
- Evaluator coverage adapter for non-Cobertura signals — see Decision 11 below (coverage_delta is N/A for Browser lane; the Evaluator prompt is updated to treat null coverage as “not applicable”, not “zero”)
- Triager update-vs-create policy + catalog mutation — uses the resolved AC-match algorithm from task 3
- Templates — Playwright + Jest + pytest starter set (login-flow, form-submit, api-mocked, fixtures, etc.)
- Skills —
tfactory-init,tfactory-add-test,tfactory-from-template- the existing
handover-to-tfactoryskill updated for the new schema
- the existing
- Portal — new endpoint surface for templates / skills / catalogs; catalog browser + “which ACs aren’t covered” view
LaneStatusGridreskin + new lane vocabulary + the migration CLI helpertfactory init+tfactory migrate v0_1_catalog
Appendix: full framework catalogue
(Reference — what’s in scope for v0.4+ rollout)
Unit / functional (xUnit family)
| Language | Tier 1 | Tier 2 | Tier 3 |
|---|---|---|---|
| Python | pytest ✅, unittest | doctest | nose2 |
| JS/TS | Jest, Vitest, Mocha+Chai | Jasmine, Tape, AVA | node:test |
| Java | JUnit 5, JUnit 4, TestNG | Spock | JBehave |
| .NET | xUnit, NUnit | MSTest | Fixie |
| Go | testing + Testify | Ginkgo + Gomega | gocheck |
| Ruby | RSpec, Minitest | Test::Unit | |
| Rust | cargo test | proptest, quickcheck | |
| Kotlin | JUnit 5, Kotest | Spek | |
| Scala | ScalaTest | MUnit, Specs2 | uTest |
| PHP | PHPUnit | Pest, Codeception | |
| Swift | XCTest, Swift Testing | Quick + Nimble | |
| C/C++ | GoogleTest, Catch2 | Boost.Test, Doctest | CppUnit |
| Elixir | ExUnit |
Browser E2E
Playwright, Cypress, Selenium WebDriver, WebdriverIO, Puppeteer, Appium, TestCafe
API
Postman/Newman, REST Assured, Karate, supertest, Tavern, Pact, TestContainers, Bruno
BDD
Cucumber, SpecFlow, Behave, Godog, Karate
Mutation
mutmut, mutpy, Stryker (multi-target), PIT, mull
Property-based / generative (functional strength)
Hypothesis, fast-check, jqwik, FsCheck, proptest, quickcheck, PropEr
Mocking / assertion libraries (paired with unit frameworks)
Mockito, AssertJ, Hamcrest (Java) · Moq, NSubstitute, FluentAssertions (.NET) · Sinon, jest mocks, MSW (JS) · RSpec mocks (Ruby) · gomock, testify/mock (Go) · unittest.mock, pytest-mock, freezegun (Python)
Spec metadata
- Architectural decisions locked: 12 (added Decision 12 — test evidence capture)
- Frameworks catalogued: ~80
- Frameworks in v0.2: 3 (Playwright, Jest, pytest)
- Estimated v0.2 task count: 16 (was 15 — added Task 16: evidence capture + portal viewer)
- Predecessor spec: TFactory v0.1.0-mvp walking-skeleton design
- Reviewer round 1: 3 blocking + 3 advisory issues found, all addressed
- Late requirements folded in: test evidence capture (screenshots + video + trace + HAR)