GAN-inspired multi-agent system that autonomously builds full-stack web apps from a single prompt using Claude AI agents
Idle Harness is an autonomous multi-agent coding system inspired by GAN (Generative Adversarial Network) architecture. It takes a short natural-language prompt and automatically generates a complete full-stack web application β frontend, backend, database, and styling β without human intervention.
The system orchestrates three specialized AI agents (Planner, Generator, and Evaluator) that collaborate through a structured build-evaluate-iterate loop. Like a GAN's generator-discriminator dynamic, the Generator builds the application while the Evaluator tests it as a real user would β without ever reading the source code. This adversarial relationship drives quality: the Generator can't cut corners because the Evaluator will catch it.
Built on Anthropic's harness design for long-running apps and powered by the Claude Agent SDK.
git clone https://github.com/jhlee0409/idle-harness.git
cd idle-harness
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
# Interactive setup β configures auth automatically
python orchestrator.py --setup
# Build an app
python orchestrator.py "A tarot reading web app with card-draw animations and AI interpretations"
# After build β start the app
python orchestrator.py serveThat's it. If anything is missing, the harness detects it and offers to fix it automatically.
Idle Harness includes a built-in interactive setup that detects and configures all dependencies.
Just run the harness. If dependencies are missing, it offers to fix them:
$ python orchestrator.py "my app idea"
Preflight checks
β claude_agent_sdk
β node (v20.11.0)
β npm (10.2.4)
β git (2.43.0)
β auth β No auth configured
β MCP: playwright (SDK-managed via npx)
Fix 1 issue(s) automatically? [Y/n]: y
Choose auth method:
[o] OAuth login (uses subscription quota)
[a] API key (pay per use, no quota limit)
Choose: o
β Running: claude login
β OAuth authenticated
All issues fixed
python orchestrator.py --setupRuns all checks and auto-fixes everything without asking.
CI=1 python orchestrator.py "my app idea"Fails hard with exit code 1 if any dependency is missing. No interactive prompts.
# Start the last-built app (opens browser automatically)
python orchestrator.py serve
# Clean runtime artifacts for a fresh run
python orchestrator.py clean
# Clean everything including generated apps
python orchestrator.py clean --all| Check | Auto-fixable | How |
|---|---|---|
claude_agent_sdk |
Yes | pip install claude-agent-sdk |
node, npm, git |
No | Prints install link |
| Claude CLI | No | Prints install link |
| Auth (OAuth or API key) | Yes | claude login or API key input |
| Playwright MCP | Automatic | SDK launches via npx β no user config needed |
| Method | How to configure | Cost model |
|---|---|---|
| OAuth | claude login (interactive setup handles this) |
Uses subscription quota (Pro/Max plan) |
| API key | Set ANTHROPIC_API_KEY env var |
Pay per token, no quota limit |
User Prompt (1-4 sentences)
β
βββββββββββ βββββββββββββ βββββββββββββ
β Planner β βββ β Generator β βββ β Evaluator β
β β β β β β
β Spec β β React+TS β β Browser β
β Design β β Vite β β Testing β
β Languageβ β FastAPI β β Screenshotβ
β β β SQLite β β Grading β
βββββββββββ βββββββββββββ βββββββββββββ
β
Build β Evaluate β Feedback Loop (max 3 rounds)
- Plan β Planner reads the frontend design skill, then expands the prompt into a full product spec with visual design language
- Negotiate β Generator and Evaluator negotiate sprint contracts with testable criteria
- Build β Generator implements the full-stack app in TypeScript + FastAPI, writes and runs tests (continuous session preserves context across retries)
- Evaluate β Evaluator tests the running app via Playwright, grading on product depth, functionality, visual design, and code quality
- Iterate β On FAIL, feedback is returned to the Generator for another attempt (up to 3 rounds)
- Integration β After all sprints, a final cross-sprint evaluation verifies the complete application works together
The Evaluator never reads source code. It can only interact with the running application through a browser β clicking buttons, filling forms, taking screenshots. This mirrors how a GAN's discriminator only sees the output, never the generator's internals. The result: the Generator must produce genuinely working software, not just code that looks correct.
| Agent | Role | Key Behavior |
|---|---|---|
| Planner | Prompt β Product Spec | Reads frontend design skill, defines visual design language, explores AI integration, high-level technical design (no implementation details) |
| Generator | Spec β Full-Stack Implementation | React+Vite+TypeScript+FastAPI+SQLite, writes tests (pytest+vitest), self-evaluates before handoff |
| Evaluator | Browser-Tests the Running App | Never reads source code (GAN principle), screenshot evidence, detects stubs/fakes, grades on 4 full-stack criteria |
| Criterion | Weight | Description |
|---|---|---|
| Product Depth | High | Are features complete and real, or surface-level stubs? |
| Functionality | High | Do core interactions work end-to-end with database persistence? |
| Visual Design | Normal | Does the app match the spec's visual design language? |
| Code Quality | Normal | Stability, error handling, edge case behavior |
Editable in config.py:
| Setting | Default | Description |
|---|---|---|
mode |
full |
full (sprints + contracts + iteration) / simple (single build + eval) |
max_build_attempts |
3 |
Max buildβevaluate retry rounds |
max_negotiation_rounds |
3 |
Max contract negotiation rounds |
generator_max_turns |
200 |
Max turns for Generator agent |
dev_server_url |
http://localhost:5173 |
Frontend server URL |
mcp_tool |
playwright |
Evaluator browser testing tool |
idle-harness/
βββ orchestrator.py # Main orchestration loop + preflight + setup
βββ cli.py # Claude Agent SDK wrapper
βββ config.py # Settings (mode, servers, limits)
βββ state.py # State management (status.json)
βββ server.py # Dev server start/stop
βββ sprint.py # Sprint parsing
βββ agents/
β βββ planner.md # Planner system prompt
β βββ generator.md # Generator system prompt
β βββ evaluator.md # Evaluator system prompt
β βββ frontend-design-skill.md # Design skill (Planner reads at runtime)
βββ tests/ # pytest tests
βββ comms/ # Runtime artifacts (spec, contracts, evaluations)
βββ output/ # Generated applications
Any full-stack web application that can be described in a few sentences. Examples: a tarot reading app with AI interpretations, an AI-powered bookmark manager, a recipe finder with dietary filters, a personal finance tracker, a kanban board with drag-and-drop.
Most AI code generators produce code in a single pass. Idle Harness uses a multi-agent adversarial loop: one agent builds, another independently evaluates the running application (not the code), and feedback drives iterative improvement. This is closer to how a development team works β with separate roles for implementation and quality assurance.
Typical: 30 minutes to 2 hours depending on complexity. A 4-sprint app with retries can take 3+ hours but produces significantly better results than a single-pass generation.
Depends on your auth method:
- OAuth (subscription): Uses your Claude Pro/Max quota. A typical 2-sprint app uses roughly 30-60 minutes of agent time.
- API key: Pay per token. A typical run costs $50-200 depending on complexity and retries.
Yes. simple mode builds the entire app in one pass and runs a single evaluation. It still retries up to 3 times on failure, but skips sprint decomposition and contract negotiation. Good for simpler apps or faster iteration.
MIT
