freshcrate
Home > AI Agents > idle-harness

idle-harness

GAN-inspired multi-agent system that autonomously builds full-stack web apps from a single prompt using Claude AI agents

Description

GAN-inspired multi-agent system that autonomously builds full-stack web apps from a single prompt using Claude AI agents

README

Idle Harness

GAN-inspired multi-agent system that autonomously builds full-stack web apps from a single prompt using Claude AI agents

What is Idle Harness?

Idle Harness is an autonomous multi-agent coding system inspired by GAN (Generative Adversarial Network) architecture. It takes a short natural-language prompt and automatically generates a complete full-stack web application β€” frontend, backend, database, and styling β€” without human intervention.

The system orchestrates three specialized AI agents (Planner, Generator, and Evaluator) that collaborate through a structured build-evaluate-iterate loop. Like a GAN's generator-discriminator dynamic, the Generator builds the application while the Evaluator tests it as a real user would β€” without ever reading the source code. This adversarial relationship drives quality: the Generator can't cut corners because the Evaluator will catch it.

Built on Anthropic's harness design for long-running apps and powered by the Claude Agent SDK.

Quick Start

git clone https://github.com/jhlee0409/idle-harness.git
cd idle-harness
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

# Interactive setup β€” configures auth automatically
python orchestrator.py --setup

# Build an app
python orchestrator.py "A tarot reading web app with card-draw animations and AI interpretations"

# After build β€” start the app
python orchestrator.py serve

That's it. If anything is missing, the harness detects it and offers to fix it automatically.

Setup

Idle Harness includes a built-in interactive setup that detects and configures all dependencies.

Option A: Auto-detect on first run

Just run the harness. If dependencies are missing, it offers to fix them:

$ python orchestrator.py "my app idea"

Preflight checks
  βœ“ claude_agent_sdk
  βœ“ node (v20.11.0)
  βœ“ npm (10.2.4)
  βœ“ git (2.43.0)
  βœ— auth β€” No auth configured
  βœ“ MCP: playwright (SDK-managed via npx)

  Fix 1 issue(s) automatically? [Y/n]: y

  Choose auth method:
    [o] OAuth login (uses subscription quota)
    [a] API key (pay per use, no quota limit)
  Choose: o
  β†’ Running: claude login
  βœ“ OAuth authenticated

All issues fixed

Option B: Explicit setup

python orchestrator.py --setup

Runs all checks and auto-fixes everything without asking.

Option C: CI / Non-interactive

CI=1 python orchestrator.py "my app idea"

Fails hard with exit code 1 if any dependency is missing. No interactive prompts.

Other commands

# Start the last-built app (opens browser automatically)
python orchestrator.py serve

# Clean runtime artifacts for a fresh run
python orchestrator.py clean

# Clean everything including generated apps
python orchestrator.py clean --all

What gets checked

Check Auto-fixable How
claude_agent_sdk Yes pip install claude-agent-sdk
node, npm, git No Prints install link
Claude CLI No Prints install link
Auth (OAuth or API key) Yes claude login or API key input
Playwright MCP Automatic SDK launches via npx β€” no user config needed

Authentication options

Method How to configure Cost model
OAuth claude login (interactive setup handles this) Uses subscription quota (Pro/Max plan)
API key Set ANTHROPIC_API_KEY env var Pay per token, no quota limit

How It Works

User Prompt (1-4 sentences)
    ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Planner β”‚ ──→ β”‚ Generator β”‚ ←─→ β”‚ Evaluator β”‚
β”‚         β”‚     β”‚           β”‚     β”‚           β”‚
β”‚ Spec    β”‚     β”‚ React+TS  β”‚     β”‚ Browser   β”‚
β”‚ Design  β”‚     β”‚ Vite      β”‚     β”‚ Testing   β”‚
β”‚ Languageβ”‚     β”‚ FastAPI   β”‚     β”‚ Screenshotβ”‚
β”‚         β”‚     β”‚ SQLite    β”‚     β”‚ Grading   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                      ↕
              Build β†’ Evaluate β†’ Feedback Loop (max 3 rounds)
  1. Plan β€” Planner reads the frontend design skill, then expands the prompt into a full product spec with visual design language
  2. Negotiate β€” Generator and Evaluator negotiate sprint contracts with testable criteria
  3. Build β€” Generator implements the full-stack app in TypeScript + FastAPI, writes and runs tests (continuous session preserves context across retries)
  4. Evaluate β€” Evaluator tests the running app via Playwright, grading on product depth, functionality, visual design, and code quality
  5. Iterate β€” On FAIL, feedback is returned to the Generator for another attempt (up to 3 rounds)
  6. Integration β€” After all sprints, a final cross-sprint evaluation verifies the complete application works together

The GAN Principle

The Evaluator never reads source code. It can only interact with the running application through a browser β€” clicking buttons, filling forms, taking screenshots. This mirrors how a GAN's discriminator only sees the output, never the generator's internals. The result: the Generator must produce genuinely working software, not just code that looks correct.

Agents

Agent Role Key Behavior
Planner Prompt β†’ Product Spec Reads frontend design skill, defines visual design language, explores AI integration, high-level technical design (no implementation details)
Generator Spec β†’ Full-Stack Implementation React+Vite+TypeScript+FastAPI+SQLite, writes tests (pytest+vitest), self-evaluates before handoff
Evaluator Browser-Tests the Running App Never reads source code (GAN principle), screenshot evidence, detects stubs/fakes, grades on 4 full-stack criteria

Evaluation Criteria

Criterion Weight Description
Product Depth High Are features complete and real, or surface-level stubs?
Functionality High Do core interactions work end-to-end with database persistence?
Visual Design Normal Does the app match the spec's visual design language?
Code Quality Normal Stability, error handling, edge case behavior

Configuration

Editable in config.py:

Setting Default Description
mode full full (sprints + contracts + iteration) / simple (single build + eval)
max_build_attempts 3 Max build→evaluate retry rounds
max_negotiation_rounds 3 Max contract negotiation rounds
generator_max_turns 200 Max turns for Generator agent
dev_server_url http://localhost:5173 Frontend server URL
mcp_tool playwright Evaluator browser testing tool

Project Structure

idle-harness/
β”œβ”€β”€ orchestrator.py      # Main orchestration loop + preflight + setup
β”œβ”€β”€ cli.py               # Claude Agent SDK wrapper
β”œβ”€β”€ config.py            # Settings (mode, servers, limits)
β”œβ”€β”€ state.py             # State management (status.json)
β”œβ”€β”€ server.py            # Dev server start/stop
β”œβ”€β”€ sprint.py            # Sprint parsing
β”œβ”€β”€ agents/
β”‚   β”œβ”€β”€ planner.md                # Planner system prompt
β”‚   β”œβ”€β”€ generator.md              # Generator system prompt
β”‚   β”œβ”€β”€ evaluator.md              # Evaluator system prompt
β”‚   └── frontend-design-skill.md  # Design skill (Planner reads at runtime)
β”œβ”€β”€ tests/               # pytest tests
β”œβ”€β”€ comms/               # Runtime artifacts (spec, contracts, evaluations)
└── output/              # Generated applications

FAQ

What can I build with Idle Harness?

Any full-stack web application that can be described in a few sentences. Examples: a tarot reading app with AI interpretations, an AI-powered bookmark manager, a recipe finder with dietary filters, a personal finance tracker, a kanban board with drag-and-drop.

How is this different from other AI code generators?

Most AI code generators produce code in a single pass. Idle Harness uses a multi-agent adversarial loop: one agent builds, another independently evaluates the running application (not the code), and feedback drives iterative improvement. This is closer to how a development team works β€” with separate roles for implementation and quality assurance.

How long does it take?

Typical: 30 minutes to 2 hours depending on complexity. A 4-sprint app with retries can take 3+ hours but produces significantly better results than a single-pass generation.

What does it cost?

Depends on your auth method:

  • OAuth (subscription): Uses your Claude Pro/Max quota. A typical 2-sprint app uses roughly 30-60 minutes of agent time.
  • API key: Pay per token. A typical run costs $50-200 depending on complexity and retries.

Does the simple mode skip sprints?

Yes. simple mode builds the entire app in one pass and runs a single evaluation. It still retries up to 3 times on failure, but skips sprint decomposition and contract negotiation. Good for simpler apps or faster iteration.

License

MIT

Release History

VersionChangesUrgencyDate
main@2026-04-18Latest activity on main branchHigh4/18/2026
0.0.0No release found β€” using repo HEADHigh4/9/2026

Dependencies & License Audit

Loading dependencies...

Similar Packages

hermes-agentThe agent that grows with youv2026.4.16
kaiAgentic AI assistant on Telegram, powered by Claude Code. Runs locally with shell access, spec-driven PR reviews, layered security, persistent memory, and scheduled jobs. Your machine, your data, yourv1.4.0
forgegodAutonomous coding agent with web research (Recon), adversarial plan debate, 5-tier cognitive memory, multi-model routing (Gemini + DeepSeek + Ollama), 24/7 loops, and $0 local mode. Apache 2.0.main@2026-04-19
GENesis-AGIAutonomous AI agent with persistent memory, self-learning, and earned autonomy. Cognitive partner that remembers, learns, and evolves.v3.0a7
career-opsAI-powered job search system built on Claude Code. 14 skill modes, Go dashboard, PDF generation, batch processing.v1.5.0