freshcrate
Skin:/
Home > AI Agents > autoresearch

autoresearch

Claude Autoresearch Skill — Autonomous goal-directed iteration for Claude Code. Inspired by Karpathy's autoresearch. Modify → Verify → Keep/Discard → Repeat forever.

Why this rank:Strong adoptionRecent releaseHealthy release cadence

Description

Claude Autoresearch Skill — Autonomous goal-directed iteration for Claude Code. Inspired by Karpathy's autoresearch. Modify → Verify → Keep/Discard → Repeat forever.

README


      PLAN              LOOP             DEBUG              FIX            SECURE            SHIP
 ┌──────────┐     ┌──────────┐     ┌──────────┐     ┌──────────┐     ┌──────────┐     ┌──────────┐
 │   Goal   │     │  Modify  │     │   Find   │     │   Fix    │     │  STRIDE  │     │  Stage   │
 │  Metric  │────▶│  Verify  │────▶│   Bugs   │────▶│  Errors  │────▶│  OWASP   │────▶│  Deploy  │
 │  Scope   │     │  Keep/   │     │  Trace   │     │  Repair  │     │  Red     │     │ Release  │
 └──────────┘     │  Discard │     └──────────┘     └──────────┘     │  Team    │     └──────────┘
/autoresearch:    └──────────┘    /autoresearch:    /autoresearch:   └──────────┘    /autoresearch:
  plan            /autoresearch     debug              fix          /autoresearch:      ship
                                                                     security

                  ┌──────────┐     ┌──────────┐     ┌──────────┐     ┌──────────┐
                  │ Scenario │     │ Predict  │     │  Learn   │     │  Reason  │
                  │   Edge   │     │ 5-Expert │     │   Docs   │     │  Debate  │
                  │   Cases  │     │  Swarm   │     │   Gen    │     │ Converge │
                  └──────────┘     └──────────┘     └──────────┘     └──────────┘
                 /autoresearch:   /autoresearch:   /autoresearch:   /autoresearch:
                   scenario         predict           learn           reason

Why This Exists

Karpathy's autoresearch demonstrated that a 630-line Python script could autonomously improve ML models overnight — 100 experiments per night — by following simple principles: one metric, constrained scope, fast verification, automatic rollback, git as memory.

Claude Autoresearch generalizes these principles to ANY domain. Not just ML — code, content, marketing, sales, HR, DevOps, or anything with a number you can measure.


How It Works

LOOP (FOREVER or N times):
  1. Review current state + git history + results log
  2. Pick the next change (based on what worked, what failed, what's untried)
  3. Make ONE focused change
  4. Git commit (before verification)
  5. Run mechanical verification (tests, benchmarks, scores)
  6. If improved → keep. If worse → git revert. If crashed → fix or skip.
  7. Log the result
  8. Repeat. Never stop until you interrupt (or N iterations complete).

Every improvement stacks. Every failure auto-reverts. Progress is logged in TSV format.

The Setup Phase

Before looping, Claude performs a one-time setup:

  1. Read context — reads all in-scope files
  2. Define goal — extracts or asks for a mechanical metric
  3. Define scope — which files can be modified vs read-only
  4. Establish baseline — runs verification on current state (iteration #0)
  5. Confirm and go — shows setup, then begins the loop

8 Critical Rules

# Rule
1 Loop until done — unbounded: forever. Bounded: N times then summarize
2 Read before write — understand full context before modifying
3 One change per iteration — atomic changes. If it breaks, you know why
4 Mechanical verification only — no subjective "looks good." Use metrics
5 Automatic rollback — failed changes revert instantly
6 Simplicity wins — equal results + less code = KEEP
7 Git is memory — experiments committed with experiment: prefix, git revert preserves failed experiments in history, agent MUST read git log + git diff before each iteration
8 When stuck, think harder — re-read, combine near-misses, try radical changes

Commands

Command What it does
/autoresearch Run the autonomous iteration loop (unlimited)
Iterations: N Add to inline config to run exactly N iterations then stop
/autoresearch:plan Interactive wizard: Goal → Scope, Metric, Verify config
/autoresearch:security Autonomous STRIDE + OWASP + red-team security audit
/autoresearch:ship Universal shipping workflow (code, content, marketing, sales, research, design)
/autoresearch:debug Autonomous bug-hunting loop — scientific method + iterative investigation
/autoresearch:fix Autonomous fix loop — iteratively repair errors until zero remain
/autoresearch:scenario Scenario-driven use case generator — explore situations, edge cases, derivative scenarios
/autoresearch:predict Multi-persona prediction
/autoresearch:learn Autonomous documentation engine — scout codebase, generate/update docs, validate, fix loop
/autoresearch:reason Adversarial refinement — blind judge panel converges subjective content through isolated multi-agent debate
Guard: <command> Optional safety net — must pass for changes to be kept

All commands use interactive setup when invoked without arguments. Just type the command — the agent will ask you what you need step by step with smart defaults based on your codebase. Power users can skip the wizard by providing flags inline.

OpenCode users: Commands use underscore naming (/autoresearch_debug, /autoresearch_fix, etc.) instead of colons. See OpenCode Quick Start below.

Codex users: Invoke the skill via $autoresearch mention syntax. Subcommands are keywords: $autoresearch plan, $autoresearch debug, etc. See Codex Quick Start below.

Quick Decision Guide

I want to... Use
Improve test coverage / reduce bundle size / any metric /autoresearch (add Iterations: N for bounded runs)
Don't know what metric to use /autoresearch:plan
Run a security audit /autoresearch:security
Ship a PR / deployment / release /autoresearch:ship
Optimize without breaking existing tests Add Guard: npm test
Hunt all bugs in a codebase /autoresearch:debug (add Iterations: 20 for bounded runs)
Fix all errors (tests, types, lint) /autoresearch:fix
Debug then auto-fix /autoresearch:debug --fix
Check if something is ready to ship /autoresearch:ship --checklist-only
Explore edge cases for a feature /autoresearch:scenario
Generate test scenarios /autoresearch:scenario --domain software --format test-scenarios
Stress test a user journey /autoresearch:scenario --depth deep
I want expert opinions before I start /autoresearch:predict
Analyze this from multiple angles /autoresearch:predict --chain debug
Generate docs for a new codebase /autoresearch:learn --mode init
Update existing docs after changes /autoresearch:learn --mode update
Check if docs are stale /autoresearch:learn --mode check
Debate an architecture decision /autoresearch:reason --domain software
Refine a pitch or proposal adversarially /autoresearch:reason --domain business
Converge on best design then validate /autoresearch:reason --chain predict

Quick Start

Claude Code

Option A — Plugin install (recommended):

In Claude Code, run:

/plugin marketplace add uditgoenka/autoresearch
/plugin install autoresearch@autoresearch

That's it. All 10 commands are available after restarting Claude Code.

Note: Start a new Claude Code session after installing. Reference files aren't resolvable in the same session where installation happened — this is a Claude Code platform limitation.

Updating (no reinstall needed):

/plugin update autoresearch

That pulls the latest version. Run /reload-plugins to activate. No need to uninstall or re-clone.

Option B — Manual copy:

git clone https://github.com/uditgoenka/autoresearch.git

# Copy skill + subcommands to your project
cp -r autoresearch/claude-plugin/skills/autoresearch .claude/skills/autoresearch
cp -r autoresearch/claude-plugin/commands/autoresearch .claude/commands/autoresearch
cp autoresearch/claude-plugin/commands/autoresearch.md .claude/commands/autoresearch.md

Or install globally:

cp -r autoresearch/claude-plugin/skills/autoresearch ~/.claude/skills/autoresearch
cp -r autoresearch/claude-plugin/commands/autoresearch ~/.claude/commands/autoresearch
cp autoresearch/claude-plugin/commands/autoresearch.md ~/.claude/commands/autoresearch.md

Note: The commands/ directory is required for subcommands (/autoresearch:ship, /autoresearch:plan, /autoresearch:security) to work.

Option C — Guided installer:

git clone https://github.com/uditgoenka/autoresearch.git
cd autoresearch
./scripts/install.sh --claude --global

OpenCode Quick Start

Option A — Guided installer (recommended):

git clone https://github.com/uditgoenka/autoresearch.git
cd autoresearch
./scripts/install.sh --opencode --global

Option B — Manual copy:

git clone https://github.com/uditgoenka/autoresearch.git

# Copy to your project
cp -r autoresearch/.opencode/skills/autoresearch .opencode/skills/autoresearch
cp autoresearch/.opencode/commands/autoresearch*.md .opencode/commands/
cp autoresearch/.opencode/agents/docs-manager.md .opencode/agents/docs-manager.md

Or install globally:

cp -r autoresearch/.opencode/skills/autoresearch ~/.config/opencode/skills/autoresearch
cp autoresearch/.opencode/commands/autoresearch*.md ~/.config/opencode/commands/
cp autoresearch/.opencode/agents/docs-manager.md ~/.config/opencode/agents/docs-manager.md

OpenCode command names: Use underscores instead of colons — /autoresearch_debug, /autoresearch_fix, /autoresearch_plan, etc. All 10 commands are available.

Codex Quick Start

Option A — Guided installer (recommended):

git clone https://github.com/uditgoenka/autoresearch.git
cd autoresearch
./scripts/install.sh --codex --global

Option B — Manual copy:

git clone https://github.com/uditgoenka/autoresearch.git

# Copy to your project
cp -r autoresearch/.agents/skills/autoresearch .agents/skills/autoresearch

Or install globally:

cp -r autoresearch/.agents/skills/autoresearch ~/.agents/skills/autoresearch

Codex invocation: Use $autoresearch mention syntax in your prompt. Subcommands are keywords — $autoresearch plan, $autoresearch debug, $autoresearch security, etc. Codex discovers skills automatically from .agents/skills/ directories.

2. Run It

/autoresearch
Goal: Increase test coverage from 72% to 90%
Scope: src/**/*.test.ts, src/**/*.ts
Metric: coverage % (higher is better)
Verify: npm test -- --coverage | grep "All files"

3. Walk Away

Claude reads all files, establishes a baseline, and starts iterating — one change at a time. Keep improvements, auto-revert failures, log everything. Never stops until you interrupt (or N iterations complete).


/autoresearch:plan — Goal → Config Wizard

The hardest part isn't the loop — it's defining Scope, Metric, and Verify correctly. /autoresearch:plan converts your plain-language goal into a validated, ready-to-execute configuration.

/autoresearch:plan
Goal: Make the API respond faster

The wizard walks you through 5 steps: capture goal → define scope → define metric → define direction → validate verify command (dry-run). Every gate is mechanical — scope must resolve to files, metric must output a number, verify must pass a dry-run.


/autoresearch:security — Autonomous Security Audit

Read-only security audit using STRIDE threat modeling, OWASP Top 10 sweeps, and red-team adversarial analysis with 4 hostile personas.

/autoresearch:security
Iterations: 10

What it does: Codebase recon → asset inventory → trust boundaries → STRIDE threat model → attack surface map → autonomous testing loop → structured report.

Every finding requires code evidence (file:line + attack scenario). No theoretical fluff.

Flag Purpose
--diff Only audit files changed since last audit
--fix Auto-fix confirmed Critical/High findings
--fail-on <severity> Exit non-zero for CI/CD gating

Output: Creates security/{date}-{slug}/ with 7 structured report files.


/autoresearch:ship — Universal Shipping Workflow

Ship anything through 8 phases: Identify → Inventory → Checklist → Prepare → Dry-run → Ship → Verify → Log.

/autoresearch:ship --auto

Auto-detects what you're shipping (code PR, deployment, blog post, email campaign, sales deck, research paper, design assets) and generates domain-specific checklists — every item mechanically verifiable.

Flag Purpose
--dry-run Validate everything but don't ship
--auto Auto-approve if checklist passes
--force Skip non-critical items (blockers still enforced)
--rollback Undo last ship action
--monitor N Post-ship monitoring for N minutes
--type <type> Override auto-detection
--checklist-only Just check readiness

9 supported types: code-pr, code-release, deployment, content, marketing-email, marketing-campaign, sales, research, design.


/autoresearch:debug — Autonomous Bug Hunter (v1.3.0)

Scientific method meets autoresearch loop. Doesn't stop at one bug — iteratively hunts ALL bugs using falsifiable hypotheses, evidence-based investigation, and 7 investigation techniques.

/autoresearch:debug
Scope: src/api/**/*.ts
Symptom: API returns 500 on POST /users
Iterations: 20

How it works: Gather symptoms → Recon (map error surface) → Hypothesize (specific, testable) → Test (one experiment per iteration) → Classify (confirmed/disproven/inconclusive) → Log → Repeat.

Every finding requires code evidence (file:line + reproduction steps). Every disproven hypothesis is logged — equally valuable. Uses 7 techniques: binary search, differential debugging, minimal reproduction, trace execution, pattern search, working backwards, rubber duck.

Flag Purpose
--fix After hunting, auto-switch to /autoresearch:fix
--scope <glob> Limit investigation scope
--symptom "<text>" Pre-fill symptom
--severity <level> Minimum severity to report

/autoresearch:fix — Autonomous Error Crusher (v1.3.0)

Takes a broken state and iteratively repairs it until everything passes. ONE fix per iteration. Atomic, committed, verified, auto-reverted on failure.

/autoresearch:fix

How it works: Auto-detects what's broken (tests, types, lint, build) → Prioritizes (blockers first) → Fixes ONE thing → Commits → Verifies error count decreased → Guard check (no regressions) → Keep/Revert → Repeat until zero errors.

Stops automatically when error count hits zero — even in unbounded mode.

Flag Purpose
--target <command> Explicit verify command
--guard <command> Safety command that must always pass
--category <type> Only fix specific type (test, type, lint, build)
--from-debug Read findings from latest debug session

Chain them: Run /autoresearch:debug with Iterations: 15, then /autoresearch:fix --from-debug with Iterations: 30


/autoresearch:learn — Autonomous Documentation Engine

Scout codebase → generate docs → validate → fix → repeat. 4 modes: init (create from scratch), update (refresh existing), check (read-only health report), summarize (quick overview).

/autoresearch:learn --mode init --depth deep

Dynamic doc discovery (scans docs/*.md), project-type detection, validation-fix loop (max 3 retries), scale-aware scouting, git-diff scoping for updates, selective single-doc update with --file. Auto-generates Mermaid architecture diagrams, conditional docs (API reference, testing guide, config guide, changelog), cross-reference links between docs, and dependency documentation. Supports --format for alternative output formats.


/autoresearch:predict — Multi-Persona Prediction (v1.7.0)

Before you debug, fix, or ship — get 5 expert perspectives in 2 minutes.

/autoresearch:predict simulates a team of experts (Architect, Security Analyst, Performance Engineer, Reliability Engineer, Devil's Advocate) who independently analyze your code, debate findings, and reach consensus. Chain the output directly to any other command:

  • /autoresearch:predict --chain debug — pre-ranked hypotheses before debugging
  • /autoresearch:predict --chain security — multi-persona red team analysis
  • /autoresearch:predict --chain scenario,debug,fix — full quality pipeline

/autoresearch:reason — Adversarial Refinement (v1.9.0)

Extends autoresearch to subjective domains where no objective metric exists. The blind judge panel IS the fitness function — it's val_bpb for architecture decisions, product strategy, content quality, and design debates.

/autoresearch:reason
Task: Should we use event sourcing for our order management system?
Domain: software
Iterations: 8

How it works: Generate-A → Critic attacks (strawman) → Author-B responds → Synthesizer merges → Blind judge panel (randomized labels) picks winner → Winner becomes new A → Repeat until convergence.

Key invariant: Every agent is a cold-start fresh invocation — no shared session, no history bleed. Judges never see A/B/AB labels, only X/Y/Z.

Flag Purpose
--iterations N Bounded mode — run exactly N rounds
--judges N Judge count (3-7, odd preferred)
--convergence N Consecutive wins to converge (default: 3)
--mode <mode> convergent (default), creative, debate
--domain <type> software, product, business, security, research, content
--chain <targets> Chain converged output to any autoresearch command

Chain patterns: reason → predict (converge then stress-test), reason → plan,fix (converge then implement), reason → scenario (converge then explore edge cases).

Output: Creates reason/{date}-{slug}/ with lineage.md, candidates.md, judge-transcripts.md, reason-results.tsv, handoff.json.


/autoresearch:scenario — Scenario Explorer (v1.6.0)

Autonomous scenario exploration engine. Takes a seed scenario and iteratively generates situations across 12 dimensions — happy paths, errors, edge cases, abuse, scale, concurrency, temporal, data variation, permissions, integrations, recovery, and state transitions.

/autoresearch:scenario
Scenario: User attempts to checkout with multiple payment methods
Iterations: 25

How it works: Seed analysis → Decompose into 12 dimensions → Generate ONE situation per iteration → Classify (new/variant/duplicate) → Expand edge cases → Log → Repeat until all dimensions explored.

Adaptive setup: provides 4-8 questions based on how much context you give. Just type /autoresearch:scenario with nothing else and it walks you through everything.

Flag Purpose
--domain <type> Domain: software, product, business, security, marketing
--depth <level> Depth: shallow (10), standard (25), deep (50+)
--format <type> Output: use-cases, user-stories, test-scenarios, threat-scenarios
--focus <area> Prioritize: edge-cases, failures, security, scale
--scope <glob> Limit to specific files/features

5 domains supported with tailored dimension priorities and output formats. Chain with /autoresearch:debug to hunt bugs in discovered edge cases, or /autoresearch:security to audit discovered threat scenarios.


Guard — Prevent Regressions (v1.0.4)

When optimizing a metric, the loop might break existing behavior. Guard is an optional safety net.

/autoresearch
Goal: Reduce API response time to under 100ms
Verify: npm run bench:api | grep "p95"
Guard: npm test
  • Verify = "Did the metric improve?" (the goal)
  • Guard = "Did anything else break?" (the safety net)

If the metric improves but the guard fails, Claude reworks the optimization (up to 2 attempts). Guard/test files are never modified.

Credit: Guard was contributed by @pronskiy (JetBrains) in PR #7.


Results Tracking

Every iteration is logged in TSV format:

iteration  commit   metric  delta   status    description
0          a1b2c3d  85.2    0.0     baseline  initial state
1          b2c3d4e  87.1    +1.9    keep      add tests for auth edge cases
2          -        86.5    -0.6    discard   refactor test helpers (broke 2 tests)
3          c3d4e5f  88.3    +1.2    keep      add error handling tests

Every 10 iterations, Claude prints a progress summary. Bounded loops print a final summary with baseline → current best.


Crash Recovery

Failure Response
Syntax error Fix immediately, don't count as iteration
Runtime error Attempt fix (max 3 tries), then move on
Resource exhaustion Revert, try smaller variant
Infinite loop / hang Kill after timeout, revert
External dependency Skip, log, try different approach

Repository Structure

autoresearch/
├── README.md
├── COMPARISON.md                                  ← Karpathy's Autoresearch vs Claude Autoresearch
├── guide/                                         ← Comprehensive guides — one per command + advanced patterns
├── scripts/
│   ├── install.sh                                 ← Guided installer (Claude Code + OpenCode + Codex)
│   ├── sync-opencode.sh                           ← Sync .claude/ → .opencode/ with adaptations
│   ├── sync-codex.sh                              ← Sync .claude/ → .agents/ with Codex adaptations
│   ├── release.sh                                 ← Release automation
│   └── release.md                                 ← Release checklist
├── .claude/skills/autoresearch/                   ← Claude Code source (canonical)
│   ├── SKILL.md                                   ← Main skill
│   └── references/                                ← 12 workflow protocol files
├── .opencode/                                     ← OpenCode port (generated via sync-opencode.sh)
│   ├── skills/autoresearch/                       ← Adapted SKILL.md + references
│   ├── commands/                                  ← 10 command files (autoresearch_*.md)
│   └── agents/docs-manager.md                     ← Subagent for learn workflow
├── .agents/skills/autoresearch/                   ← Codex port (generated via sync-codex.sh)
│   ├── SKILL.md                                   ← Adapted SKILL.md + references
│   ├── references/                                ← 12 workflow protocol files
│   └── agents/openai.yaml                         ← UI metadata for Codex
├── claude-plugin/                                 ← Distribution package (Claude Code plugin install)
│   ├── .claude-plugin/plugin.json                 ← Plugin metadata + version
│   ├── commands/                                  ← Command registrations
│   └── skills/autoresearch/                       ← Skill + references
└── LICENSE

FAQ

Q: I don't know what metric to use. A: Run /autoresearch:plan — it analyzes your codebase, suggests metrics, and dry-runs the verify command before you launch.

Q: Does this work with any project? A: Yes. Any language, framework, or domain. Install via /plugin marketplace add uditgoenka/autoresearch (Claude Code), ./scripts/install.sh --opencode --global (OpenCode), ./scripts/install.sh --codex --global (Codex), or manually copy files.

Q: Does this work with OpenCode? A: Yes, as of v2.0.0-beta. Run ./scripts/install.sh --opencode --global or manually copy .opencode/ files. Commands use underscore naming (/autoresearch_debug instead of /autoresearch:debug).

Q: Does this work with OpenAI Codex? A: Yes, as of v2.0.0-beta.0.2. Run ./scripts/install.sh --codex --global or copy .agents/skills/autoresearch/ to ~/.agents/skills/. Invoke via $autoresearch mention syntax in Codex.

Q: How do I stop the loop? A: Ctrl+C or add Iterations: N to your inline config to run exactly N iterations. Claude commits before verifying, so your last successful state is always in git.

Q: Can I use this for non-code tasks? A: Absolutely. Sales emails, marketing copy, HR policies, runbooks — anything with a measurable metric. See Examples by Domain.

Q: Does /autoresearch:security modify my code? A: No. It's read-only — analyzes code and produces a structured report. Use --fix to opt into auto-remediation of confirmed Critical/High findings.

Q: Can I use MCP servers? A: Yes. Any MCP server configured in Claude Code is available during the loop for database queries, API calls, analytics, etc. See Advanced Patterns.

Q: What's the difference between /autoresearch:predict and /autoresearch:reason? A: Predict is a one-shot analysis — 5 experts debate your existing code. Reason is an iterative refinement loop — competing candidates are generated, critiqued, synthesized, and blind-judged over multiple rounds until convergence. Use predict for analysis before acting; use reason for decisions where no objective metric exists.


Contributing

Contributions welcome! See CONTRIBUTING.md.

Areas of interest: new domain examples, verification script templates, CI/CD integrations, real-world benchmarks. All guides are in the guide/ folder.


Star History

Star History Chart

License

MIT — see LICENSE.


Credits


About the Author

Udit Goenka

Udit Goenka — AI Product Expert, Founder & Angel Investor

Self-taught builder who went from a slow internet connection in India to founding multiple companies and helping 700+ startups generate over ~$25m in revenue.

Building: TinyCheque (India's first agentic AI venture studio) · Firstsales.io (sales automation)

Investing: 38 startups backed, 6 exits. Focused on early-stage AI and SaaS.

Connect: udit.co · @iuditg · @uditgoenka · Newsletter

"Autonomy scales when you constrain scope, clarify success, mechanize verification, and let agents optimize tactics while humans optimize strategy."

Release History

VersionChangesUrgencyDate
v2.1.2## What's New ### `/autoresearch:improve` — Research What to Build Next New subcommand for product companies. Fills the "what should we build next?" gap in a family that previously only covered "how to build it right." **How it works:** 1. **Product Context** — resolves from learn summary, README, package.json, or auto-discovers 2. **Research Loop** — 5 categories (ICP challenges, competitor gaps, market trends, UX & experience, revenue & growth) with saturation-based termination 3. **FeatureHigh5/23/2026
v2.1.1## What's New **9 auto-firing hooks** that ship as part of the Claude Code plugin. Zero configuration — they activate on `npx skills add uditgoenka/autoresearch`. ### Safety Gates (PreToolUse) | Hook | What it does | |------|-------------| | `scout-block` | Blocks vendor dirs, `.git/`, `__pycache__/`, `dist/`, `build/`, `coverage/` — prevents context bloat. Loads `.ckignore` for per-project customization. Smart Bash argument parsing prevents false positives on string literals. | | `privacy-blHigh5/22/2026
v2.0.04## What's New ### Reliable Skill Triggering The autoresearch skill description across all 5 distributions (Claude Code, Claude Plugin, OpenCode, Agents, Codex) has been changed from passive "Use when user types..." to imperative "ALWAYS activate...MUST...BLOCKING". This matches the pattern used by reliably-triggering skills like `cook` and `fix`, ensuring autoresearch subcommands activate consistently without requiring users to add emphatic language like "strictly trigger" or "must must triggerHigh5/6/2026
v2.0.03## What's Changed * Release v2.0.03 by @uditgoenka in https://github.com/uditgoenka/autoresearch/pull/85 **Full Changelog**: https://github.com/uditgoenka/autoresearch/compare/v2.0.2...v2.0.03High5/2/2026
v2.0.0## v2.0.0 — Multi-Platform GA Release Promotes v2.0.0-beta to stable. **Claude Code + OpenCode + Codex** all fully supported with strict YAML compliance, security-hardened scripts, and complete command metadata. ### What's New in v2.0.0 - **11 subcommands:** plan, debug, fix, security, ship, scenario, predict, learn, reason, probe (new) - **3 platforms:** Claude Code, OpenCode, OpenAI Codex - **`/autoresearch:probe`** — adversarial multi-persona requirement interrogation engine (new in this rHigh4/28/2026
v1.9.12## What's fixed `/autoresearch` and its 9 subcommands (`plan`, `debug`, `fix`, `security`, `ship`, `scenario`, `predict`, `learn`, `reason`) now dispatch reliably on the **first** invocation. Previously the skill failed to trigger ~80% of the time, requiring repeated attempts (often 10+) before Claude Code would activate it. ## Root cause Claude Code routes slash commands through a **description-based fuzzy matcher**. None of the skill or command descriptions contained the literal `/autoreseaHigh4/15/2026
v2.0.0-beta.0.2## What's New ### OpenAI Codex Support Autoresearch now runs on **three platforms**: Claude Code, OpenCode, and OpenAI Codex. **Codex-specific:** - Skills installed to `.agents/skills/autoresearch/` (Codex standard path) - Invoked via `$autoresearch` mention syntax (not slash commands) - Subcommands are keywords: `$autoresearch plan`, `$autoresearch debug`, `$autoresearch security`, etc. - Includes `agents/openai.yaml` for Codex UI metadata (display name, brand color, implicit invocation) - GlMedium4/6/2026
v2.0.0-beta.0.1## 🧪 Beta Release — OpenCode Support > **This is a beta release.** OpenCode support is new and needs community testing before stable promotion. ### What's New **Full OpenCode port of autoresearch** — all 10 commands, all 12 reference workflows, adapted for OpenCode's tool and command conventions. #### OpenCode Support - **10 commands:** `/autoresearch`, `/autoresearch_plan`, `/autoresearch_debug`, `/autoresearch_fix`, `/autoresearch_security`, `/autoresearch_ship`, `/autoresearch_scenario`,Medium4/6/2026
v1.9.11## What's New ### Metric Extraction Validation (PR #63) - **Mandatory numeric validation** — extracted values must match `^-?[0-9]+\.?[0-9]*$` before any decision logic runs - **`metric-error` status** — new iteration status for non-numeric extraction failures - **Two-consecutive-error halt** — stops the loop (even unbounded) when the verify pipeline is confirmed broken - **Diagnostic output** — shows raw verify output on failure so the problem is visible - **Whitespace trim** — strips leading/High4/6/2026
v1.9.1## What's New ### AGENTS.md — Universal AI Agent Onboarding Added `AGENTS.md` to the project root so that any AI coding agent (Claude Code, Codex, OpenCode, Gemini CLI, etc.) can immediately discover and use all 10 autoresearch commands without reading the full README or skill files. **Covers:** - Installation (Claude Code plugin, Codex plugin, manual copy) - All 10 commands with usage examples - Configuration fields (Goal, Scope, Metric, Verify, Guard, Iterations) - Per-command flag referencMedium4/6/2026
v1.9.0## What's New in v1.9.0 ### `/autoresearch:reason` — The 10th Subcommand Extends autoresearch to **subjective domains** where no objective metric exists. Constructs a subjective fitness function through isolated multi-agent adversarial refinement with blind evaluation — the same way science uses peer review where math uses proofs. **The blind judge panel IS the val_bpb equivalent for subjective work.** ### How It Works ``` Generate-A → Critic attacks A (strawman) → Author-B sees task+A+critMedium3/31/2026
v1.8.2## Stability & Documentation Patch 10 bugs fixed from a 50-iteration `/autoresearch:debug` audit. No new features — purely stability and cross-reference consistency. ### Bug Fixes - **learn.md** — Added missing Argument Parsing section (9 flags were silently ignored) - **SKILL.md** — Fixed `--budget` flag semantic mismatch (was "LLM cost", now correctly "max findings" matching predict-workflow.md) - **results-logging.md** — Completed status enum (added `keep (reworked)`, `no-op`, `hook-blockeLow3/21/2026
v1.8.1## /autoresearch:learn — 10 Enhancements This patch adds 10 improvements to the autonomous documentation engine, making it smarter about what docs to generate and how to keep them in sync. ### New capabilities - **Mermaid architecture diagrams** — `system-architecture.md` now auto-generates component, data flow, and dependency diagrams - **Conditional documentation** — 4 new doc types auto-created when signals detected: - `api-reference.md` (routes, controllers, OpenAPI specs) - `testing-Low3/21/2026
v1.8.0## What's Changed * Update COMPARISON.md by @uditgoenka in https://github.com/uditgoenka/autoresearch/pull/49 * feat: /autoresearch:learn — autonomous codebase documentation engine by @uditgoenka in https://github.com/uditgoenka/autoresearch/pull/50 * release: v1.8.0 — /autoresearch:learn by @uditgoenka in https://github.com/uditgoenka/autoresearch/pull/52 **Full Changelog**: https://github.com/uditgoenka/autoresearch/compare/v1.7.6...v1.8.0Low3/20/2026
v1.7.6## What's New ### 📖 10 Scenario-Based Guide Examples (`guide/scenario/`) Real-world, end-to-end walkthroughs of `/autoresearch:scenario` applied to specific domains. Each guide includes command config, 5-6 example situations across the 12 exploration dimensions, chain patterns, and domain-specific tips. | Guide | Domain | Key Dimensions | |-------|--------|----------------| | [Real-Time Chat Messaging](guide/scenario/real-time-chat-messaging.md) | software | concurrent, recovery, temporal | Low3/20/2026
v1.7.5## Documentation Improvements Adds 1,500+ lines of actionable implementation guidance to improve benchmark score from **65.4/100** toward **95+/100**. Every addition includes executable code snippets, configuration parameters, and real-world examples. ### What Changed #### TSV Logging (Q7: 30→95) - Setup & initialization script, `log_iteration()` function, read/query patterns, loop integration lifecycle #### Noisy Metric Handling (Q9: 31→81) — NEW Phase 5.1 - Multi-run verification, minimum Low3/19/2026
v1.7.4## Bug Fix Fixes #43 — Plugin installation failed on macOS with `ENAMETOOLONG` error due to recursive self-nesting in the plugin cache. ### What happened When `marketplace.json` had `"source": "./"`, Claude Code cached the **entire repo** — including `marketplace.json` itself. The cached copy triggered another cache cycle, creating an infinite loop: ``` ~/.claude/plugins/cache/autoresearch/1.7.3/autoresearch/1.7.3/autoresearch/1.7.3/... (45+ levels deep) ``` This pushed file paths to 1021+ Low3/19/2026
v1.7.3## What's Changed * Release v1.7.3 — Further Stability Fixes & Improvements by @uditgoenka in https://github.com/uditgoenka/autoresearch/pull/42 **Full Changelog**: https://github.com/uditgoenka/autoresearch/compare/v1.7.2...v1.7.3Low3/18/2026
v1.7.2## v1.7.2 — Stability Fixes Comprehensive stability patch addressing **25+ bugs** found via two rounds of deep `/autoresearch:debug --iterations 25` audits. Pure bugfix release — no new features. ### Highlights - **Ship command**: Added `--target` flag for explicit ship target - **Debug→Fix chain**: `--fix` now correctly passes `--from-debug` to fix command - **Security CI**: Template now copies both `commands/` and `skills/` (was missing commands) - **Predict budget**: Resolved unit collisioLow3/18/2026
v1.7.1## What's Changed ### Documentation Restructure - Replaced monolithic `GUIDE.md` (1,791 lines) + `EXAMPLES.md` (2,228 lines) with **13 focused guide files** in `guide/` folder (6,183 lines total) - Each subcommand has its own detailed guide: `guide/autoresearch-predict.md` (778 lines), `guide/autoresearch-security.md` (512 lines), etc. - New files: `guide/chains-and-combinations.md`, `guide/examples-by-domain.md`, `guide/advanced-patterns.md` ### Command Reliability Fixes - **Commands now trigLow3/18/2026
v1.7.0## What's Changed * fix: remove self-referencing source URL causing recursive directory nesting by @daviseford in https://github.com/uditgoenka/autoresearch/pull/36 * feat: /autoresearch:predict — Multi-Persona Swarm Prediction (v1.7.0) by @uditgoenka in https://github.com/uditgoenka/autoresearch/pull/39 ## New Contributors * @daviseford made their first contribution in https://github.com/uditgoenka/autoresearch/pull/36 **Full Changelog**: https://github.com/uditgoenka/autoresearch/compare/v1.Low3/18/2026
v1.6.2## What's Changed * Simplify autoresearch plugin installation instructions by @tomashm in https://github.com/uditgoenka/autoresearch/pull/31 * Release v1.6.2 — Comprehensive GUIDE.md by @uditgoenka in https://github.com/uditgoenka/autoresearch/pull/34 ## New Contributors * @tomashm made their first contribution in https://github.com/uditgoenka/autoresearch/pull/31 **Full Changelog**: https://github.com/uditgoenka/autoresearch/compare/v1.6.1...v1.6.2Low3/17/2026
v1.6.1## What's Changed Fixes #29 — The "git is memory" mechanism that powers inter-iteration learning in the autonomous loop wasn't working reliably. The agent would skip reading git history, use destructive `git reset --hard` (destroying experiment memory), and enter the loop without verifying git state. This release comprehensively hardens the protocol with **10 targeted fixes** discovered through a 20-iteration edge case audit. ### Highlights **New: Phase 0 — Precondition Checks** Before enterLow3/17/2026
v1.6.0## What's New ### `/autoresearch:scenario` — Scenario Explorer New subcommand that autonomously explores a seed scenario across **12 dimensions** to generate situations, edge cases, failure modes, and derivative scenarios. Think brainstorming meets the autoresearch loop. **Just type:** ``` /autoresearch:scenario ``` Claude asks 4-8 adaptive questions, then iterates — generating one concrete situation per iteration, classifying it (new/variant/duplicate), expanding edge cases, and logging everLow3/17/2026
v1.5.0## What's New ### Mandatory Interactive Setup Gate All autoresearch commands now **enforce** `AskUserQuestion` when invoked without required context. Previously, invoking `/autoresearch` or any subcommand without inline configuration would silently skip the interactive setup wizard and proceed directly to execution — leaving Claude without the Goal, Scope, Metric, or other required fields. ### Changes **New: Routing table in SKILL.md** A `MANDATORY: Interactive Setup Gate` section at the toLow3/17/2026
v1.4.0## Breaking Change **`/loop N /autoresearch` no longer recommended.** Use `Iterations: N` inline config instead. ### Why Claude Code's `/loop` command is a **scheduler** (runs on time intervals like `/loop 5m /foo`), NOT an iteration counter. When users typed `/loop 5 /autoresearch`, Claude interpreted "5" as "5 minutes" and scheduled a recurring task — not 5 iterations. (#24) ### New Syntax **Before (broken):** ``` /loop 25 /autoresearch Goal: Increase test coverage to 90% ``` **After (coLow3/17/2026
v1.3.3## Bug Fix Fixes #22 — `/autoresearch` (without suffix like `:debug` or `:fix`) was throwing "Unknown skill: autoresearch" error. ### Root Cause Claude Code resolves slash commands via `.claude/commands/<name>.md` files. All subcommands (`:debug`, `:fix`, `:plan`, `:security`, `:ship`) had their registration files, but the **base `/autoresearch` command** was missing its registration file — only the directory existed. ### What Changed - Added `commands/autoresearch.md` — base command registLow3/16/2026
v1.3.2## What's New All 6 subcommands now batch their `AskUserQuestion` calls — asking 3-4 questions per call instead of one at a time. Users see all configuration choices together for full context upfront. ### Batched Setup by Command | Subcommand | Questions per Call | Topics | |------------|-------------------|--------| | `/autoresearch` | **Batch 1:** 4 (Goal, Scope, Metric, Direction) **Batch 2:** 3 (Verify, Guard, Launch) | | `/autoresearch:plan` | **Batch 1:** 4 (Goal, Scope, Metric, DirectiLow3/16/2026
v1.3.1## What's New ### Interactive Setup with AskUserQuestion All 6 commands now use Claude's `AskUserQuestion` tool for guided setup when invoked without arguments. Just type the command — Claude walks you through it with smart defaults. **Before:** You had to know the exact syntax ``` /autoresearch Goal: Increase test coverage from 72% to 90% Scope: src/**/*.test.ts, src/**/*.ts Metric: coverage % (higher is better) Verify: npm test -- --coverage | grep "All files" ``` **After:** Just type `/auLow3/16/2026
v1.3.0## What's New ### `/autoresearch:debug` — Autonomous Bug Hunter Scientific method meets autoresearch loop. Doesn't stop at one bug — iteratively hunts ALL bugs using falsifiable hypotheses and evidence-based investigation. ```bash # Hunt all bugs /loop 20 /autoresearch:debug # Debug specific error /autoresearch:debug Symptom: API returns 500 on POST /users # Debug then auto-fix everything found /autoresearch:debug --fix ``` **What makes it different from regular debugging:** | Feature | TLow3/16/2026
v1.2.0## What's New ### Plugin Install Support You can now install autoresearch via the Claude Code plugin system: ```bash /plugin install autoresearch@autoresearch ``` Or add to your `settings.json`: ```json { "extraKnownMarketplaces": { "autoresearch": { "source": { "source": "git", "url": "https://github.com/uditgoenka/autoresearch.git" } } } } ``` No more manual `cp -r`. The plugin system handles `skills/` and `commands/` directories automatically. Manual install still worLow3/16/2026
v1.1.1## What's Fixed **Subcommands now work.** `/autoresearch:ship`, `/autoresearch:plan`, and `/autoresearch:security` were returning "Unknown skill" because they lacked command registration files. ### Root Cause Claude Code resolves `parent:subcommand` syntax by looking for `.md` files in `.claude/commands/<parent>/<subcommand>.md`. PR #10 (v1.1.0) added the ship workflow as documentation in SKILL.md but never created these registration files. ### Changes - **Added `commands/autoresearch/`** —Low3/16/2026
v1.1.0# /autoresearch:ship — Ship Anything, Anywhere **The 4th subcommand for Claude Autoresearch.** A universal shipping workflow that applies autoresearch loop principles to the last mile — taking any artifact from "done" to "deployed/published/delivered." ## What's New ### `/autoresearch:ship` — Universal Shipping Workflow Ship code, content, marketing emails, sales decks, research papers, or design assets through a structured 8-phase workflow: ``` Identify → Inventory → Checklist → Prepare → Low3/16/2026
v1.0.4## What's New in v1.0.4 ### Guard — Optional Regression Prevention When optimizing a metric (e.g., benchmark time), the loop can break existing behavior. **Guard** is an optional safety net — a command that must ALWAYS pass for a change to be kept. > **Contributed by [@pronskiy](https://github.com/pronskiy) (JetBrains) in [PR #7](https://github.com/uditgoenka/autoresearch/pull/7)** ### Usage \`\`\` /autoresearch Goal: Reduce API response time to under 100ms Metric: p95 response time in ms (Low3/15/2026
v1.0.3## What's New in v1.0.3 ### /autoresearch:security — Autonomous Security Audit Turn Claude into an autonomous security auditor combining **STRIDE threat modeling**, **OWASP Top 10 sweeps** (70+ checks), and **red-team adversarial analysis** (4 hostile personas) into a single iterative loop. ### Commands Added | Command | Description | |---------|-------------| | \`/autoresearch:security\` | Autonomous STRIDE + OWASP + red-team security audit | | \`/loop N /autoresearch:security\` | Bounded sLow3/15/2026
v1.0.2## What's New ### /autoresearch:plan — Goal → Configuration Wizard The hardest part of autoresearch isn't the loop — it's **defining Scope, Metric, and Verify** correctly. Get these wrong and the loop wastes iterations. Get them right and it's unstoppable. `/autoresearch:plan` is an interactive wizard that converts your plain-language goal into a validated, ready-to-execute configuration. ``` /autoresearch:plan Goal: Make the API respond faster ``` ### How It Works The wizard walks you thrLow3/14/2026
v1.0.1## What's New ### Optional Bounded Loop Count You can now control how many iterations autoresearch runs using Claude Code's built-in `/loop` command: ``` # Run exactly 25 iterations, then stop and summarize /loop 25 /autoresearch Goal: Increase test coverage to 90% ``` > **Requires:** Claude Code **v1.0.32+** (the `/loop` command was introduced in this version) ### Two Loop Modes | Mode | Usage | Behavior | |------|-------|----------| | **Unbounded** (default) | `/autoresearch` | Loops forLow3/14/2026
v1.0.0# Claude Autoresearch v1.0.0 Autonomous goal-directed iteration for Claude Code, inspired by [Karpathy's autoresearch](https://github.com/karpathy/autoresearch). ## What's Included - **SKILL.md** — Main skill definition with setup phase, loop protocol, 8 critical rules, and domain adaptation table - **autonomous-loop-protocol.md** — Detailed 8-phase loop: Review → Ideate → Modify → Commit → Verify → Decide → Log → Repeat - **core-principles.md** — 7 universal principles generalized from autorLow3/13/2026

Dependencies & License Audit

Loading dependencies...

Similar Packages

hermes-agentThe agent that grows with youv2026.6.5
agent-search-cliEnable AI agents to search, crawl, and extract web data with IP rotation, CAPTCHA handling, and rate limit management via CLI and Python.main@2026-06-04
claude-doctor-skillAudit projects for security, broken hooks, tests, and CI issues across 20+ languages with adaptive scoring and actionable fixes.main@2026-06-04
vibe-replayTurn AI coding sessions into animated, interactive web replaysv0.2.2
pickle-rick-claude🥒 Pickle Rick for Claude Code — autonomous PRD-driven coding loops + relentless code review. Ralph Loop toolkit.v1.88.0

More in AI Agents

hermes-agentThe agent that grows with you
awesome-copilotCommunity-contributed instructions, agents, skills, and configurations to help you make the most of GitHub Copilot.
CopilotKitThe Frontend Stack for Agents & Generative UI. React + Angular. Makers of the AG-UI Protocol
e2bE2B SDK that give agents cloud environments