freshcrate
Skin:/
Home > Testing > OpenClawProBench

OpenClawProBench

OpenClawProBench is a live-first benchmark harness for evaluating LLM agents in the OpenClaw runtime with deterministic grading and repeated-trial reliability.

Why this rank:Strong adoptionRecent releaseHealthy release cadence

Description

OpenClawProBench is a live-first benchmark harness for evaluating LLM agents in the OpenClaw runtime with deterministic grading and repeated-trial reliability.

README

OpenClawProBench Logo

OpenClawProBench

Active Scenarios Catalog Core Profile Execution License

Transparent live-first benchmark harness for evaluating model capability inside the OpenClaw runtime.
102 active scenarios, 162 catalog scenarios, deterministic grading, and OpenClaw-native coverage.

OpenClawProBench focuses on real OpenClaw execution with deterministic grading, structured reports, and benchmark-profile selection. The default ranking path is the core profile; broader active coverage remains available through intelligence, coverage, native, and full.

The current worktree inventory reports 102 active scenarios and 162 total catalog scenarios (60 incubating) via python3 run.py inventory --json and python3 run.py inventory --benchmark-status all --json.

Leaderboard

Browse the public leaderboard and benchmark cases at suyoumo.github.io/bench.

OpenClawProBench leaderboard preview

๐Ÿ“ข Updates

  • v1.0.2 - Added kimi-for-coding, gemma4-31b, and kimi-k2-thinking; improved image download flows for easier mobile-device browsing.
  • v1.0.1 - Added qwen3-coder-next, doubao-seed-code, qwen3-max-2026-01-23, and qwen3.6plus rerun with bailiancodingplan; added model image download and benchmark sharing to Twitter; fixed completed-report resume overwrite, tool_use_14 graceful fallback on skills inventory load failure, tool_use_17 invalid JSON and missing-file tolerance, and audit_scenario_quality.py compatibility.
  • v1.0.0 - OpenClawProBench released with 102 tasks across 6 domains, with 3-try runs, checkpoint resume, and cross-environment resume support.

Evaluation Logic

  • Default ranking path: core
  • Extended active capability suite: intelligence
  • Native-only slice: native
  • Multi-trial runs are supported via --trials N
  • Reports expose avg_score, max_score, coverage-aware summaries, cost, latency, and resume metadata
  • Interrupted runs can continue with --continue or --resume-from, and execution failures can be re-queued with --rerun-execution-failures

Quick Start

We recommend using uv for fast, reliable Python environment setup:

pip install uv
uv venv --python 3.11
source .venv/bin/activate
uv pip install -r requirements.txt

Before running the benchmark, make sure your local OpenClaw runtime is available:

openclaw --help
openclaw agents list --json

Inspect the benchmark catalog and validate the scenario set:

python3 run.py inventory
python3 run.py inventory --json
python3 run.py dry

Run a one-trial smoke on the default ranking benchmark:

python3 run.py run \
  --model '<MODEL>' \
  --execution-mode live \
  --benchmark-profile core \
  --trials 1 \
  --cleanup-agents

Run the full default benchmark:

python3 run.py run \
  --model '<MODEL>' \
  --execution-mode live \
  --benchmark-profile core \
  --trials 3 \
  --cleanup-agents

Compare generated reports:

python3 run.py compare --results-dir results

For isolated same-host runs, the harness also supports:

  • --openclaw-profile
  • --openclaw-state-dir
  • --openclaw-config-path
  • --openclaw-gateway-port
  • --openclaw-binary

Benchmark Profiles

Profile Active scenarios Purpose
core 26 Default ranking suite
intelligence 95 Extended active capability benchmark
coverage 7 Lower-stakes breadth and regression slice
native 36 Active OpenClaw-native slice only
full 102 Union of all active scenarios

The benchmark catalog also includes 60 incubating scenarios that can be inspected with --benchmark-status all.

OpenClaw Runtime

Live runs expect a working local openclaw CLI plus the auth and config required by the surfaces exercised by the selected scenarios. If your binary is not on PATH, set OPENCLAW_BINARY or pass --openclaw-binary.

config/openclaw.json.template is provided as a reference template for local OpenClaw configuration and isolated-run setups.

Repo Map

  • run.py: CLI entrypoint for inventory, dry, run, and compare
  • harness/: loader, runner, scoring, reporting, and live OpenClaw bridge
  • scenarios/: benchmark tasks in YAML
  • datasets/: seeded live-task data and optional setup / teardown scripts
  • custom_checks/: scenario-specific grading logic
  • tests/: regression coverage for loader, runner, scoring, and reporting
  • docs/: public assets plus evaluation validation and benchmark-profile policy

Generated Output

Benchmark reports are written to results/. They are generated runtime artifacts and are intentionally ignored by version control in this repo layout.

Citation

If you use OpenClawProBench in your research, please cite:

@misc{openclawprobench2026,
  title={OpenClawProBench โ€” a transparent benchmark for true intelligence in real-world AI agents.},
  author={suyoumo},
  year={2026},
  url={https://github.com/suyoumo/OpenClawProBench}
}

Contribution

We welcome issues, documentation fixes, scenario improvements, grader hardening, and benchmark-engine contributions. See CONTRIBUTING.md for setup and validation guidance.

Acknowledgements

This project was informed by prior open-source work on agent evaluation, benchmark design, and real-world task assessment.

We drew ideas from projects such as PinchBench, Claw-Eval, AgencyBench, and related agent-benchmark efforts, especially in areas like task design, evaluation methodology, harness structure, and public benchmark presentation.

Some tasks in this repository are adapted and reworked from earlier public benchmark-style task sets into the OpenClaw runtime and grading framework.

Contributors

Public contributor list: waiting.

Discussion Group

OpenClaw WeChat community QR code

Join our WeChat discussion group to discuss OpenClaw with other users and builders.

Release History

VersionChangesUrgencyDate
main@2026-05-19Latest activity on main branchHigh5/19/2026
0.0.0No release found โ€” using repo HEADHigh4/11/2026

Dependencies & License Audit

Loading dependencies...

Similar Packages

claw-evalClaw-Eval is an evaluation harness for evaluating LLM as agents. All tasks verified by humans.main@2026-05-17
aragA-RAG: Agentic Retrieval-Augmented Generation via Hierarchical Retrieval Interfaces. State-of-the-art RAG framework with keyword, semantic, and chunk read tools for multi-hop QA.v0.1.0
fastRAGEfficient Retrieval Augmentation and Generation Frameworkv3.1.2
deer-flowAn open-source long-horizon SuperAgent harness that researches, codes, and creates. With the help of sandboxes, memories, tools, skill, subagents and message gateway, it handles different levels of tamain@2026-06-06
vector-db-benchmarkFramework for benchmarking vector search enginesmaster@2026-06-05

More in Testing

vector-db-benchmarkFramework for benchmarking vector search engines
GitoAn AI-powered GitHub code review tool that uses LLMs to detect high-confidence, high-impact issuesโ€”such as security vulnerabilities, bugs, and maintainability concerns.
mxcliMendix cli tool, a headless way to work with Mendix projects. Enables Mendix projects for use with 3rd party agentic coding tools like Claude Code and Copilot. Includes a starlark linter for quality v
llm_context_benchmarks ๐Ÿ“Š LLM Context Benchmarks - A comprehensive benchmarking tool for testing LLMs with varying context sizes using Ollama. Features dual benchmark modes (API/CLI), automatic hardware detection (optimiz