freshcrate
Home > Testing > OpenClawProBench

OpenClawProBench

OpenClawProBench is a live-first benchmark harness for evaluating LLM agents in the OpenClaw runtime with deterministic grading and repeated-trial reliability.

Description

OpenClawProBench is a live-first benchmark harness for evaluating LLM agents in the OpenClaw runtime with deterministic grading and repeated-trial reliability.

README

OpenClawProBench Logo

OpenClawProBench

Active Scenarios Catalog Core Profile Execution License

Transparent live-first benchmark harness for evaluating model capability inside the OpenClaw runtime.
102 active scenarios, 162 catalog scenarios, deterministic grading, and OpenClaw-native coverage.

OpenClawProBench focuses on real OpenClaw execution with deterministic grading, structured reports, and benchmark-profile selection. The default ranking path is the core profile; broader active coverage remains available through intelligence, coverage, native, and full.

The current worktree inventory reports 102 active scenarios and 162 total catalog scenarios (60 incubating) via python3 run.py inventory --json and python3 run.py inventory --benchmark-status all --json.

Leaderboard

Browse the public leaderboard and benchmark cases at suyoumo.github.io/bench.

OpenClawProBench leaderboard preview

📢 Updates

  • v1.0.2 - Added kimi-for-coding, gemma4-31b, and kimi-k2-thinking; improved image download flows for easier mobile-device browsing.
  • v1.0.1 - Added qwen3-coder-next, doubao-seed-code, qwen3-max-2026-01-23, and qwen3.6plus rerun with bailiancodingplan; added model image download and benchmark sharing to Twitter; fixed completed-report resume overwrite, tool_use_14 graceful fallback on skills inventory load failure, tool_use_17 invalid JSON and missing-file tolerance, and audit_scenario_quality.py compatibility.
  • v1.0.0 - OpenClawProBench released with 102 tasks across 6 domains, with 3-try runs, checkpoint resume, and cross-environment resume support.

Evaluation Logic

  • Default ranking path: core
  • Extended active capability suite: intelligence
  • Native-only slice: native
  • Multi-trial runs are supported via --trials N
  • Reports expose avg_score, max_score, coverage-aware summaries, cost, latency, and resume metadata
  • Interrupted runs can continue with --continue or --resume-from, and execution failures can be re-queued with --rerun-execution-failures

Quick Start

We recommend using uv for fast, reliable Python environment setup:

pip install uv
uv venv --python 3.11
source .venv/bin/activate
uv pip install -r requirements.txt

Before running the benchmark, make sure your local OpenClaw runtime is available:

openclaw --help
openclaw agents list --json

Inspect the benchmark catalog and validate the scenario set:

python3 run.py inventory
python3 run.py inventory --json
python3 run.py dry

Run a one-trial smoke on the default ranking benchmark:

python3 run.py run \
  --model '<MODEL>' \
  --execution-mode live \
  --benchmark-profile core \
  --trials 1 \
  --cleanup-agents

Run the full default benchmark:

python3 run.py run \
  --model '<MODEL>' \
  --execution-mode live \
  --benchmark-profile core \
  --trials 3 \
  --cleanup-agents

Compare generated reports:

python3 run.py compare --results-dir results

For isolated same-host runs, the harness also supports:

  • --openclaw-profile
  • --openclaw-state-dir
  • --openclaw-config-path
  • --openclaw-gateway-port
  • --openclaw-binary

Benchmark Profiles

Profile Active scenarios Purpose
core 26 Default ranking suite
intelligence 95 Extended active capability benchmark
coverage 7 Lower-stakes breadth and regression slice
native 36 Active OpenClaw-native slice only
full 102 Union of all active scenarios

The benchmark catalog also includes 60 incubating scenarios that can be inspected with --benchmark-status all.

OpenClaw Runtime

Live runs expect a working local openclaw CLI plus the auth and config required by the surfaces exercised by the selected scenarios. If your binary is not on PATH, set OPENCLAW_BINARY or pass --openclaw-binary.

config/openclaw.json.template is provided as a reference template for local OpenClaw configuration and isolated-run setups.

Repo Map

  • run.py: CLI entrypoint for inventory, dry, run, and compare
  • harness/: loader, runner, scoring, reporting, and live OpenClaw bridge
  • scenarios/: benchmark tasks in YAML
  • datasets/: seeded live-task data and optional setup / teardown scripts
  • custom_checks/: scenario-specific grading logic
  • tests/: regression coverage for loader, runner, scoring, and reporting
  • docs/: public assets plus evaluation validation and benchmark-profile policy

Generated Output

Benchmark reports are written to results/. They are generated runtime artifacts and are intentionally ignored by version control in this repo layout.

Citation

If you use OpenClawProBench in your research, please cite:

@misc{openclawprobench2026,
  title={OpenClawProBench — a transparent benchmark for true intelligence in real-world AI agents.},
  author={suyoumo},
  year={2026},
  url={https://github.com/suyoumo/OpenClawProBench}
}

Contribution

We welcome issues, documentation fixes, scenario improvements, grader hardening, and benchmark-engine contributions. See CONTRIBUTING.md for setup and validation guidance.

Acknowledgements

This project was informed by prior open-source work on agent evaluation, benchmark design, and real-world task assessment.

We drew ideas from projects such as PinchBench, Claw-Eval, AgencyBench, and related agent-benchmark efforts, especially in areas like task design, evaluation methodology, harness structure, and public benchmark presentation.

Some tasks in this repository are adapted and reworked from earlier public benchmark-style task sets into the OpenClaw runtime and grading framework.

Contributors

Public contributor list: waiting.

Discussion Group

OpenClaw WeChat community QR code

Join our WeChat discussion group to discuss OpenClaw with other users and builders.

Release History

VersionChangesUrgencyDate
main@2026-04-15Latest activity on main branchHigh4/15/2026
0.0.0No release found — using repo HEADHigh4/11/2026

Dependencies & License Audit

Loading dependencies...

Similar Packages

claw-evalClaw-Eval is an evaluation harness for evaluating LLM as agents. All tasks verified by humans.main@2026-04-15
aragA-RAG: Agentic Retrieval-Augmented Generation via Hierarchical Retrieval Interfaces. State-of-the-art RAG framework with keyword, semantic, and chunk read tools for multi-hop QA.v0.1.0
fastRAGEfficient Retrieval Augmentation and Generation Frameworkv3.1.2
openclaw-model-bridgeConnect any LLM to OpenClaw — production-tested middleware for Qwen3-235B and beyondmain@2026-04-21
Ollama-Terminal-AgentAutomate shell tasks using a local Ollama model that plans, executes, and fixes commands without cloud or API dependencies.main@2026-04-21