freshcrate
Home > Testing > claw-eval

claw-eval

Claw-Eval is an evaluation harness for evaluating LLM as agents. All tasks verified by humans.

Description

Claw-Eval is an evaluation harness for evaluating LLM as agents. All tasks verified by humans.

README

Claw-Eval Logo

Claw-Eval

Tasks Models Paper Leaderboard Dataset License

Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents.
300 human-verified tasks | 2,159 rubrics | 9 categories | Completion ยท Safety ยท Robustness.


Leaderboard

Browse the full leaderboard and individual task cases at claw-eval.github.io.

Evaluation Logic (Updated March 2026):

  • Primary Metric: Pass^3. To eliminate "lucky runs," a model must now consistently pass a task across three independent trials ($N=3$) to earn a success credit.
  • Strict Pass Criterion: Under the Pass^3 methodology, a task is only marked as passed if the model meets the success criteria in all three runs.
  • Reproducibility: We are committed to end-to-end reproducibility. Our codebase is currently being audited to ensure all benchmark results on the leaderboard can be verified by the community.
  • Handling API Instability: In the event of execution errors caused by network or API fluctuations, we manually re-trigger the evaluation to ensure exactly 3 trajectories are successfully generated.

๐Ÿ“ข Updates

  • v1.1.0 โ€” 300 human-verified tasks in 9 categories: Agents perceive, reason, create, and deliver.

  • v1.0.0 โ€” Built on reproducible real-world complexity.

  • v0.0.0 โ€” From chatbot to real world. (2026.3)

Tasks

300 tasks across 3 splits and 9 categories, each task with human-verified rubrics.

Split Count Description
general 161 Core agent tasks across communication, finance, ops, productivity, etc.
multimodal 101 Perception and creation โ€” webpage generation, video QA, document extraction, etc.
multi_turn 38 Conversational tasks with simulated user personas for clarification and advice

Agents are graded on three dimensions through full-trajectory auditing:

  • Completion โ€” did the agent finish the task?
  • Safety โ€” did it avoid harmful or unauthorized actions?
  • Robustness โ€” does it pass consistently across multiple trials?

Dataset

Available on Hugging Face: claw-eval/Claw-Eval

Field Type Description
task_id string Unique task identifier
query string Task instruction / description
fixture list[string] Fixture files required (available in data/fixtures.tar.gz)
language string en or zh
category string Task domain

Quick Start

We recommend using uv for fast, reliable dependency management:

pip install uv
uv venv --python 3.11
source .venv/bin/activate

Prepare your keys and set up the environments with one command:

export OPENROUTER_API_KEY=sk-or-...
export SERP_DEV_KEY=... # add this for tasks need real web search
bash scripts/test_sandbox.sh

Note on video fixtures: Due to file size limits, this GitHub repository does not include video files for video-related tasks. The complete fixtures (including all videos) are available on Hugging Face: claw-eval/Claw-Eval.

Note on grade: we use gemini-3-flash in general and multimodal tasks while claude opus4.6 for both grader and user-agent in multi_turn tasks!

Go rock ๐Ÿš€

claw-eval batch --config model_configs/claude_opus_46.yaml --sandbox --trials 3 --parallel 16
# For different tasks, you can follow different config: config_general.yaml/config_multimodal.yaml/config_user_agent.yaml.

Roadmap

  • More real-world, multimodal tasks in complex productivity environments
  • Comprehensive, fine-grained scoring logic with deep state verification
  • Enhanced sandbox isolation and full-trace tracking for transparent, scalable evaluation

Contribution

We welcome any kind of contribution. Let us know if you have any suggestions!

Acknowledgements

Our test cases are built on the work of the community. We draw from and adapt tasks contributed by OpenClaw, PinchBench, OfficeQA, OneMillion-Bench, Finance Agent, and Terminal-Bench 2.0.

Core Contributors

Bowen Ye(PKU), Rang Li (PKU), Qibin Yang (PKU), Zhihui Xie(HKU), Yuanxin Liu(PKU), Linli Yao(PKU), Hanglong Lyu(PKU), Lei Li(HKU, project lead)

Advisors

Tong Yang (PKU), Zhifang Sui (PKU), Lingpeng Kong (HKU), Qi Liu (HKU)

Citation

If you use Claw-Eval in your research, please cite:

@misc{claw-eval2026,
  title={Claw-Eval: End-to-End Transparent Benchmark for AI Agents in the Real World},
  author={Ye, Bowen and Li, Rang and Yang, Qibin and Xie, Zhihui and Liu, Yuanxin and Yao, Linli and Lyu, Hanglong and Li, Lei},
  year={2026},
  url={https://github.com/claw-eval/claw-eval}
}

License

This project is released under the MIT License.

Release History

VersionChangesUrgencyDate
main@2026-04-15Latest activity on main branchHigh4/15/2026
0.0.0No release found โ€” using repo HEADHigh4/11/2026

Dependencies & License Audit

Loading dependencies...

Similar Packages

OpenClawProBenchOpenClawProBench is a live-first benchmark harness for evaluating LLM agents in the OpenClaw runtime with deterministic grading and repeated-trial reliability.main@2026-04-15
openclaw-model-bridgeConnect any LLM to OpenClaw โ€” production-tested middleware for Qwen3-235B and beyondmain@2026-04-21
Ollama-Terminal-AgentAutomate shell tasks using a local Ollama model that plans, executes, and fixes commands without cloud or API dependencies.main@2026-04-21
deer-flowAn open-source long-horizon SuperAgent harness that researches, codes, and creates. With the help of sandboxes, memories, tools, skill, subagents and message gateway, it handles different levels of tamain@2026-04-21
MemoriMemori is agent-native memory infrastructure. A SQL-native, LLM-agnostic layer that turns agent execution and conversation into structured, persistent state for production systems.v3.3.0