freshcrate
Skin:/

skill

PinchBench is a benchmarking system for evaluating LLM models as OpenClaw coding agents. Made with ๐Ÿฆ€ by the humans at https://kilo.ai

Why this rank:Strong adoptionRelease freshnessHealthy release cadence

Description

PinchBench is a benchmarking system for evaluating LLM models as OpenClaw coding agents. Made with ๐Ÿฆ€ by the humans at https://kilo.ai

README

๐Ÿฆ€ PinchBench

Real-world benchmarks for AI coding agents

Leaderboard License

![Tasks](https://img.shields.io/badge/tasks-53-orange)

Note: This repository contains the benchmark skill/tasks. It is NOT the source of official leaderboard results. To add models to the official results, modify pinchbench/scripts/default-models.yml.

PinchBench measures how well LLM models perform as the brain of an OpenClaw agent. Instead of synthetic tests, we throw real tasks at agents: scheduling meetings, writing code, triaging email, researching topics, and managing files.

Results are collected on a public leaderboard at pinchbench.com.

PinchBench

Why PinchBench?

Most LLM benchmarks test isolated capabilities. PinchBench tests what actually matters for coding agents:

  • Tool usage โ€” Can the model call the right tools with the right parameters?
  • Multi-step reasoning โ€” Can it chain together actions to complete complex tasks?
  • Real-world messiness โ€” Can it handle ambiguous instructions and incomplete information?
  • Practical outcomes โ€” Did it actually create the file, send the email, or schedule the meeting?

Quick Start

# Clone the skill
git clone https://github.com/pinchbench/skill.git
cd skill

# Run benchmarks with your model of choice
./scripts/run.sh --model openrouter/anthropic/claude-sonnet-4

# Or run specific tasks
./scripts/run.sh --model openrouter/openai/gpt-4o --suite task_calendar,task_stock

Note: Model IDs must include their provider prefix (e.g. openrouter/, anthropic/). OpenRouter is the default provider used for routing.

Requirements:

  • Python 3.10+
  • uv package manager
  • A running OpenClaw instance

What Gets Tested

PinchBench includes 53 tasks across real-world categories:
Category Tasks What's tested
Productivity Calendar, daily summaries Event creation, time parsing, scheduling
Research Stock prices, conferences, markets Web search, data extraction, synthesis
Writing Blog posts, emails, humanization Content generation, tone, formatting
Coding Weather scripts, file structures Code generation, file operations
Analysis Spreadsheets, PDFs, documents Data processing, summarization
Email Triage, search Inbox management, filtering
Memory Context retrieval, knowledge management Long-term memory, recall
Skills ClawHub, skill discovery OpenClaw ecosystem integration

Each task is graded automatically, by an LLM judge, or both โ€” ensuring both objective and nuanced evaluation.

Submitting Results

To get your results on the leaderboard:

# Register for an API token (one-time)
./scripts/run.sh --register

# Run benchmark โ€” results auto-upload with your token
./scripts/run.sh --model openrouter/anthropic/claude-sonnet-4

Skip uploading with --no-upload if you just want local results.

Official Results

To submit an official run (marked on the leaderboard):

# Using environment variable
export PINCHBENCH_OFFICIAL_KEY=your_official_key
./scripts/run.sh --model anthropic/claude-sonnet-4

# Using command line flag
./scripts/run.sh --model anthropic/claude-sonnet-4 --official-key your_official_key

Command Reference

Flag Description
--model MODEL Model to test (e.g., openrouter/anthropic/claude-sonnet-4)
--judge MODEL Judge model for LLM grading; uses direct API when set (see below)
--suite SUITE all, automated-only, or comma-separated task IDs
--runs N Number of runs per task for averaging
--timeout-multiplier N Scale timeouts for slower models
--output-dir DIR Where to save results (default: results/)
--no-upload Skip uploading to leaderboard
--register Request an API token for submissions
--upload FILE Upload a previous results JSON
--official-key KEY Mark submission as official (or use PINCHBENCH_OFFICIAL_KEY env var)

Judge

By default (no --judge flag), the LLM judge runs as an OpenClaw agent session. When --judge is specified, it calls the model API directly instead, bypassing OpenClaw personality injection.

# Default: OpenClaw agent session (no --judge needed)
./scripts/run.sh --model openrouter/anthropic/claude-sonnet-4

# Direct API via OpenRouter
./scripts/run.sh --model openai/gpt-4o --judge openrouter/anthropic/claude-sonnet-4-5

# Direct API via Anthropic
./scripts/run.sh --model openai/gpt-4o --judge anthropic/claude-sonnet-4-5-20250514

# Direct API via OpenAI
./scripts/run.sh --model openai/gpt-4o --judge openai/gpt-4o

# Headless Claude CLI
./scripts/run.sh --model openai/gpt-4o --judge claude

Required env vars: OPENROUTER_API_KEY, ANTHROPIC_API_KEY, or OPENAI_API_KEY depending on the judge model prefix.

Contributing Tasks

We welcome new tasks! Check out tasks/TASK_TEMPLATE.md for the format. Good tasks are:

  • Real-world โ€” Something an actual user would ask an agent to do
  • Measurable โ€” Clear success criteria that can be graded
  • Reproducible โ€” Same task should produce consistent grading
  • Challenging โ€” Tests agent capabilities, not just LLM knowledge

Transcript Archive

Session transcripts are automatically saved to results/{run_id}_transcripts/ alongside the results JSON. Each task's full agent conversation is preserved as a JSONL file (e.g. task_calendar.jsonl) for post-run analysis.

Links

Star History

Star History Chart

License

MIT โ€” see LICENSE for details.


Claw-some AI agent testing ๐Ÿฆž

Release History

VersionChangesUrgencyDate
v2.0.0# PinchBench 2.0.0 ๐Ÿฆ€ A major release with significant expansion of the benchmark suite and infrastructure improvements. ## Highlights - **148 tasks** (up from ~25 in v1.x) โ€” nearly 6x more comprehensive benchmark coverage - **Parallel judge execution** โ€” overlaps grading with task execution for faster benchmarks - **Haiku judge by default** โ€” faster grading without sacrificing accuracy - **Thinking-level support** โ€” benchmark models at different reasoning intensities - **Axiom observability*High5/6/2026
v2.0.0-rc11## Fix **The model was responding correctly all along!** gpt-oss-120b said 'Hello, I'm ready!' โ€” we just weren't looking in the right place. The grader was checking for `type='message'` entries but OpenClaw trajectory format uses `type='model.completed'` with `data.assistantTexts`. **Changes in this release:** - Grader now checks `type='model.completed'` and `data.assistantTexts` - rc4-rc10 fixed the bootstrap preamble issue (BOOTSTRAP.md now correctly missing) - rc11 fixes the grader to actHigh4/27/2026
v1.2.1## What's Changed ### Infrastructure - **fix: use RELEASE_PAT for release workflow** โ€” Fixes the release workflow to bypass branch protection when updating BENCHMARK_VERSION (#118) This is a patch release to test the automated version bump workflow. **Full Changelog**: https://github.com/pinchbench/skill/compare/v1.2.0...v1.2.1High4/6/2026
v1.2.0## What's Changed ### New Features - **Custom OpenAI-compatible endpoints** โ€” Use `--base-url` and `--api-key` flags to benchmark against local inference servers, Together, Fireworks, or any OpenAI-compatible API (#84) - **Direct API judge backend** โ€” New `--judge` flag bypasses OpenClaw personality files to get clean JSON responses from judges (#87) - **RunTrendAnalyzer** โ€” New analysis tool for detecting score regression across runs with statistical trend analysis (#104) - **Transcript archiMedium4/6/2026
v1.1.0## What's New ### Major Features - **Multi-session support** - Benchmark tasks now run across multiple sessions for better isolation and reliability - **Fail-fast sanity checks** - Invalid OpenRouter model names and config issues are caught immediately instead of failing mid-run - **Score summary logging** - Final results include submission ID and summary stats for easier tracking ### Bug Fixes - **Task 10 grader compatibility** - Now supports both `read` and `read_file` tool names (OpenClaw/CLow3/19/2026
1.0.0# PinchBench 1.0.0 Release Notes ## Overview PinchBench 1.0.0 marks our first stable releaseโ€”a fully automated, open-source LLM benchmarking platform that measures how well AI coding agents handle real-world development tasks. This release brings together four months of development across our skill framework, API backend, leaderboard frontend, and orchestration infrastructure. --- ## What's New ### ๐Ÿ† Official Benchmark Submissions Benchmark runs can now be tagged as "official" uLow3/17/2026

Dependencies & License Audit

Loading dependencies...

Similar Packages

ai-dataset-generator๐Ÿค– Generate tailored AI training datasets quickly and easily, transforming your domain knowledge into essential training data for model fine-tuning.main@2026-06-06
dopEffectCSharp๐Ÿš€ Maximize your C# productivity with advanced techniques in strings, LINQ, and clean code, inspired by the book "Produtivo com C#."master@2026-06-06
modal-clientSDK libraries for Modalmain@2026-06-05
pinecone-python-clientThe Pinecone Python client v9.1.0

More in Uncategorized

llama.cppLLM inference in C/C++
modal-clientSDK libraries for Modal
anolisaANOLISA - Agentic Nexus Operating Layer & Interface System Architecture