# skill

> PinchBench is a benchmarking system for evaluating LLM models as OpenClaw coding agents. Made with 🦀 by the humans at https://kilo.ai

- **URL**: https://www.freshcrate.ai/projects/skill
- **Author**: pinchbench
- **Category**: Uncategorized
- **Latest version**: `v2.0.0` (2026-05-06)
- **License**: MIT
- **Source**: https://github.com/pinchbench/skill
- **Homepage**: https://pinchbench.com
- **Language**: Python
- **GitHub**: 1,039 stars, 116 forks
- **Registry**: github
- **Tags**: `python`

## Description

PinchBench is a benchmarking system for evaluating LLM models as OpenClaw coding agents. Made with 🦀 by the humans at https://kilo.ai

## Recent releases

| Version | Date | Urgency | Changes |
| --- | --- | --- | --- |
| `v2.0.0` | 2026-05-06 | High | # PinchBench 2.0.0 🦀  A major release with significant expansion of the benchmark suite and infrastructure improvements.  ## Highlights  - **148 tasks** (up from ~25 in v1.x) — nearly 6x more comprehensive benchmark coverage - **Parallel judge execution** — overlaps grading with task execution for faster benchmarks - **Haiku judge by default** — faster grading without sacrificing accuracy - **Thinking-level support** — benchmark models at different reasoning intensities - **Axiom observability* |
| `v2.0.0-rc11` | 2026-04-27 | High | ## Fix  **The model was responding correctly all along!** gpt-oss-120b said 'Hello, I'm ready!' — we just weren't looking in the right place.  The grader was checking for `type='message'` entries but OpenClaw trajectory format uses `type='model.completed'` with `data.assistantTexts`.  **Changes in this release:** - Grader now checks `type='model.completed'` and `data.assistantTexts`  - rc4-rc10 fixed the bootstrap preamble issue (BOOTSTRAP.md now correctly missing) - rc11 fixes the grader to act |
| `v1.2.1` | 2026-04-06 | High | ## What's Changed  ### Infrastructure  - **fix: use RELEASE_PAT for release workflow** — Fixes the release workflow to bypass branch protection when updating BENCHMARK_VERSION (#118)  This is a patch release to test the automated version bump workflow.  **Full Changelog**: https://github.com/pinchbench/skill/compare/v1.2.0...v1.2.1 |
| `v1.2.0` | 2026-04-06 | Medium | ## What's Changed  ### New Features  - **Custom OpenAI-compatible endpoints** — Use `--base-url` and `--api-key` flags to benchmark against local inference servers, Together, Fireworks, or any OpenAI-compatible API (#84) - **Direct API judge backend** — New `--judge` flag bypasses OpenClaw personality files to get clean JSON responses from judges (#87) - **RunTrendAnalyzer** — New analysis tool for detecting score regression across runs with statistical trend analysis (#104) - **Transcript archi |
| `v1.1.0` | 2026-03-19 | Low | ## What's New  ### Major Features - **Multi-session support** - Benchmark tasks now run across multiple sessions for better isolation and reliability - **Fail-fast sanity checks** - Invalid OpenRouter model names and config issues are caught immediately instead of failing mid-run - **Score summary logging** - Final results include submission ID and summary stats for easier tracking  ### Bug Fixes - **Task 10 grader compatibility** - Now supports both `read` and `read_file` tool names (OpenClaw/C |
| `1.0.0` | 2026-03-17 | Low | # PinchBench 1.0.0 Release Notes  ## Overview  PinchBench 1.0.0 marks our first stable release—a fully automated, open-source LLM benchmarking platform that measures how well AI coding agents handle real-world development tasks. This release brings together four months of development across our skill framework, API backend, leaderboard frontend, and orchestration infrastructure.  ---  ## What's New  ### 🏆 Official Benchmark Submissions Benchmark runs can now be tagged as "official" u |

## Citation

- HTML: https://www.freshcrate.ai/projects/skill
- Markdown: https://www.freshcrate.ai/projects/skill.md
- Dependencies JSON: https://www.freshcrate.ai/api/projects/skill/deps

_Generated by freshcrate.ai. Indexes github releases for AI-agent ecosystem packages._