# evals

> A comprehensive evaluation framework for AI agents and LLM applications.

- **URL**: https://www.freshcrate.ai/projects/evals
- **Author**: strands-agents
- **Category**: Frameworks
- **Latest version**: `v0.2.1` (2026-05-29)
- **License**: Apache-2.0
- **Source**: https://github.com/strands-agents/evals
- **Homepage**: https://strandsagents.com
- **Language**: Python
- **GitHub**: 106 stars, 31 forks
- **Registry**: github
- **Tags**: `agentic`, `agentic-ai`, `ai`, `evaluation`, `machine-learning`, `python`, `strands-agents`

## Description

A comprehensive evaluation framework for AI agents and LLM applications.

## Recent releases

| Version | Date | Urgency | Changes |
| --- | --- | --- | --- |
| `v0.2.1` | 2026-05-29 | High | ## What's Changed * chore: added evals-skills by @poshinchen in https://github.com/strands-agents/evals/pull/231 * feat: add chaos testing module for fault injection by @ybdarrenwang in https://github.com/strands-agents/evals/pull/224  **Full Changelog**: https://github.com/strands-agents/evals/compare/v0.2.0...v0.2.1 |
| `v0.2.0` | 2026-05-14 | High | ## What's Changed * chore(detectors): update import to include DiagnosisTrigger by @poshinchen in https://github.com/strands-agents/evals/pull/219 * feat(simulator): structured_output for ActorSimulator by @poshinchen in https://github.com/strands-agents/evals/pull/207 * feat: added strands-reviewer workflow into evals by @poshinchen in https://github.com/strands-agents/evals/pull/223 * feat: add official Discord link by @Albertozhao in https://github.com/strands-agents/evals/pull/227  ## |
| `v0.1.17` | 2026-05-08 | High | ## What's Changed * feat: add multimodal evaluators and prompt templates for image-to-text evaluation by @sangminwoo in https://github.com/strands-agents/evals/pull/187 * feat(detectors): added analyze_root_cause by @poshinchen in https://github.com/strands-agents/evals/pull/179 * feat(detectors): integrated rca into evaluation workflow by @poshinchen in https://github.com/strands-agents/evals/pull/210 * chore(detectors): included more fields to the RCAItem by @poshinchen in https://github.c |
| `v0.1.16` | 2026-04-30 | High | ## What's Changed * feat: simplify devx by adding @eval_task decorator and handlers for wrapping task functions  by @afarntrog in https://github.com/strands-agents/evals/pull/199 * feat(detectors): detectors interface and failure_detector implementation by @poshinchen in https://github.com/strands-agents/evals/pull/189 * refactor(evaluators): use PEP 604 union syntax and add Model type to HarmfulnessEvaluator by @afarntrog in https://github.com/strands-agents/evals/pull/206   **Full Change |
| `v0.1.15` | 2026-04-17 | High | ## What's Changed * docs(simulators): updated simulators README by @poshinchen in https://github.com/strands-agents/evals/pull/195 * feat: add correctness evaluator, trace-based and reference-based by @ybdarrenwang in https://github.com/strands-agents/evals/pull/185 * feat: add OpenSearchProvider and OpenSearchSessionMapper by @kylehounslow in https://github.com/strands-agents/evals/pull/192  ## New Contributors * @kylehounslow made their first contribution in https://github.com/strands-ag |
| `v0.1.14` | 2026-04-08 | High | ## What's Changed ### Major Features  #### Ground Truth Assertion Support for Goal Success Rate Evaluator — [PR#180](https://github.com/strands-agents/evals/pull/180)  The `GoalSuccessRateEvaluator` now supports a second evaluation mode: assertion-based evaluation. When `expected_assertion` is provided on the evaluation case, the judge LLM evaluates whether the agent’s behavior satisfies explicit success assertions rather than inferring goals from the conversation. This enables precise, rep |
| `v0.1.13` | 2026-03-31 | Medium | ## What's Changed * feat: add LocalFileTaskResultStore for caching task results locally by @afarntrog in https://github.com/strands-agents/evals/pull/178 * feat(mappers): langfuse provider changes to support newer version of langfuse by @poshinchen in https://github.com/strands-agents/evals/pull/165   **Full Changelog**: https://github.com/strands-agents/evals/compare/v0.1.12...v0.1.13 |
| `v0.1.12` | 2026-03-26 | Medium | ## What's Changed * feat(mapper): added framework detection for traces from CloudWatch by @poshinchen in https://github.com/strands-agents/evals/pull/164 * refactor: unify sync/async evaluation by defaulting aevaluate to asyncio.to_thread by @afarntrog in https://github.com/strands-agents/evals/pull/173 * feat: add TaskResultStore for caching and replaying task execution results by @afarntrog in https://github.com/strands-agents/evals/pull/176 * feat(mappers): cloudwatch change for openinfer |
| `v0.1.11` | 2026-03-19 | Low | ## What's Changed * feat(report): allow flattened report by @poshinchen in https://github.com/strands-agents/evals/pull/157 * feat: add environment state evaluation support by @afarntrog in https://github.com/strands-agents/evals/pull/156 * feat: added Langchain mappers by @poshinchen in https://github.com/strands-agents/evals/pull/153 * feat: add environment state support to OutputEvaluator by @afarntrog in https://github.com/strands-agents/evals/pull/160 * fix: hatch run test-lint by @afa |
| `v0.1.10` | 2026-03-11 | Low | ## What's Changed * feat: add deterministic evaluators for output and trajectory checks by @afarntrog in https://github.com/strands-agents/evals/pull/154   **Full Changelog**: https://github.com/strands-agents/evals/compare/v0.1.9...v0.1.10 |

## Citation

- HTML: https://www.freshcrate.ai/projects/evals
- Markdown: https://www.freshcrate.ai/projects/evals.md
- Dependencies JSON: https://www.freshcrate.ai/api/projects/evals/deps

_Generated by freshcrate.ai. Indexes github releases for AI-agent ecosystem packages._