Home > Frameworks > evals

evals

A comprehensive evaluation framework for AI agents and LLM applications.

agentic agentic-ai ai evaluation machine-learning python strands-agents

Why this rank:Strong adoptionRecent releaseHealthy release cadence

Description

A comprehensive evaluation framework for AI agents and LLM applications.

README

Strands Evals SDK

A comprehensive evaluation framework for AI agents and LLM applications.

Documentation ◆ Samples ◆ Python SDK ◆ Typescript SDK ◆ Tools ◆ Evaluations

Strands Evaluation is a powerful framework for evaluating AI agents and LLM applications. From simple output validation to complex multi-agent interaction analysis, trajectory evaluation, and automated experiment generation, Strands Evaluation provides comprehensive tools to measure and improve your AI systems.

Feature Overview

Multiple Evaluation Types: Output evaluation, trajectory analysis, tool usage assessment, and interaction evaluation
Dynamic Simulators: Multi-turn conversation simulation with realistic user behavior, goal-oriented interactions, and LLM-powered tool simulation with shared state
LLM-as-a-Judge: Built-in evaluators using language models for sophisticated assessment with structured scoring
Trace-based Evaluation: Analyze agent behavior through OpenTelemetry execution traces
Automated Experiment Generation: Generate comprehensive test suites from context descriptions
Custom Evaluators: Extensible framework for domain-specific evaluation logic
Experiment Management: Save, load, and version your evaluation experiments with JSON serialization
Built-in Scoring Tools: Helper functions for exact, in-order, and any-order trajectory matching

Quick Start

# Install Strands Evals SDK
pip install strands-agents-evals

from strands import Agent
from strands_evals import Case, Experiment
from strands_evals.evaluators import OutputEvaluator

# Create test cases
test_cases = [
    Case[str, str](
        name="knowledge-1",
        input="What is the capital of France?",
        expected_output="The capital of France is Paris.",
        metadata={"category": "knowledge"}
    )
]

# Create evaluators with custom rubric
evaluators = [
    OutputEvaluator(
        rubric="""
        Evaluate based on:
        1. Accuracy - Is the information correct?
        2. Completeness - Does it fully answer the question?
        3. Clarity - Is it easy to understand?
        
        Score 1.0 if all criteria are met excellently.
        Score 0.5 if some criteria are partially met.
        Score 0.0 if the response is inadequate.
        """
    )
]

# Create experiment and run evaluation
experiment = Experiment[str, str](cases=test_cases, evaluators=evaluators)

def get_response(case: Case) -> str:
    agent = Agent(callback_handler=None)
    return str(agent(case.input))

# Run evaluations
reports = experiment.run_evaluations(get_response)
reports[0].run_display()

Installation

Ensure you have Python 3.10+ installed, then:

# Create and activate virtual environment
python -m venv .venv
source .venv/bin/activate  # On Windows use: .venv\Scripts\activate

# Install in development mode
pip install -e .

# Install with test dependencies
pip install -e ".[test]"

# Install with both test and dev dependencies
pip install -e ".[test,dev]"

Features at a Glance

Output Evaluation with Custom Rubrics

Evaluate agent responses using LLM-as-a-judge with flexible scoring criteria:

from strands_evals.evaluators import OutputEvaluator

evaluator = OutputEvaluator(
    rubric="Score 1.0 for accurate, complete responses. Score 0.5 for partial answers. Score 0.0 for incorrect or unhelpful responses.",
    include_inputs=True,  # Include context in evaluation
    model="us.anthropic.claude-sonnet-4-20250514-v1:0"  # Custom judge model
)

Trajectory Evaluation with Built-in Scoring

Analyze agent tool usage and action sequences with helper scoring functions:

from strands_evals.evaluators import TrajectoryEvaluator
from strands_evals.extractors import tools_use_extractor
from strands_tools import calculator

def get_response_with_tools(case: Case) -> dict:
    agent = Agent(tools=[calculator])
    response = agent(case.input)
    
    # Extract trajectory efficiently to prevent context overflow
    trajectory = tools_use_extractor.extract_agent_tools_used_from_messages(agent.messages)
    
    # Update evaluator with tool descriptions
    evaluator.update_trajectory_description(
        tools_use_extractor.extract_tools_description(agent, is_short=True)
    )
    
    return {"output": str(response), "trajectory": trajectory}

# Evaluator includes built-in scoring tools: exact_match_scorer, in_order_match_scorer, any_order_match_scorer
evaluator = TrajectoryEvaluator(
    rubric="Score 1.0 if correct tools used in proper sequence. Use scoring tools to verify trajectory matches."
)

Trace-based Helpfulness Evaluation

Evaluate agent helpfulness using OpenTelemetry traces with seven-level scoring:

from strands_evals.evaluators import HelpfulnessEvaluator
from strands_evals.telemetry import StrandsEvalsTelemetry
from strands_evals.mappers import StrandsInMemorySessionMapper

# Setup telemetry for trace capture
telemetry = StrandsEvalsTelemetry().setup_in_memory_exporter()

def user_task_function(case: Case) -> dict:
    telemetry.memory_exporter.clear()
    
    agent = Agent(
        trace_attributes={"session.id": case.session_id},
        callback_handler=None
    )
    response = agent(case.input)
    
    # Map spans to session for evaluation
    spans = telemetry.memory_exporter.get_finished_spans()
    mapper = StrandsInMemorySessionMapper()
    session = mapper.map_to_session(spans, session_id=case.session_id)
    
    return {"output": str(response), "trajectory": session}

# Seven-level scoring: Not helpful (0.0) to Above and beyond (1.0)
evaluators = [HelpfulnessEvaluator()]
experiment = Experiment[str, str](cases=test_cases, evaluators=evaluators)

# Run evaluations
reports = experiment.run_evaluations(user_task_function)
reports[0].run_display()

Multi-turn Conversation Simulation

Simulate realistic user interactions with dynamic, goal-oriented conversations using ActorSimulator:

from strands import Agent
from strands_evals import Case, Experiment, ActorSimulator
from strands_evals.evaluators import HelpfulnessEvaluator, GoalSuccessRateEvaluator
from strands_evals.mappers import StrandsInMemorySessionMapper
from strands_evals.telemetry import StrandsEvalsTelemetry

# Setup telemetry
telemetry = StrandsEvalsTelemetry().setup_in_memory_exporter()
memory_exporter = telemetry.in_memory_exporter

def task_function(case: Case) -> dict:
    # Create simulator to drive conversation
    simulator = ActorSimulator.from_case_for_user_simulator(
        case=case,
        max_turns=10
    )

    # Create agent to evaluate
    agent = Agent(
        trace_attributes={
            "gen_ai.conversation.id": case.session_id,
            "session.id": case.session_id
        },
        callback_handler=None
    )

    # Run multi-turn conversation
    all_spans = []
    user_message = case.input

    while simulator.has_next():
        memory_exporter.clear()
        agent_response = agent(user_message)
        turn_spans = list(memory_exporter.get_finished_spans())
        all_spans.extend(turn_spans)

        user_result = simulator.act(str(agent_response))
        user_message = str(user_result.structured_output.message)

    # Map to session for evaluation
    mapper = StrandsInMemorySessionMapper()
    session = mapper.map_to_session(all_spans, session_id=case.session_id)

    return {"output": str(agent_response), "trajectory": session}

# Use evaluators to assess simulated conversations
evaluators = [
    HelpfulnessEvaluator(),
    GoalSuccessRateEvaluator()
]

experiment = Experiment(cases=test_cases, evaluators=evaluators)
reports = experiment.run_evaluations(task_function)

Key Benefits:

Dynamic Interactions: Simulator adapts responses based on agent behavior
Goal-Oriented Testing: Verify agents can complete user objectives through dialogue
Realistic Conversations: Generate authentic multi-turn interaction patterns
No Predefined Scripts: Test agents without hardcoded conversation paths
Comprehensive Evaluation: Combine with trace-based evaluators for full assessment

Tool Simulation

Simulate tool behavior with LLM-powered responses for controlled agent evaluation using ToolSimulator. Register tools with a decorator, define output schemas, and optionally share state across related tools — the simulator replaces real execution with realistic, schema-validated responses:

from typing import Any
from enum import Enum
from pydantic import BaseModel, Field
from strands import Agent
from strands_evals import Case, Experiment
from strands_evals.evaluators import GoalSuccessRateEvaluator
from strands_evals.simulation.tool_simulator import ToolSimulator

tool_simulator = ToolSimulator()

# Define output schema
class HVACMode(str, Enum):
    HEAT = "heat"
    COOL = "cool"
    AUTO = "auto"
    OFF = "off"

class HVACResponse(BaseModel):
    temperature: float = Field(..., description="Target temperature in Fahrenheit")
    mode: HVACMode = Field(..., description="HVAC mode")
    status: str = Field(default="success", description="Operation status")

# Register tool — the function body is never called; the LLM generates responses
@tool_simulator.tool(
    share_state_id="room_environment",
    initial_state_description="Room: 68°F, humidity 45%, HVAC off",
    output_schema=HVACResponse,
)
def hvac_controller(temperature: float, mode: str) -> dict[str, Any]:
    """Control heating/cooling system that affects room temperature and humidity."""
    pass

def task_function(case: Case) -> dict:
    hvac_tool = tool_simulator.get_tool("hvac_controller")
    agent = Agent(tools=[hvac_tool], callback_handler=None)
    response = agent(case.input)
    return {"output": str(response)}

cases = [Case(name="heat_control", input="Turn on the heat to 72 degrees")]
experiment = Experiment(cases=cases, evaluators=[GoalSuccessRateEvaluator()])
reports = experiment.run_evaluations(task_function)

Key Benefits:

No Real Infrastructure: Test tool-using agents without live APIs, databases, or services
Schema-Validated Responses: Pydantic output schemas ensure structured, consistent tool responses
Shared State: Related tools (e.g., sensor + controller) share state via share_state_id for coherent behavior
Stateful Context: Call history and initial state are passed to the LLM for consistent multi-call sequences
Drop-in Replacement: Simulated tools plug directly into Strands Agent via get_tool()

Automated Experiment Generation

Generate comprehensive test suites automatically from context descriptions:

from strands_evals.generators import ExperimentGenerator
from strands_evals.evaluators import TrajectoryEvaluator

# Define available tools and context
tool_context = """
Available tools:
- calculator(expression: str) -> float: Evaluate mathematical expressions
- web_search(query: str) -> str: Search the web for information
- file_read(path: str) -> str: Read file contents
"""

# Generate experiment with multiple test cases
generator = ExperimentGenerator[str, str](str, str)
experiment = await generator.from_context_async(
    context=tool_context,
    num_cases=10,
    evaluator=TrajectoryEvaluator,
    task_description="Math and research assistant with tool usage",
    num_topics=3  # Distribute cases across multiple topics
)

# Save generated experiment
experiment.to_file("generated_experiment", "json")

Custom Evaluators with Structured Output

Create domain-specific evaluation logic with standardized output format:

from strands_evals.evaluators import Evaluator
from strands_evals.types import EvaluationData, EvaluationOutput

class PolicyComplianceEvaluator(Evaluator[str, str]):
    def evaluate(self, evaluation_case: EvaluationData[str, str]) -> EvaluationOutput:
        # Custom evaluation logic
        response = evaluation_case.actual_output
        
        # Check for policy violations
        violations = self._check_policy_violations(response)
        
        if not violations:
            return EvaluationOutput(
                score=1.0,
                test_pass=True,
                reason="Response complies with all policies",
                label="compliant"
            )
        else:
            return EvaluationOutput(
                score=0.0,
                test_pass=False,
                reason=f"Policy violations: {', '.join(violations)}",
                label="non_compliant"
            )
    
    def _check_policy_violations(self, response: str) -> list[str]:
        # Implementation details...
        return []

Tool Usage and Parameter Evaluation

Evaluate specific aspects of tool usage with specialized evaluators:

from strands_evals.evaluators import ToolSelectionAccuracyEvaluator, ToolParameterAccuracyEvaluator

# Evaluate if correct tools were selected
tool_selection_evaluator = ToolSelectionAccuracyEvaluator(
    rubric="Score 1.0 if optimal tools selected, 0.5 if suboptimal but functional, 0.0 if wrong tools"
)

# Evaluate if tool parameters were correct
tool_parameter_evaluator = ToolParameterAccuracyEvaluator(
    rubric="Score based on parameter accuracy and appropriateness for the task"
)

Available Evaluators

Output-Based Evaluators

These evaluators work directly with inputs and outputs without requiring OpenTelemetry traces:

OutputEvaluator: Flexible LLM-based evaluation with custom rubrics
TrajectoryEvaluator: Action sequence evaluation with built-in scoring tools (supports both list-based trajectories and Session traces via extractors)
InteractionsEvaluator: Multi-agent interaction and handoff evaluation
Custom Evaluators: Extensible base class for domain-specific logic

Trace-Based Evaluators

These evaluators require OpenTelemetry traces (Session objects) to analyze agent behavior:

Tool-Level Evaluators

Evaluate individual tool calls within a conversation:

ToolSelectionAccuracyEvaluator: Evaluates appropriateness of tool choices at specific points
ToolParameterAccuracyEvaluator: Evaluates correctness of tool parameters based on context

Trace-Level Evaluators

Evaluate the most recent turn in a conversation:

HelpfulnessEvaluator: Seven-level helpfulness assessment from user perspective
FaithfulnessEvaluator: Evaluates if responses are grounded in conversation history
CoherenceEvaluator: Assesses logical cohesion and reasoning quality with five-level scoring
ConcisenessEvaluator: Evaluates response brevity with three-level scoring
ResponseRelevanceEvaluator: Evaluates relevance of responses to user questions
HarmfulnessEvaluator: Binary evaluation for harmful content detection

Session-Level Evaluators

Evaluate entire conversation sessions:

GoalSuccessRateEvaluator: Measures if user goals were achieved across the full conversation

Experiment Management and Serialization

Save, load, and version experiments for reproducibility:

# Save experiment with metadata
experiment.to_file("customer_service_eval", "json")

# Load experiment from file
loaded_experiment = Experiment.from_file("./experiment_files/customer_service_eval.json", "json")

# Experiment files include:
# - Test cases with metadata
# - Evaluator configuration
# - Expected outputs and trajectories
# - Versioning information

Evaluation Metrics and Analysis

Track comprehensive metrics across multiple dimensions:

# Built-in metrics to consider:
metrics = {
    "accuracy": "Factual correctness of responses",
    "task_completion": "Whether agent completed the task",
    "tool_selection": "Appropriateness of tool choices", 
    "response_time": "Agent response latency",
    "hallucination_rate": "Frequency of fabricated information",
    "token_usage": "Efficiency of token consumption",
    "user_satisfaction": "Subjective helpfulness ratings"
}

# Generate analysis reports
reports = experiment.run_evaluations(task_function)
reports[0].run_display()  # Interactive display with metrics breakdown

Best Practices

Evaluation Strategy

Diversify Test Cases: Cover knowledge, reasoning, tool usage, conversation, edge cases, and safety scenarios
Use Statistical Baselines: Run multiple evaluations to account for LLM non-determinism
Combine Multiple Evaluators: Use output, trajectory, and helpfulness evaluators together
Regular Evaluation Cadence: Implement consistent evaluation schedules for continuous improvement

Performance Optimization

Use Extractors: Always use tools_use_extractor functions to prevent context overflow
Update Descriptions Dynamically: Call update_trajectory_description() with tool descriptions
Choose Appropriate Judge Models: Use stronger models for complex evaluations
Batch Evaluations: Process multiple test cases efficiently

Experiment Design

Write Clear Rubrics: Include explicit scoring criteria and examples
Include Expected Trajectories: Define exact sequences for trajectory evaluation
Use Appropriate Matching: Choose between exact, in-order, or any-order matching
Version Control: Track agent configurations alongside evaluation results

Documentation

For detailed guidance & examples, explore our documentation:

Contributing ❤️

We welcome contributions! See our Contributing Guide for details on:

Development setup
Contributing via Pull Requests
Code of Conduct
Reporting of security issues

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Security

See CONTRIBUTING for more information.

Release History

Version	Changes	Urgency	Date
v1.0.3	## v1.0.3 _Auto-drafted from commits in `v1.0.2..v1.0.3`, grouped by conventional-commit type. Edit on the release page after publish if you want a polished writeup; the canonical release notes live on the website._ ### 🚀 Features - feat: map type labels to native issue type (#287) (9ab3913) ### 🐛 Fixes - fix: detect_otel_mapper checks all spans for body in CloudWatch split format (#320) (771a3fe) - fix: route smolagents OpenInference spans to OpenInferenceSessionMapper (#308) (fe3b4b0) -	High	7/23/2026
v1.0.2	## v1.0.2 _Auto-drafted from commits in `v1.0.1..v1.0.2`, grouped by conventional-commit type. Edit on the release page after publish if you want a polished writeup; the canonical release notes live on the website._ ### 🐛 Fixes - fix(redteam): rename structured-output models off leading underscore (#294) (bc8436a) ### 👷 CI - ci: add aggregate CI Gate status check (#303) (aa4c1dc) - ci: added evals full release workflow (#302) (023f87f)	High	7/9/2026
v1.0.1	## What's Changed * ci: bump actions/github-script from 8 to 9 by @dependabot[bot] in https://github.com/strands-agents/evals/pull/193 * chore(cli): added fetch command to pull traces from different sources by @poshinchen in https://github.com/strands-agents/evals/pull/276 * docs(skill): updated skill content for chaos testing by @poshinchen in https://github.com/strands-agents/evals/pull/275 * ci: bump actions/checkout from 6 to 7 by @dependabot[bot] in https://github.com/strands-agents/eva	High	6/25/2026
v1.0.0	## What's Changed * feat(redteam): add GOAT multi-turn attack strategy by @yeomjiwonyeom in https://github.com/strands-agents/evals/pull/250 * feat(redteam): add PAIR single-stream multi-turn attack strategy by @yeomjiwonyeom in https://github.com/strands-agents/evals/pull/253 * docs: updated markdown files and skills to know the CLI exists by @poshinchen in https://github.com/strands-agents/evals/pull/264 * chore: added RedTeamExperiment round-trips by @poshinchen in https://github.com/stra	High	6/16/2026
v0.3.0	## What's Changed * feat(redteam): add built-in red teaming support by @kevmyung in https://github.com/strands-agents/evals/pull/184 * chore: allow importing EvaluationReport from root by @poshinchen in https://github.com/strands-agents/evals/pull/238 * chore: added trace-based evaluators into defaults by @poshinchen in https://github.com/strands-agents/evals/pull/244 * chore(report): always return flattened report by @poshinchen in https://github.com/strands-agents/evals/pull/241 * feat: a	High	6/12/2026
v0.2.1	## What's Changed * chore: added evals-skills by @poshinchen in https://github.com/strands-agents/evals/pull/231 * feat: add chaos testing module for fault injection by @ybdarrenwang in https://github.com/strands-agents/evals/pull/224 Full Changelog: https://github.com/strands-agents/evals/compare/v0.2.0...v0.2.1	High	5/29/2026
v0.2.0	## What's Changed * chore(detectors): update import to include DiagnosisTrigger by @poshinchen in https://github.com/strands-agents/evals/pull/219 * feat(simulator): structured_output for ActorSimulator by @poshinchen in https://github.com/strands-agents/evals/pull/207 * feat: added strands-reviewer workflow into evals by @poshinchen in https://github.com/strands-agents/evals/pull/223 * feat: add official Discord link by @Albertozhao in https://github.com/strands-agents/evals/pull/227 ##	High	5/14/2026
v0.1.17	## What's Changed * feat: add multimodal evaluators and prompt templates for image-to-text evaluation by @sangminwoo in https://github.com/strands-agents/evals/pull/187 * feat(detectors): added analyze_root_cause by @poshinchen in https://github.com/strands-agents/evals/pull/179 * feat(detectors): integrated rca into evaluation workflow by @poshinchen in https://github.com/strands-agents/evals/pull/210 * chore(detectors): included more fields to the RCAItem by @poshinchen in https://github.c	High	5/8/2026
v0.1.16	## What's Changed * feat: simplify devx by adding @eval_task decorator and handlers for wrapping task functions by @afarntrog in https://github.com/strands-agents/evals/pull/199 * feat(detectors): detectors interface and failure_detector implementation by @poshinchen in https://github.com/strands-agents/evals/pull/189 * refactor(evaluators): use PEP 604 union syntax and add Model type to HarmfulnessEvaluator by @afarntrog in https://github.com/strands-agents/evals/pull/206 **Full Change	High	4/30/2026
v0.1.15	## What's Changed * docs(simulators): updated simulators README by @poshinchen in https://github.com/strands-agents/evals/pull/195 * feat: add correctness evaluator, trace-based and reference-based by @ybdarrenwang in https://github.com/strands-agents/evals/pull/185 * feat: add OpenSearchProvider and OpenSearchSessionMapper by @kylehounslow in https://github.com/strands-agents/evals/pull/192 ## New Contributors * @kylehounslow made their first contribution in https://github.com/strands-ag	High	4/17/2026
v0.1.14	## What's Changed ### Major Features #### Ground Truth Assertion Support for Goal Success Rate Evaluator — [PR#180](https://github.com/strands-agents/evals/pull/180) The `GoalSuccessRateEvaluator` now supports a second evaluation mode: assertion-based evaluation. When `expected_assertion` is provided on the evaluation case, the judge LLM evaluates whether the agent’s behavior satisfies explicit success assertions rather than inferring goals from the conversation. This enables precise, rep	High	4/8/2026
v0.1.13	## What's Changed * feat: add LocalFileTaskResultStore for caching task results locally by @afarntrog in https://github.com/strands-agents/evals/pull/178 * feat(mappers): langfuse provider changes to support newer version of langfuse by @poshinchen in https://github.com/strands-agents/evals/pull/165 Full Changelog: https://github.com/strands-agents/evals/compare/v0.1.12...v0.1.13	Medium	3/31/2026
v0.1.12	## What's Changed * feat(mapper): added framework detection for traces from CloudWatch by @poshinchen in https://github.com/strands-agents/evals/pull/164 * refactor: unify sync/async evaluation by defaulting aevaluate to asyncio.to_thread by @afarntrog in https://github.com/strands-agents/evals/pull/173 * feat: add TaskResultStore for caching and replaying task execution results by @afarntrog in https://github.com/strands-agents/evals/pull/176 * feat(mappers): cloudwatch change for openinfer	Medium	3/26/2026
v0.1.11	## What's Changed * feat(report): allow flattened report by @poshinchen in https://github.com/strands-agents/evals/pull/157 * feat: add environment state evaluation support by @afarntrog in https://github.com/strands-agents/evals/pull/156 * feat: added Langchain mappers by @poshinchen in https://github.com/strands-agents/evals/pull/153 * feat: add environment state support to OutputEvaluator by @afarntrog in https://github.com/strands-agents/evals/pull/160 * fix: hatch run test-lint by @afa	Low	3/19/2026
v0.1.10	## What's Changed * feat: add deterministic evaluators for output and trajectory checks by @afarntrog in https://github.com/strands-agents/evals/pull/154 Full Changelog: https://github.com/strands-agents/evals/compare/v0.1.9...v0.1.10	Low	3/11/2026
v0.1.9	## What's Changed * feat: add CloudWatchProvider to pull remote cloudwatch traces and run evals against them. by @afarntrog in https://github.com/strands-agents/evals/pull/147 * feat: add ToolSimulator for tool response simulation by @ybdarrenwang in https://github.com/strands-agents/evals/pull/111 Full Changelog: https://github.com/strands-agents/evals/compare/v0.1.8...v0.1.9	Low	3/4/2026
v0.1.8	## What's Changed * fix: handle parallel tool calls during tool extraction by @clareliguori in https://github.com/strands-agents/evals/pull/137 * feat: trace provider interface by @afarntrog in https://github.com/strands-agents/evals/pull/140 * feat: add LangfuseProvider for remote trace evaluation by @afarntrog in https://github.com/strands-agents/evals/pull/144 * ci: bump amannn/action-semantic-pull-request from 5 to 6 by @dependabot[bot] in https://github.com/strands-agents/evals/pull/138	Low	2/25/2026
v0.1.7	## What's Changed * fix: retrieve multiple text contentBlock in messageConent by @poshinchen in https://github.com/strands-agents/evals/pull/133 * feat(workflows): add conventional commit workflow in PR by @mkmeral in https://github.com/strands-agents/evals/pull/134 * fix: add tool info to concisenss, harmfulness, helpfulness and response relevance evaluators by @ybdarrenwang in https://github.com/strands-agents/evals/pull/132 * fix: update output variable name in workflow by @Unshure in htt	Low	2/19/2026
v0.1.6	## What's Changed * Refactor: centralized InputT and OutputT by @poshinchen in https://github.com/strands-agents/evals/pull/124 * Added CoherenceEvaluator by @poshinchen in https://github.com/strands-agents/evals/pull/125 Full Changelog: https://github.com/strands-agents/evals/compare/v0.1.5...v0.1.6	Low	2/11/2026
v0.1.5	## Major Features ### Response Relevance Evaluator - [PR#112](https://github.com/strands-agents/evals/pull/112) The new `ResponseRelevanceEvaluator` measures how well an agent's response addresses the user's question. It uses a 5-level LLM-as-judge scoring system — Not At All (0.0), Not Generally (0.25), Neutral/Mixed (0.5), Generally Yes (0.75), and Completely Yes (1.0) — with a pass threshold at ≥0.5. Like other trace-level evaluators, it requires an `actual_trajectory` session	Low	2/5/2026
v0.1.4	## What's Changed * fix: include tool executions in _extract_trace_level by @razkenari in https://github.com/strands-agents/evals/pull/77 ## New Contributors * @razkenari made their first contribution in https://github.com/strands-agents/evals/pull/77 Full Changelog: https://github.com/strands-agents/evals/compare/v0.1.3...v0.1.4	Low	1/29/2026
v0.1.3	## What's Changed * fix: Multiple Tool Usage Not Detected in tools_use_extractor.py by @bipro1992 in https://github.com/strands-agents/evals/pull/80 ## New Contributors * @bipro1992 made their first contribution in https://github.com/strands-agents/evals/pull/80 Full Changelog: https://github.com/strands-agents/evals/compare/v0.1.2...v0.1.3	Low	1/21/2026
v0.1.2	## What's Changed * fix: Isolate evaluator errors in run_evaluations by @afarntrog in https://github.com/strands-agents/evals/pull/84 * fix(extractors): Add null check for toolResult in message extraction by @afarntrog in https://github.com/strands-agents/evals/pull/85 Full Changelog: https://github.com/strands-agents/evals/compare/v0.1.1...v0.1.2	Low	1/13/2026
v0.1.1	## What's Changed * fix broken links by @theofpa in https://github.com/strands-agents/evals/pull/63 * feat: Extract whether tool result was an error by @clareliguori in https://github.com/strands-agents/evals/pull/66 * docs: updated README to include simulator feature by @poshinchen in https://github.com/strands-agents/evals/pull/70 * fix: preserve non-ASCII characters in JSON file output by @daisuke-awaji in https://github.com/strands-agents/evals/pull/69 * ci: bump actions/download-artifa	Low	12/15/2025
v0.1.0	Strands Evaluation is a powerful framework for evaluating AI agents and LLM applications. From simple output validation to complex multi-agent interaction analysis, trajectory evaluation, and automated experiment generation, Strands Evaluation provides comprehensive tools to measure and improve your AI systems. Feature Overview - Multiple Evaluation Types: Output evaluation, trajectory analysis, tool usage assessment, and interaction evaluation - LLM-as-a-Judge: Built-in evalu	Low	12/3/2025

Dependencies & License Audit

Loading dependencies...

Similar Packages

opentulpaSelf-hosted personal AI agent that lives in your DMs. Describe any workflow: triage Gmail, pull a Giphy feed, build a Slack bot, monitor markets. It writes the code, runs it, schedules it, and saves imain@2026-07-21

robotsControl robots and physical hardware with natural language through Strands Agents.v0.4.1

SimpleLLMFuncA simple and well-tailored LLM application framework that enables you to seamlessly integrate LLM capabilities in the most "Code-Centric" manner. LLM As Function, Prompt As Code. 一个简单的恰到v0.8.4

AGI-Alpha-Agent-v0META‑AGENTIC α‑AGI 👁️✨ — Mission 🎯 End‑to‑end: Identify 🔍 → Out‑Learn 📚 → Out‑Think 🧠 → Out‑Design 🎨 → Out‑Strategise ♟️ → Out‑Execute ⚡main@2026-04-30

Open-SableOpen-Sable is a local-first autonomous agent framework with AGI-inspired cognitive subsystems (goals, memory, metacognition, tool use). It can run continuously on your machine, integrate with chat intv1.7.0

More from strands-agents

sdk-pythonA model-driven approach to building AI agents in just a few lines of code.

samplesAgent samples built using the Strands Agents SDK.

docsDocumentation for the Strands Agents SDK. A model-driven approach to building AI agents in just a few lines of code.

More in Frameworks

sglangSGLang is a fast serving framework for large language models and vision language models.

onnxruntimeONNX Runtime: cross-platform, high performance ML inferencing and training accelerator

elizaAutonomous agents for everyone

bamlThe AI framework that adds the engineering to prompt engineering (Python/TS/Ruby/Java/C#/Rust/Go compatible)