freshcrate
Skin:/
Home > Frameworks > evals

evals

A comprehensive evaluation framework for AI agents and LLM applications.

Why this rank:Strong adoptionRecent releaseHealthy release cadence

Description

A comprehensive evaluation framework for AI agents and LLM applications.

README

Strands Evals SDK

A comprehensive evaluation framework for AI agents and LLM applications.

GitHub commit activity GitHub open issues GitHub open pull requests License PyPI version Python versions

DocumentationSamplesPython SDKTypescript SDKToolsEvaluations

Strands Evaluation is a powerful framework for evaluating AI agents and LLM applications. From simple output validation to complex multi-agent interaction analysis, trajectory evaluation, and automated experiment generation, Strands Evaluation provides comprehensive tools to measure and improve your AI systems.

Feature Overview

  • Multiple Evaluation Types: Output evaluation, trajectory analysis, tool usage assessment, and interaction evaluation
  • Dynamic Simulators: Multi-turn conversation simulation with realistic user behavior, goal-oriented interactions, and LLM-powered tool simulation with shared state
  • LLM-as-a-Judge: Built-in evaluators using language models for sophisticated assessment with structured scoring
  • Trace-based Evaluation: Analyze agent behavior through OpenTelemetry execution traces
  • Automated Experiment Generation: Generate comprehensive test suites from context descriptions
  • Custom Evaluators: Extensible framework for domain-specific evaluation logic
  • Experiment Management: Save, load, and version your evaluation experiments with JSON serialization
  • Built-in Scoring Tools: Helper functions for exact, in-order, and any-order trajectory matching

Quick Start

# Install Strands Evals SDK
pip install strands-agents-evals
from strands import Agent
from strands_evals import Case, Experiment
from strands_evals.evaluators import OutputEvaluator

# Create test cases
test_cases = [
    Case[str, str](
        name="knowledge-1",
        input="What is the capital of France?",
        expected_output="The capital of France is Paris.",
        metadata={"category": "knowledge"}
    )
]

# Create evaluators with custom rubric
evaluators = [
    OutputEvaluator(
        rubric="""
        Evaluate based on:
        1. Accuracy - Is the information correct?
        2. Completeness - Does it fully answer the question?
        3. Clarity - Is it easy to understand?
        
        Score 1.0 if all criteria are met excellently.
        Score 0.5 if some criteria are partially met.
        Score 0.0 if the response is inadequate.
        """
    )
]

# Create experiment and run evaluation
experiment = Experiment[str, str](cases=test_cases, evaluators=evaluators)

def get_response(case: Case) -> str:
    agent = Agent(callback_handler=None)
    return str(agent(case.input))

# Run evaluations
reports = experiment.run_evaluations(get_response)
reports[0].run_display()

Installation

Ensure you have Python 3.10+ installed, then:

# Create and activate virtual environment
python -m venv .venv
source .venv/bin/activate  # On Windows use: .venv\Scripts\activate

# Install in development mode
pip install -e .

# Install with test dependencies
pip install -e ".[test]"

# Install with both test and dev dependencies
pip install -e ".[test,dev]"

Features at a Glance

Output Evaluation with Custom Rubrics

Evaluate agent responses using LLM-as-a-judge with flexible scoring criteria:

from strands_evals.evaluators import OutputEvaluator

evaluator = OutputEvaluator(
    rubric="Score 1.0 for accurate, complete responses. Score 0.5 for partial answers. Score 0.0 for incorrect or unhelpful responses.",
    include_inputs=True,  # Include context in evaluation
    model="us.anthropic.claude-sonnet-4-20250514-v1:0"  # Custom judge model
)

Trajectory Evaluation with Built-in Scoring

Analyze agent tool usage and action sequences with helper scoring functions:

from strands_evals.evaluators import TrajectoryEvaluator
from strands_evals.extractors import tools_use_extractor
from strands_tools import calculator

def get_response_with_tools(case: Case) -> dict:
    agent = Agent(tools=[calculator])
    response = agent(case.input)
    
    # Extract trajectory efficiently to prevent context overflow
    trajectory = tools_use_extractor.extract_agent_tools_used_from_messages(agent.messages)
    
    # Update evaluator with tool descriptions
    evaluator.update_trajectory_description(
        tools_use_extractor.extract_tools_description(agent, is_short=True)
    )
    
    return {"output": str(response), "trajectory": trajectory}

# Evaluator includes built-in scoring tools: exact_match_scorer, in_order_match_scorer, any_order_match_scorer
evaluator = TrajectoryEvaluator(
    rubric="Score 1.0 if correct tools used in proper sequence. Use scoring tools to verify trajectory matches."
)

Trace-based Helpfulness Evaluation

Evaluate agent helpfulness using OpenTelemetry traces with seven-level scoring:

from strands_evals.evaluators import HelpfulnessEvaluator
from strands_evals.telemetry import StrandsEvalsTelemetry
from strands_evals.mappers import StrandsInMemorySessionMapper

# Setup telemetry for trace capture
telemetry = StrandsEvalsTelemetry().setup_in_memory_exporter()

def user_task_function(case: Case) -> dict:
    telemetry.memory_exporter.clear()
    
    agent = Agent(
        trace_attributes={"session.id": case.session_id},
        callback_handler=None
    )
    response = agent(case.input)
    
    # Map spans to session for evaluation
    spans = telemetry.memory_exporter.get_finished_spans()
    mapper = StrandsInMemorySessionMapper()
    session = mapper.map_to_session(spans, session_id=case.session_id)
    
    return {"output": str(response), "trajectory": session}

# Seven-level scoring: Not helpful (0.0) to Above and beyond (1.0)
evaluators = [HelpfulnessEvaluator()]
experiment = Experiment[str, str](cases=test_cases, evaluators=evaluators)

# Run evaluations
reports = experiment.run_evaluations(user_task_function)
reports[0].run_display()

Multi-turn Conversation Simulation

Simulate realistic user interactions with dynamic, goal-oriented conversations using ActorSimulator:

from strands import Agent
from strands_evals import Case, Experiment, ActorSimulator
from strands_evals.evaluators import HelpfulnessEvaluator, GoalSuccessRateEvaluator
from strands_evals.mappers import StrandsInMemorySessionMapper
from strands_evals.telemetry import StrandsEvalsTelemetry

# Setup telemetry
telemetry = StrandsEvalsTelemetry().setup_in_memory_exporter()
memory_exporter = telemetry.in_memory_exporter

def task_function(case: Case) -> dict:
    # Create simulator to drive conversation
    simulator = ActorSimulator.from_case_for_user_simulator(
        case=case,
        max_turns=10
    )

    # Create agent to evaluate
    agent = Agent(
        trace_attributes={
            "gen_ai.conversation.id": case.session_id,
            "session.id": case.session_id
        },
        callback_handler=None
    )

    # Run multi-turn conversation
    all_spans = []
    user_message = case.input

    while simulator.has_next():
        memory_exporter.clear()
        agent_response = agent(user_message)
        turn_spans = list(memory_exporter.get_finished_spans())
        all_spans.extend(turn_spans)

        user_result = simulator.act(str(agent_response))
        user_message = str(user_result.structured_output.message)

    # Map to session for evaluation
    mapper = StrandsInMemorySessionMapper()
    session = mapper.map_to_session(all_spans, session_id=case.session_id)

    return {"output": str(agent_response), "trajectory": session}

# Use evaluators to assess simulated conversations
evaluators = [
    HelpfulnessEvaluator(),
    GoalSuccessRateEvaluator()
]

experiment = Experiment(cases=test_cases, evaluators=evaluators)
reports = experiment.run_evaluations(task_function)

Key Benefits:

  • Dynamic Interactions: Simulator adapts responses based on agent behavior
  • Goal-Oriented Testing: Verify agents can complete user objectives through dialogue
  • Realistic Conversations: Generate authentic multi-turn interaction patterns
  • No Predefined Scripts: Test agents without hardcoded conversation paths
  • Comprehensive Evaluation: Combine with trace-based evaluators for full assessment

Tool Simulation

Simulate tool behavior with LLM-powered responses for controlled agent evaluation using ToolSimulator. Register tools with a decorator, define output schemas, and optionally share state across related tools — the simulator replaces real execution with realistic, schema-validated responses:

from typing import Any
from enum import Enum
from pydantic import BaseModel, Field
from strands import Agent
from strands_evals import Case, Experiment
from strands_evals.evaluators import GoalSuccessRateEvaluator
from strands_evals.simulation.tool_simulator import ToolSimulator

tool_simulator = ToolSimulator()

# Define output schema
class HVACMode(str, Enum):
    HEAT = "heat"
    COOL = "cool"
    AUTO = "auto"
    OFF = "off"

class HVACResponse(BaseModel):
    temperature: float = Field(..., description="Target temperature in Fahrenheit")
    mode: HVACMode = Field(..., description="HVAC mode")
    status: str = Field(default="success", description="Operation status")

# Register tool — the function body is never called; the LLM generates responses
@tool_simulator.tool(
    share_state_id="room_environment",
    initial_state_description="Room: 68°F, humidity 45%, HVAC off",
    output_schema=HVACResponse,
)
def hvac_controller(temperature: float, mode: str) -> dict[str, Any]:
    """Control heating/cooling system that affects room temperature and humidity."""
    pass

def task_function(case: Case) -> dict:
    hvac_tool = tool_simulator.get_tool("hvac_controller")
    agent = Agent(tools=[hvac_tool], callback_handler=None)
    response = agent(case.input)
    return {"output": str(response)}

cases = [Case(name="heat_control", input="Turn on the heat to 72 degrees")]
experiment = Experiment(cases=cases, evaluators=[GoalSuccessRateEvaluator()])
reports = experiment.run_evaluations(task_function)

Key Benefits:

  • No Real Infrastructure: Test tool-using agents without live APIs, databases, or services
  • Schema-Validated Responses: Pydantic output schemas ensure structured, consistent tool responses
  • Shared State: Related tools (e.g., sensor + controller) share state via share_state_id for coherent behavior
  • Stateful Context: Call history and initial state are passed to the LLM for consistent multi-call sequences
  • Drop-in Replacement: Simulated tools plug directly into Strands Agent via get_tool()

Automated Experiment Generation

Generate comprehensive test suites automatically from context descriptions:

from strands_evals.generators import ExperimentGenerator
from strands_evals.evaluators import TrajectoryEvaluator

# Define available tools and context
tool_context = """
Available tools:
- calculator(expression: str) -> float: Evaluate mathematical expressions
- web_search(query: str) -> str: Search the web for information
- file_read(path: str) -> str: Read file contents
"""

# Generate experiment with multiple test cases
generator = ExperimentGenerator[str, str](str, str)
experiment = await generator.from_context_async(
    context=tool_context,
    num_cases=10,
    evaluator=TrajectoryEvaluator,
    task_description="Math and research assistant with tool usage",
    num_topics=3  # Distribute cases across multiple topics
)

# Save generated experiment
experiment.to_file("generated_experiment", "json")

Custom Evaluators with Structured Output

Create domain-specific evaluation logic with standardized output format:

from strands_evals.evaluators import Evaluator
from strands_evals.types import EvaluationData, EvaluationOutput

class PolicyComplianceEvaluator(Evaluator[str, str]):
    def evaluate(self, evaluation_case: EvaluationData[str, str]) -> EvaluationOutput:
        # Custom evaluation logic
        response = evaluation_case.actual_output
        
        # Check for policy violations
        violations = self._check_policy_violations(response)
        
        if not violations:
            return EvaluationOutput(
                score=1.0,
                test_pass=True,
                reason="Response complies with all policies",
                label="compliant"
            )
        else:
            return EvaluationOutput(
                score=0.0,
                test_pass=False,
                reason=f"Policy violations: {', '.join(violations)}",
                label="non_compliant"
            )
    
    def _check_policy_violations(self, response: str) -> list[str]:
        # Implementation details...
        return []

Tool Usage and Parameter Evaluation

Evaluate specific aspects of tool usage with specialized evaluators:

from strands_evals.evaluators import ToolSelectionAccuracyEvaluator, ToolParameterAccuracyEvaluator

# Evaluate if correct tools were selected
tool_selection_evaluator = ToolSelectionAccuracyEvaluator(
    rubric="Score 1.0 if optimal tools selected, 0.5 if suboptimal but functional, 0.0 if wrong tools"
)

# Evaluate if tool parameters were correct
tool_parameter_evaluator = ToolParameterAccuracyEvaluator(
    rubric="Score based on parameter accuracy and appropriateness for the task"
)

Available Evaluators

Output-Based Evaluators

These evaluators work directly with inputs and outputs without requiring OpenTelemetry traces:

  • OutputEvaluator: Flexible LLM-based evaluation with custom rubrics
  • TrajectoryEvaluator: Action sequence evaluation with built-in scoring tools (supports both list-based trajectories and Session traces via extractors)
  • InteractionsEvaluator: Multi-agent interaction and handoff evaluation
  • Custom Evaluators: Extensible base class for domain-specific logic

Trace-Based Evaluators

These evaluators require OpenTelemetry traces (Session objects) to analyze agent behavior:

Tool-Level Evaluators

Evaluate individual tool calls within a conversation:

  • ToolSelectionAccuracyEvaluator: Evaluates appropriateness of tool choices at specific points
  • ToolParameterAccuracyEvaluator: Evaluates correctness of tool parameters based on context

Trace-Level Evaluators

Evaluate the most recent turn in a conversation:

  • HelpfulnessEvaluator: Seven-level helpfulness assessment from user perspective
  • FaithfulnessEvaluator: Evaluates if responses are grounded in conversation history
  • CoherenceEvaluator: Assesses logical cohesion and reasoning quality with five-level scoring
  • ConcisenessEvaluator: Evaluates response brevity with three-level scoring
  • ResponseRelevanceEvaluator: Evaluates relevance of responses to user questions
  • HarmfulnessEvaluator: Binary evaluation for harmful content detection

Session-Level Evaluators

Evaluate entire conversation sessions:

  • GoalSuccessRateEvaluator: Measures if user goals were achieved across the full conversation

Experiment Management and Serialization

Save, load, and version experiments for reproducibility:

# Save experiment with metadata
experiment.to_file("customer_service_eval", "json")

# Load experiment from file
loaded_experiment = Experiment.from_file("./experiment_files/customer_service_eval.json", "json")

# Experiment files include:
# - Test cases with metadata
# - Evaluator configuration
# - Expected outputs and trajectories
# - Versioning information

Evaluation Metrics and Analysis

Track comprehensive metrics across multiple dimensions:

# Built-in metrics to consider:
metrics = {
    "accuracy": "Factual correctness of responses",
    "task_completion": "Whether agent completed the task",
    "tool_selection": "Appropriateness of tool choices", 
    "response_time": "Agent response latency",
    "hallucination_rate": "Frequency of fabricated information",
    "token_usage": "Efficiency of token consumption",
    "user_satisfaction": "Subjective helpfulness ratings"
}

# Generate analysis reports
reports = experiment.run_evaluations(task_function)
reports[0].run_display()  # Interactive display with metrics breakdown

Best Practices

Evaluation Strategy

  1. Diversify Test Cases: Cover knowledge, reasoning, tool usage, conversation, edge cases, and safety scenarios
  2. Use Statistical Baselines: Run multiple evaluations to account for LLM non-determinism
  3. Combine Multiple Evaluators: Use output, trajectory, and helpfulness evaluators together
  4. Regular Evaluation Cadence: Implement consistent evaluation schedules for continuous improvement

Performance Optimization

  1. Use Extractors: Always use tools_use_extractor functions to prevent context overflow
  2. Update Descriptions Dynamically: Call update_trajectory_description() with tool descriptions
  3. Choose Appropriate Judge Models: Use stronger models for complex evaluations
  4. Batch Evaluations: Process multiple test cases efficiently

Experiment Design

  1. Write Clear Rubrics: Include explicit scoring criteria and examples
  2. Include Expected Trajectories: Define exact sequences for trajectory evaluation
  3. Use Appropriate Matching: Choose between exact, in-order, or any-order matching
  4. Version Control: Track agent configurations alongside evaluation results

Documentation

For detailed guidance & examples, explore our documentation:

Contributing ❤️

We welcome contributions! See our Contributing Guide for details on:

  • Development setup
  • Contributing via Pull Requests
  • Code of Conduct
  • Reporting of security issues

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Security

See CONTRIBUTING for more information.

Release History

VersionChangesUrgencyDate
v0.2.1## What's Changed * chore: added evals-skills by @poshinchen in https://github.com/strands-agents/evals/pull/231 * feat: add chaos testing module for fault injection by @ybdarrenwang in https://github.com/strands-agents/evals/pull/224 **Full Changelog**: https://github.com/strands-agents/evals/compare/v0.2.0...v0.2.1High5/29/2026
v0.2.0## What's Changed * chore(detectors): update import to include DiagnosisTrigger by @poshinchen in https://github.com/strands-agents/evals/pull/219 * feat(simulator): structured_output for ActorSimulator by @poshinchen in https://github.com/strands-agents/evals/pull/207 * feat: added strands-reviewer workflow into evals by @poshinchen in https://github.com/strands-agents/evals/pull/223 * feat: add official Discord link by @Albertozhao in https://github.com/strands-agents/evals/pull/227 ## High5/14/2026
v0.1.17## What's Changed * feat: add multimodal evaluators and prompt templates for image-to-text evaluation by @sangminwoo in https://github.com/strands-agents/evals/pull/187 * feat(detectors): added analyze_root_cause by @poshinchen in https://github.com/strands-agents/evals/pull/179 * feat(detectors): integrated rca into evaluation workflow by @poshinchen in https://github.com/strands-agents/evals/pull/210 * chore(detectors): included more fields to the RCAItem by @poshinchen in https://github.cHigh5/8/2026
v0.1.16## What's Changed * feat: simplify devx by adding @eval_task decorator and handlers for wrapping task functions by @afarntrog in https://github.com/strands-agents/evals/pull/199 * feat(detectors): detectors interface and failure_detector implementation by @poshinchen in https://github.com/strands-agents/evals/pull/189 * refactor(evaluators): use PEP 604 union syntax and add Model type to HarmfulnessEvaluator by @afarntrog in https://github.com/strands-agents/evals/pull/206 **Full ChangeHigh4/30/2026
v0.1.15## What's Changed * docs(simulators): updated simulators README by @poshinchen in https://github.com/strands-agents/evals/pull/195 * feat: add correctness evaluator, trace-based and reference-based by @ybdarrenwang in https://github.com/strands-agents/evals/pull/185 * feat: add OpenSearchProvider and OpenSearchSessionMapper by @kylehounslow in https://github.com/strands-agents/evals/pull/192 ## New Contributors * @kylehounslow made their first contribution in https://github.com/strands-agHigh4/17/2026
v0.1.14## What's Changed ### Major Features #### Ground Truth Assertion Support for Goal Success Rate Evaluator — [PR#180](https://github.com/strands-agents/evals/pull/180) The `GoalSuccessRateEvaluator` now supports a second evaluation mode: assertion-based evaluation. When `expected_assertion` is provided on the evaluation case, the judge LLM evaluates whether the agent’s behavior satisfies explicit success assertions rather than inferring goals from the conversation. This enables precise, repHigh4/8/2026
v0.1.13## What's Changed * feat: add LocalFileTaskResultStore for caching task results locally by @afarntrog in https://github.com/strands-agents/evals/pull/178 * feat(mappers): langfuse provider changes to support newer version of langfuse by @poshinchen in https://github.com/strands-agents/evals/pull/165 **Full Changelog**: https://github.com/strands-agents/evals/compare/v0.1.12...v0.1.13Medium3/31/2026
v0.1.12## What's Changed * feat(mapper): added framework detection for traces from CloudWatch by @poshinchen in https://github.com/strands-agents/evals/pull/164 * refactor: unify sync/async evaluation by defaulting aevaluate to asyncio.to_thread by @afarntrog in https://github.com/strands-agents/evals/pull/173 * feat: add TaskResultStore for caching and replaying task execution results by @afarntrog in https://github.com/strands-agents/evals/pull/176 * feat(mappers): cloudwatch change for openinferMedium3/26/2026
v0.1.11## What's Changed * feat(report): allow flattened report by @poshinchen in https://github.com/strands-agents/evals/pull/157 * feat: add environment state evaluation support by @afarntrog in https://github.com/strands-agents/evals/pull/156 * feat: added Langchain mappers by @poshinchen in https://github.com/strands-agents/evals/pull/153 * feat: add environment state support to OutputEvaluator by @afarntrog in https://github.com/strands-agents/evals/pull/160 * fix: hatch run test-lint by @afaLow3/19/2026
v0.1.10## What's Changed * feat: add deterministic evaluators for output and trajectory checks by @afarntrog in https://github.com/strands-agents/evals/pull/154 **Full Changelog**: https://github.com/strands-agents/evals/compare/v0.1.9...v0.1.10Low3/11/2026
v0.1.9## What's Changed * feat: add CloudWatchProvider to pull remote cloudwatch traces and run evals against them. by @afarntrog in https://github.com/strands-agents/evals/pull/147 * feat: add ToolSimulator for tool response simulation by @ybdarrenwang in https://github.com/strands-agents/evals/pull/111 **Full Changelog**: https://github.com/strands-agents/evals/compare/v0.1.8...v0.1.9Low3/4/2026
v0.1.8## What's Changed * fix: handle parallel tool calls during tool extraction by @clareliguori in https://github.com/strands-agents/evals/pull/137 * feat: trace provider interface by @afarntrog in https://github.com/strands-agents/evals/pull/140 * feat: add LangfuseProvider for remote trace evaluation by @afarntrog in https://github.com/strands-agents/evals/pull/144 * ci: bump amannn/action-semantic-pull-request from 5 to 6 by @dependabot[bot] in https://github.com/strands-agents/evals/pull/138Low2/25/2026
v0.1.7## What's Changed * fix: retrieve multiple text contentBlock in messageConent by @poshinchen in https://github.com/strands-agents/evals/pull/133 * feat(workflows): add conventional commit workflow in PR by @mkmeral in https://github.com/strands-agents/evals/pull/134 * fix: add tool info to concisenss, harmfulness, helpfulness and response relevance evaluators by @ybdarrenwang in https://github.com/strands-agents/evals/pull/132 * fix: update output variable name in workflow by @Unshure in httLow2/19/2026
v0.1.6## What's Changed * Refactor: centralized InputT and OutputT by @poshinchen in https://github.com/strands-agents/evals/pull/124 * Added CoherenceEvaluator by @poshinchen in https://github.com/strands-agents/evals/pull/125 **Full Changelog**: https://github.com/strands-agents/evals/compare/v0.1.5...v0.1.6Low2/11/2026
v0.1.5## Major Features ### Response Relevance Evaluator - [PR#112](https://github.com/strands-agents/evals/pull/112) The new `ResponseRelevanceEvaluator` measures how well an agent's response addresses the user's question. It uses a 5-level LLM-as-judge scoring system — *Not At All* (0.0), *Not Generally* (0.25), *Neutral/Mixed* (0.5), *Generally Yes* (0.75), and *Completely Yes* (1.0) — with a pass threshold at ≥0.5. Like other trace-level evaluators, it requires an `actual_trajectory` sessionLow2/5/2026
v0.1.4## What's Changed * fix: include tool executions in _extract_trace_level by @razkenari in https://github.com/strands-agents/evals/pull/77 ## New Contributors * @razkenari made their first contribution in https://github.com/strands-agents/evals/pull/77 **Full Changelog**: https://github.com/strands-agents/evals/compare/v0.1.3...v0.1.4Low1/29/2026
v0.1.3## What's Changed * fix: Multiple Tool Usage Not Detected in tools_use_extractor.py by @bipro1992 in https://github.com/strands-agents/evals/pull/80 ## New Contributors * @bipro1992 made their first contribution in https://github.com/strands-agents/evals/pull/80 **Full Changelog**: https://github.com/strands-agents/evals/compare/v0.1.2...v0.1.3Low1/21/2026
v0.1.2## What's Changed * fix: Isolate evaluator errors in run_evaluations by @afarntrog in https://github.com/strands-agents/evals/pull/84 * fix(extractors): Add null check for toolResult in message extraction by @afarntrog in https://github.com/strands-agents/evals/pull/85 **Full Changelog**: https://github.com/strands-agents/evals/compare/v0.1.1...v0.1.2Low1/13/2026
v0.1.1## What's Changed * fix broken links by @theofpa in https://github.com/strands-agents/evals/pull/63 * feat: Extract whether tool result was an error by @clareliguori in https://github.com/strands-agents/evals/pull/66 * docs: updated README to include simulator feature by @poshinchen in https://github.com/strands-agents/evals/pull/70 * fix: preserve non-ASCII characters in JSON file output by @daisuke-awaji in https://github.com/strands-agents/evals/pull/69 * ci: bump actions/download-artifaLow12/15/2025
v0.1.0Strands Evaluation is a powerful framework for evaluating AI agents and LLM applications. From simple output validation to complex multi-agent interaction analysis, trajectory evaluation, and automated experiment generation, Strands Evaluation provides comprehensive tools to measure and improve your AI systems. **Feature Overview** - Multiple Evaluation Types: Output evaluation, trajectory analysis, tool usage assessment, and interaction evaluation - LLM-as-a-Judge: Built-in evaluLow12/3/2025

Dependencies & License Audit

Loading dependencies...

Similar Packages

opentulpaSelf-hosted personal AI agent that lives in your DMs. Describe any workflow: triage Gmail, pull a Giphy feed, build a Slack bot, monitor markets. It writes the code, runs it, schedules it, and saves imain@2026-06-05
SimpleLLMFuncA simple and well-tailored LLM application framework that enables you to seamlessly integrate LLM capabilities in the most "Code-Centric" manner. LLM As Function, Prompt As Code. 一个简单的恰到v0.8.4
AGI-Alpha-Agent-v0META‑AGENTIC α‑AGI 👁️✨ — Mission 🎯 End‑to‑end: Identify 🔍 → Out‑Learn 📚 → Out‑Think 🧠 → Out‑Design 🎨 → Out‑Strategise ♟️ → Out‑Execute ⚡main@2026-04-30
Open-SableOpen-Sable is a local-first autonomous agent framework with AGI-inspired cognitive subsystems (goals, memory, metacognition, tool use). It can run continuously on your machine, integrate with chat intv1.7.0
robotsControl robots and physical hardware with natural language through Strands Agents.v0.3.8

More from strands-agents

sdk-pythonA model-driven approach to building AI agents in just a few lines of code.
samplesAgent samples built using the Strands Agents SDK.
docsDocumentation for the Strands Agents SDK. A model-driven approach to building AI agents in just a few lines of code.

More in Frameworks

langchainThe agent engineering platform
deer-flowAn open-source long-horizon SuperAgent harness that researches, codes, and creates. With the help of sandboxes, memories, tools, skill, subagents and message gateway, it handles different levels of ta
tqdmFast, Extensible Progress Meter
simBuild, deploy, and orchestrate AI agents. Sim is the central intelligence layer for your AI workforce.