freshcrate
Skin:/
Home > Frameworks > deepeval

deepeval

The LLM Evaluation Framework

Why this rank:Strong adoptionRecent releaseHealthy release cadence

Description

The LLM Evaluation Framework

README

GitHub release Try Quickstart in Colab License Twitter Follow Deutsch | Espaรฑol | franรงais | ๆ—ฅๆœฌ่ชž | ํ•œ๊ตญ์–ด | Portuguรชs | ะ ัƒััะบะธะน | ไธญๆ–‡

DeepEval is a simple-to-use, open-source LLM evaluation framework, for evaluating large-language model systems. It is similar to Pytest but specialized for unit testing LLM apps. DeepEval incorporates the latest research to run evals via metrics such as G-Eval, task completion, answer relevancy, hallucination, etc., which uses LLM-as-a-judge and other NLP models that run locally on your machine.

Whether you're building AI agents, RAG pipelines, or chatbots, implemented via LangChain or OpenAI, DeepEval has you covered. With it, you can easily determine the optimal models, prompts, and architecture to improve your AI quality, prevent prompt drifting, or even transition from OpenAI to Claude with confidence.

Important

Need a place for your DeepEval testing data to live ๐Ÿกโค๏ธ? Sign up to the DeepEval platform to compare iterations of your LLM app, generate & share testing reports, and more.

Demo GIF

Want to talk LLM evaluation, need help picking metrics, or just to say hi? Come join our discord.


๐Ÿ”ฅ Metrics and Features

  • ๐Ÿ“ Large variety of ready-to-use LLM eval metrics (all with explanations) powered by ANY LLM of your choice, statistical methods, or NLP models that run locally on your machine covering all use cases:

    • Custom, All-Purpose Metrics:

      • G-Eval โ€” a research-backed LLM-as-a-judge metric for evaluating on any custom criteria with human-like accuracy
      • DAG โ€” DeepEval's graph-based deterministic LLM-as-a-judge metric builder
    • Agentic Metrics
    • RAG Metrics
      • Answer Relevancy โ€” measure how relevant the RAG pipeline's output is to the input
      • Faithfulness โ€” evaluate whether the RAG pipeline's output factually aligns with the retrieval context
      • Contextual Recall โ€” measure how well the RAG pipeline's retrieval context aligns with the expected output
      • Contextual Precision โ€” evaluate whether relevant nodes in the RAG pipeline's retrieval context are ranked higher
      • Contextual Relevancy โ€” measure the overall relevance of the RAG pipeline's retrieval context to the input
      • RAGAS โ€” average of answer relevancy, faithfulness, contextual precision, and contextual recall
    • Multi-Turn Metrics
      • Knowledge Retention โ€” evaluate whether the chatbot retains factual information throughout a conversation
      • Conversation Completeness โ€” measure whether the chatbot satisfies user needs throughout a conversation
      • Turn Relevancy โ€” evaluate whether the chatbot generates consistently relevant responses throughout a conversation
      • Turn Faithfulness โ€” check if the chatbot's responses are factually grounded in retrieval context across turns
      • Role Adherence โ€” evaluate whether the chatbot adheres to its assigned role throughout a conversation
    • MCP Metrics
      • MCP Task Completion โ€” evaluate how effectively an MCP-based agent accomplishes a task
      • MCP Use โ€” measure how effectively an agent uses its available MCP servers
      • Multi-Turn MCP Use โ€” evaluate MCP server usage across conversation turns
    • Multimodal Metrics
      • Text to Image โ€” evaluate image generation quality based on semantic consistency and perceptual quality
      • Image Editing โ€” evaluate image editing quality based on semantic consistency and perceptual quality
      • Image Coherence โ€” measure how well images align with their accompanying text
      • Image Helpfulness โ€” evaluate how effectively images contribute to user comprehension of the text
      • Image Reference โ€” evaluate how accurately images are referred to or explained by accompanying text
    • Other Metrics
      • Hallucination โ€” check whether the LLM generates factually correct information against provided context
      • Summarization โ€” evaluate whether summaries are factually correct and include necessary details
      • Bias โ€” detect gender, racial, or political bias in LLM outputs
      • Toxicity โ€” evaluate toxicity in LLM outputs
      • JSON Correctness โ€” check whether the output matches an expected JSON schema
      • Prompt Alignment โ€” measure whether the output aligns with instructions in the prompt template
  • ๐ŸŽฏ Supports both end-to-end and component-level LLM evaluation.

  • ๐Ÿงฉ Build your own custom metrics that are automatically integrated with DeepEval's ecosystem.

  • ๐Ÿ”ฎ Generate both single and multi-turn synthetic datasets for evaluation.

  • ๐Ÿ”— Integrates seamlessly with ANY CI/CD environment.

  • ๐Ÿงฌ Optimize prompts automatically based on evaluation results.

  • ๐Ÿ† Easily benchmark ANY LLM on popular LLM benchmarks in under 10 lines of code., including MMLU, HellaSwag, DROP, BIG-Bench Hard, TruthfulQA, HumanEval, GSM8K.


๐Ÿ”Œ Integrations

DeepEval plugs into any LLM framework โ€” OpenAI Agents, LangChain, CrewAI, and more. To scale evals across your team โ€” or let anyone run them without writing code โ€” Confident AI gives you a native platform integration.

Frameworks

  • OpenAI โ€” evaluate and trace OpenAI applications via a client wrapper
  • OpenAI Agents โ€” evaluate OpenAI Agents end-to-end in under a minute
  • LangChain โ€” evaluate LangChain applications with a callback handler
  • LangGraph โ€” evaluate LangGraph agents with a callback handler
  • Pydantic AI โ€” evaluate Pydantic AI agents with type-safe validation
  • CrewAI โ€” evaluate CrewAI multi-agent systems
  • Anthropic โ€” evaluate and trace Claude applications via a client wrapper
  • AWS AgentCore โ€” evaluate agents deployed on Amazon AgentCore
  • LlamaIndex โ€” evaluate RAG applications built with LlamaIndex

โ˜๏ธ Platform + Ecosystem

Confident AI is an all-in-one platform that integrates natively with DeepEval.

  • Manage datasets, trace LLM applications, run evaluations, and monitor responses in production โ€” all from one platform.
  • Don't need a UI? Confident AI can also be your data persistant layer - run evals, pull datasets, and inspect traces straight from claude code, cursor, via Confident AI's MCP server.

Confident AI MCP Architecture


๐Ÿš€ QuickStart

Let's pretend your LLM application is a RAG based customer support chatbot; here's how DeepEval can help test what you've built.

Installation

Deepeval works with Python>=3.9+.

pip install -U deepeval

Create an account (highly recommended)

Using the deepeval platform will allow you to generate sharable testing reports on the cloud. It is free, takes no additional code to setup, and we highly recommend giving it a try.

To login, run:

deepeval login

Follow the instructions in the CLI to create an account, copy your API key, and paste it into the CLI. All test cases will automatically be logged (find more information on data privacy here).

Write your first test case

Create a test file:

touch test_chatbot.py

Open test_chatbot.py and write your first test case to run an end-to-end evaluation using DeepEval, which treats your LLM app as a black-box:

import pytest
from deepeval import assert_test
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams

def test_case():
    correctness_metric = GEval(
        name="Correctness",
        criteria="Determine if the 'actual output' is correct based on the 'expected output'.",
        evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT],
        threshold=0.5
    )
    test_case = LLMTestCase(
        input="What if these shoes don't fit?",
        # Replace this with the actual output from your LLM application
        actual_output="You have 30 days to get a full refund at no extra cost.",
        expected_output="We offer a 30-day full refund at no extra costs.",
        retrieval_context=["All customers are eligible for a 30 day full refund at no extra costs."]
    )
    assert_test(test_case, [correctness_metric])

Set your OPENAI_API_KEY as an environment variable (you can also evaluate using your own custom model, for more details visit this part of our docs):

export OPENAI_API_KEY="..."

And finally, run test_chatbot.py in the CLI:

deepeval test run test_chatbot.py

Congratulations! Your test case should have passed โœ… Let's breakdown what happened.

  • The variable input mimics a user input, and actual_output is a placeholder for what your application's supposed to output based on this input.
  • The variable expected_output represents the ideal answer for a given input, and GEval is a research-backed metric provided by deepeval for you to evaluate your LLM output's on any custom with human-like accuracy.
  • In this example, the metric criteria is correctness of the actual_output based on the provided expected_output.
  • All metric scores range from 0 - 1, which the threshold=0.5 threshold ultimately determines if your test have passed or not.

Read our documentation for more information!


Evaluating Nested Components

Use the @observe decorator to trace components (LLM calls, retrievers, tool calls, agents) and apply metrics at the component level โ€” no need to rewrite your codebase:

from deepeval.tracing import observe, update_current_span
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.metrics import GEval

correctness = GEval(
    name="Correctness",
    criteria="Determine if the 'actual output' is correct based on the 'expected output'.",
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT],
)

@observe(metrics=[correctness])
def inner_component():
    update_current_span(test_case=LLMTestCase(input="...", actual_output="..."))
    return "result"

@observe()
def llm_app(input: str):
    return inner_component()

dataset = EvaluationDataset(goldens=[Golden(input="Hi!")])
for golden in dataset.evals_iterator():
    llm_app(golden.input)

Learn more about component-level evaluations here.


Evaluate Without Pytest Integration

Alternatively, you can evaluate without Pytest, which is more suited for a notebook environment.

from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.7)
test_case = LLMTestCase(
    input="What if these shoes don't fit?",
    # Replace this with the actual output from your LLM application
    actual_output="We offer a 30-day full refund at no extra costs.",
    retrieval_context=["All customers are eligible for a 30 day full refund at no extra costs."]
)
evaluate([test_case], [answer_relevancy_metric])

Using Standalone Metrics

DeepEval is extremely modular, making it easy for anyone to use any of our metrics. Continuing from the previous example:

from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.7)
test_case = LLMTestCase(
    input="What if these shoes don't fit?",
    # Replace this with the actual output from your LLM application
    actual_output="We offer a 30-day full refund at no extra costs.",
    retrieval_context=["All customers are eligible for a 30 day full refund at no extra costs."]
)

answer_relevancy_metric.measure(test_case)
print(answer_relevancy_metric.score)
# All metrics also offer an explanation
print(answer_relevancy_metric.reason)

Note that some metrics are for RAG pipelines, while others are for fine-tuning. Make sure to use our docs to pick the right one for your use case.

Evaluating a Dataset / Test Cases in Bulk

In DeepEval, a dataset is simply a collection of test cases. Here is how you can evaluate these in bulk:

import pytest
from deepeval import assert_test
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

dataset = EvaluationDataset(goldens=[Golden(input="What's the weather like today?")])

for golden in dataset.goldens:
    test_case = LLMTestCase(
        input=golden.input,
        actual_output=your_llm_app(golden.input)
    )
    dataset.add_test_case(test_case)

@pytest.mark.parametrize(
    "test_case",
    dataset.test_cases,
)
def test_customer_chatbot(test_case: LLMTestCase):
    answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.5)
    assert_test(test_case, [answer_relevancy_metric])
# Run this in the CLI, you can also add an optional -n flag to run tests in parallel
deepeval test run test_<filename>.py -n 4

Alternatively, although we recommend using deepeval test run, you can evaluate a dataset/test cases without using our Pytest integration:

from deepeval import evaluate
...

evaluate(dataset, [answer_relevancy_metric])

A Note on Env Variables (.env / .env.local)

DeepEval auto-loads .env.local then .env from the current working directory at import time. Precedence: process env -> .env.local -> .env. Opt out with DEEPEVAL_DISABLE_DOTENV=1.

cp .env.example .env.local
# then edit .env.local (ignored by git)

DeepEval With Confident AI

Confident AI is an all-in-one platform to manage datasets, trace LLM applications, and run evaluations in production. Log in from the CLI to get started:

deepeval login

Then run your tests as usual โ€” results are automatically synced to the platform:

deepeval test run test_chatbot.py

Demo GIF

Prefer to stay in your IDE? Use DeepEval via Confident AI's MCP server as the persistent layer to run evals, pull datasets, and inspect traces without leaving your editor.

Confident AI MCP Architecture

Everything on Confident AI is available here.


Contributing

Please read CONTRIBUTING.md for details on our code of conduct, and the process for submitting pull requests to us.


Roadmap

Features:

  • Integration with Confident AI
  • Implement G-Eval
  • Implement RAG metrics
  • Implement Conversational metrics
  • Evaluation Dataset Creation
  • Red-Teaming
  • DAG custom metrics
  • Guardrails

Authors

Built by the founders of Confident AI. Contact jeffreyip@confident-ai.com for all enquiries.


License

DeepEval is licensed under Apache 2.0 - see the LICENSE.md file for details.

Release History

VersionChangesUrgencyDate
v4.0.5### New Feature - Add support for the `claude-opus-4-8` model preset, including multimodal and structured output capabilities with updated pricing metadata. ([#2698](https://github.com/confident-ai/deepeval/pull/2698)) ([Vamshi Adimalla](https://github.com/A-Vamshi)) High5/28/2026
v4.0.3### New Features - Add a simulation graph API to control how user turns are generated during conversation simulation. `ConversationSimulator` now accepts `simulation_graph`, and `controller` is deprecated in favor of `stopping_controller` with a warning for legacy usage. ([#2678](https://github.com/confident-ai/deepeval/pull/2678)) ([Jeffrey Ip](https://github.com/penguine-ip)) - Add support for `retrieval_context` entries as `RetrievedContextData` with `context` and `source`, enabling conteHigh5/21/2026
v4.0.2DeepEval 4.0 introduces an agent-native evaluation workflow designed for coding agents, rapid debugging, and production AI systems. If you're vibe coding agents, on something like claude code, this release is for you. ## Eval Harness for Coding Agents Coding agents can now run eval-driven iterations directly in context. - Agents see metric failures, scores, and reasoning inline - Supports iterative patch โ†’ eval โ†’ retry workflows - Built for Cursor, Claude Code, Codex, and agentic dHigh5/13/2026
v3.9.5# Full support for agentic evals :) If you're building agents, DeepEval can now analyze and give you metric scores based on the trace of your LLM app. ## ๐ŸŽฏ 1. Task Completion Evaluate whether an agent *actually completes the intended task*, not just whether its final output โ€œlooks correct.โ€ Captures: - Goal completion - Intermediate step correctness - Error recovery - Procedural accuracy Docs: https://deepeval.com/docs/metrics-task-completion --- ## ๐Ÿ”ง 2. Tool CorLow12/1/2025
v3.9.7# Full support for agentic evals :) If you're building agents, DeepEval can now analyze and give you metric scores based on the trace of your LLM app. ## ๐ŸŽฏ 1. Task Completion Evaluate whether an agent *actually completes the intended task*, not just whether its final output โ€œlooks correct.โ€ Captures: - Goal completion - Intermediate step correctness - Error recovery - Procedural accuracy Docs: https://deepeval.com/docs/metrics-task-completion --- ## ๐Ÿ”ง 2. Tool CorLow12/1/2025
v3.9.7# Full support for agentic evals :) If you're building agents, DeepEval can now analyze and give you metric scores based on the trace of your LLM app. ## ๐ŸŽฏ 1. Task Completion Evaluate whether an agent *actually completes the intended task*, not just whether its final output โ€œlooks correct.โ€ Captures: - Goal completion - Intermediate step correctness - Error recovery - Procedural accuracy Docs: https://deepeval.com/docs/metrics-task-completion --- ## ๐Ÿ”ง 2. Tool CorLow12/1/2025
v3.9.7# Full support for agentic evals :) If you're building agents, DeepEval can now analyze and give you metric scores based on the trace of your LLM app. ## ๐ŸŽฏ 1. Task Completion Evaluate whether an agent *actually completes the intended task*, not just whether its final output โ€œlooks correct.โ€ Captures: - Goal completion - Intermediate step correctness - Error recovery - Procedural accuracy Docs: https://deepeval.com/docs/metrics-task-completion --- ## ๐Ÿ”ง 2. Tool CorLow12/1/2025
v3.9.7# Full support for agentic evals :) If you're building agents, DeepEval can now analyze and give you metric scores based on the trace of your LLM app. ## ๐ŸŽฏ 1. Task Completion Evaluate whether an agent *actually completes the intended task*, not just whether its final output โ€œlooks correct.โ€ Captures: - Goal completion - Intermediate step correctness - Error recovery - Procedural accuracy Docs: https://deepeval.com/docs/metrics-task-completion --- ## ๐Ÿ”ง 2. Tool CorLow12/1/2025
v3.9.7# Full support for agentic evals :) If you're building agents, DeepEval can now analyze and give you metric scores based on the trace of your LLM app. ## ๐ŸŽฏ 1. Task Completion Evaluate whether an agent *actually completes the intended task*, not just whether its final output โ€œlooks correct.โ€ Captures: - Goal completion - Intermediate step correctness - Error recovery - Procedural accuracy Docs: https://deepeval.com/docs/metrics-task-completion --- ## ๐Ÿ”ง 2. Tool CorLow12/1/2025
v3.9.7# Full support for agentic evals :) If you're building agents, DeepEval can now analyze and give you metric scores based on the trace of your LLM app. ## ๐ŸŽฏ 1. Task Completion Evaluate whether an agent *actually completes the intended task*, not just whether its final output โ€œlooks correct.โ€ Captures: - Goal completion - Intermediate step correctness - Error recovery - Procedural accuracy Docs: https://deepeval.com/docs/metrics-task-completion --- ## ๐Ÿ”ง 2. Tool CorLow12/1/2025
v3.9.7# Full support for agentic evals :) If you're building agents, DeepEval can now analyze and give you metric scores based on the trace of your LLM app. ## ๐ŸŽฏ 1. Task Completion Evaluate whether an agent *actually completes the intended task*, not just whether its final output โ€œlooks correct.โ€ Captures: - Goal completion - Intermediate step correctness - Error recovery - Procedural accuracy Docs: https://deepeval.com/docs/metrics-task-completion --- ## ๐Ÿ”ง 2. Tool CorLow12/1/2025
v3.9.7# Full support for agentic evals :) If you're building agents, DeepEval can now analyze and give you metric scores based on the trace of your LLM app. ## ๐ŸŽฏ 1. Task Completion Evaluate whether an agent *actually completes the intended task*, not just whether its final output โ€œlooks correct.โ€ Captures: - Goal completion - Intermediate step correctness - Error recovery - Procedural accuracy Docs: https://deepeval.com/docs/metrics-task-completion --- ## ๐Ÿ”ง 2. Tool CorLow12/1/2025
v3.9.7# Full support for agentic evals :) If you're building agents, DeepEval can now analyze and give you metric scores based on the trace of your LLM app. ## ๐ŸŽฏ 1. Task Completion Evaluate whether an agent *actually completes the intended task*, not just whether its final output โ€œlooks correct.โ€ Captures: - Goal completion - Intermediate step correctness - Error recovery - Procedural accuracy Docs: https://deepeval.com/docs/metrics-task-completion --- ## ๐Ÿ”ง 2. Tool CorLow12/1/2025
v3.9.7# Full support for agentic evals :) If you're building agents, DeepEval can now analyze and give you metric scores based on the trace of your LLM app. ## ๐ŸŽฏ 1. Task Completion Evaluate whether an agent *actually completes the intended task*, not just whether its final output โ€œlooks correct.โ€ Captures: - Goal completion - Intermediate step correctness - Error recovery - Procedural accuracy Docs: https://deepeval.com/docs/metrics-task-completion --- ## ๐Ÿ”ง 2. Tool CorLow12/1/2025
v3.9.7# Full support for agentic evals :) If you're building agents, DeepEval can now analyze and give you metric scores based on the trace of your LLM app. ## ๐ŸŽฏ 1. Task Completion Evaluate whether an agent *actually completes the intended task*, not just whether its final output โ€œlooks correct.โ€ Captures: - Goal completion - Intermediate step correctness - Error recovery - Procedural accuracy Docs: https://deepeval.com/docs/metrics-task-completion --- ## ๐Ÿ”ง 2. Tool CorLow12/1/2025
v3.9.7# Full support for agentic evals :) If you're building agents, DeepEval can now analyze and give you metric scores based on the trace of your LLM app. ## ๐ŸŽฏ 1. Task Completion Evaluate whether an agent *actually completes the intended task*, not just whether its final output โ€œlooks correct.โ€ Captures: - Goal completion - Intermediate step correctness - Error recovery - Procedural accuracy Docs: https://deepeval.com/docs/metrics-task-completion --- ## ๐Ÿ”ง 2. Tool CorLow12/1/2025
v3.9.7# Full support for agentic evals :) If you're building agents, DeepEval can now analyze and give you metric scores based on the trace of your LLM app. ## ๐ŸŽฏ 1. Task Completion Evaluate whether an agent *actually completes the intended task*, not just whether its final output โ€œlooks correct.โ€ Captures: - Goal completion - Intermediate step correctness - Error recovery - Procedural accuracy Docs: https://deepeval.com/docs/metrics-task-completion --- ## ๐Ÿ”ง 2. Tool CorLow12/1/2025
v3.9.7# Full support for agentic evals :) If you're building agents, DeepEval can now analyze and give you metric scores based on the trace of your LLM app. ## ๐ŸŽฏ 1. Task Completion Evaluate whether an agent *actually completes the intended task*, not just whether its final output โ€œlooks correct.โ€ Captures: - Goal completion - Intermediate step correctness - Error recovery - Procedural accuracy Docs: https://deepeval.com/docs/metrics-task-completion --- ## ๐Ÿ”ง 2. Tool CorLow12/1/2025
v3.9.7# Full support for agentic evals :) If you're building agents, DeepEval can now analyze and give you metric scores based on the trace of your LLM app. ## ๐ŸŽฏ 1. Task Completion Evaluate whether an agent *actually completes the intended task*, not just whether its final output โ€œlooks correct.โ€ Captures: - Goal completion - Intermediate step correctness - Error recovery - Procedural accuracy Docs: https://deepeval.com/docs/metrics-task-completion --- ## ๐Ÿ”ง 2. Tool CorLow12/1/2025
v3.9.9# Full support for agentic evals :) If you're building agents, DeepEval can now analyze and give you metric scores based on the trace of your LLM app. ## ๐ŸŽฏ 1. Task Completion Evaluate whether an agent *actually completes the intended task*, not just whether its final output โ€œlooks correct.โ€ Captures: - Goal completion - Intermediate step correctness - Error recovery - Procedural accuracy Docs: https://deepeval.com/docs/metrics-task-completion --- ## ๐Ÿ”ง 2. Tool CorLow12/1/2025
v3.9.9# Full support for agentic evals :) If you're building agents, DeepEval can now analyze and give you metric scores based on the trace of your LLM app. ## ๐ŸŽฏ 1. Task Completion Evaluate whether an agent *actually completes the intended task*, not just whether its final output โ€œlooks correct.โ€ Captures: - Goal completion - Intermediate step correctness - Error recovery - Procedural accuracy Docs: https://deepeval.com/docs/metrics-task-completion --- ## ๐Ÿ”ง 2. Tool CorLow12/1/2025
v3.9.9# Full support for agentic evals :) If you're building agents, DeepEval can now analyze and give you metric scores based on the trace of your LLM app. ## ๐ŸŽฏ 1. Task Completion Evaluate whether an agent *actually completes the intended task*, not just whether its final output โ€œlooks correct.โ€ Captures: - Goal completion - Intermediate step correctness - Error recovery - Procedural accuracy Docs: https://deepeval.com/docs/metrics-task-completion --- ## ๐Ÿ”ง 2. Tool CorLow12/1/2025
v3.9.9# Full support for agentic evals :) If you're building agents, DeepEval can now analyze and give you metric scores based on the trace of your LLM app. ## ๐ŸŽฏ 1. Task Completion Evaluate whether an agent *actually completes the intended task*, not just whether its final output โ€œlooks correct.โ€ Captures: - Goal completion - Intermediate step correctness - Error recovery - Procedural accuracy Docs: https://deepeval.com/docs/metrics-task-completion --- ## ๐Ÿ”ง 2. Tool CorLow12/1/2025
v3.9.9# Full support for agentic evals :) If you're building agents, DeepEval can now analyze and give you metric scores based on the trace of your LLM app. ## ๐ŸŽฏ 1. Task Completion Evaluate whether an agent *actually completes the intended task*, not just whether its final output โ€œlooks correct.โ€ Captures: - Goal completion - Intermediate step correctness - Error recovery - Procedural accuracy Docs: https://deepeval.com/docs/metrics-task-completion --- ## ๐Ÿ”ง 2. Tool CorLow12/1/2025
v3.9.9# Full support for agentic evals :) If you're building agents, DeepEval can now analyze and give you metric scores based on the trace of your LLM app. ## ๐ŸŽฏ 1. Task Completion Evaluate whether an agent *actually completes the intended task*, not just whether its final output โ€œlooks correct.โ€ Captures: - Goal completion - Intermediate step correctness - Error recovery - Procedural accuracy Docs: https://deepeval.com/docs/metrics-task-completion --- ## ๐Ÿ”ง 2. Tool CorLow12/1/2025
v3.9.9# Full support for agentic evals :) If you're building agents, DeepEval can now analyze and give you metric scores based on the trace of your LLM app. ## ๐ŸŽฏ 1. Task Completion Evaluate whether an agent *actually completes the intended task*, not just whether its final output โ€œlooks correct.โ€ Captures: - Goal completion - Intermediate step correctness - Error recovery - Procedural accuracy Docs: https://deepeval.com/docs/metrics-task-completion --- ## ๐Ÿ”ง 2. Tool CorLow12/1/2025
v3.9.9# Full support for agentic evals :) If you're building agents, DeepEval can now analyze and give you metric scores based on the trace of your LLM app. ## ๐ŸŽฏ 1. Task Completion Evaluate whether an agent *actually completes the intended task*, not just whether its final output โ€œlooks correct.โ€ Captures: - Goal completion - Intermediate step correctness - Error recovery - Procedural accuracy Docs: https://deepeval.com/docs/metrics-task-completion --- ## ๐Ÿ”ง 2. Tool CorLow12/1/2025
v3.9.9# Full support for agentic evals :) If you're building agents, DeepEval can now analyze and give you metric scores based on the trace of your LLM app. ## ๐ŸŽฏ 1. Task Completion Evaluate whether an agent *actually completes the intended task*, not just whether its final output โ€œlooks correct.โ€ Captures: - Goal completion - Intermediate step correctness - Error recovery - Procedural accuracy Docs: https://deepeval.com/docs/metrics-task-completion --- ## ๐Ÿ”ง 2. Tool CorLow12/1/2025
v3.9.9# Full support for agentic evals :) If you're building agents, DeepEval can now analyze and give you metric scores based on the trace of your LLM app. ## ๐ŸŽฏ 1. Task Completion Evaluate whether an agent *actually completes the intended task*, not just whether its final output โ€œlooks correct.โ€ Captures: - Goal completion - Intermediate step correctness - Error recovery - Procedural accuracy Docs: https://deepeval.com/docs/metrics-task-completion --- ## ๐Ÿ”ง 2. Tool CorLow12/1/2025
v3.9.9# Full support for agentic evals :) If you're building agents, DeepEval can now analyze and give you metric scores based on the trace of your LLM app. ## ๐ŸŽฏ 1. Task Completion Evaluate whether an agent *actually completes the intended task*, not just whether its final output โ€œlooks correct.โ€ Captures: - Goal completion - Intermediate step correctness - Error recovery - Procedural accuracy Docs: https://deepeval.com/docs/metrics-task-completion --- ## ๐Ÿ”ง 2. Tool CorLow12/1/2025
v3.9.9# Full support for agentic evals :) If you're building agents, DeepEval can now analyze and give you metric scores based on the trace of your LLM app. ## ๐ŸŽฏ 1. Task Completion Evaluate whether an agent *actually completes the intended task*, not just whether its final output โ€œlooks correct.โ€ Captures: - Goal completion - Intermediate step correctness - Error recovery - Procedural accuracy Docs: https://deepeval.com/docs/metrics-task-completion --- ## ๐Ÿ”ง 2. Tool CorLow12/1/2025
v3.9.9# Full support for agentic evals :) If you're building agents, DeepEval can now analyze and give you metric scores based on the trace of your LLM app. ## ๐ŸŽฏ 1. Task Completion Evaluate whether an agent *actually completes the intended task*, not just whether its final output โ€œlooks correct.โ€ Captures: - Goal completion - Intermediate step correctness - Error recovery - Procedural accuracy Docs: https://deepeval.com/docs/metrics-task-completion --- ## ๐Ÿ”ง 2. Tool CorLow12/1/2025
v3.9.9# Full support for agentic evals :) If you're building agents, DeepEval can now analyze and give you metric scores based on the trace of your LLM app. ## ๐ŸŽฏ 1. Task Completion Evaluate whether an agent *actually completes the intended task*, not just whether its final output โ€œlooks correct.โ€ Captures: - Goal completion - Intermediate step correctness - Error recovery - Procedural accuracy Docs: https://deepeval.com/docs/metrics-task-completion --- ## ๐Ÿ”ง 2. Tool CorLow12/1/2025
v3.9.9# Full support for agentic evals :) If you're building agents, DeepEval can now analyze and give you metric scores based on the trace of your LLM app. ## ๐ŸŽฏ 1. Task Completion Evaluate whether an agent *actually completes the intended task*, not just whether its final output โ€œlooks correct.โ€ Captures: - Goal completion - Intermediate step correctness - Error recovery - Procedural accuracy Docs: https://deepeval.com/docs/metrics-task-completion --- ## ๐Ÿ”ง 2. Tool CorLow12/1/2025
v3.7.2# Less Code to Load Data In and Out of DeepEval's Ecosystem :) If you're using any of the features below, you'll likely see a 50% reduction in code required, especially around ETL for formatting things in and out of DeepEval's ecosystem. This includes: ## ๐Ÿ†š Arena-GEval The first LLM-arena-as-a-Judge metric, now runs a blinded experiment and swaps positions randomly for a fair verdict on which LLM output is better. Docs: https://deepeval.com/docs/metrics-arena-g-eval ## โš›๏ธ You canLow8/4/2025
v3.2.6### โš™๏ธ New Features DeepEval's 3.2.6 release focuses on single-vs multi-turn use cases in datasets! #### ๐Ÿงฉ Support for Single-Turn and Multi-Turn Datasets - **Single-turn datasets**: Simple `input โ†’ output` pairs for one-off prompt testing. - **Multi-turn datasets**: Full conversation flows with alternating user/assistant turns. Perfect for simulating real chat interactions. DeepEval now **automatically detects** whether a dataset is single-turn or multi-turn based on structure andLow7/15/2025
v3.1.9# Metric that is alike LLM Arena is Here In DeepEval's latest release, we are introducing `ArenaGEval`, the first ever metric to compare test cases to choose the best performing one based on your custom criteria. It looks something like this: ```python from deepeval import evaluate from deepeval.test_case import ArenaTestCase, LLMTestCaseParams from deepeval.metrics import ArenaGEval a_test_case = ArenaTestCase( contestants={ "GPT-4": LLMTestCase( input=Low6/25/2025
v3.1.5# In DeepEval's latest release, we are introducing multimodal G-Eval, plus 7+ multimodal metrics! Previously we had great support for single-turn, text evaluation in the form of `LLMTestCase`s, but now we're adding `MLLMTestCase`, which accepts images: ```python from deepeval.metrics import MultimodalGEval from deepeval.test_case import MLLMTestCaseParams, MLLMTestCase, MLLMImage from deepeval import evaluate m_test_case = MLLMTestCase( input=["Show me how to fold an airplane"],Low6/19/2025
v3.0.8# In DeepEval's latest release, we are introducing a slight change in how a conversation is evaluated. Previously we assumed a conversation as as a list of `LLMTestCase`s, which might necessarily be the case. Now a conversational test case is made up of a list of `Turn`s instead, which follows OpenAI's standard `messages` format: ```python from deepeval.test_case import Turn turns = [Turn(role="user", content="...")] ``` Docs here: https://deepeval.com/docs/evaluation-test-cases#coLow6/10/2025
v3.0.6Added new loading bars for component-level evals, and `deepeval view` to see results on Confident AI.Low6/7/2025
v3.0# ๐Ÿš€ DeepEval v3.0 โ€” Evaluate Any LLM Workflow, Anywhere Weโ€™re excited to introduce **DeepEval v3.0**, a major milestone that transforms how you evaluate LLM applications โ€” from complex multi-step agents to simple prompt chains. This release brings **component-level granularity**, **production-ready observability**, and **simulation tools** to empower devs building modern AI systems. --- ## ๐Ÿ” Component-Level Evaluation for Agentic Workflows You can now apply DeepEval metrics **to anLow5/27/2025
v2.9.0# Rubric Available for G-Eval https://www.deepeval.com/docs/metrics-llm-evals#rubricLow5/15/2025
v2.8.5In this release we've cleaned up some dependencies to separate out dev packages, as well as more tracing verbose logs for debugging.Low5/6/2025
v2.7.9# ๐Ÿšจ Breaking Changes > โš ๏ธ This release introduces breaking changes in preparation for DeepEval v3.0. > Please review carefully and adjust your code as needed. ## The `evaluate()` function now has "configs" - Previously the `evaluate()` function had 13+ arguments to control display, async behaviors, caching, etc. and it was growing out of control. We've now abstracted it into "configs" instead: ```python from deepeval.evaluate.configs import AsyncConfig from deepeval import evaluatLow4/28/2025
v2.7.6Cleaned up dependencies for upcoming 3.0 release: - Removed the automatic updates, it is now opt-in: https://www.deepeval.com/docs/miscellaneous - Removed instructor, double checked and it wasn't used anywhere - Removed LlamaIndex and moved it to optional, only needed for one moduleLow4/23/2025
v2.6.8The latest conversation simulator simulates fake user interactions to generate conversations on your behalf. These conversations can be used for evaluation right afterwards, and is similar to the goldens synthesizer. Docs here: https://docs.confident-ai.com/docs/evaluation-conversation-simulatorLow4/7/2025
v2.6.5What's New ๐Ÿ”ฅ - Migrated default provider models to support Synthesizer - Default model providers are now in a different directory, those that are using `deepeval` < 2.5.6 might need to update importsLow3/26/2025
v2.5.9# What's New ๐Ÿ”ฅ - Custom prompt template overriding for all RAG metrics. This was introduced for folks using weaker models for evaluation, or just models in general that don't fit too well with OpenAI's prompt formatting, which is what most of `deepeval`'s metrics are built around. You can still use your favorite metrics and algorithms, but now with a custom template if required. Example here: https://docs.confident-ai.com/docs/metrics-answer-relevancy#customize-your-template - Fixes to our Low3/18/2025
v2.3.9๐Ÿฅณ Latest feature to allow users to inject the Faithfulness metric with their custom template. Most suited for custom LLMs where text data is highly formatted by data engineers and stored in databases according to different categories.Low2/20/2025
v2.2.7Here are the new features we're bringing to you in the latest release: ๐Ÿ’ฅ Releasing beta for *Deep, Acyclic, Graph*. A new deterministic way in deepeval to build decision trees for deterministic outputs for LLM evaluation: https://docs.confident-ai.com/docs/metrics-dag โš™๏ธ Open-sourcing all LLM red teaming vulnerabilities: https://docs.confident-ai.com/docs/red-teaming-introduction ๐Ÿช„ Fixes to synthetic dataset generation pipelineLow1/31/2025
v2.0Here are the new features we're bringing to you in the latest release: โš™๏ธ Automated LLM red teaming, aka. vulnerability and security safety scanning. You can now scan for over 40+ vulnerabilities using 10+ SOTA attack enhancement techniques in <10 lines of python code. ๐Ÿช„ Synthetic dataset generation with a highly customizable synthetic data generation pipeline to cover literally any use case. ๐Ÿ–ผ๏ธ Multi-modal LLM evaluation - perfect for an image editing or text-image use cases. ๐Ÿ’ฌ ConversatLow12/2/2024
v1.4.7In DeepEval 1.4.7, we're releasing: - LLM red teaming. Safety test your LLM application for 40+ vulnerabilities with 10+ attack enhancements, docs here: https://docs.confident-ai.com/docs/red-teaming-introduction - Improved synthetic data synthesizer, much more functionality and customizbility: https://docs.confident-ai.com/docs/evaluation-datasets-synthetic-data - Conversational metrics: Dedicated metrics to evaluate LLM turns - Multi-modal metrics: Image editing and text to image evaluatioLow10/31/2024
v0.21.74In DeepEval v0.21.74, we have: - Agnetic evaluation metric to evaluate tool calling correctness for LLM agents: https://docs.confident-ai.com/docs/metrics-tool-correctness - Pydantic Schemas to enforce JSON outputs for custom, smaller LLMs: https://docs.confident-ai.com/docs/guides-using-custom-llms - Asynchronous support for synthetic data generation: https://docs.confident-ai.com/docs/evaluation-datasets-synthetic-data - Tracing integration for LLamaIndex and LangChain: https://docs.confidLow7/30/2024
v0.21.62In DeepEval v0.21.62, we: - added an option to print out intermediate steps during metric execution, which can be configured via the `verbose_mode` parameter: https://docs.confident-ai.com/docs/metrics-answer-relevancy#example - hyperparameters can be logged to Confident AI via the evaluate() function: https://docs.confident-ai.com/docs/getting-started#optimizing-hyperparameters - Synthetic data generation now gives more realistic results and is more customizable: https://docs.confident-ai.coLow6/25/2024
v0.21.15For deepeval's latest release v0.21.15, we release: - Synthetic Data generation. Generate synthetic data from documents easily: https://docs.confident-ai.com/docs/evaluation-datasets-synthetic-data - caching. If you're running 10k test cases and it fails at the 9999th test case, you no longer have to rerun the first 9999 test case as you can just read from cache using the `-c` flag: https://docs.confident-ai.com/docs/evaluation-introduction#cache - repeats. If you want to repeat each test casLow3/31/2024
v0.20.85In deepeval v0.20.85: - asynchronous support throughout deepeval, and no longer using threads. Users can also call individual metrics asynchronously: https://docs.confident-ai.com/docs/metrics-introduction#measuring-metrics-in-async - improved the way in which you create a custom LLM for evaluation. You'll now have to implement an asynchronous generate() method to use deepeval's async features: https://docs.confident-ai.com/docs/metrics-introduction#using-a-custom-llm - strict mode for all Low3/9/2024
v0.20.80In DeepEval's latest release, there is now: - conversational metrics: https://docs.confident-ai.com/docs/metrics-knowledge-retention. This metric evaluates whether your LLM is able to retain factual information presented to it throughout a conversation - synthetic data generation. Generate evaluation datasets from scratch: https://docs.confident-ai.com/docs/evaluation-datasets#generate-an-evaluation-datasetLow3/4/2024
v0.20.73For the newest release, deepeval now is now stable for production use: - reduced package size - separated functionality of pytest vs deepeval test run command - included coverage score for summarization - fix contextual precision node error - released docs for better transparency into metrics calculation - allows users to configure RAGAS metrics for custom embedding models: https://docs.confident-ai.com/docs/metrics-ragas#example - fixed bugs with checking for package updates Low2/25/2024
v0.20.68For the latest release, DeepEval: - Supports Hugging Face users by providing real-time evaluations during fine-tuning: https://docs.confident-ai.com/docs/integrations-huggingface - Supports LlamaIndex users by allowing unit testing of LlamaIndex apps in CI/CD, and offer metrics in LlamaIndex's evaluators: https://docs.confident-ai.com/docs/integrations-llamaindex - Improvements to accuracy and reliability in Faithfulness and Answer Relevancy - Summarization Metric now offers explanation -Low2/14/2024
v0.20.57- LLM-Evals (LLM evaluated metrics) now support all of langchain's chat models. - `LLMTestCase` now has `execution_time` and `cost`, useful for those looking to evaluate on these parameters - `minimum_score` is now `threshold` instead, meaning you can now create custom metrics that either have a "minimum" or "maximum" threshold - `LLMEvalMetric` is now `GEval` - Llamaindex Tracing integration: (https://docs.llamaindex.ai/en/stable/module_guides/observability/observability.html#deepeval)Low1/16/2024
v0.20.43In this release: - Faithfulness, Answer Relevancy, Contextual Relevancy, Contextual Precision, and Contextual Recall, all offer a reasoning for its given score. - Azure OpenAI now supported via a single command in the CLI: https://docs.confident-ai.com/docs/metrics-introduction#using-azure-openai - New Summarization Metric that uses the QAG framework for its implementation: https://docs.confident-ai.com/docs/metrics-summarization - Pulling datasets from Confident AI now offers an intermediatLow12/28/2023
v0.20.35Lots of new features this release: 1. `JudgementalGPT` now allows for different languages - useful for our APAC and European friends 2. `RAGAS` metrics now supports all OpenAI models - useful for those running into context length issues 3. `LLMEvalMetric` now returns a reasoning for its score 4. `deepeval test run` now has hooks that call on test run completion 5. `evaluate` now displays `retrieval_context` for RAG evaluation 6. `RAGAS` metric now displays metric breakdown for all its dLow12/14/2023
v0.20.23[Automatically integrated with Confident AI](https://app.confident-ai.com/) for continous evaluation throughout the lifetime of your LLM (app): -log evaluation results and analyze metrics pass / fails -compare and pick the optimal hyperparameters (eg. prompt templates, chunk size, models used, etc.) based on evaluation results -debug evaluation results via LLM traces -manage evaluation test cases / datasets in one place -track events to identify live LLM responses in production -add prodLow12/4/2023
v0.20.27[Automatically integrated with Confident AI](https://app.confident-ai.com/) for continous evaluation throughout the lifetime of your LLM (app): -log evaluation results and analyze metrics pass / fails -compare and pick the optimal hyperparameters (eg. prompt templates, chunk size, models used, etc.) based on evaluation results -debug evaluation results via LLM traces -manage evaluation test cases / datasets in one place -track events to identify live LLM responses in production -add prodLow11/22/2023
v0.20.19Mid-week bug fixes release with an extra feature: - run_test now works - new function `evaluate`, evaluates a list of test cases (dataset) on metrics you define, all without having to go through the CLI. More info here: https://docs.confident-ai.com/docs/evaluation-datasets#evaluate-your-dataset-without-pytestLow11/16/2023
v0.20.18In this release, deepeval has added support for: - JudgementalGPT, a dedicated LLM app developed by Confident AI to perform evaluations more robustly and accurately. JudgementalGPT provides a score and a reason for the score. - Parallel testing: execute test cases in parallel and speed up evaluation up to 100x.Low11/14/2023
v0.20.17Release v0.20.17Low11/13/2023
v0.20.16Release v0.20.16Low11/7/2023
v0.20.15Release v0.20.15Low11/6/2023
v0.20.14Release v0.20.14Low11/5/2023
v0.20.13Release v0.20.13Low11/5/2023
v0.20.12Release v0.20.12Low10/23/2023
v0.20.11Release v0.20.11Low10/20/2023
v0.20.10Release v0.20.10Low10/18/2023
v0.20.6## What's Changed * ensure telemetry hits your server by @ColabDog in https://github.com/confident-ai/deepeval/pull/202 **Full Changelog**: https://github.com/confident-ai/deepeval/compare/v0.20.5...v0.20.6Low10/12/2023
v0.20.5## What's Changed * firewall check for telemetry by @ColabDog in https://github.com/confident-ai/deepeval/pull/200 * hotfix telemetry setup by @ColabDog in https://github.com/confident-ai/deepeval/pull/201 **Full Changelog**: https://github.com/confident-ai/deepeval/compare/v0.20.3...v0.20.5Low10/12/2023
v0.20.3## What's Changed * clean quickstart by @ColabDog in https://github.com/confident-ai/deepeval/pull/166 * Hotfix.readme by @penguine-ip in https://github.com/confident-ai/deepeval/pull/168 * Freeze typer v by @ColabDog in https://github.com/confident-ai/deepeval/pull/169 * update quickstart by @ColabDog in https://github.com/confident-ai/deepeval/pull/170 * update sidebar by @ColabDog in https://github.com/confident-ai/deepeval/pull/171 * Fix sidebar by @ColabDog in https://github.com/confiLow10/11/2023
v0.20.0## What's Changed * Rename HOW_TO_CONTRIBUTE.md to CONTRIBUTING.md by @penguine-ip in https://github.com/confident-ai/deepeval/pull/164 * add image similarity metric by @ColabDog in https://github.com/confident-ai/deepeval/pull/162 * Feature/add image similarity by @ColabDog in https://github.com/confident-ai/deepeval/pull/165 **Full Changelog**: https://github.com/confident-ai/deepeval/compare/v0.19.0...v0.20.0Low10/2/2023
v0.19.0## What's Changed * add guardrails integration by @ColabDog in https://github.com/confident-ai/deepeval/pull/158 * add github workflow results by @ColabDog in https://github.com/confident-ai/deepeval/pull/159 * Feature/add llm eval by @ColabDog in https://github.com/confident-ai/deepeval/pull/161 **Full Changelog**: https://github.com/confident-ai/deepeval/compare/v0.18.0...v0.19.0Low10/1/2023
v0.18.0## What's Changed * Add new customer support example by @ColabDog in https://github.com/confident-ai/deepeval/pull/154 * Add example test case by @ColabDog in https://github.com/confident-ai/deepeval/pull/156 **Full Changelog**: https://github.com/confident-ai/deepeval/compare/v0.17.9...v0.18.0Low9/29/2023
v0.17.9## What's Changed * fix the API and version by @ColabDog in https://github.com/confident-ai/deepeval/pull/153 **Full Changelog**: https://github.com/confident-ai/deepeval/compare/v0.17.8...v0.17.9Low9/28/2023
v0.17.8## What's Changed * fix by @ColabDog in https://github.com/confident-ai/deepeval/pull/147 * add context to the API by @ColabDog in https://github.com/confident-ai/deepeval/pull/150 * Resolves #151 by @ColabDog in https://github.com/confident-ai/deepeval/pull/152 **Full Changelog**: https://github.com/confident-ai/deepeval/compare/v0.17.6...v0.17.8Low9/28/2023
v0.17.6## What's Changed * adding length metric including test and documentation by @j-space-b in https://github.com/confident-ai/deepeval/pull/139 * add koala by @ColabDog in https://github.com/confident-ai/deepeval/pull/141 * Feature/update file name by @ColabDog in https://github.com/confident-ai/deepeval/pull/145 * add switch CLI by @ColabDog in https://github.com/confident-ai/deepeval/pull/146 **Full Changelog**: https://github.com/confident-ai/deepeval/compare/v0.17.5...v0.17.6Low9/27/2023

Dependencies & License Audit

Loading dependencies...

Similar Packages

AutoRAGAutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automationv0.3.22
SploitGPT๐Ÿ› ๏ธ Automate penetration testing with SploitGPT, an AI agent using Kali Linux tools for efficient security assessments and minimal user input.main@2026-06-07
agent-resources๐Ÿ› ๏ธ Install, manage, and share Claude Code skills effortlessly with one command, streamlining your workflow and enhancing team collaboration.main@2026-06-07
planning-with-files๐Ÿ“„ Transform your workflow with persistent markdown files for planning, tracking progress, and storing knowledge like a pro.master@2026-06-07
cadwynProduction-ready community-driven modern Stripe-like API versioning in FastAPI7.0.0

More in Frameworks

spec_driven_developSpec-Driven Develop is a platform-agnostic AI agent skill that automates the pre-development workflow for large-scale complex tasks. It is not a framework, not a runtime, not a package manager โ€” it is
DrasilGenerate all the things (focusing on research software)
langchainThe agent engineering platform
deer-flowAn open-source long-horizon SuperAgent harness that researches, codes, and creates. With the help of sandboxes, memories, tools, skill, subagents and message gateway, it handles different levels of ta