Documentation |
Metrics and Features |
Getting Started |
Integrations |
Confident AI
The LLM Evaluation Framework
The LLM Evaluation Framework
Deutsch |
Espaรฑol |
franรงais |
ๆฅๆฌ่ช |
ํ๊ตญ์ด |
Portuguรชs |
ะ ัััะบะธะน |
ไธญๆ
DeepEval is a simple-to-use, open-source LLM evaluation framework, for evaluating large-language model systems. It is similar to Pytest but specialized for unit testing LLM apps. DeepEval incorporates the latest research to run evals via metrics such as G-Eval, task completion, answer relevancy, hallucination, etc., which uses LLM-as-a-judge and other NLP models that run locally on your machine.
Whether you're building AI agents, RAG pipelines, or chatbots, implemented via LangChain or OpenAI, DeepEval has you covered. With it, you can easily determine the optimal models, prompts, and architecture to improve your AI quality, prevent prompt drifting, or even transition from OpenAI to Claude with confidence.
Important
Need a place for your DeepEval testing data to live ๐กโค๏ธ? Sign up to the DeepEval platform to compare iterations of your LLM app, generate & share testing reports, and more.
Want to talk LLM evaluation, need help picking metrics, or just to say hi? Come join our discord.
๐ Large variety of ready-to-use LLM eval metrics (all with explanations) powered by ANY LLM of your choice, statistical methods, or NLP models that run locally on your machine covering all use cases:
Custom, All-Purpose Metrics:
๐ฏ Supports both end-to-end and component-level LLM evaluation.
๐งฉ Build your own custom metrics that are automatically integrated with DeepEval's ecosystem.
๐ฎ Generate both single and multi-turn synthetic datasets for evaluation.
๐ Integrates seamlessly with ANY CI/CD environment.
๐งฌ Optimize prompts automatically based on evaluation results.
๐ Easily benchmark ANY LLM on popular LLM benchmarks in under 10 lines of code., including MMLU, HellaSwag, DROP, BIG-Bench Hard, TruthfulQA, HumanEval, GSM8K.
DeepEval plugs into any LLM framework โ OpenAI Agents, LangChain, CrewAI, and more. To scale evals across your team โ or let anyone run them without writing code โ Confident AI gives you a native platform integration.
Confident AI is an all-in-one platform that integrates natively with DeepEval.
Let's pretend your LLM application is a RAG based customer support chatbot; here's how DeepEval can help test what you've built.
Deepeval works with Python>=3.9+.
pip install -U deepeval
Using the deepeval platform will allow you to generate sharable testing reports on the cloud. It is free, takes no additional code to setup, and we highly recommend giving it a try.
To login, run:
deepeval login
Follow the instructions in the CLI to create an account, copy your API key, and paste it into the CLI. All test cases will automatically be logged (find more information on data privacy here).
Create a test file:
touch test_chatbot.pyOpen test_chatbot.py and write your first test case to run an end-to-end evaluation using DeepEval, which treats your LLM app as a black-box:
import pytest
from deepeval import assert_test
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
def test_case():
correctness_metric = GEval(
name="Correctness",
criteria="Determine if the 'actual output' is correct based on the 'expected output'.",
evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT],
threshold=0.5
)
test_case = LLMTestCase(
input="What if these shoes don't fit?",
# Replace this with the actual output from your LLM application
actual_output="You have 30 days to get a full refund at no extra cost.",
expected_output="We offer a 30-day full refund at no extra costs.",
retrieval_context=["All customers are eligible for a 30 day full refund at no extra costs."]
)
assert_test(test_case, [correctness_metric])Set your OPENAI_API_KEY as an environment variable (you can also evaluate using your own custom model, for more details visit this part of our docs):
export OPENAI_API_KEY="..."
And finally, run test_chatbot.py in the CLI:
deepeval test run test_chatbot.py
Congratulations! Your test case should have passed โ Let's breakdown what happened.
input mimics a user input, and actual_output is a placeholder for what your application's supposed to output based on this input.expected_output represents the ideal answer for a given input, and GEval is a research-backed metric provided by deepeval for you to evaluate your LLM output's on any custom with human-like accuracy.criteria is correctness of the actual_output based on the provided expected_output.threshold=0.5 threshold ultimately determines if your test have passed or not.Read our documentation for more information!
Use the @observe decorator to trace components (LLM calls, retrievers, tool calls, agents) and apply metrics at the component level โ no need to rewrite your codebase:
from deepeval.tracing import observe, update_current_span
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.metrics import GEval
correctness = GEval(
name="Correctness",
criteria="Determine if the 'actual output' is correct based on the 'expected output'.",
evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT],
)
@observe(metrics=[correctness])
def inner_component():
update_current_span(test_case=LLMTestCase(input="...", actual_output="..."))
return "result"
@observe()
def llm_app(input: str):
return inner_component()
dataset = EvaluationDataset(goldens=[Golden(input="Hi!")])
for golden in dataset.evals_iterator():
llm_app(golden.input)Learn more about component-level evaluations here.
Alternatively, you can evaluate without Pytest, which is more suited for a notebook environment.
from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase
answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.7)
test_case = LLMTestCase(
input="What if these shoes don't fit?",
# Replace this with the actual output from your LLM application
actual_output="We offer a 30-day full refund at no extra costs.",
retrieval_context=["All customers are eligible for a 30 day full refund at no extra costs."]
)
evaluate([test_case], [answer_relevancy_metric])DeepEval is extremely modular, making it easy for anyone to use any of our metrics. Continuing from the previous example:
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase
answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.7)
test_case = LLMTestCase(
input="What if these shoes don't fit?",
# Replace this with the actual output from your LLM application
actual_output="We offer a 30-day full refund at no extra costs.",
retrieval_context=["All customers are eligible for a 30 day full refund at no extra costs."]
)
answer_relevancy_metric.measure(test_case)
print(answer_relevancy_metric.score)
# All metrics also offer an explanation
print(answer_relevancy_metric.reason)Note that some metrics are for RAG pipelines, while others are for fine-tuning. Make sure to use our docs to pick the right one for your use case.
In DeepEval, a dataset is simply a collection of test cases. Here is how you can evaluate these in bulk:
import pytest
from deepeval import assert_test
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase
dataset = EvaluationDataset(goldens=[Golden(input="What's the weather like today?")])
for golden in dataset.goldens:
test_case = LLMTestCase(
input=golden.input,
actual_output=your_llm_app(golden.input)
)
dataset.add_test_case(test_case)
@pytest.mark.parametrize(
"test_case",
dataset.test_cases,
)
def test_customer_chatbot(test_case: LLMTestCase):
answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.5)
assert_test(test_case, [answer_relevancy_metric])# Run this in the CLI, you can also add an optional -n flag to run tests in parallel
deepeval test run test_<filename>.py -n 4Alternatively, although we recommend using deepeval test run, you can evaluate a dataset/test cases without using our Pytest integration:
from deepeval import evaluate
...
evaluate(dataset, [answer_relevancy_metric])DeepEval auto-loads .env.local then .env from the current working directory at import time.
Precedence: process env -> .env.local -> .env.
Opt out with DEEPEVAL_DISABLE_DOTENV=1.
cp .env.example .env.local
# then edit .env.local (ignored by git)Confident AI is an all-in-one platform to manage datasets, trace LLM applications, and run evaluations in production. Log in from the CLI to get started:
deepeval loginThen run your tests as usual โ results are automatically synced to the platform:
deepeval test run test_chatbot.pyPrefer to stay in your IDE? Use DeepEval via Confident AI's MCP server as the persistent layer to run evals, pull datasets, and inspect traces without leaving your editor.
Everything on Confident AI is available here.
Please read CONTRIBUTING.md for details on our code of conduct, and the process for submitting pull requests to us.
Features:
Built by the founders of Confident AI. Contact jeffreyip@confident-ai.com for all enquiries.
DeepEval is licensed under Apache 2.0 - see the LICENSE.md file for details.
| Version | Changes | Urgency | Date |
|---|---|---|---|
| v4.0.5 | ### New Feature - Add support for the `claude-opus-4-8` model preset, including multimodal and structured output capabilities with updated pricing metadata. ([#2698](https://github.com/confident-ai/deepeval/pull/2698)) ([Vamshi Adimalla](https://github.com/A-Vamshi)) | High | 5/28/2026 |
| v4.0.3 | ### New Features - Add a simulation graph API to control how user turns are generated during conversation simulation. `ConversationSimulator` now accepts `simulation_graph`, and `controller` is deprecated in favor of `stopping_controller` with a warning for legacy usage. ([#2678](https://github.com/confident-ai/deepeval/pull/2678)) ([Jeffrey Ip](https://github.com/penguine-ip)) - Add support for `retrieval_context` entries as `RetrievedContextData` with `context` and `source`, enabling conte | High | 5/21/2026 |
| v4.0.2 | DeepEval 4.0 introduces an agent-native evaluation workflow designed for coding agents, rapid debugging, and production AI systems. If you're vibe coding agents, on something like claude code, this release is for you. ## Eval Harness for Coding Agents Coding agents can now run eval-driven iterations directly in context. - Agents see metric failures, scores, and reasoning inline - Supports iterative patch โ eval โ retry workflows - Built for Cursor, Claude Code, Codex, and agentic d | High | 5/13/2026 |
| v3.9.5 | # Full support for agentic evals :) If you're building agents, DeepEval can now analyze and give you metric scores based on the trace of your LLM app. ## ๐ฏ 1. Task Completion Evaluate whether an agent *actually completes the intended task*, not just whether its final output โlooks correct.โ Captures: - Goal completion - Intermediate step correctness - Error recovery - Procedural accuracy Docs: https://deepeval.com/docs/metrics-task-completion --- ## ๐ง 2. Tool Cor | Low | 12/1/2025 |
| v3.9.7 | # Full support for agentic evals :) If you're building agents, DeepEval can now analyze and give you metric scores based on the trace of your LLM app. ## ๐ฏ 1. Task Completion Evaluate whether an agent *actually completes the intended task*, not just whether its final output โlooks correct.โ Captures: - Goal completion - Intermediate step correctness - Error recovery - Procedural accuracy Docs: https://deepeval.com/docs/metrics-task-completion --- ## ๐ง 2. Tool Cor | Low | 12/1/2025 |
| v3.9.7 | # Full support for agentic evals :) If you're building agents, DeepEval can now analyze and give you metric scores based on the trace of your LLM app. ## ๐ฏ 1. Task Completion Evaluate whether an agent *actually completes the intended task*, not just whether its final output โlooks correct.โ Captures: - Goal completion - Intermediate step correctness - Error recovery - Procedural accuracy Docs: https://deepeval.com/docs/metrics-task-completion --- ## ๐ง 2. Tool Cor | Low | 12/1/2025 |
| v3.9.7 | # Full support for agentic evals :) If you're building agents, DeepEval can now analyze and give you metric scores based on the trace of your LLM app. ## ๐ฏ 1. Task Completion Evaluate whether an agent *actually completes the intended task*, not just whether its final output โlooks correct.โ Captures: - Goal completion - Intermediate step correctness - Error recovery - Procedural accuracy Docs: https://deepeval.com/docs/metrics-task-completion --- ## ๐ง 2. Tool Cor | Low | 12/1/2025 |
| v3.9.7 | # Full support for agentic evals :) If you're building agents, DeepEval can now analyze and give you metric scores based on the trace of your LLM app. ## ๐ฏ 1. Task Completion Evaluate whether an agent *actually completes the intended task*, not just whether its final output โlooks correct.โ Captures: - Goal completion - Intermediate step correctness - Error recovery - Procedural accuracy Docs: https://deepeval.com/docs/metrics-task-completion --- ## ๐ง 2. Tool Cor | Low | 12/1/2025 |
| v3.9.7 | # Full support for agentic evals :) If you're building agents, DeepEval can now analyze and give you metric scores based on the trace of your LLM app. ## ๐ฏ 1. Task Completion Evaluate whether an agent *actually completes the intended task*, not just whether its final output โlooks correct.โ Captures: - Goal completion - Intermediate step correctness - Error recovery - Procedural accuracy Docs: https://deepeval.com/docs/metrics-task-completion --- ## ๐ง 2. Tool Cor | Low | 12/1/2025 |
| v3.9.7 | # Full support for agentic evals :) If you're building agents, DeepEval can now analyze and give you metric scores based on the trace of your LLM app. ## ๐ฏ 1. Task Completion Evaluate whether an agent *actually completes the intended task*, not just whether its final output โlooks correct.โ Captures: - Goal completion - Intermediate step correctness - Error recovery - Procedural accuracy Docs: https://deepeval.com/docs/metrics-task-completion --- ## ๐ง 2. Tool Cor | Low | 12/1/2025 |
| v3.9.7 | # Full support for agentic evals :) If you're building agents, DeepEval can now analyze and give you metric scores based on the trace of your LLM app. ## ๐ฏ 1. Task Completion Evaluate whether an agent *actually completes the intended task*, not just whether its final output โlooks correct.โ Captures: - Goal completion - Intermediate step correctness - Error recovery - Procedural accuracy Docs: https://deepeval.com/docs/metrics-task-completion --- ## ๐ง 2. Tool Cor | Low | 12/1/2025 |
| v3.9.7 | # Full support for agentic evals :) If you're building agents, DeepEval can now analyze and give you metric scores based on the trace of your LLM app. ## ๐ฏ 1. Task Completion Evaluate whether an agent *actually completes the intended task*, not just whether its final output โlooks correct.โ Captures: - Goal completion - Intermediate step correctness - Error recovery - Procedural accuracy Docs: https://deepeval.com/docs/metrics-task-completion --- ## ๐ง 2. Tool Cor | Low | 12/1/2025 |
| v3.9.7 | # Full support for agentic evals :) If you're building agents, DeepEval can now analyze and give you metric scores based on the trace of your LLM app. ## ๐ฏ 1. Task Completion Evaluate whether an agent *actually completes the intended task*, not just whether its final output โlooks correct.โ Captures: - Goal completion - Intermediate step correctness - Error recovery - Procedural accuracy Docs: https://deepeval.com/docs/metrics-task-completion --- ## ๐ง 2. Tool Cor | Low | 12/1/2025 |
| v3.9.7 | # Full support for agentic evals :) If you're building agents, DeepEval can now analyze and give you metric scores based on the trace of your LLM app. ## ๐ฏ 1. Task Completion Evaluate whether an agent *actually completes the intended task*, not just whether its final output โlooks correct.โ Captures: - Goal completion - Intermediate step correctness - Error recovery - Procedural accuracy Docs: https://deepeval.com/docs/metrics-task-completion --- ## ๐ง 2. Tool Cor | Low | 12/1/2025 |
| v3.9.7 | # Full support for agentic evals :) If you're building agents, DeepEval can now analyze and give you metric scores based on the trace of your LLM app. ## ๐ฏ 1. Task Completion Evaluate whether an agent *actually completes the intended task*, not just whether its final output โlooks correct.โ Captures: - Goal completion - Intermediate step correctness - Error recovery - Procedural accuracy Docs: https://deepeval.com/docs/metrics-task-completion --- ## ๐ง 2. Tool Cor | Low | 12/1/2025 |
| v3.9.7 | # Full support for agentic evals :) If you're building agents, DeepEval can now analyze and give you metric scores based on the trace of your LLM app. ## ๐ฏ 1. Task Completion Evaluate whether an agent *actually completes the intended task*, not just whether its final output โlooks correct.โ Captures: - Goal completion - Intermediate step correctness - Error recovery - Procedural accuracy Docs: https://deepeval.com/docs/metrics-task-completion --- ## ๐ง 2. Tool Cor | Low | 12/1/2025 |
| v3.9.7 | # Full support for agentic evals :) If you're building agents, DeepEval can now analyze and give you metric scores based on the trace of your LLM app. ## ๐ฏ 1. Task Completion Evaluate whether an agent *actually completes the intended task*, not just whether its final output โlooks correct.โ Captures: - Goal completion - Intermediate step correctness - Error recovery - Procedural accuracy Docs: https://deepeval.com/docs/metrics-task-completion --- ## ๐ง 2. Tool Cor | Low | 12/1/2025 |
| v3.9.7 | # Full support for agentic evals :) If you're building agents, DeepEval can now analyze and give you metric scores based on the trace of your LLM app. ## ๐ฏ 1. Task Completion Evaluate whether an agent *actually completes the intended task*, not just whether its final output โlooks correct.โ Captures: - Goal completion - Intermediate step correctness - Error recovery - Procedural accuracy Docs: https://deepeval.com/docs/metrics-task-completion --- ## ๐ง 2. Tool Cor | Low | 12/1/2025 |
| v3.9.7 | # Full support for agentic evals :) If you're building agents, DeepEval can now analyze and give you metric scores based on the trace of your LLM app. ## ๐ฏ 1. Task Completion Evaluate whether an agent *actually completes the intended task*, not just whether its final output โlooks correct.โ Captures: - Goal completion - Intermediate step correctness - Error recovery - Procedural accuracy Docs: https://deepeval.com/docs/metrics-task-completion --- ## ๐ง 2. Tool Cor | Low | 12/1/2025 |
| v3.9.9 | # Full support for agentic evals :) If you're building agents, DeepEval can now analyze and give you metric scores based on the trace of your LLM app. ## ๐ฏ 1. Task Completion Evaluate whether an agent *actually completes the intended task*, not just whether its final output โlooks correct.โ Captures: - Goal completion - Intermediate step correctness - Error recovery - Procedural accuracy Docs: https://deepeval.com/docs/metrics-task-completion --- ## ๐ง 2. Tool Cor | Low | 12/1/2025 |
| v3.9.9 | # Full support for agentic evals :) If you're building agents, DeepEval can now analyze and give you metric scores based on the trace of your LLM app. ## ๐ฏ 1. Task Completion Evaluate whether an agent *actually completes the intended task*, not just whether its final output โlooks correct.โ Captures: - Goal completion - Intermediate step correctness - Error recovery - Procedural accuracy Docs: https://deepeval.com/docs/metrics-task-completion --- ## ๐ง 2. Tool Cor | Low | 12/1/2025 |
| v3.9.9 | # Full support for agentic evals :) If you're building agents, DeepEval can now analyze and give you metric scores based on the trace of your LLM app. ## ๐ฏ 1. Task Completion Evaluate whether an agent *actually completes the intended task*, not just whether its final output โlooks correct.โ Captures: - Goal completion - Intermediate step correctness - Error recovery - Procedural accuracy Docs: https://deepeval.com/docs/metrics-task-completion --- ## ๐ง 2. Tool Cor | Low | 12/1/2025 |
| v3.9.9 | # Full support for agentic evals :) If you're building agents, DeepEval can now analyze and give you metric scores based on the trace of your LLM app. ## ๐ฏ 1. Task Completion Evaluate whether an agent *actually completes the intended task*, not just whether its final output โlooks correct.โ Captures: - Goal completion - Intermediate step correctness - Error recovery - Procedural accuracy Docs: https://deepeval.com/docs/metrics-task-completion --- ## ๐ง 2. Tool Cor | Low | 12/1/2025 |
| v3.9.9 | # Full support for agentic evals :) If you're building agents, DeepEval can now analyze and give you metric scores based on the trace of your LLM app. ## ๐ฏ 1. Task Completion Evaluate whether an agent *actually completes the intended task*, not just whether its final output โlooks correct.โ Captures: - Goal completion - Intermediate step correctness - Error recovery - Procedural accuracy Docs: https://deepeval.com/docs/metrics-task-completion --- ## ๐ง 2. Tool Cor | Low | 12/1/2025 |
| v3.9.9 | # Full support for agentic evals :) If you're building agents, DeepEval can now analyze and give you metric scores based on the trace of your LLM app. ## ๐ฏ 1. Task Completion Evaluate whether an agent *actually completes the intended task*, not just whether its final output โlooks correct.โ Captures: - Goal completion - Intermediate step correctness - Error recovery - Procedural accuracy Docs: https://deepeval.com/docs/metrics-task-completion --- ## ๐ง 2. Tool Cor | Low | 12/1/2025 |
| v3.9.9 | # Full support for agentic evals :) If you're building agents, DeepEval can now analyze and give you metric scores based on the trace of your LLM app. ## ๐ฏ 1. Task Completion Evaluate whether an agent *actually completes the intended task*, not just whether its final output โlooks correct.โ Captures: - Goal completion - Intermediate step correctness - Error recovery - Procedural accuracy Docs: https://deepeval.com/docs/metrics-task-completion --- ## ๐ง 2. Tool Cor | Low | 12/1/2025 |
| v3.9.9 | # Full support for agentic evals :) If you're building agents, DeepEval can now analyze and give you metric scores based on the trace of your LLM app. ## ๐ฏ 1. Task Completion Evaluate whether an agent *actually completes the intended task*, not just whether its final output โlooks correct.โ Captures: - Goal completion - Intermediate step correctness - Error recovery - Procedural accuracy Docs: https://deepeval.com/docs/metrics-task-completion --- ## ๐ง 2. Tool Cor | Low | 12/1/2025 |
| v3.9.9 | # Full support for agentic evals :) If you're building agents, DeepEval can now analyze and give you metric scores based on the trace of your LLM app. ## ๐ฏ 1. Task Completion Evaluate whether an agent *actually completes the intended task*, not just whether its final output โlooks correct.โ Captures: - Goal completion - Intermediate step correctness - Error recovery - Procedural accuracy Docs: https://deepeval.com/docs/metrics-task-completion --- ## ๐ง 2. Tool Cor | Low | 12/1/2025 |
| v3.9.9 | # Full support for agentic evals :) If you're building agents, DeepEval can now analyze and give you metric scores based on the trace of your LLM app. ## ๐ฏ 1. Task Completion Evaluate whether an agent *actually completes the intended task*, not just whether its final output โlooks correct.โ Captures: - Goal completion - Intermediate step correctness - Error recovery - Procedural accuracy Docs: https://deepeval.com/docs/metrics-task-completion --- ## ๐ง 2. Tool Cor | Low | 12/1/2025 |
| v3.9.9 | # Full support for agentic evals :) If you're building agents, DeepEval can now analyze and give you metric scores based on the trace of your LLM app. ## ๐ฏ 1. Task Completion Evaluate whether an agent *actually completes the intended task*, not just whether its final output โlooks correct.โ Captures: - Goal completion - Intermediate step correctness - Error recovery - Procedural accuracy Docs: https://deepeval.com/docs/metrics-task-completion --- ## ๐ง 2. Tool Cor | Low | 12/1/2025 |
| v3.9.9 | # Full support for agentic evals :) If you're building agents, DeepEval can now analyze and give you metric scores based on the trace of your LLM app. ## ๐ฏ 1. Task Completion Evaluate whether an agent *actually completes the intended task*, not just whether its final output โlooks correct.โ Captures: - Goal completion - Intermediate step correctness - Error recovery - Procedural accuracy Docs: https://deepeval.com/docs/metrics-task-completion --- ## ๐ง 2. Tool Cor | Low | 12/1/2025 |
| v3.9.9 | # Full support for agentic evals :) If you're building agents, DeepEval can now analyze and give you metric scores based on the trace of your LLM app. ## ๐ฏ 1. Task Completion Evaluate whether an agent *actually completes the intended task*, not just whether its final output โlooks correct.โ Captures: - Goal completion - Intermediate step correctness - Error recovery - Procedural accuracy Docs: https://deepeval.com/docs/metrics-task-completion --- ## ๐ง 2. Tool Cor | Low | 12/1/2025 |
| v3.9.9 | # Full support for agentic evals :) If you're building agents, DeepEval can now analyze and give you metric scores based on the trace of your LLM app. ## ๐ฏ 1. Task Completion Evaluate whether an agent *actually completes the intended task*, not just whether its final output โlooks correct.โ Captures: - Goal completion - Intermediate step correctness - Error recovery - Procedural accuracy Docs: https://deepeval.com/docs/metrics-task-completion --- ## ๐ง 2. Tool Cor | Low | 12/1/2025 |
| v3.9.9 | # Full support for agentic evals :) If you're building agents, DeepEval can now analyze and give you metric scores based on the trace of your LLM app. ## ๐ฏ 1. Task Completion Evaluate whether an agent *actually completes the intended task*, not just whether its final output โlooks correct.โ Captures: - Goal completion - Intermediate step correctness - Error recovery - Procedural accuracy Docs: https://deepeval.com/docs/metrics-task-completion --- ## ๐ง 2. Tool Cor | Low | 12/1/2025 |
| v3.7.2 | # Less Code to Load Data In and Out of DeepEval's Ecosystem :) If you're using any of the features below, you'll likely see a 50% reduction in code required, especially around ETL for formatting things in and out of DeepEval's ecosystem. This includes: ## ๐ Arena-GEval The first LLM-arena-as-a-Judge metric, now runs a blinded experiment and swaps positions randomly for a fair verdict on which LLM output is better. Docs: https://deepeval.com/docs/metrics-arena-g-eval ## โ๏ธ You can | Low | 8/4/2025 |
| v3.2.6 | ### โ๏ธ New Features DeepEval's 3.2.6 release focuses on single-vs multi-turn use cases in datasets! #### ๐งฉ Support for Single-Turn and Multi-Turn Datasets - **Single-turn datasets**: Simple `input โ output` pairs for one-off prompt testing. - **Multi-turn datasets**: Full conversation flows with alternating user/assistant turns. Perfect for simulating real chat interactions. DeepEval now **automatically detects** whether a dataset is single-turn or multi-turn based on structure and | Low | 7/15/2025 |
| v3.1.9 | # Metric that is alike LLM Arena is Here In DeepEval's latest release, we are introducing `ArenaGEval`, the first ever metric to compare test cases to choose the best performing one based on your custom criteria. It looks something like this: ```python from deepeval import evaluate from deepeval.test_case import ArenaTestCase, LLMTestCaseParams from deepeval.metrics import ArenaGEval a_test_case = ArenaTestCase( contestants={ "GPT-4": LLMTestCase( input= | Low | 6/25/2025 |
| v3.1.5 | # In DeepEval's latest release, we are introducing multimodal G-Eval, plus 7+ multimodal metrics! Previously we had great support for single-turn, text evaluation in the form of `LLMTestCase`s, but now we're adding `MLLMTestCase`, which accepts images: ```python from deepeval.metrics import MultimodalGEval from deepeval.test_case import MLLMTestCaseParams, MLLMTestCase, MLLMImage from deepeval import evaluate m_test_case = MLLMTestCase( input=["Show me how to fold an airplane"], | Low | 6/19/2025 |
| v3.0.8 | # In DeepEval's latest release, we are introducing a slight change in how a conversation is evaluated. Previously we assumed a conversation as as a list of `LLMTestCase`s, which might necessarily be the case. Now a conversational test case is made up of a list of `Turn`s instead, which follows OpenAI's standard `messages` format: ```python from deepeval.test_case import Turn turns = [Turn(role="user", content="...")] ``` Docs here: https://deepeval.com/docs/evaluation-test-cases#co | Low | 6/10/2025 |
| v3.0.6 | Added new loading bars for component-level evals, and `deepeval view` to see results on Confident AI. | Low | 6/7/2025 |
| v3.0 | # ๐ DeepEval v3.0 โ Evaluate Any LLM Workflow, Anywhere Weโre excited to introduce **DeepEval v3.0**, a major milestone that transforms how you evaluate LLM applications โ from complex multi-step agents to simple prompt chains. This release brings **component-level granularity**, **production-ready observability**, and **simulation tools** to empower devs building modern AI systems. --- ## ๐ Component-Level Evaluation for Agentic Workflows You can now apply DeepEval metrics **to an | Low | 5/27/2025 |
| v2.9.0 | # Rubric Available for G-Eval https://www.deepeval.com/docs/metrics-llm-evals#rubric | Low | 5/15/2025 |
| v2.8.5 | In this release we've cleaned up some dependencies to separate out dev packages, as well as more tracing verbose logs for debugging. | Low | 5/6/2025 |
| v2.7.9 | # ๐จ Breaking Changes > โ ๏ธ This release introduces breaking changes in preparation for DeepEval v3.0. > Please review carefully and adjust your code as needed. ## The `evaluate()` function now has "configs" - Previously the `evaluate()` function had 13+ arguments to control display, async behaviors, caching, etc. and it was growing out of control. We've now abstracted it into "configs" instead: ```python from deepeval.evaluate.configs import AsyncConfig from deepeval import evaluat | Low | 4/28/2025 |
| v2.7.6 | Cleaned up dependencies for upcoming 3.0 release: - Removed the automatic updates, it is now opt-in: https://www.deepeval.com/docs/miscellaneous - Removed instructor, double checked and it wasn't used anywhere - Removed LlamaIndex and moved it to optional, only needed for one module | Low | 4/23/2025 |
| v2.6.8 | The latest conversation simulator simulates fake user interactions to generate conversations on your behalf. These conversations can be used for evaluation right afterwards, and is similar to the goldens synthesizer. Docs here: https://docs.confident-ai.com/docs/evaluation-conversation-simulator | Low | 4/7/2025 |
| v2.6.5 | What's New ๐ฅ - Migrated default provider models to support Synthesizer - Default model providers are now in a different directory, those that are using `deepeval` < 2.5.6 might need to update imports | Low | 3/26/2025 |
| v2.5.9 | # What's New ๐ฅ - Custom prompt template overriding for all RAG metrics. This was introduced for folks using weaker models for evaluation, or just models in general that don't fit too well with OpenAI's prompt formatting, which is what most of `deepeval`'s metrics are built around. You can still use your favorite metrics and algorithms, but now with a custom template if required. Example here: https://docs.confident-ai.com/docs/metrics-answer-relevancy#customize-your-template - Fixes to our | Low | 3/18/2025 |
| v2.3.9 | ๐ฅณ Latest feature to allow users to inject the Faithfulness metric with their custom template. Most suited for custom LLMs where text data is highly formatted by data engineers and stored in databases according to different categories. | Low | 2/20/2025 |
| v2.2.7 | Here are the new features we're bringing to you in the latest release: ๐ฅ Releasing beta for *Deep, Acyclic, Graph*. A new deterministic way in deepeval to build decision trees for deterministic outputs for LLM evaluation: https://docs.confident-ai.com/docs/metrics-dag โ๏ธ Open-sourcing all LLM red teaming vulnerabilities: https://docs.confident-ai.com/docs/red-teaming-introduction ๐ช Fixes to synthetic dataset generation pipeline | Low | 1/31/2025 |
| v2.0 | Here are the new features we're bringing to you in the latest release: โ๏ธ Automated LLM red teaming, aka. vulnerability and security safety scanning. You can now scan for over 40+ vulnerabilities using 10+ SOTA attack enhancement techniques in <10 lines of python code. ๐ช Synthetic dataset generation with a highly customizable synthetic data generation pipeline to cover literally any use case. ๐ผ๏ธ Multi-modal LLM evaluation - perfect for an image editing or text-image use cases. ๐ฌ Conversat | Low | 12/2/2024 |
| v1.4.7 | In DeepEval 1.4.7, we're releasing: - LLM red teaming. Safety test your LLM application for 40+ vulnerabilities with 10+ attack enhancements, docs here: https://docs.confident-ai.com/docs/red-teaming-introduction - Improved synthetic data synthesizer, much more functionality and customizbility: https://docs.confident-ai.com/docs/evaluation-datasets-synthetic-data - Conversational metrics: Dedicated metrics to evaluate LLM turns - Multi-modal metrics: Image editing and text to image evaluatio | Low | 10/31/2024 |
| v0.21.74 | In DeepEval v0.21.74, we have: - Agnetic evaluation metric to evaluate tool calling correctness for LLM agents: https://docs.confident-ai.com/docs/metrics-tool-correctness - Pydantic Schemas to enforce JSON outputs for custom, smaller LLMs: https://docs.confident-ai.com/docs/guides-using-custom-llms - Asynchronous support for synthetic data generation: https://docs.confident-ai.com/docs/evaluation-datasets-synthetic-data - Tracing integration for LLamaIndex and LangChain: https://docs.confid | Low | 7/30/2024 |
| v0.21.62 | In DeepEval v0.21.62, we: - added an option to print out intermediate steps during metric execution, which can be configured via the `verbose_mode` parameter: https://docs.confident-ai.com/docs/metrics-answer-relevancy#example - hyperparameters can be logged to Confident AI via the evaluate() function: https://docs.confident-ai.com/docs/getting-started#optimizing-hyperparameters - Synthetic data generation now gives more realistic results and is more customizable: https://docs.confident-ai.co | Low | 6/25/2024 |
| v0.21.15 | For deepeval's latest release v0.21.15, we release: - Synthetic Data generation. Generate synthetic data from documents easily: https://docs.confident-ai.com/docs/evaluation-datasets-synthetic-data - caching. If you're running 10k test cases and it fails at the 9999th test case, you no longer have to rerun the first 9999 test case as you can just read from cache using the `-c` flag: https://docs.confident-ai.com/docs/evaluation-introduction#cache - repeats. If you want to repeat each test cas | Low | 3/31/2024 |
| v0.20.85 | In deepeval v0.20.85: - asynchronous support throughout deepeval, and no longer using threads. Users can also call individual metrics asynchronously: https://docs.confident-ai.com/docs/metrics-introduction#measuring-metrics-in-async - improved the way in which you create a custom LLM for evaluation. You'll now have to implement an asynchronous generate() method to use deepeval's async features: https://docs.confident-ai.com/docs/metrics-introduction#using-a-custom-llm - strict mode for all | Low | 3/9/2024 |
| v0.20.80 | In DeepEval's latest release, there is now: - conversational metrics: https://docs.confident-ai.com/docs/metrics-knowledge-retention. This metric evaluates whether your LLM is able to retain factual information presented to it throughout a conversation - synthetic data generation. Generate evaluation datasets from scratch: https://docs.confident-ai.com/docs/evaluation-datasets#generate-an-evaluation-dataset | Low | 3/4/2024 |
| v0.20.73 | For the newest release, deepeval now is now stable for production use: - reduced package size - separated functionality of pytest vs deepeval test run command - included coverage score for summarization - fix contextual precision node error - released docs for better transparency into metrics calculation - allows users to configure RAGAS metrics for custom embedding models: https://docs.confident-ai.com/docs/metrics-ragas#example - fixed bugs with checking for package updates | Low | 2/25/2024 |
| v0.20.68 | For the latest release, DeepEval: - Supports Hugging Face users by providing real-time evaluations during fine-tuning: https://docs.confident-ai.com/docs/integrations-huggingface - Supports LlamaIndex users by allowing unit testing of LlamaIndex apps in CI/CD, and offer metrics in LlamaIndex's evaluators: https://docs.confident-ai.com/docs/integrations-llamaindex - Improvements to accuracy and reliability in Faithfulness and Answer Relevancy - Summarization Metric now offers explanation - | Low | 2/14/2024 |
| v0.20.57 | - LLM-Evals (LLM evaluated metrics) now support all of langchain's chat models. - `LLMTestCase` now has `execution_time` and `cost`, useful for those looking to evaluate on these parameters - `minimum_score` is now `threshold` instead, meaning you can now create custom metrics that either have a "minimum" or "maximum" threshold - `LLMEvalMetric` is now `GEval` - Llamaindex Tracing integration: (https://docs.llamaindex.ai/en/stable/module_guides/observability/observability.html#deepeval) | Low | 1/16/2024 |
| v0.20.43 | In this release: - Faithfulness, Answer Relevancy, Contextual Relevancy, Contextual Precision, and Contextual Recall, all offer a reasoning for its given score. - Azure OpenAI now supported via a single command in the CLI: https://docs.confident-ai.com/docs/metrics-introduction#using-azure-openai - New Summarization Metric that uses the QAG framework for its implementation: https://docs.confident-ai.com/docs/metrics-summarization - Pulling datasets from Confident AI now offers an intermediat | Low | 12/28/2023 |
| v0.20.35 | Lots of new features this release: 1. `JudgementalGPT` now allows for different languages - useful for our APAC and European friends 2. `RAGAS` metrics now supports all OpenAI models - useful for those running into context length issues 3. `LLMEvalMetric` now returns a reasoning for its score 4. `deepeval test run` now has hooks that call on test run completion 5. `evaluate` now displays `retrieval_context` for RAG evaluation 6. `RAGAS` metric now displays metric breakdown for all its d | Low | 12/14/2023 |
| v0.20.23 | [Automatically integrated with Confident AI](https://app.confident-ai.com/) for continous evaluation throughout the lifetime of your LLM (app): -log evaluation results and analyze metrics pass / fails -compare and pick the optimal hyperparameters (eg. prompt templates, chunk size, models used, etc.) based on evaluation results -debug evaluation results via LLM traces -manage evaluation test cases / datasets in one place -track events to identify live LLM responses in production -add prod | Low | 12/4/2023 |
| v0.20.27 | [Automatically integrated with Confident AI](https://app.confident-ai.com/) for continous evaluation throughout the lifetime of your LLM (app): -log evaluation results and analyze metrics pass / fails -compare and pick the optimal hyperparameters (eg. prompt templates, chunk size, models used, etc.) based on evaluation results -debug evaluation results via LLM traces -manage evaluation test cases / datasets in one place -track events to identify live LLM responses in production -add prod | Low | 11/22/2023 |
| v0.20.19 | Mid-week bug fixes release with an extra feature: - run_test now works - new function `evaluate`, evaluates a list of test cases (dataset) on metrics you define, all without having to go through the CLI. More info here: https://docs.confident-ai.com/docs/evaluation-datasets#evaluate-your-dataset-without-pytest | Low | 11/16/2023 |
| v0.20.18 | In this release, deepeval has added support for: - JudgementalGPT, a dedicated LLM app developed by Confident AI to perform evaluations more robustly and accurately. JudgementalGPT provides a score and a reason for the score. - Parallel testing: execute test cases in parallel and speed up evaluation up to 100x. | Low | 11/14/2023 |
| v0.20.17 | Release v0.20.17 | Low | 11/13/2023 |
| v0.20.16 | Release v0.20.16 | Low | 11/7/2023 |
| v0.20.15 | Release v0.20.15 | Low | 11/6/2023 |
| v0.20.14 | Release v0.20.14 | Low | 11/5/2023 |
| v0.20.13 | Release v0.20.13 | Low | 11/5/2023 |
| v0.20.12 | Release v0.20.12 | Low | 10/23/2023 |
| v0.20.11 | Release v0.20.11 | Low | 10/20/2023 |
| v0.20.10 | Release v0.20.10 | Low | 10/18/2023 |
| v0.20.6 | ## What's Changed * ensure telemetry hits your server by @ColabDog in https://github.com/confident-ai/deepeval/pull/202 **Full Changelog**: https://github.com/confident-ai/deepeval/compare/v0.20.5...v0.20.6 | Low | 10/12/2023 |
| v0.20.5 | ## What's Changed * firewall check for telemetry by @ColabDog in https://github.com/confident-ai/deepeval/pull/200 * hotfix telemetry setup by @ColabDog in https://github.com/confident-ai/deepeval/pull/201 **Full Changelog**: https://github.com/confident-ai/deepeval/compare/v0.20.3...v0.20.5 | Low | 10/12/2023 |
| v0.20.3 | ## What's Changed * clean quickstart by @ColabDog in https://github.com/confident-ai/deepeval/pull/166 * Hotfix.readme by @penguine-ip in https://github.com/confident-ai/deepeval/pull/168 * Freeze typer v by @ColabDog in https://github.com/confident-ai/deepeval/pull/169 * update quickstart by @ColabDog in https://github.com/confident-ai/deepeval/pull/170 * update sidebar by @ColabDog in https://github.com/confident-ai/deepeval/pull/171 * Fix sidebar by @ColabDog in https://github.com/confi | Low | 10/11/2023 |
| v0.20.0 | ## What's Changed * Rename HOW_TO_CONTRIBUTE.md to CONTRIBUTING.md by @penguine-ip in https://github.com/confident-ai/deepeval/pull/164 * add image similarity metric by @ColabDog in https://github.com/confident-ai/deepeval/pull/162 * Feature/add image similarity by @ColabDog in https://github.com/confident-ai/deepeval/pull/165 **Full Changelog**: https://github.com/confident-ai/deepeval/compare/v0.19.0...v0.20.0 | Low | 10/2/2023 |
| v0.19.0 | ## What's Changed * add guardrails integration by @ColabDog in https://github.com/confident-ai/deepeval/pull/158 * add github workflow results by @ColabDog in https://github.com/confident-ai/deepeval/pull/159 * Feature/add llm eval by @ColabDog in https://github.com/confident-ai/deepeval/pull/161 **Full Changelog**: https://github.com/confident-ai/deepeval/compare/v0.18.0...v0.19.0 | Low | 10/1/2023 |
| v0.18.0 | ## What's Changed * Add new customer support example by @ColabDog in https://github.com/confident-ai/deepeval/pull/154 * Add example test case by @ColabDog in https://github.com/confident-ai/deepeval/pull/156 **Full Changelog**: https://github.com/confident-ai/deepeval/compare/v0.17.9...v0.18.0 | Low | 9/29/2023 |
| v0.17.9 | ## What's Changed * fix the API and version by @ColabDog in https://github.com/confident-ai/deepeval/pull/153 **Full Changelog**: https://github.com/confident-ai/deepeval/compare/v0.17.8...v0.17.9 | Low | 9/28/2023 |
| v0.17.8 | ## What's Changed * fix by @ColabDog in https://github.com/confident-ai/deepeval/pull/147 * add context to the API by @ColabDog in https://github.com/confident-ai/deepeval/pull/150 * Resolves #151 by @ColabDog in https://github.com/confident-ai/deepeval/pull/152 **Full Changelog**: https://github.com/confident-ai/deepeval/compare/v0.17.6...v0.17.8 | Low | 9/28/2023 |
| v0.17.6 | ## What's Changed * adding length metric including test and documentation by @j-space-b in https://github.com/confident-ai/deepeval/pull/139 * add koala by @ColabDog in https://github.com/confident-ai/deepeval/pull/141 * Feature/update file name by @ColabDog in https://github.com/confident-ai/deepeval/pull/145 * add switch CLI by @ColabDog in https://github.com/confident-ai/deepeval/pull/146 **Full Changelog**: https://github.com/confident-ai/deepeval/compare/v0.17.5...v0.17.6 | Low | 9/27/2023 |