freshcrate
Home > Testing > giskard-oss

giskard-oss

๐Ÿข Open-Source Evaluation & Testing library for LLM Agents

Description

๐Ÿข Open-Source Evaluation & Testing library for LLM Agents

README

giskardlogo giskardlogo

Evals, Red Teaming and Test Generation for Agentic Systems

Modular, Lightweight, Dynamic and Async-first

GitHub release License Downloads CI Giskard on Discord

Docs โ€ข Website โ€ข Community


Important

Giskard v3 is a fresh rewrite designed for dynamic, multi-turn testing of AI agents. This release drops heavy dependencies for better efficiency while introducing a more powerful AI vulnerability scanner and enhanced RAG evaluation capabilities. For now, the vulnerability scanner and RAG evaluation still rely on Giskard v2. Giskard v2 remains available but is no longer actively maintained. Follow progress โ†’ Read the v3 Annoucement ยท Roadmap

Install

pip install giskard

Requires Python 3.12+.


Giskard is an open-source Python library for testing and evaluating agentic systems. The v3 architecture is a modular set of focused packages โ€” each carrying only the dependencies it needs โ€” built from scratch to wrap anything: an LLM, a black-box agent, or a multi-step pipeline.

Status Package Description
โœ… Alpha giskard-checks Testing & evaluation โ€” scenario API, built-in checks, LLM-as-judge
๐Ÿšง In progress giskard-scan Agent vulnerability scanner โ€” red teaming, prompt injection, data leakage (successor of v2 Scan)
๐Ÿ“‹ Planned giskard-rag RAG evaluation & synthetic data generation (successor of v2 RAGET)

Giskard Checks โ€” create and apply evals for testing agents

pip install giskard-checks

Giskard Checks is a lightweight library for creating evaluations (evals) that test LLM-based systems โ€” from simple assertions to LLM-as-judge assessments. Unlike traditional unit tests, evals are designed for non-deterministic outputs where the same input can produce different valid responses.

Use Giskard Checks to:

  • Catch regressions โ€” verify your system still behaves correctly after changes
  • Validate RAG quality โ€” check if answers are grounded in retrieved context
  • Enforce safety rules โ€” ensure outputs conform to your content policies
  • Evaluate multi-turn agents โ€” test full conversations, not just single exchanges

Built-in evals include string matching, comparisons, regex, semantic similarity, and LLM-as-judge checks (Groundedness, Conformity, LLMJudge).

Quickstart

from openai import OpenAI
from giskard.checks import Scenario, Groundedness

client = OpenAI()

def get_answer(inputs: str) -> str:
    response = client.chat.completions.create(
        model="gpt-5-mini",
        messages=[{"role": "user", "content": inputs}],
    )
    return response.choices[0].message.content

scenario = (
    Scenario("test_dynamic_output")
    .interact(
        inputs="What is the capital of France?",
        outputs=get_answer,
    )
    .check(
        Groundedness(
            name="answer is grounded",
            answer_key="trace.last.outputs",
            context="France is a country in Western Europe. Its capital is Paris.",
        )
    )
)

result = await scenario.run()
result.print_report()

The run() method is async. In a script, wrap it with asyncio.run(). See the full docs for Suites, LLMJudge, multi-turn scenarios, and more.

Looking for Giskard v2?

Giskard v2 included Scan (automatic vulnerability detection) and RAGET (RAG evaluation test set generation) for both ML models and LLM applications. These features are not available in v3.

pip install "giskard[llm]>2,<3"

Scan โ€” automatically detect performance, bias & security issues

Wrap your model and run the scan:

import giskard
import pandas as pd

# Replace my_llm_chain with your actual LLM chain or model inference logic
def model_predict(df: pd.DataFrame):
    """The function takes a DataFrame and must return a list of outputs (one per row)."""
    return [my_llm_chain.run({"query": question}) for question in df["question"]]

giskard_model = giskard.Model(
    model=model_predict,
    model_type="text_generation",
    name="My LLM Application",
    description="A question answering assistant",
    feature_names=["question"],
)

scan_results = giskard.scan(giskard_model)
display(scan_results)

Scan Example

RAGET โ€” generate evaluation datasets for RAG applications

Automatically generate questions, reference answers, and context from your knowledge base:

import pandas as pd
from giskard.rag import generate_testset, KnowledgeBase

# Load your knowledge base documents
df = pd.read_csv("path/to/your/knowledge_base.csv")
knowledge_base = KnowledgeBase.from_pandas(df, columns=["column_1", "column_2"])

testset = generate_testset(
    knowledge_base,
    num_questions=60,
    language='en',
    agent_description="A customer support chatbot for company X",
)

RAGET Example

Full v2 docs

๐Ÿ‘‹ Community

We welcome contributions from the AI community! Read this guide to get started, and join our thriving community on Discord.

Follow the progress and share feedback: v3 Announcement ยท Roadmap

๐ŸŒŸ Leave us a star, it helps the project to get discovered by others and keeps us motivated to build awesome open-source tools! ๐ŸŒŸ

โค๏ธ If you find our work useful, please consider sponsoring us on GitHub. With a monthly sponsoring, you can get a sponsor badge, display your company in this readme, and get your bug reports prioritized. We also offer one-time sponsoring if you want us to get involved in a consulting project, run a workshop, or give a talk at your company.

๐Ÿ’š Current sponsors

We thank the following companies which are sponsoring our project with monthly donations:

Lunary

Lunary logo

Biolevate

Biolevate logo

Release History

VersionChangesUrgencyDate
giskard-checks/v1.0.2b1## What's Changed * security(deps): Upgrade pygments to fix CVE CVE-2026-4539 by @kevinmessiaen in https://github.com/Giskard-AI/giskard-oss/pull/2351 * fix(checks): preserve generator and embedding model when running scenarios by @kevinmessiaen in https://github.com/Giskard-AI/giskard-oss/pull/2293 * security: upgrade aiohttp to fix multiple cve by @kevinmessiaen in https://github.com/Giskard-AI/giskard-oss/pull/2356 * Add Junix XML export for SuiteResult by @Mapalo90 in https://github.com/GiskHigh4/10/2026

Dependencies & License Audit

Loading dependencies...

Similar Packages

AI-Infra-GuardA full-stack AI Red Teaming platform securing AI ecosystems via OpenClaw Security Scan, Agent Scan, Skills Scan, MCP scan, AI Infra scan and LLM jailbreak evaluation.v4.1.4
promptfooTest your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line andcode-scan-action-0.1.5
trulensEvaluation and Tracking for LLM Experiments and AI Agentstrulens-2.7.2
local-rag-system๐Ÿค– Build your own local Retrieval-Augmented Generation system for private, offline AI memory without ongoing costs or data privacy concerns.main@2026-04-21
opikDebug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.2.0.6