freshcrate
Skin:/
Home > #evaluation

Tag: #evaluation

19 packages • ⭐ 134,521 total stars

mlflowv3.13.0🏛️ Flagship25,479

The open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all sizes to debug, evaluate, monitor, and optimize production-quality AI applications while controllin

langfusev3.178.0🏛️ Flagship25,291

🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with OpenTelemetry, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23

promptfoo0.121.14🏛️ Flagship20,382

Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and

opik2.0.56🏛️ Flagship18,965

Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.

WeKnorav0.6.1🏛️ Flagship13,971

LLM-powered framework for deep document understanding, semantic retrieval, and context-aware answers using RAG paradigm.

ragasv0.4.3🌳 Mature13,570

Supercharge Your LLM Application Evaluations 🚀

AutoRAGv0.3.22🌳 Mature4,713

AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation

agentav0.100.9🌳 Mature4,045

The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.

langwatchlangwatch-3.4.1🌳 Mature3,206

The platform for LLM evaluations and AI agent testing

langsmithv0.8.9🌳 Mature858

Client library to connect to the LangSmith Observability and Evaluation Platform.

Observalv1.4.0🌳 Mature572

Observal is an AI agent registry with first in class observabilty and eval framework

OpenClawProBenchmain@2026-05-19🌿 Growing453

OpenClawProBench is a live-first benchmark harness for evaluating LLM agents in the OpenClaw runtime with deterministic grading and repeated-trial reliability.

aragv0.1.0🌱 Seedling252

A-RAG: Agentic Retrieval-Augmented Generation via Hierarchical Retrieval Interfaces. State-of-the-art RAG framework with keyword, semantic, and chunk read tools for multi-hop QA.

evalsv0.2.1🌿 Growing106

A comprehensive evaluation framework for AI agents and LLM applications.

arthur-engine2.1.601🌿 Growing77

Make AI work for Everyone - Monitoring and governing for your AI/ML

@poofnew/vibe-check0.1.1🌱 Seedling5

AI agent evaluation framework for Claude and beyond

agent-regression-testingv0.1.15🌱 Seedling

A standalone library for AI agent regression testing using LLM-as-judge evaluation

@wix/eval-assertions0.51.0🌱 Seedling

Assertion framework for AI agent evaluations - supports skill invocation checks, build validation, and LLM-based judging