freshcrate
Skin:/
Home > #llm-evaluation

Tag: #llm-evaluation

11 packages • ⭐ 128,628 total stars

mlflowv3.13.0🏛️ Flagship25,479

The open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all sizes to debug, evaluate, monitor, and optimize production-quality AI applications while controllin

langfusev3.178.0🏛️ Flagship25,291

🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with OpenTelemetry, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23

promptfoo0.121.14🏛️ Flagship20,382

Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and

opik2.0.56🏛️ Flagship18,965

Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.

chinese-llm-benchmarkv5.10🏛️ Flagship5,889

ReLE评测:中文AI大模型能力评测(持续更新):目前已囊括359个大模型,覆盖chatgpt、gpt-5.2、o4-mini、谷歌gemini-3-pro、Claude-4.6、文心ERNIE-X1.1、ERNIE-5.0、qwen3-max、qwen3.5-plus、百川、讯飞星火、商汤senseChat等商用模型, 以及step3.5-flash、kimi-k2.5、ernie4.5、Min

giskard-ossgiskard-checks/v1.0.2b3🏛️ Flagship5,289

🐢 Open-Source Evaluation & Testing library for LLM Agents

AutoRAGv0.3.22🌳 Mature4,713

AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation

agentav0.100.9🌳 Mature4,045

The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.

AI-Infra-Guardv4.1.11🌳 Mature3,521

A full-stack AI Red Teaming platform securing AI ecosystems via OpenClaw Security Scan, Agent Scan, Skills Scan, MCP scan, AI Infra scan and LLM jailbreak evaluation.

GTAv0.2.0🌿 Growing143

[NeurIPS 2024 D&B] GTA: A Benchmark for General Tool Agents & [arXiv 2026] GTA-2