freshcrate — #llm-evaluation

Home > #llm-evaluation

Tag: #llm-evaluation

11 packages • ⭐ 128,628 total stars

mlflowv3.14.0🏛️ Flagship⭐25,479

The open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all sizes to debug, evaluate, monitor, and optimize production-quality AI applications while controllin

agentops agents ai ai-governance apache-spark evaluation langchain llm-evaluation pythonby mlflow

langfusev3.224.0🏛️ Flagship⭐25,291

🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with OpenTelemetry, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23

analytics autogen evaluation langchain large-language-models llama-index llm llm-evaluation prompt-engineering typescriptby langfuse

promptfoo0.121.19🏛️ Flagship⭐20,382

Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and

ci ci-cd cicd evaluation evaluation-framework llm llm-eval llm-evaluation typescriptby promptfoo

opik2.1.32🏛️ Flagship⭐18,965

Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.

evaluation hacktoberfest hacktoberfest2025 langchain llama-index llm llm-evaluation llm-observability pythonby comet-ml

deepevalv4.1.0🏛️ Flagship⭐14,911

The LLM Evaluation Framework

evaluation-framework evaluation-metrics llm-evaluation llm-evaluation-framework llm-evaluation-metrics pythonby confident-ai

chinese-llm-benchmarkv5.10🌳 Mature⭐5,889

ReLE评测：中文AI大模型能力评测（持续更新）：目前已囊括359个大模型，覆盖chatgpt、gpt-5.2、o4-mini、谷歌gemini-3-pro、Claude-4.6、文心ERNIE-X1.1、ERNIE-5.0、qwen3-max、qwen3.5-plus、百川、讯飞星火、商汤senseChat等商用模型，以及step3.5-flash、kimi-k2.5、ernie4.5、Min

agentic-ai artificial-intelligence llm-agent llm-evaluationby jeinlee1991

giskard-ossgiskard-scan/v1.0.0b3🏛️ Flagship⭐5,289

🐢 Open-Source Evaluation & Testing library for LLM Agents

agent-evaluation ai-red-team ai-security ai-testing fairness-ai llm llm-eval llm-evaluation pythonby Giskard-AI

AutoRAGv2.0.0🌳 Mature⭐4,713

AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation

analysis automl benchmarking document-parser embeddings evaluation llm llm-evaluation pythonby Marker-Inc-Korea

agentav0.105.0🌳 Mature⭐4,045

The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.

agents evaluation llm-as-a-judge llm-evaluation llm-framework llm-monitoring llm-observability llm-platform prompt-engineering typescriptby Agenta-AI

AI-Infra-Guardv4.1.15🌳 Mature⭐3,521

A full-stack AI Red Teaming platform securing AI ecosystems via OpenClaw Security Scan, Agent Scan, Skills Scan, MCP scan, AI Infra scan and LLM jailbreak evaluation.

agent agent-security ai-infra ai-red-teaming ai-security llm llm-evaluation llm-jailbreak pythonby Tencent

GTAv0.2.0🌱 Seedling⭐143

[NeurIPS 2024 D&B] GTA: A Benchmark for General Tool Agents & [arXiv 2026] GTA-2

llm-agent llm-evaluation pythonby open-compass

Tag: #llm-evaluation

Trending in #llm-evaluation