Tag: #evaluation
14 packages • ⭐ 130,490 total stars
The open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all sizes to debug, evaluate, monitor, and optimize production-quality AI applications while controllin
🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with OpenTelemetry, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23
Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and
Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.
LLM-powered framework for deep document understanding, semantic retrieval, and context-aware answers using RAG paradigm.
Supercharge Your LLM Application Evaluations 🚀
AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation
The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.
The platform for LLM evaluations and AI agent testing
Stop prompting. Start specifying.
OpenClawProBench is a live-first benchmark harness for evaluating LLM agents in the OpenClaw runtime with deterministic grading and repeated-trial reliability.
A-RAG: Agentic Retrieval-Augmented Generation via Hierarchical Retrieval Interfaces. State-of-the-art RAG framework with keyword, semantic, and chunk read tools for multi-hop QA.
A comprehensive evaluation framework for AI agents and LLM applications.
Make AI work for Everyone - Monitoring and governing for your AI/ML
