Search results for "eval"
Interface between LLMs and your data
MLflow is an open source platform for the complete machine learning lifecycle
Client library to connect to the LangSmith Observability and Evaluation Platform.
Stop prompting. Start specifying.
Open Source AI Platform - AI Chat with advanced features that works with every LLM
AI Observability & Evaluation
Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.
RAGFlow is a leading open-source Retrieval-Augmented Generation (RAG) engine that fuses cutting-edge RAG with Agent capabilities to create a superior context layer for LLMs
An open-source, code-first Python toolkit for building, evaluating, and deploying sophisticated AI agents with flexibility and control.
A comprehensive evaluation framework for AI agents and LLM applications.
Make AI work for Everyone - Monitoring and governing for your AI/ML
A full-stack AI Red Teaming platform securing AI ecosystems via OpenClaw Security Scan, Agent Scan, Skills Scan, MCP scan, AI Infra scan and LLM jailbreak evaluation.
Code, Build and Evaluate agents - excellent Model and Skills/MCP/ACP Support
AI observability platform for production LLM and agent systems.
The open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all sizes to debug, evaluate, monitor, and optimize production-quality AI applications while controllin
🐢 Open-Source Evaluation & Testing library for LLM Agents
Evaluation and Tracking for LLM Experiments and AI Agents
AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation
Harness LLMs with Multi-Agent Programming
"RAG-Anything: All-in-One RAG Framework"
High-Performance Engine for Multi-Vector Search
💡 All-in-one AI framework for semantic search, LLM orchestration and language model workflows
LLM-powered knowledge base from your Claude Code, Codex CLI, Copilot, Cursor & Gemini sessions. Karpathy's LLM Wiki pattern — implemented and shipped.
⚡ Lightweight offline AI agent for local models. No cloud, no API keys — just your GPU.
Security and best-practices scanner for AI Plugins, covering Codex, Claude, Opencode, Gemini & more. Scores trust for plugins 0-100.
Cognithor - Agent OS: Local-first autonomous agent operating system. 16 LLM providers, 17 channels, 112+ MCP tools, 5-tier memory, A2A protocol, knowledge vault, voice, browser automation, Computer-us
PraisonAI 🦞 — Hire a 24/7 AI Workforce. Stop writing boilerplate and start shipping autonomous agents that research, plan, code, and execute tasks. Deployed in 5 lines of code with built-in memory, R
423 plugins, 2,849 skills, 177 agents for Claude Code. Open-source marketplace at tonsofskills.com with the ccpi CLI package manager.
A text-based user interface (TUI) client for interacting with MCP servers using Ollama. Features include agent mode, multi-server, model switching, streaming responses, tool management, human-in-the-l
The Unofficial and Awesome Home Assistant MCP Server
RESTai is an AIaaS (AI as a Service) open-source platform. Supports many public and local LLM suported by Ollama/vLLM/etc. Precise embeddings usage, tuning, analytics etc. Built-in image/audio generat
ARIS ⚔️ (Auto-Research-In-Sleep) — Lightweight Markdown-only skills for autonomous ML research: cross-model review loops, idea discovery, and experiment automation. No framework, no lock-in — works wi
Internal Safety Collapse: Turning the LLM or an AI Agent into a sensitive data generator.
Brain-inspired knowledge graph: spreading activation, Hebbian learning, memory consolidation.
The Best AI Agent Framework for Agent Collaboration.
🪨 why use many token when few token do trick — Claude Code skill that cuts 65% of tokens by talking like caveman
The implementation for SIGIR 2026: Learning to Retrieve from Agent Trajectories.
A single interface to use and evaluate different agent frameworks
Benchmarking the gap between AI agent hype and architecture. Three agent archetypes, 73-point performance spread, stress testing, network resilience, and ensemble coordination analysis with statistica
Unified framework for building enterprise RAG pipelines with small, specialized models
My personal Claude Code and OpenAI Codex setup with battle-tested skills, commands, hooks, agents and MCP servers that I use daily.
Open-source multi-agent AI assistant powered by LangGraph, FastAPI & Next.js — 16+ agents, Human-in-the-Loop, MCP integration, voice TTS, RAG, 500+ metrics, 6 languages.
A-RAG: Agentic Retrieval-Augmented Generation via Hierarchical Retrieval Interfaces. State-of-the-art RAG framework with keyword, semantic, and chunk read tools for multi-hop QA.
A thin cython wrapper around llama.cpp, whisper.cpp and stable-diffusion.cpp
RAGElo is a set of tools that helps you selecting the best RAG-based LLM agents by using an Elo ranker
JRVS AI Agent with JARCORE autonomous coding engine - RAG knowledge base, web scraping, calendar, code generation. Powered by whatever local AI you choose.
Supercharge Your LLM Application Evaluations 🚀
Observal is an AI agent registry with first in class observabilty and eval framework
[NeurIPS 2024 D&B] GTA: A Benchmark for General Tool Agents & [arXiv 2026] GTA-2
YAO = Yielding AI Outcomes. A lightweight but rigorous system for creating, evaluating, packaging, and governing reusable agent skills.
OpenClawProBench is a live-first benchmark harness for evaluating LLM agents in the OpenClaw runtime with deterministic grading and repeated-trial reliability.
Claw-Eval is an evaluation harness for evaluating LLM as agents. All tasks verified by humans.
TrustRAG:The RAG Framework within Reliable input,Trusted output
AgenticX is a unified, production-ready multi-agent platform — Python SDK + CLI (agx) + Studio server + Machi desktop app. Features Meta-Agent orchestration, 15+ LLM providers, MCP Hub, hierarchical m
📑 PageIndex: Document Index for Vectorless, Reasoning-based RAG
Markdown-first work-memory protocol for existing agents, with maintained knowledge, candidate notes, evals, and an example KB.
autonomous agent with access to a tool library
A SEC EDGAR MCP (Model Context Protocol) Server
Prompt Driven Development Command Line Interface
An open-source long-horizon SuperAgent harness that researches, codes, and creates. With the help of sandboxes, memories, tools, skill, subagents and message gateway, it handles different levels of ta
The LLM Evaluation Framework
3-tier agentic ChatOps (n8n + GPT-4o + Claude Code) implementing all 21 patterns from "Agentic Design Patterns" — solo operator managing 137 devices
🔥 An autonomous AI agent that runs your deep learning experiments 24/7 while you sleep. Zero-cost monitoring, Leader-Worker architecture, constant-size memory.
A multi-agent LLM system for detecting and resolving cognitive dissonance.
Zero-dependency browser automation CLI. 70+ commands, 10 test assertions, smart commands (click/fill by text — no LLM needed). MCP server for AI agents with 500x fewer tokens. Extract, observe, script
Claude Code skills, architectural principles, and alternative approaches for AI-assisted development
A 27-chapter hands-on tutorial for building an autonomous AI agent from zero in Python. Agent loop, tool system, memory, skills, MCP, multi-platform gateway, and self-evolution — inspired by Herme
NanoCoder Pro — Autonomous Coding Agent with Master-SubAgent Architecture
MCP Server for Simplenote integration with Claude Desktop
📊 LLM Context Benchmarks - A comprehensive benchmarking tool for testing LLMs with varying context sizes using Ollama. Features dual benchmark modes (API/CLI), automatic hardware detection (optimiz
Ambient intelligence that sees what you see, hears what you hear, and acts on your behalf
One memory layer for every AI agent. Local-first, markdown source of truth, and CLI/HTTP/MCP native. Your agent forgot who you are. Again. Dory fixes that.
No description
The production runtime for AI agents. Schema in, API out. Built on PydanticAI + FastAPI.
Lightweight hallucination detection framework for RAG applications
Claude Code plugin for Ruby, Rails, Grape, PostgreSQL, Redis, and Sidekiq development
220+ Claude Code skills & agent plugins for Claude Code, Codex, Gemini CLI, Cursor, and 8 more coding agents — engineering, marketing, product, compliance, C-level advisory.
Route, manage, and analyze your LLM requests across multiple providers with a unified API interface
Self-evolving AI agent framework with 5-layer safety gatekeeper. Agents observe failures, propose fixes, and safely apply them. Built on HKUDS/nanobot.
Autonomous overnight codebase improvement agent for Claude Code. Run it before bed, wake up to production-ready fixes.
Python SDK for Agent AI Observability, Monitoring and Evaluation Framework. Includes features like agent, llm and tools tracing, debugging multi-agentic system, self-hosted dashboard and advanced anal
Broken RAG For The Broken Souls
The open framework for extensible & grounded AI agent orchestration.
Syllabus-aware RAG study assistant for university students. Answers strictly from your own notes & PDFs, unit-scoped retrieval, cross-encoder reranking, and a hallucination gate — built to help studen
GEON: Structure-first decoding via equivalence classes and field closure
Claude Code skills collection — CCA study guides, Twitter research, MCP review, auto-iteration tools
AI-agent-friendly PyTorch research pipeline — one YAML config drives preflight, training, Optuna HPO, and real-time TUI monitoring
Self-hosted autonomous AI agent — 9-layer cascade, Docker sandbox, encrypted vault, review/build/control plane, 1407+ tests
Complete Workspace Template for OpenClaw - Full agent lifecycle with unified memory system (Markdown + SQLite), self-evolution, RAG. Not for SubAgent/Skill use.
Autonomous, multilingual AI voice agent using ElevenLabs, LangGraph, and RAG for government services
GAN-inspired multi-agent system that autonomously builds full-stack web apps from a single prompt using Claude AI agents
Efficient Retrieval Augmentation and Generation Framework
Command line tool and async library to perform basic file operations on local paths, Google Cloud Storage paths and Azure Blob Storage paths.
Looker REST API
