Search results for "evaluation"
Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.
Stop prompting. Start specifying.
One-stop handbook for building, deploying, and understanding LLM agents with 60+ skeletons, tutorials, ecosystem guides, and evaluation tools.
Benchmarking the gap between AI agent hype and architecture. Three agent archetypes, 73-point performance spread, stress testing, network resilience, and ensemble coordination analysis with statistica
Open source platform for AI Engineering: OpenTelemetry-native LLM Observability, GPU Monitoring, Guardrails, Evaluations, Prompt Management, Vault, Playground. ๐๐ป Integrates with 50+ LLM Providers,
Reference implementation of code generation projects from Facebook AI Research. General toolkit to apply machine learning to code, from dataset creation to model training and evaluation. Comes with pr
A framework for building, orchestrating and deploying AI agents and multi-agent workflows with support for Python and .NET.
PraisonAI ๐ฆ โ Hire a 24/7 AI Workforce. Stop writing boilerplate and start shipping autonomous agents that research, plan, code, and execute tasks. Deployed in 5 lines of code with built-in memory, R
Secure, Fast, and Extensible Sandbox runtime for AI agents.
The agent engineering platform
RAPTOR (Robust AI-Powered Toolkit for Operational Robots) is an AI-native Content Insight Engine that transforms passive media storage into an intelligent knowledge platform through automated analysis
autonomous agent with access to a tool library
The implementation for SIGIR 2026: Learning to Retrieve from Agent Trajectories.
Group Evolving Agents: Open-Ended Self-Improvement via Experience Sharing
Make AI work for Everyone - Monitoring and governing for your AI/ML
Cognithor - Agent OS: Local-first autonomous agent operating system. 16 LLM providers, 17 channels, 112+ MCP tools, 5-tier memory, A2A protocol, knowledge vault, voice, browser automation, Computer-us
A-RAG: Agentic Retrieval-Augmented Generation via Hierarchical Retrieval Interfaces. State-of-the-art RAG framework with keyword, semantic, and chunk read tools for multi-hop QA.
Python Deep Agent framework built on top of Pydantic-AI, designed to help you quickly build production-grade autonomous AI agents with planning, filesystem operations, subagent delegation, skills, and
A modular RAG (Retrieval-Augmented Generation) system with MCP Server architecture. Using Skill to make AI follow each step of the spec and complete the code 100% by AI.
A comprehensive evaluation framework for AI agents and LLM applications.
A full-stack AI Red Teaming platform securing AI ecosystems via OpenClaw Security Scan, Agent Scan, Skills Scan, MCP scan, AI Infra scan and LLM jailbreak evaluation.
OpenClawProBench is a live-first benchmark harness for evaluating LLM agents in the OpenClaw runtime with deterministic grading and repeated-trial reliability.
Claw-Eval is an evaluation harness for evaluating LLM as agents. All tasks verified by humans.
Zero-friction LLM fine-tuning skill for Claude Code, Gemini CLI & any ACP agent. Unsloth on NVIDIA ยท TRL+MPS/MLX on Apple Silicon. Automates env setup, LoRA training (SFT, DPO, GRPO, vision), post-hoc
Agentic RAG R1 Framework via Reinforcement Learning
AgenticX is a unified, production-ready multi-agent platform โ Python SDK + CLI (agx) + Studio server + Machi desktop app. Features Meta-Agent orchestration, 15+ LLM providers, MCP Hub, hierarchical m
Internal Safety Collapse: Turning the LLM or an AI Agent into a sensitive data generator.
Memory library for building stateful agents
A sovereign cognitive architecture with IIT 4.0 integrated information, residual-stream affective steering (CAA), Global Workspace Theory, active inference, and 72 consciousness modules โ running loca
An open-source long-horizon SuperAgent harness that researches, codes, and creates. With the help of sandboxes, memories, tools, skill, subagents and message gateway, it handles different levels of ta
Automatically Update LLM-Agent Papers Daily using Github Actions (Update Every 12th hours)
Agent samples built using the Strands Agents SDK.
Build and run agents you can see, understand and trust.
The LLM Evaluation Framework
A curated list of products, benchmarks, and research papers on autonomous code agents. Beyond coding โ they're redefining how software changes the world.
๐ฅ An autonomous AI agent that runs your deep learning experiments 24/7 while you sleep. Zero-cost monitoring, Leader-Worker architecture, constant-size memory.
The Next-Gen Agent-Native Skill Recommendation Engine
AI-first security scanner with 76 analyzers, 9,600+ detection rules, and repo poisoning detection for AI/ML, LLM agents, and MCP servers. Scan any GitHub repo with: medusa scan --git user/repo
Framework for orchestrating role-playing, autonomous AI agents. By fostering collaborative intelligence, CrewAI empowers agents to work together seamlessly, tackling complex tasks.
๐ข Open-Source Evaluation & Testing library for LLM Agents
MaverickMCP - Personal Stock Analysis MCP Server
Evaluation and Tracking for LLM Experiments and AI Agents
Lightning-Fast RL for LLM Reasoning and Agents. Made Simple & Flexible.
The open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all sizes to debug, evaluate, monitor, and optimize production-quality AI applications while controllin
A multi-agent LLM system for detecting and resolving cognitive dissonance.
Autonomous knowledge base plugin for Claude Code - captures reserch, ideas, and decisions into an interlinked wiki with reserch-on-miss, semantic search, and a Wikipedia-style web UI. Knowledge compou
Advanced AI Real Estate Assistant using RAG, LLMs, and Python. Features market analysis, property valuation, and intelligent search.
The Multi-Agent Custom Automation Engine Solution Accelerator is an AI-driven system that manages a group of AI agents to accomplish tasks based on user input. Powered by Microsoft Agent Framework, Az
AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation
Cyber Pilot is a traceable delivery system for requirements, design, plans, and code.
A Low-Code MCP Framework for Building Complex and Innovative RAG Pipelines
An Excel AI agent that uses MCP tools to let LLMs read, edit, and automate Excel spreadsheets.
PinchBench is a benchmarking system for evaluating LLM models as OpenClaw coding agents. Made with ๐ฆ by the humans at https://kilo.ai
Exposes internet search tools for use by LLM-backed Assist in Home Assistant
JSON Agents - A universal JSON-native standard for describing AI agents, their capabilities, tools, runtimes, and governance in a portable, framework-agnostic format. Based on RFC 8259, JSON Schema 2
A single interface to use and evaluate different agent frameworks
๐ซ CAMEL: The first and the best multi-agent framework. Finding the Scaling Law of Agents. https://www.camel-ai.org
KawaiiGPT โ Open-source LLM gateway accessing DeepSeek, Gemini, and Kimi-K2 through reverse-engineered Pollinations API with no API keys required, built-in prompt injection capabilities for security r
RAGElo is a set of tools that helps you selecting the best RAG-based LLM agents by using an Elo ranker
Route, manage, and analyze your LLM requests across multiple providers with a unified API interface
Supercharge Your LLM Application Evaluations ๐
LightAgent: Lightweight AI agent framework with memory, tools & tree-of-thought. Supports multi-agent collaboration, self-learning, and major LLMs (OpenAI/DeepSeek/Qwen). Open-source with MCP/SSE prot
AI-powered PRD generation for Claude Code with taskmaster integration
An easy-to-use framework for modular RAG
Desktop AI Assistant powered by GPT-5, GPT-4, o1, o3, Gemini, Claude, Ollama, DeepSeek, Perplexity, Grok, Bielik, chat, vision, voice, RAG, image and video generation, agents, tools, MCP, plugins, spe
Complete Workspace Template for OpenClaw - Full agent lifecycle with unified memory system (Markdown + SQLite), self-evolution, RAG. Not for SubAgent/Skill use.
Syllabus-aware RAG study assistant for university students. Answers strictly from your own notes & PDFs, unit-scoped retrieval, cross-encoder reranking, and a hallucination gate โ built to help studen
Your agent in your terminal, equipped with local tools: writes code, uses the terminal, browses the web. Make your own persistent autonomous agent on top!
Modular multi-agent orchestration framework powered by LangGraph and FastAPI.
Autonomous, multilingual AI voice agent using ElevenLabs, LangGraph, and RAG for government services
Lightweight hallucination detection framework for RAG applications
HealthFlow: A Self-Evolving AI Agent with Meta Planning for Autonomous Healthcare Research
Python SDK for Agent AI Observability, Monitoring and Evaluation Framework. Includes features like agent, llm and tools tracing, debugging multi-agentic system, self-hosted dashboard and advanced anal
FlexRAG: A RAG Framework for Information Retrieval and Generation.
Agent framework and applications built upon Qwen>=3.0, featuring Function Calling, MCP, Code Interpreter, RAG, Chrome extension, etc.
Medical-AI is a AI framework specifically for Medical Applications https://aibharata.github.io/medicalAI/
