Search results for "evaluation"
AI Observability & Evaluation
Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.
The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.
Stop prompting. Start specifying.
LLM-powered framework for deep document understanding, semantic retrieval, and context-aware answers using RAG paradigm.
One-stop handbook for building, deploying, and understanding LLM agents with 60+ skeletons, tutorials, ecosystem guides, and evaluation tools.
Benchmarking the gap between AI agent hype and architecture. Three agent archetypes, 73-point performance spread, stress testing, network resilience, and ensemble coordination analysis with statistica
Open source platform for AI Engineering: OpenTelemetry-native LLM Observability, GPU Monitoring, Guardrails, Evaluations, Prompt Management, Vault, Playground. ๐๐ป Integrates with 50+ LLM Providers,
Reference implementation of code generation projects from Facebook AI Research. General toolkit to apply machine learning to code, from dataset creation to model training and evaluation. Comes with pr
Local-first memory plugin for OpenClaw AI agents. LLM-powered extraction, plain markdown storage, hybrid search via QMD. Gives agents persistent long-term memory across conversations.
A framework for building, orchestrating and deploying AI agents and multi-agent workflows with support for Python and .NET.
PraisonAI ๐ฆ โ Hire a 24/7 AI Workforce. Stop writing boilerplate and start shipping autonomous agents that research, plan, code, and execute tasks. Deployed in 5 lines of code with built-in memory, R
Secure, Fast, and Extensible Sandbox runtime for AI agents.
Persistent memory for AI coding agents
AgentWard โ Built for all, hardened for OpenClaw.
Universal AI Development Platform with MCP server integration, multi-provider support, and professional CLI. Build, test, and deploy AI applications with multiple ai providers.
I'm going to build my own OpenClaw, with blackjack... and bun!
Memory that lasts and compounds. MentisDB gives agents durable memory so they do not just remember, they improve over time. It stores append-only thought chains plus a Git-like skills registry, lett
The agent engineering platform
A community-driven collection of RAG (Retrieval-Augmented Generation) frameworks, projects, and resources. Contribute and explore the evolving RAG ecosystem.
RAPTOR (Robust AI-Powered Toolkit for Operational Robots) is an AI-native Content Insight Engine that transforms passive media storage into an intelligent knowledge platform through automated analysis
autonomous agent with access to a tool library
The implementation for SIGIR 2026: Learning to Retrieve from Agent Trajectories.
Group Evolving Agents: Open-Ended Self-Improvement via Experience Sharing
ReLE่ฏๆต๏ผไธญๆAIๅคงๆจกๅ่ฝๅ่ฏๆต๏ผๆ็ปญๆดๆฐ๏ผ๏ผ็ฎๅๅทฒๅๆฌ359ไธชๅคงๆจกๅ๏ผ่ฆ็chatgptใgpt-5.2ใo4-miniใ่ฐทๆญgemini-3-proใClaude-4.6ใๆๅฟERNIE-X1.1ใERNIE-5.0ใqwen3-maxใqwen3.5-plusใ็พๅทใ่ฎฏ้ฃๆ็ซใๅๆฑคsenseChat็ญๅ็จๆจกๅ๏ผ ไปฅๅstep3.5-flashใkimi-k2.5ใernie4.5ใMin
Make AI work for Everyone - Monitoring and governing for your AI/ML
๐ชข Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with OpenTelemetry, Langchain, OpenAI SDK, LiteLLM, and more. ๐YC W23
Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and
Cognithor - Agent OS: Local-first autonomous agent operating system. 16 LLM providers, 17 channels, 112+ MCP tools, 5-tier memory, A2A protocol, knowledge vault, voice, browser automation, Computer-us
Plano is an AI-native proxy and data plane for agentic apps โ with built-in orchestration, safety, observability, and smart LLM routing so you stay focused on your agents core logic.
A-RAG: Agentic Retrieval-Augmented Generation via Hierarchical Retrieval Interfaces. State-of-the-art RAG framework with keyword, semantic, and chunk read tools for multi-hop QA.
Python Deep Agent framework built on top of Pydantic-AI, designed to help you quickly build production-grade autonomous AI agents with planning, filesystem operations, subagent delegation, skills, and
โพ๏ธ Private Agent Fleet with Spec Coding. Each agent gets their own GPU-accelerated desktop. Run Claude, Codex, Gemini and open models on a full private AI Stack โพ๏ธ
Autonomous Agents (LLMs) research papers. Updated Daily.
Get up and running with Kimi-K2.5, GLM-5, MiniMax, DeepSeek, gpt-oss, Qwen, Gemma and other models.
AI-powered job search system built on Claude Code. 14 skill modes, Go dashboard, PDF generation, batch processing.
๐ฅ Comprehensive survey on Context Engineering: from prompt engineering to production-grade AI systems. hundreds of papers, frameworks, and implementation guides for LLMs and AI agents.
A modular RAG (Retrieval-Augmented Generation) system with MCP Server architecture. Using Skill to make AI follow each step of the spec and complete the code 100% by AI.
A comprehensive evaluation framework for AI agents and LLM applications.
The platform for LLM evaluations and AI agent testing
A full-stack AI Red Teaming platform securing AI ecosystems via OpenClaw Security Scan, Agent Scan, Skills Scan, MCP scan, AI Infra scan and LLM jailbreak evaluation.
OpenClawProBench is a live-first benchmark harness for evaluating LLM agents in the OpenClaw runtime with deterministic grading and repeated-trial reliability.
Claw-Eval is an evaluation harness for evaluating LLM as agents. All tasks verified by humans.
Zero-friction LLM fine-tuning skill for Claude Code, Gemini CLI & any ACP agent. Unsloth on NVIDIA ยท TRL+MPS/MLX on Apple Silicon. Automates env setup, LoRA training (SFT, DPO, GRPO, vision), post-hoc
Latitude is the open-source agent engineering platform
Agentic RAG R1 Framework via Reinforcement Learning
AgenticX is a unified, production-ready multi-agent platform โ Python SDK + CLI (agx) + Studio server + Machi desktop app. Features Meta-Agent orchestration, 15+ LLM providers, MCP Hub, hierarchical m
The Mind Palace for AI Agents โ Autonomous Cognitive OS with affect-tagged memory (valence engine), token-economic RL (surprisal gate + UBI), Hebbian learning, ACT-R spreading activation, Synapse Engi
Autonomous CLI agent integrations for the Spring AI ecosystem with Claude Code, Gemini CLI, and secure sandbox execution
Learn to build AI agents with Strands framework. Covers LLM integration via Amazon Bedrock/Anthropic, AWS service connections, tool implementation with MCP/A2A protocols, and agent evaluation using La
Internal Safety Collapse: Turning the LLM or an AI Agent into a sensitive data generator.
Weaviate is an open-source vector database that stores both objects and vectors, allowing for the combination of vector search with structured filtering with the fault tolerance and scalability of a c
The open agent control plane. Govern autonomous AI agents with pre-execution policy enforcement, approval gates, and audit trails. Works with LangChain, CrewAI, MCP, and any framework.
Open-source security platform for AI agents -- audits skills before install, monitors 24/7, shares threat intelligence across all users. | AI Agent ้ๆบๅฎๅ จๅนณๅฐ -- ๅฎ่ฃๅๅฏฉ่จ skillใ24/7 ๅณๆ็ฃๆงใ็คพ็พคๅ ฑไบซๅจ่ ๆ ๅ ฑใ
The world's first Autonomous Product Engine (APE): AI agents research your market, generate features, and ship code as PRs. Convoy mode, crash recovery, cost tracking, 80+ API endpoints. Self-hosted v
Self-evolving cognitive memory and context engine for AI agents in Java. Empowering 24/7 proactive agents like OpenClaw with understanding and SOTA performance.
A comprehensive list of papers for the definition of World Models and using World Models for General Video Generation, Embodied AI, and Autonomous Driving, including papers, codes, and related website
The Self-Growing Karpathy LLM Wiki โ grown by an AI agent yoyo from Karpathy's founding prompt
Curated list of chatgpt prompts from the top-rated GPTs in the GPTs Store. Prompt Engineering, prompt attack & prompt protect. Advanced Prompt Engineering papers.
Memory library for building stateful agents
A sovereign cognitive architecture with IIT 4.0 integrated information, residual-stream affective steering (CAA), Global Workspace Theory, active inference, and 72 consciousness modules โ running loca
An open-source long-horizon SuperAgent harness that researches, codes, and creates. With the help of sandboxes, memories, tools, skill, subagents and message gateway, it handles different levels of ta
Automatically Update LLM-Agent Papers Daily using Github Actions (Update Every 12th hours)
๐ค Kubernetes for AI Agents. Self-hosted, production-grade runtime for orchestrating LLM swarms and autonomous agents. TypeScript-native.
Agent samples built using the Strands Agents SDK.
Open-source AI orchestration framework for building context-engineered, production-ready LLM applications. Design modular pipelines and agent workflows with explicit control over retrieval, routing, m
Build and run agents you can see, understand and trust.
The LLM Evaluation Framework
A curated list of products, benchmarks, and research papers on autonomous code agents. Beyond coding โ they're redefining how software changes the world.
Sign integrity generic notation
Open-source meeting transcription API for Google Meet, Microsoft Teams & Zoom. Auto-join bots, real-time WebSocket transcripts, MCP server for AI agents. Self-host or use hosted SaaS.
๐ฅ An autonomous AI agent that runs your deep learning experiments 24/7 while you sleep. Zero-cost monitoring, Leader-Worker architecture, constant-size memory.
The Next-Gen Agent-Native Skill Recommendation Engine
AI-first security scanner with 76 analyzers, 9,600+ detection rules, and repo poisoning detection for AI/ML, LLM agents, and MCP servers. Scan any GitHub repo with: medusa scan --git user/repo
Framework for orchestrating role-playing, autonomous AI agents. By fostering collaborative intelligence, CrewAI empowers agents to work together seamlessly, tackling complex tasks.
๐ข Open-Source Evaluation & Testing library for LLM Agents
MaverickMCP - Personal Stock Analysis MCP Server
Evaluation and Tracking for LLM Experiments and AI Agents
Lightning-Fast RL for LLM Reasoning and Agents. Made Simple & Flexible.
Curated systems, benchmarks, and papers etc. on memory for LLMs/MLLMs --- long-term context, retrieval, and reasoning.
Beads - A memory upgrade for your coding agent
Pragmatic AI Labs MCP Agent Toolkit - An MCP Server designed to make code with agents more deterministic
The open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all sizes to debug, evaluate, monitor, and optimize production-quality AI applications while controllin
Autonomous knowledge base plugin for Claude Code - captures reserch, ideas, and decisions into an interlinked wiki with reserch-on-miss, semantic search, and a Wikipedia-style web UI. Knowledge compou
A multi-agent LLM system for detecting and resolving cognitive dissonance.
Advanced AI Real Estate Assistant using RAG, LLMs, and Python. Features market analysis, property valuation, and intelligent search.
A secure, stable Rust alternative to openclaw/moltbot/clawdbot
A curated list of awesome works related to high dimensional structure/vector search & database
The Multi-Agent Custom Automation Engine Solution Accelerator is an AI-driven system that manages a group of AI agents to accomplish tasks based on user input. Powered by Microsoft Agent Framework, Az
AI Agent Engineering Platform built on an Open Source TypeScript AI Agent Framework
AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation
Cyber Pilot is a traceable delivery system for requirements, design, plans, and code.
Must-read papers on Repository-level Code Generation & Issue Resolution ๐ฅ
TensorZero is an open-source LLMOps platform that unifies an LLM gateway, observability, evaluation, optimization, and experimentation.
DSPEx - Declarative Self-improving Elixir | A BEAM-Native AI Program Optimization Framework
A Low-Code MCP Framework for Building Complex and Innovative RAG Pipelines
Make your OpenClaw agents better, cheaper, and faster.
An Excel AI agent that uses MCP tools to let LLMs read, edit, and automate Excel spreadsheets.
From the team behind Gatsby, Mastra is a framework for building AI-powered applications and agents with a modern TypeScript stack.
Open-source AI SDR template for B2B export. 10-stage sales pipeline, 10 cron jobs, 4-engine memory, multi-channel (WhatsApp+Telegram+Email). Built on OpenClaw.
trpc-agent-go is a powerful Go framework for building intelligent agent systems using large language models (LLMs) and tools.
PinchBench is a benchmarking system for evaluating LLM models as OpenClaw coding agents. Made with ๐ฆ by the humans at https://kilo.ai
kbot โ the AI agent that dreams, learns, and evolves. 764+ tools, 35 agents, 20 providers. Music production, iPhone control, financial analysis, cyber threat intel. Always-on daemon. Runs offline. npm
The agent harness performance optimization system. Skills, instincts, memory, security, and research-first development for Claude Code, Codex, Opencode, Cursor and beyond.
AI-powered meme coin trading bot for Solana and Base that automatically scans new tokens, detects honeypots, calculates win probability, executes trades. Built in Go with a multi-agent architecture, r
Exposes internet search tools for use by LLM-backed Assist in Home Assistant
JSON Agents - A universal JSON-native standard for describing AI agents, their capabilities, tools, runtimes, and governance in a portable, framework-agnostic format. Based on RFC 8259, JSON Schema 2
A single interface to use and evaluate different agent frameworks
๐ซ CAMEL: The first and the best multi-agent framework. Finding the Scaling Law of Agents. https://www.camel-ai.org
Master Codex with this Framework file system + Prompt Generator consisting of 32 markdown files that will set such strict constraints and rules for Codex that its output is nearly flawless. Files for:
A selective learning and memory substrate for agentic systems โ typed, revisable, decayable memory with competence learning and trust-aware retrieval.
Mattermost Agents plugin supporting multiple LLMs
KawaiiGPT โ Open-source LLM gateway accessing DeepSeek, Gemini, and Kimi-K2 through reverse-engineered Pollinations API with no API keys required, built-in prompt injection capabilities for security r
BISHENG is an open LLM devops platform for next generation Enterprise AI applications. Powerful and comprehensive features include: GenAI workflow, RAG, Agent, Unified model management, Evaluation, SF
RAGElo is a set of tools that helps you selecting the best RAG-based LLM agents by using an Elo ranker
Route, manage, and analyze your LLM requests across multiple providers with a unified API interface
Supercharge Your LLM Application Evaluations ๐
LightAgent: Lightweight AI agent framework with memory, tools & tree-of-thought. Supports multi-agent collaboration, self-learning, and major LLMs (OpenAI/DeepSeek/Qwen). Open-source with MCP/SSE prot
AI-powered PRD generation for Claude Code with taskmaster integration
An easy-to-use framework for modular RAG
Multi-agent system for software development
Desktop AI Assistant powered by GPT-5, GPT-4, o1, o3, Gemini, Claude, Ollama, DeepSeek, Perplexity, Grok, Bielik, chat, vision, voice, RAG, image and video generation, agents, tools, MCP, plugins, spe
Autonomous local AI assistant in Go โ 40+ tools, 20+ LLM providers, multi-agent orchestration, self-improving
Complete Workspace Template for OpenClaw - Full agent lifecycle with unified memory system (Markdown + SQLite), self-evolution, RAG. Not for SubAgent/Skill use.
Syllabus-aware RAG study assistant for university students. Answers strictly from your own notes & PDFs, unit-scoped retrieval, cross-encoder reranking, and a hallucination gate โ built to help studen
The open-source framework that makes AI agents proactive, self-learning, and autonomous. Multi-project tracking, full logging pipeline, message discipline, and memory review system.
Your agent in your terminal, equipped with local tools: writes code, uses the terminal, browses the web. Make your own persistent autonomous agent on top!
Open-source autonomous AI assistant with 5-tier security, 62 tools, 14 LLM providers. Written in Rust. Single binary.
Build semantic vector databases from code and docs to enable AI agents to understand and navigate your entire codebase effectively.
Define and control AI agents in markdown with full prompt transparency, persistent memory, and integrated tools via the Claude Agent SDK.
Agent-ready telemetry SDK โ enriches OpenTelemetry across Java, Go, Python, Node.js, and browser with structured context for AI-driven observability.
Modular multi-agent orchestration framework powered by LangGraph and FastAPI.
Nix packages for AI coding agents and development tools. Automatically updated daily.
Autonomous, multilingual AI voice agent using ElevenLabs, LangGraph, and RAG for government services
Lightweight hallucination detection framework for RAG applications
Deterministic governance engine for AI agents. Enforce rules defined in .md governance files across AI systems.
TSUKUYOMI is an advanced modular intelligence framework designed for the democratization of Intelligence Analysis via systematic analysis, processing, and reporting across multiple domains. Built on a
HealthFlow: A Self-Evolving AI Agent with Meta Planning for Autonomous Healthcare Research
Python SDK for Agent AI Observability, Monitoring and Evaluation Framework. Includes features like agent, llm and tools tracing, debugging multi-agentic system, self-hosted dashboard and advanced anal
An open-source SSPM tool written in Go
FlexRAG: A RAG Framework for Information Retrieval and Generation.
Agent framework and applications built upon Qwen>=3.0, featuring Function Calling, MCP, Code Interpreter, RAG, Chrome extension, etc.
Robust, fast, scalable, and sandboxed open-source online code execution system for humans and AI.
PromptGPT is an opensource framework that enables users to automatically generate high-quality prompts with zero installations, coding necessary or technical knowledge. Promptgpt follows industry best
Medical-AI is a AI framework specifically for Medical Applications https://aibharata.github.io/medicalAI/
