Search results for "eval"
Interface between LLMs and your data
MLflow is an open source platform for the complete machine learning lifecycle
Client library to connect to the LangSmith Observability and Evaluation Platform.
Vertex AI API client library
No description
Stop prompting. Start specifying.
Open Source AI Platform - AI Chat with advanced features that works with every LLM
ReLE่ฏๆต๏ผไธญๆAIๅคงๆจกๅ่ฝๅ่ฏๆต๏ผๆ็ปญๆดๆฐ๏ผ๏ผ็ฎๅๅทฒๅๆฌ359ไธชๅคงๆจกๅ๏ผ่ฆ็chatgptใgpt-5.2ใo4-miniใ่ฐทๆญgemini-3-proใClaude-4.6ใๆๅฟERNIE-X1.1ใERNIE-5.0ใqwen3-maxใqwen3.5-plusใ็พๅทใ่ฎฏ้ฃๆ็ซใๅๆฑคsenseChat็ญๅ็จๆจกๅ๏ผ ไปฅๅstep3.5-flashใkimi-k2.5ใernie4.5ใMin
AI Observability & Evaluation
Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.
RAGFlow is a leading open-source Retrieval-Augmented Generation (RAG) engine that fuses cutting-edge RAG with Agent capabilities to create a superior context layer for LLMs
Agent ensembles to design, generate, and select the best code for every task.
An open-source, code-first Python toolkit for building, evaluating, and deploying sophisticated AI agents with flexibility and control.
The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.
Open-source AI orchestration framework for building context-engineered, production-ready LLM applications. Design modular pipelines and agent workflows with explicit control over retrieval, routing, m
SeekStorm: vector & lexical search - in-process library & multi-tenancy server, in Rust.
Memory that lasts and compounds. MentisDB gives agents durable memory so they do not just remember, they improve over time. It stores append-only thought chains plus a Git-like skills registry, lett
The platform for LLM evaluations and AI agent testing
A comprehensive evaluation framework for AI agents and LLM applications.
Make AI work for Everyone - Monitoring and governing for your AI/ML
Weaviate is an open-source vector database that stores both objects and vectors, allowing for the combination of vector search with structured filtering with the fault tolerance and scalability of a c
๐ชข Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with OpenTelemetry, Langchain, OpenAI SDK, LiteLLM, and more. ๐YC W23
A full-stack AI Red Teaming platform securing AI ecosystems via OpenClaw Security Scan, Agent Scan, Skills Scan, MCP scan, AI Infra scan and LLM jailbreak evaluation.
Code, Build and Evaluate agents - excellent Model and Skills/MCP/ACP Support
AI observability platform for production LLM and agent systems.
LLM-powered framework for deep document understanding, semantic retrieval, and context-aware answers using RAG paradigm.
Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and
The open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all sizes to debug, evaluate, monitor, and optimize production-quality AI applications while controllin
The Mind Palace for AI Agents โ Autonomous Cognitive OS with affect-tagged memory (valence engine), token-economic RL (surprisal gate + UBI), Hebbian learning, ACT-R spreading activation, Synapse Engi
๐ข Open-Source Evaluation & Testing library for LLM Agents
Evaluation and Tracking for LLM Experiments and AI Agents
From the team behind Gatsby, Mastra is a framework for building AI-powered applications and agents with a modern TypeScript stack.
AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation
Harness LLMs with Multi-Agent Programming
"RAG-Anything: All-in-One RAG Framework"
High-Performance Engine for Multi-Vector Search
A selective learning and memory substrate for agentic systems โ typed, revisable, decayable memory with competence learning and trust-aware retrieval.
๐ก All-in-one AI framework for semantic search, LLM orchestration and language model workflows
Codingbuddy orchestrates 29 specialized AI agents to deliver code quality comparable to a team of human experts through a PLAN โ ACT โ EVAL workflow.
Sage Mode for F# development โ REPL with solution or project loading, Live Testing for FREE, Hot Reload, and session management.
LLM-powered knowledge base from your Claude Code, Codex CLI, Copilot, Cursor & Gemini sessions. Karpathy's LLM Wiki pattern โ implemented and shipped.
โก Lightweight offline AI agent for local models. No cloud, no API keys โ just your GPU.
Security and best-practices scanner for AI Plugins, covering Codex, Claude, Opencode, Gemini & more. Scores trust for plugins 0-100.
Cognithor - Agent OS: Local-first autonomous agent operating system. 16 LLM providers, 17 channels, 112+ MCP tools, 5-tier memory, A2A protocol, knowledge vault, voice, browser automation, Computer-us
eBPF-based GPU causal observability agent
PraisonAI ๐ฆ โ Hire a 24/7 AI Workforce. Stop writing boilerplate and start shipping autonomous agents that research, plan, code, and execute tasks. Deployed in 5 lines of code with built-in memory, R
Latitude is the open-source agent engineering platform
Universal AI Development Platform with MCP server integration, multi-provider support, and professional CLI. Build, test, and deploy AI applications with multiple ai providers.
423 plugins, 2,849 skills, 177 agents for Claude Code. Open-source marketplace at tonsofskills.com with the ccpi CLI package manager.
A text-based user interface (TUI) client for interacting with MCP servers using Ollama. Features include agent mode, multi-server, model switching, streaming responses, tool management, human-in-the-l
The Unofficial and Awesome Home Assistant MCP Server
RESTai is an AIaaS (AI as a Service) open-source platform. Supports many public and local LLM suported by Ollama/vLLM/etc. Precise embeddings usage, tuning, analytics etc. Built-in image/audio generat
A Model Context Protocol (MCP) server that gives Claude direct control over Strudel.cc for AI-assisted music generation and live coding.
ARIS โ๏ธ (Auto-Research-In-Sleep) โ Lightweight Markdown-only skills for autonomous ML research: cross-model review loops, idea discovery, and experiment automation. No framework, no lock-in โ works wi
AI + Data, online. https://vespa.ai
OmniRoute is an AI gateway for multi-provider LLMs: an OpenAI-compatible endpoint with smart routing, load balancing, retries, and fallbacks. Add policies, rate limits, caching, and observability for
A modular MCP server that provides commonly used developer tools for AI coding agents
The Execution Security Layer for the Agentic Era. Providing deterministic "Sudo" governance and audit logs for autonomous AI agents.
Generate a map of your codebaseto help AI Agents understand your architecture, coding conventions and patterns. Discoverable with Semantic Search
Internal Safety Collapse: Turning the LLM or an AI Agent into a sensitive data generator.
Brain-inspired knowledge graph: spreading activation, Hebbian learning, memory consolidation.
A secure, durable runtime to sandbox AI agent tasks. Run untrusted code in isolated WebAssembly environments.
The app framework built for AI coding agents. Own every line. Your AI already knows how to build on it.
The Best AI Agent Framework for Agent Collaboration.
๐ชจ why use many token when few token do trick โ Claude Code skill that cuts 65% of tokens by talking like caveman
Open-source security platform for AI agents -- audits skills before install, monitors 24/7, shares threat intelligence across all users. | AI Agent ้ๆบๅฎๅ จๅนณๅฐ -- ๅฎ่ฃๅๅฏฉ่จ skillใ24/7 ๅณๆ็ฃๆงใ็คพ็พคๅ ฑไบซๅจ่ ๆ ๅ ฑใ
An MCP server for interacting with Sentry via LLMs.
Java AI application development framework (supports LLM-tool,skill; RAG; MCP; Agent-ReAct,Team-Agent). Compatible with java8 ~ java25. It can also be embedded in SpringBoot, jFinal, Vert.x, Quarkus, a
AI Agent Engineering Platform built on an Open Source TypeScript AI Agent Framework
A collection of Agent Skills Standard and Best Practice for Programming Languages, Frameworks that help our AI Agent follow best practies on frameworks and programming laguages
The implementation for SIGIR 2026: Learning to Retrieve from Agent Trajectories.
trpc-agent-go is a powerful Go framework for building intelligent agent systems using large language models (LLMs) and tools.
The agent harness performance optimization system. Skills, instincts, memory, security, and research-first development for Claude Code, Codex, Opencode, Cursor and beyond.
Autonomous agent framework with structured memory, safety hooks, and loop management. Built by the agent that runs on it.
A single interface to use and evaluate different agent frameworks
Benchmarking the gap between AI agent hype and architecture. Three agent archetypes, 73-point performance spread, stress testing, network resilience, and ensemble coordination analysis with statistica
Unified framework for building enterprise RAG pipelines with small, specialized models
My personal Claude Code and OpenAI Codex setup with battle-tested skills, commands, hooks, agents and MCP servers that I use daily.
Open-source multi-agent AI assistant powered by LangGraph, FastAPI & Next.js โ 16+ agents, Human-in-the-Loop, MCP integration, voice TTS, RAG, 500+ metrics, 6 languages.
Security scanner for AI-generated ("vibe-coded") code. Runs SAST, DAST, and sandboxed exploit simulation across 15+ languages using 30+ tools. Catches what LLMs introduce before it ships โ wit
:sparkles: :dna: Turing ES - Enterprise Search, Semantic Navigation, Chatbot using Search Engine and Generative AI.
Comprehensive guide to AI agent engineering: how 30+ frameworks actually work under the hood. Context rot, compaction, system prompt assembly, SOUL.md, agent loops, memory systems, tool sprawl, MCP,
A-RAG: Agentic Retrieval-Augmented Generation via Hierarchical Retrieval Interfaces. State-of-the-art RAG framework with keyword, semantic, and chunk read tools for multi-hop QA.
AI-powered job search system built on Claude Code. 14 skill modes, Go dashboard, PDF generation, batch processing.
Anti-detection browser server for AI agents โ REST API wrapping Camoufox engine with OpenClaw plugin support
A thin cython wrapper around llama.cpp, whisper.cpp and stable-diffusion.cpp
Official repository of the Seismic library.
RAGElo is a set of tools that helps you selecting the best RAG-based LLM agents by using an Elo ranker
JRVS AI Agent with JARCORE autonomous coding engine - RAG knowledge base, web scraping, calendar, code generation. Powered by whatever local AI you choose.
Supercharge Your LLM Application Evaluations ๐
Observal is an AI agent registry with first in class observabilty and eval framework
[NeurIPS 2024 D&B] GTA: A Benchmark for General Tool Agents & [arXiv 2026] GTA-2
Minimalist web-searching platform with an AI assistant that runs directly from your browser. Uses WebLLM, Wllama and SearXNG. Demo: https://felladrin-minisearch.hf.space
YAO = Yielding AI Outcomes. A lightweight but rigorous system for creating, evaluating, packaging, and governing reusable agent skills.
Make your OpenClaw agents better, cheaper, and faster.
OpenClawProBench is a live-first benchmark harness for evaluating LLM agents in the OpenClaw runtime with deterministic grading and repeated-trial reliability.
Claw-Eval is an evaluation harness for evaluating LLM as agents. All tasks verified by humans.
TrustRAG๏ผThe RAG Framework within Reliable input,Trusted output
AgenticX is a unified, production-ready multi-agent platform โ Python SDK + CLI (agx) + Studio server + Machi desktop app. Features Meta-Agent orchestration, 15+ LLM providers, MCP Hub, hierarchical m
MAGI: Markdown for Agent Guidance & Instruction - A next-generation markdown extension designed specifically for AI systems. MAGI enhances standard markdown with structured metadata, embedded AI instr
๐ PageIndex: Document Index for Vectorless, Reasoning-based RAG
Official Repo of Moss
Markdown-first work-memory protocol for existing agents, with maintained knowledge, candidate notes, evals, and an example KB.
autonomous agent with access to a tool library
A SEC EDGAR MCP (Model Context Protocol) Server
Handle LLM output variance for ruby_llm โ retry on malformed JSON or rule violations, escalate to a smarter model, measure variance on datasets, gate CI on regressions.
An opinionated list of awesome Pydantic-AI frameworks, libraries, software and resources.
Self-evolving cognitive memory and context engine for AI agents in Java. Empowering 24/7 proactive agents like OpenClaw with understanding and SOTA performance.
Prompt Driven Development Command Line Interface
The Self-Growing Karpathy LLM Wiki โ grown by an AI agent yoyo from Karpathy's founding prompt
Curated list of chatgpt prompts from the top-rated GPTs in the GPTs Store. Prompt Engineering, prompt attack & prompt protect. Advanced Prompt Engineering papers.
An open-source long-horizon SuperAgent harness that researches, codes, and creates. With the help of sandboxes, memories, tools, skill, subagents and message gateway, it handles different levels of ta
Autonomous orchestration framework for Claude Code with MemPalace-inspired memory (4-layer stack, 818-token wake-up), parallel-first Agent Teams (6 teammates), Aristotle First Principles methodology,
A Model Context Protocol (MCP) server that provides advanced code analysis and reasoning capabilities powered by Google's Gemini AI
The LLM Evaluation Framework
3-tier agentic ChatOps (n8n + GPT-4o + Claude Code) implementing all 21 patterns from "Agentic Design Patterns" โ solo operator managing 137 devices
๐ฅ An autonomous AI agent that runs your deep learning experiments 24/7 while you sleep. Zero-cost monitoring, Leader-Worker architecture, constant-size memory.
A multi-agent LLM system for detecting and resolving cognitive dissonance.
Hermes Agent rewritten in Rust: production-grade multi-platform AI agent runtime with gateway adapters, tool orchestration, MCP, memory plugins, and cost-safe autonomous loops.
Must-read papers on Repository-level Code Generation & Issue Resolution ๐ฅ
Zero-dependency browser automation CLI. 70+ commands, 10 test assertions, smart commands (click/fill by text โ no LLM needed). MCP server for AI agents with 500x fewer tokens. Extract, observe, script
AI agent security scanner. Detect vulnerabilities in agent configurations, MCP servers, and tool permissions. Available as CLI, GitHub Action, ECC plugin, and GitHub App integration. ๐ก๏ธ
๐ค Kubernetes for AI Agents. Self-hosted, production-grade runtime for orchestrating LLM swarms and autonomous agents. TypeScript-native.
2026 swarm Agent ๅนด๏ผswarm Agent ใAgent teamใ ai codingใskillใmemoryใevolveใagentic RL ็ญ AI Agent้ๅ
The most comprehensive MCP server for Polymarket โ 48 tools spanning direct trading, market discovery, smart money tracking, copy trading, backtesting, risk management, and portfolio optimization. Wor
Your AI coding toolkit, declared in Nix โ Claude, Gemini, Copilot, 15+ MCP servers, one flake
Production-ready AI agent framework โ semantic memory, multi-agent mesh, MCP server, intelligent routing, governance, and 67+ platform integrations.
Claude Code skills, architectural principles, and alternative approaches for AI-assisted development
Awesome list of AI-Driven Development.
A 27-chapter hands-on tutorial for building an autonomous AI agent from zero in Python. Agent loop, tool system, memory, skills, MCP, multi-platform gateway, and self-evolution โ inspired by Herme
A goal-specification file for autonomous coding agents. Generalizes Karpathy's autoresearch to domains with constructed metrics.
NanoCoder Pro โ Autonomous Coding Agent with Master-SubAgent Architecture
MCP Server for Simplenote integration with Claude Desktop
Your AI forgets everything between sessions. SAME fixes that. Local-first, no API keys, single binary.
๐ LLM Context Benchmarks - A comprehensive benchmarking tool for testing LLMs with varying context sizes using Ollama. Features dual benchmark modes (API/CLI), automatic hardware detection (optimiz
A universal CLI for Weaviate, Milvus, Chroma, Qdrant, and other vector DBs to help view, list, create, delete, and search collections and documents in collections for development, test, and debugging
Production-grade TypeScript AI runtime focused on reliability, governance, and reproducible LLM systems. Multi-provider gateway, agents, RAG, workflows, policy engine, audit trails, and deterministic
kbot โ the AI agent that dreams, learns, and evolves. 764+ tools, 35 agents, 20 providers. Music production, iPhone control, financial analysis, cyber threat intel. Always-on daemon. Runs offline. npm
One memory layer for every AI agent. Local-first, markdown source of truth, and CLI/HTTP/MCP native. Your agent forgot who you are. Again. Dory fixes that.
Ambient intelligence that sees what you see, hears what you hear, and acts on your behalf
No description
Claude Code plugin for AI-driven Smalltalk (Pharo) development
The production runtime for AI agents. Schema in, API out. Built on PydanticAI + FastAPI.
Lightweight hallucination detection framework for RAG applications
Claude Code plugin for Ruby, Rails, Grape, PostgreSQL, Redis, and Sidekiq development
Local AI anywhere, for everyone โ LLM inference, chat UI, voice, agents, workflows, RAG, and image generation. No cloud, no subscriptions.
220+ Claude Code skills & agent plugins for Claude Code, Codex, Gemini CLI, Cursor, and 8 more coding agents โ engineering, marketing, product, compliance, C-level advisory.
Route, manage, and analyze your LLM requests across multiple providers with a unified API interface
Local-first AI agent bootstrap: Playwright Browser MCP + ContextDB for Codex CLI, Claude Code, Gemini CLI, and OpenCode.
Memory-centric self-improving harness for AI agents. Six-phase cycle + Security by Absence. ADRs, JSON schemas, and a dependency-free Python reference.
Supercharge Claude Code with 11 AI agents, 36 commands & 15 skills โ the claude-code plugin framework inspired by oh-my-zsh. 6-layer security hooks included. 5-min install.
Self-evolving AI agent framework with 5-layer safety gatekeeper. Agents observe failures, propose fixes, and safely apply them. Built on HKUDS/nanobot.
Autonomous overnight codebase improvement agent for Claude Code. Run it before bed, wake up to production-ready fixes.
Implement a Pytorch-like DL library in C++ from scratch, step by step
A deterministic development harness for Claude Code โ MCP workflow engine, enforcement hooks, YAML workflows, and multi-agent consensus (Claude + Codex + Gemini)
Python SDK for Agent AI Observability, Monitoring and Evaluation Framework. Includes features like agent, llm and tools tracing, debugging multi-agentic system, self-hosted dashboard and advanced anal
ZimaOS Blue - A Local-First Agent Runtime for Bold Builders. Out-of-the-Box, Open-Source, Universal, Vendor-Neutral
Broken RAG For The Broken Souls
The open framework for extensible & grounded AI agent orchestration.
Open-source autonomous AI assistant with 5-tier security, 62 tools, 14 LLM providers. Written in Rust. Single binary.
Self-evolving Claude Code wrapper โ handles any computer work a human can do. 94+ skills, 14 agents, computer use, self-improvement.
Syllabus-aware RAG study assistant for university students. Answers strictly from your own notes & PDFs, unit-scoped retrieval, cross-encoder reranking, and a hallucination gate โ built to help studen
GEON: Structure-first decoding via equivalence classes and field closure
Multi-LLM agent orchestration TUI โ parallel Claude/Gemini/Codex sessions, 126 MCP tools
Claude Code skills collection โ CCA study guides, Twitter research, MCP review, auto-iteration tools
Agent-ready telemetry SDK โ enriches OpenTelemetry across Java, Go, Python, Node.js, and browser with structured context for AI-driven observability.
Open-source Cloudflare Browser Rendering proxy โ 10 MCP tools for Claude Code (content, screenshot, PDF, markdown, scrape, JSON AI extraction, links, a11y, crawl)
AI-agent-friendly PyTorch research pipeline โ one YAML config drives preflight, training, Optuna HPO, and real-time TUI monitoring
Self-hosted autonomous AI agent โ 9-layer cascade, Docker sandbox, encrypted vault, review/build/control plane, 1407+ tests
AI agent evaluation framework for Claude and beyond
Complete Workspace Template for OpenClaw - Full agent lifecycle with unified memory system (Markdown + SQLite), self-evolution, RAG. Not for SubAgent/Skill use.
Lean Rust AI agent: 6MB binary, 7.9MB RAM. OpenClaw replacement. Telegram + Discord + GitHub auto-PR. Ollama/Anthropic support.
A standalone library for AI agent regression testing using LLM-as-judge evaluation
๐ค Enhance chatbot accuracy with a self-correcting RAG system that ingests documents, retrieves data, and evaluates responses in real-time.
An open-source SSPM tool written in Go
๐ Build memory and retrieval infrastructure for ReasonKit, enhancing data management and access for your applications with ease and efficiency.
Autonomous, multilingual AI voice agent using ElevenLabs, LangGraph, and RAG for government services
Build semantic vector databases from code and docs to enable AI agents to understand and navigate your entire codebase effectively.
Reference Implementations for the RAG bootcamp
GAN-inspired multi-agent system that autonomously builds full-stack web apps from a single prompt using Claude AI agents
๐ฐ Fetch and summarize news articles locally using a Retrieval-Augmented Generation system powered by AI models for efficient information access.
Define and control AI agents in markdown with full prompt transparency, persistent memory, and integrated tools via the Claude Agent SDK.
A self-evolving AI Agent Team โ agents that rewrite their own operating manual.
Efficient Retrieval Augmentation and Generation Framework
Command line tool and async library to perform basic file operations on local paths, Google Cloud Storage paths and Azure Blob Storage paths.
Looker REST API
Assertion framework for AI agent evaluations - supports skill invocation checks, build validation, and LLM-based judging
ChatFlow - AI-based chat flow framework, personalize your ChatGPT workflows and build the road to automationใChatFlow โโ ๆ้ ไธชๆงๅ ChatGPT ๆต็จ๏ผๆๅปบ่ชๅจๅไน่ทฏ
