Search results for "benchmark"
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML.
Faster Whisper transcription with CTranslate2
Embeddings, Retrieval, and Reranking
GraphQL Framework for Python
Client library to connect to the LangSmith Observability and Evaluation Platform.
Make AI work for Everyone - Monitoring and governing for your AI/ML
Internal Safety Collapse: Turning the LLM or an AI Agent into a sensitive data generator.
AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation
The Context Optimization Layer for LLM Applications
Open Source AI Platform - AI Chat with advanced features that works with every LLM
Cognithor - Agent OS: Local-first autonomous agent operating system. 16 LLM providers, 17 channels, 112+ MCP tools, 5-tier memory, A2A protocol, knowledge vault, voice, browser automation, Computer-us
PraisonAI 🦞 — Hire a 24/7 AI Workforce. Stop writing boilerplate and start shipping autonomous agents that research, plan, code, and execute tasks. Deployed in 5 lines of code with built-in memory, R
AI Observability & Evaluation
Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.
🔬 Harness Vibe Research with Self-evolving AI Scientists
The leading, most token-efficient MCP server for GitHub source code exploration via tree-sitter AST parsing
PinchBench is a benchmarking system for evaluating LLM models as OpenClaw coding agents. Made with 🦀 by the humans at https://kilo.ai
Self improving agents through iterations
Open-source persistent memory for AI agent pipelines (LangGraph, CrewAI, AutoGen) and Claude. REST API + knowledge graph + autonomous consolidation.
Memori is agent-native memory infrastructure. A SQL-native, LLM-agnostic layer that turns agent execution and conversation into structured, persistent state for production systems.
Universal memory layer for AI Agents
ARIS ⚔️ (Auto-Research-In-Sleep) — Lightweight Markdown-only skills for autonomous ML research: cross-model review loops, idea discovery, and experiment automation. No framework, no lock-in — works wi
Open-source sandboxes for code execution, browser use, and AI agents.
Agentic memory for CTI in Python — STIX knowledge graphs, threat-actor alias resolution, offline-first RAG, MCP server for Claude Code and LangChain agents
SmarterRouter: An intelligent LLM gateway and VRAM-aware router for Ollama, llama.cpp, and OpenAI. Features semantic caching, model profiling, and automatic failover for local AI labs.
AI-first security scanner with 76 analyzers, 9,600+ detection rules, and repo poisoning detection for AI/ML, LLM agents, and MCP servers. Scan any GitHub repo with: medusa scan --git user/repo
Vibe-Skills is an all-in-one AI skills package. It seamlessly integrates expert-level capabilities and context management into a general-purpose skills package, enabling any AI agent to instantly upgr
Brain-inspired knowledge graph: spreading activation, Hebbian learning, memory consolidation.
The memory system your AI agent deserves. 4-stage hybrid retrieval — Vector + BM25 + Knowledge Graph + Neural Reranker — in <150ms. Self-hosted, $0/query, built for agents that need to actually rememb
Lightning-Fast RL for LLM Reasoning and Agents. Made Simple & Flexible.
AutoGPT is the vision of accessible AI for everyone, to use and to build on. Our mission is to provide the tools, so that you can focus on what matters.
An AI Gateway, registry, and proxy that sits in front of any MCP, A2A, or REST/gRPC APIs, exposing a unified endpoint with centralized discovery, guardrails and management. Optimizes Agent & Tool call
Accelerating Long Context LLM Inference with Accuracy-Preserving Context Optimization in SGLang, vLLM, llama.cpp, OpenClaw, RAG, and Agentic AI.
OpenAI and Anthropic compatible server for Apple Silicon. Run LLMs and vision-language models (Llama, Qwen-VL, LLaVA) with continuous batching, MCP tool calling, and multimodal support. Native MLX bac
An open-source AI assistant framework with skills and agent architecture
AINL helps turn AI from "a smart conversation" into "a structured worker." It is designed for teams building AI workflows that need multiple steps, state and memory, tool use, repeatable execution, v
A Low-Code MCP Framework for Building Complex and Innovative RAG Pipelines
The implementation for SIGIR 2026: Learning to Retrieve from Agent Trajectories.
PowerMem: Your AI-Powered Long-Term Memory — Accurate, Agile, Affordable. Also friendly support for the OpenClaw Memory Plugin.
AI conversations that actually remember. Never re-explain your project to your AI again. Join our Discord: https://discord.gg/tyvKNccgqN
Benchmark for vector databases.
Benchmarking the gap between AI agent hype and architecture. Three agent archetypes, 73-point performance spread, stress testing, network resilience, and ensemble coordination analysis with statistica
My personal Claude Code and OpenAI Codex setup with battle-tested skills, commands, hooks, agents and MCP servers that I use daily.
Reference implementation of code generation projects from Facebook AI Research. General toolkit to apply machine learning to code, from dataset creation to model training and evaluation. Comes with pr
Group Evolving Agents: Open-Ended Self-Improvement via Experience Sharing
A coding agent optimized to smaller LLMs
A curated list of products, benchmarks, and research papers on autonomous code agents. Beyond coding — they're redefining how software changes the world.
[NeurIPS 2024 D&B] GTA: A Benchmark for General Tool Agents & [arXiv 2026] GTA-2
CASSIA: A Multi-Agent LLM-Based Single-Cell Cell Type Annotation Framework
Unleash Next-Level AI! 🚀 💻 Code Generation: DeepSeek r1 + Claude 3.7 Sonnet - Unparalleled Performance! 📝 Content Creation: DeepSeek r1 + Gemini 2.5 Pro - Superior Quality! 🔌 OpenAI-Compatible. �
Framework for benchmarking vector search engines
🛡⚔️AI-Powered Penetration Testing Framework with automated vulnerability scanning, multi-agent system, and compliance reporting🛡⚔️
Describe it or draw it. Kiln makes it real. — 461 MCP tools for AI-agent-controlled 3D printing. OctoPrint, Moonraker, Bambu Lab, Prusa Link, and Elegoo.
OpenClawProBench is a live-first benchmark harness for evaluating LLM agents in the OpenClaw runtime with deterministic grading and repeated-trial reliability.
Open Framework for AI Agents to play Red Alert through Reinforcement Learning
AgenticX is a unified, production-ready multi-agent platform — Python SDK + CLI (agx) + Studio server + Machi desktop app. Features Meta-Agent orchestration, 15+ LLM providers, MCP Hub, hierarchical m
"DeepCode: Open Agentic Coding (Paper2Code & Text2Web & Text2Backend)"
Unified framework for building enterprise RAG pipelines with small, specialized models
A-RAG: Agentic Retrieval-Augmented Generation via Hierarchical Retrieval Interfaces. State-of-the-art RAG framework with keyword, semantic, and chunk read tools for multi-hop QA.
Open-source, contract-driven data quality validation. Shift-left enforcement at the point of write — before data enters your pipeline.
Lightweight, embedded graph-based memory system for AI applications. Fast (<3ms recall), offline-first, with MCP server support for Claude and other AI tools.
Supercharge Your LLM Application Evaluations 🚀
Curated list of the best truly open-source AI projects, models, tools, and infrastructure.
Dragon Brain — persistent long-term memory for AI agents via MCP (Model Context Protocol). Knowledge graph (FalkorDB) + vector search (Qdrant) + CUDA GPU embeddings. Works with Claude, Gemini CLI, Cur
Memory that remembers the story not just the facts. Three layer sentence graph for AI agents -> Facts, Episodes, raw Sentences. One DB. Zero config.
📊 LLM Context Benchmarks - A comprehensive benchmarking tool for testing LLMs with varying context sizes using Ollama. Features dual benchmark modes (API/CLI), automatic hardware detection (optimiz
The Next-Gen Agent-Native Skill Recommendation Engine
YAO = Yielding AI Outcomes. A lightweight but rigorous system for creating, evaluating, packaging, and governing reusable agent skills.
Life sciences computational skills for scientific AI agents
MaverickMCP - Personal Stock Analysis MCP Server
Claw-Eval is an evaluation harness for evaluating LLM as agents. All tasks verified by humans.
A multi-agent LLM system for detecting and resolving cognitive dissonance.
RAG (Retrieval-augmented generation) ChatBot that provides answers based on contextual information extracted from a collection of Markdown files.
📑 PageIndex: Document Index for Vectorless, Reasoning-based RAG
Description: Self-hosted graph-based associative memory for personal AI agents. Spreading activation, emotional weighting, zero LLM cost.
Local-first Agentic Memory Layer for MCP Agents • 25 tools • Hybrid search (FTS5 + vector + MMR) • GDPR • 100% local
🤖 The most comprehensive directory of AI agent frameworks, platforms, tools, and resources - hundreds of curated entries covering open-source, no-code, enterprise, and autonomous solutions. NEW Boil
Synthadoc: An open-source LLM knowledge compilation engine that turns raw documents into structured, local-first wikis. A transparent, human-readable alternative to traditional RAG, which can be self-
The LLM Evaluation Framework
MoralStack is a governance and safety layer for LLM applications. It analyzes user requests before generation, evaluates risk and intent, and decides whether the AI should answer normally, answer safe
Open-Sable is a local-first autonomous agent framework with AGI-inspired cognitive subsystems (goals, memory, metacognition, tool use). It can run continuously on your machine, integrate with chat int
A tool that compiles messy natural language prompts into a structured intermediate representation (IR) and optionally sends them to LLMs like ChatGPT for cleaner, more reliable responses.
Autonomous VAPT platform. Give it a target (FQDN, IP, CIDR) — it hunts, it reports. Inspired by the Obsidian Order.
Autonomous AI agent that researches viral content, generates posts, publishes them, measures engagement — and rewrites its own strategy based on what worked. Self-learning loop powered by LangGraph +
Connect any LLM to OpenClaw — production-tested middleware for Qwen3-235B and beyond
Self-evolving AI agent framework with 5-layer safety gatekeeper. Agents observe failures, propose fixes, and safely apply them. Built on HKUDS/nanobot.
Computer Environments Elicit General Agentic Intelligence in LLMs
Automatically Update LLM-Agent Papers Daily using Github Actions (Update Every 12th hours)
Local-first AI agent framework with GUI, memory, web search, personality constructs, speech i/o, tools, skills, CLI & Telegram features — fully self-hosted via Ollama.
KAG is a logical form-guided reasoning and retrieval framework based on OpenSPG engine and LLMs. It is used to build logical reasoning and factual Q&A solutions for professional domain knowledge base
FlexRAG: A RAG Framework for Information Retrieval and Generation.
Agent framework and applications built upon Qwen>=3.0, featuring Function Calling, MCP, Code Interpreter, RAG, Chrome extension, etc.
PromptManager is a desktop application for cataloguing, searching, and executing AI prompts, and much more.
An AI guardian that remembers, watches, and acts.
Open infrastructure/control plane for Unchained
Efficient Retrieval Augmentation and Generation Framework
HealthFlow: A Self-Evolving AI Agent with Meta Planning for Autonomous Healthcare Research
