Browse: Testing
Framework for benchmarking vector search engines
An AI-powered GitHub code review tool that uses LLMs to detect high-confidence, high-impact issues—such as security vulnerabilities, bugs, and maintainability concerns.
Mendix cli tool, a headless way to work with Mendix projects. Enables Mendix projects for use with 3rd party agentic coding tools like Claude Code and Copilot. Includes a starlark linter for quality v
📊 LLM Context Benchmarks - A comprehensive benchmarking tool for testing LLMs with varying context sizes using Ollama. Features dual benchmark modes (API/CLI), automatic hardware detection (optimiz
Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and
AI Observability & Evaluation
#1 Terminal Benchmark 2.0 — AI that ships your tickets.
89 skills and 38 specialized agents that enforce proven engineering practices for AI-assisted development. TDD, systematic debugging, parallel code review, and 10-gate development cycles — as a Claude
The open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all sizes to debug, evaluate, monitor, and optimize production-quality AI applications while controllin
Observal is an AI agent registry with first in class observabilty and eval framework
FSPEC: The Spec-Driven, Multi-Agent Coding Factory. It is infrastructure for the "Dark Factory"—the emerging model of fully autonomous software development where AI agents handle all implementation wh
A coding agent optimized to smaller LLMs
Internal Safety Collapse: Turning the LLM or an AI Agent into a sensitive data generator.
🐢 Open-Source Evaluation & Testing library for LLM Agents
OpenClawProBench is a live-first benchmark harness for evaluating LLM agents in the OpenClaw runtime with deterministic grading and repeated-trial reliability.
Claw-Eval is an evaluation harness for evaluating LLM as agents. All tasks verified by humans.
Benchmark for vector databases.
Evaluation and Tracking for LLM Experiments and AI Agents
Autospec is an open-source AI agent that takes a web app URL and autonomously QAs it, and saves its passing specs as E2E test code
Riverbed Community Toolkit is a public toolkit for Riverbed Solutions engineering and integration
Handle LLM output variance for ruby_llm — retry on malformed JSON or rule violations, escalate to a smarter model, measure variance on datasets, gate CI on regressions.
Autonomous AI agent that contributes to open source — discovers repos, analyzes code, generates fixes, and submits PRs
ReLE评测:中文AI大模型能力评测(持续更新):目前已囊括359个大模型,覆盖chatgpt、gpt-5.2、o4-mini、谷歌gemini-3-pro、Claude-4.6、文心ERNIE-X1.1、ERNIE-5.0、qwen3-max、qwen3.5-plus、百川、讯飞星火、商汤senseChat等商用模型, 以及step3.5-flash、kimi-k2.5、ernie4.5、Min
[NeurIPS 2024 D&B] GTA: A Benchmark for General Tool Agents & [arXiv 2026] GTA-2
Autonomous orchestration framework for Claude Code with MemPalace-inspired memory (4-layer stack, 818-token wake-up), parallel-first Agent Teams (6 teammates), Aristotle First Principles methodology,
Fast Compiler for C# Expression Trees and the lightweight LightExpression alternative. Diagnostic and code generation tools for the expressions.
Agent 驱动的专业级平面设计工作台 / Agent-powered graphic design workbench that uses HTML/CSS/SVG as the design medium, supporting vector-quality output, editable elements, multi-layer PSD export, lossless text ren
Watchtower is a simple AI-powered penetration testing automation CLI tool that leverages LLMs and LangGraph to orchestrate agentic workflows that you can use to test your websites locally. Generate us
Unleash Next-Level AI! 🚀 💻 Code Generation: DeepSeek r1 + Claude 3.7 Sonnet - Unparalleled Performance! 📝 Content Creation: DeepSeek r1 + Gemini 2.5 Pro - Superior Quality! 🔌 OpenAI-Compatible. �
Declarative framework for orchestrating multi-model LLM pipelines with context engineering and quality gates.
Benchmarking the gap between AI agent hype and architecture. Three agent archetypes, 73-point performance spread, stress testing, network resilience, and ensemble coordination analysis with statistica
Lint your repo for AI agent compatibility.
A universal CLI for Weaviate, Milvus, Chroma, Qdrant, and other vector DBs to help view, list, create, delete, and search collections and documents in collections for development, test, and debugging
🖼️ Master advanced techniques for Google's Nano Banana Pro to create stunning, professional-quality images up to 4K resolution.
🎶 Enhance audio quality with ComfyUI-AudioSR, a versatile tool for upscaling sounds to 48kHz for better clarity and listening experience.
🧠 Discover and evaluate advanced benchmark datasets for Large Language Model agents to enhance performance assessment in real-world tasks.
🧠 Qualify leads with an AI-driven system that understands intent, asks key questions, and structures quality leads without hardcoding processes.
🤖 Generate automated test cases for your GitHub repositories using AI, ensuring comprehensive coverage with seamless integration and multi-language support.
Provide token-efficient, distilled QA docs for AI coding agents to generate accurate test code quickly and reduce token usage significantly
🔍 Discover and utilize agentic iOS/watchOS audit skills and playbooks for consistent quality assurance in your applications.
Enhance prompts by injecting real project context to create clear, professional, and actionable instructions with quality and risk insights.
Generate production-ready Maestro YAML test flows for mobile and web apps with accurate selectors, project setup, CI/CD configurations, and test reports.
An automated, agentic exploratory testing tool that performs comprehensive QA testing on web applications, simulating human user interactions through various input methods (mouse, keyboard, TAB naviga
🌐 Optimize web projects with essential skills for performance, accessibility, and SEO, based on Google Lighthouse and Core Web Vitals guidelines.
✍️ Write effective AI prompts with this structured prompt engineering library and Claude Code skill, featuring 300+ curated examples for high-quality results.
🎨 Enhance cinematic image quality with ComfyUI-None-upup. This AI engine offers nodes for clarity, brightness, and video processing to elevate your visuals.
Benchmark and compare LLM tool, configuration, and prompt setups using a shared case framework with automated scoring and telemetry.
🛠 Remove watermarks from OpenAI Sora 2 videos using precise spectral analysis to keep video quality intact and watermark-free.
Analyze git code changes to generate structured review reports using flexible AI models and integrated workflows.
🍌 Generate JSON prompts for ultra-photorealistic images of nano bananas and related subjects, ensuring reproducible and high-quality visual outputs.
Provide a structured code refactoring process for OpenAI Codex with guardrails, decision gates, and parallelism awareness to simplify and improve code quality.
AI engineering framework with quality gates, persistent memory, and multi-platform support. Works inside Claude Code, Cursor, Copilot, Codex, and Gemini.
File-based autonomous agentic research swarm template (Planner/Worker/Judge) with contracts, workstreams, and deterministic quality gates.
Autonomous overnight codebase improvement agent for Claude Code. Run it before bed, wake up to production-ready fixes.
Qodo-Cover: An AI-Powered Tool for Automated Test Generation and Code Coverage Enhancement! 💻🤖🧪🐞
Trust-Grade AI Development Framework for software development — Zero dependencies.
A self-evolving AI Agent Team — agents that rewrite their own operating manual.
Efficient Retrieval Augmentation and Generation Framework
PromptGPT is an opensource framework that enables users to automatically generate high-quality prompts with zero installations, coding necessary or technical knowledge. Promptgpt follows industry best
