Browse: Testing
AI Observability & Evaluation
Autonomous orchestration framework for Claude Code with MemPalace-inspired memory (4-layer stack, 818-token wake-up), parallel-first Agent Teams (6 teammates), Aristotle First Principles methodology,
FSPEC: The Spec-Driven, Multi-Agent Coding Factory. It is infrastructure for the "Dark Factory"—the emerging model of fully autonomous software development where AI agents handle all implementation wh
#1 Terminal Benchmark 2.0 — AI that ships your tickets.
ReLE评测:中文AI大模型能力评测(持续更新):目前已囊括359个大模型,覆盖chatgpt、gpt-5.2、o4-mini、谷歌gemini-3-pro、Claude-4.6、文心ERNIE-X1.1、ERNIE-5.0、qwen3-max、qwen3.5-plus、百川、讯飞星火、商汤senseChat等商用模型, 以及step3.5-flash、kimi-k2.5、ernie4.5、Min
Internal Safety Collapse: Turning the LLM or an AI Agent into a sensitive data generator.
Framework for benchmarking vector search engines
OpenClawProBench is a live-first benchmark harness for evaluating LLM agents in the OpenClaw runtime with deterministic grading and repeated-trial reliability.
Claw-Eval is an evaluation harness for evaluating LLM as agents. All tasks verified by humans.
Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and
Autonomous AI agent that contributes to open source — discovers repos, analyzes code, generates fixes, and submits PRs
Autospec is an open-source AI agent that takes a web app URL and autonomously QAs it, and saves its passing specs as E2E test code
Fast Compiler for C# Expression Trees and the lightweight LightExpression alternative. Diagnostic and code generation tools for the expressions.
🐢 Open-Source Evaluation & Testing library for LLM Agents
Evaluation and Tracking for LLM Experiments and AI Agents
Mendix cli tool, a headless way to work with Mendix projects. Enables Mendix projects for use with 3rd party agentic coding tools like Claude Code and Copilot. Includes a starlark linter for quality v
The open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all sizes to debug, evaluate, monitor, and optimize production-quality AI applications while controllin
Benchmarking the gap between AI agent hype and architecture. Three agent archetypes, 73-point performance spread, stress testing, network resilience, and ensemble coordination analysis with statistica
Lint your repo for AI agent compatibility.
Watchtower is a simple AI-powered penetration testing automation CLI tool that leverages LLMs and LangGraph to orchestrate agentic workflows that you can use to test your websites locally. Generate us
Unleash Next-Level AI! 🚀 💻 Code Generation: DeepSeek r1 + Claude 3.7 Sonnet - Unparalleled Performance! 📝 Content Creation: DeepSeek r1 + Gemini 2.5 Pro - Superior Quality! 🔌 OpenAI-Compatible. �
Declarative framework for orchestrating multi-model LLM pipelines with context engineering and quality gates.
An AI-powered GitHub code review tool that uses LLMs to detect high-confidence, high-impact issues—such as security vulnerabilities, bugs, and maintainability concerns.
Benchmark for vector databases.
File-based autonomous agentic research swarm template (Planner/Worker/Judge) with contracts, workstreams, and deterministic quality gates.
🧠 Discover and evaluate advanced benchmark datasets for Large Language Model agents to enhance performance assessment in real-world tasks.
🧠 Qualify leads with an AI-driven system that understands intent, asks key questions, and structures quality leads without hardcoding processes.
Provide token-efficient, distilled QA docs for AI coding agents to generate accurate test code quickly and reduce token usage significantly
Autonomous overnight codebase improvement agent for Claude Code. Run it before bed, wake up to production-ready fixes.
🔍 Discover and utilize agentic iOS/watchOS audit skills and playbooks for consistent quality assurance in your applications.
🎶 Enhance audio quality with ComfyUI-AudioSR, a versatile tool for upscaling sounds to 48kHz for better clarity and listening experience.
🤖 Generate automated test cases for your GitHub repositories using AI, ensuring comprehensive coverage with seamless integration and multi-language support.
🛠 Remove watermarks from OpenAI Sora 2 videos using precise spectral analysis to keep video quality intact and watermark-free.
🌐 Optimize web projects with essential skills for performance, accessibility, and SEO, based on Google Lighthouse and Core Web Vitals guidelines.
🍌 Generate JSON prompts for ultra-photorealistic images of nano bananas and related subjects, ensuring reproducible and high-quality visual outputs.
✍️ Write effective AI prompts with this structured prompt engineering library and Claude Code skill, featuring 300+ curated examples for high-quality results.
Enhance prompts by injecting real project context to create clear, professional, and actionable instructions with quality and risk insights.
Analyze git code changes to generate structured review reports using flexible AI models and integrated workflows.
Provide a structured code refactoring process for OpenAI Codex with guardrails, decision gates, and parallelism awareness to simplify and improve code quality.
Generate production-ready Maestro YAML test flows for mobile and web apps with accurate selectors, project setup, CI/CD configurations, and test reports.
An automated, agentic exploratory testing tool that performs comprehensive QA testing on web applications, simulating human user interactions through various input methods (mouse, keyboard, TAB naviga
🎨 Enhance cinematic image quality with ComfyUI-None-upup. This AI engine offers nodes for clarity, brightness, and video processing to elevate your visuals.
Benchmark and compare LLM tool, configuration, and prompt setups using a shared case framework with automated scoring and telemetry.
Trust-Grade AI Development Framework for software development — Zero dependencies.
A self-evolving AI Agent Team — agents that rewrite their own operating manual.
Qodo-Cover: An AI-Powered Tool for Automated Test Generation and Code Coverage Enhancement! 💻🤖🧪🐞
Efficient Retrieval Augmentation and Generation Framework
PromptGPT is an opensource framework that enables users to automatically generate high-quality prompts with zero installations, coding necessary or technical knowledge. Promptgpt follows industry best
