freshcrate — Search

Search results for "evaluation"

76 results found (Python)

opik 📁2.0.6🌳 Mature⭐18,767

Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.

evaluation hacktoberfest hacktoberfest2025 langchain llama-index llm llm-evaluation llm-observability pythonby comet-mlPython

ouroboros 📁v0.28.8🌳 Mature⭐2,107

Stop prompting. Start specifying.

ai-agent claude-code codex-cli devtools evaluation llm mcp multi-agent prompt-engineering pythonby Q00Python

LLM-Agents-Ecosystem-Handbook 📁0.0.0🌳 Mature⭐508

One-stop handbook for building, deploying, and understanding LLM agents with 60+ skeletons, tutorials, ecosystem guides, and evaluation tools.

ai ai-agent ai-agents fine-tuning finetuning-llms freamework llm llmops pythonby oxbshwPython

ai-agents-reality-check 📁0.0.0🌿 Growing⭐57

Benchmarking the gap between AI agent hype and architecture. Three agent archetypes, 73-point performance spread, stress testing, network resilience, and ensemble coordination analysis with statistica

agent-architecture agent-benchmark agent-evaluation agent-performance agentic-ai agentic-workflow ai-benchmarking architectural-evaluation llm-agent pythonby Cre4T3Tiv3Python

openlit 📁openlit-1.18.1🌿 Growing⭐2,358

Open source platform for AI Engineering: OpenTelemetry-native LLM Observability, GPU Monitoring, Guardrails, Evaluations, Prompt Management, Vault, Playground. 🚀💻 Integrates with 50+ LLM Providers,

ai-observability amd-gpu clickhouse distributed-tracing genai gpu-monitoring grafana langchain pythonby openlitPython

CodeGen 📁0.0.0🌳 Mature⭐773

Reference implementation of code generation projects from Facebook AI Research. General toolkit to apply machine learning to code, from dataset creation to model training and evaluation. Comes with pr

pythonby facebookresearchPython

agent-framework 📁python-1.1.0🌳 Mature⭐9,325

A framework for building, orchestrating and deploying AI agents and multi-agent workflows with support for Python and .NET.

agent-framework agentic-ai agents ai dotnet multi-agent orchestration pythonby microsoftPython

PraisonAI 📁v4.6.25🌳 Mature⭐6,900

PraisonAI 🦞 — Hire a 24/7 AI Workforce. Stop writing boilerplate and start shipping autonomous agents that research, plan, code, and execute tasks. Deployed in 5 lines of code with built-in memory, R

agents ai ai-agent-framework ai-agent-sdk ai-agents ai-agents-framework ai-agents-sdk ai-framwork pythonby MervinPraisonPython

OpenSandbox 📁docker/execd/v1.0.13🌳 Mature⭐9,925

Secure, Fast, and Extensible Sandbox runtime for AI agents.

ai ai-agent ai-infra kubernetes python sandboxby alibabaPython

langchain 📁langchain-core==1.3.0🌳 Mature⭐133,178

The agent engineering platform

agents ai ai-agents anthropic chatgpt deepagents enterprise framework pythonby langchain-aiPython

RAPTOR 📁0.0.0🌱 Seedling⭐13

RAPTOR (Robust AI-Powered Toolkit for Operational Robots) is an AI-native Content Insight Engine that transforms passive media storage into an intelligent knowledge platform through automated analysis

ai ai-automation ai-framework ai-orchestration artificial-intelligence audio-processing computer-vision content-analysis pythonby DHT-AI-StudioPython

tulip_agent 📁0.0.0🌱 Seedling⭐44

autonomous agent with access to a tool library

autonomous-agent large-language-model python tool-libraryby HRI-EUPython

LRAT 📁0.0.0🌱 Seedling⭐34

The implementation for SIGIR 2026: Learning to Retrieve from Agent Trajectories.

agent agentic llm python searchby Yuqi-ZhouPython

GEA 📁0.0.0🌱 Seedling⭐23

Group Evolving Agents: Open-Ended Self-Improvement via Experience Sharing

code-generation group-evolving-agents open-ended-evolution open-endedness python research-agents self-evolving-agentsby eric-ai-labPython

arthur-engine 📁2.1.529🌿 Growing⭐75

Make AI work for Everyone - Monitoring and governing for your AI/ML

agentic benchmarking evaluation genai guardrails llm ml monitoring pythonby arthur-aiPython

cognithor 📁v0.92.2🌿 Growing⭐94

Cognithor - Agent OS: Local-first autonomous agent operating system. 16 LLM providers, 17 channels, 112+ MCP tools, 5-tier memory, A2A protocol, knowledge vault, voice, browser automation, Computer-us

agent-os ai-agent anthropic autonomous-agent discord-bot document-analysis gdpr-compliant gemini pythonby Alex8791-cyberPython

arag 📁v0.1.0🌿 Growing⭐247

A-RAG: Agentic Retrieval-Augmented Generation via Hierarchical Retrieval Interfaces. State-of-the-art RAG framework with keyword, semantic, and chunk read tools for multi-hop QA.

agent agentic-ai agenticrag deepresearch evaluation graphrag llm llmagents pythonby Ayanami0730Python

pydantic-deepagents 📁0.3.15🌿 Growing⭐648

Python Deep Agent framework built on top of Pydantic-AI, designed to help you quickly build production-grade autonomous AI agents with planning, filesystem operations, subagent delegation, skills, and

agent-framework anthropic artificial-intelligence business-intelligence chatgpt clawdbot enterprise framework pythonby vstorm-coPython

MODULAR-RAG-MCP-SERVER 📁0.0.0🌳 Mature⭐783

A modular RAG (Retrieval-Augmented Generation) system with MCP Server architecture. Using Skill to make AI follow each step of the spec and complete the code 100% by AI.

pythonby jerry-ai-devPython

evals 📁v0.1.15🌿 Growing⭐103

A comprehensive evaluation framework for AI agents and LLM applications.

agentic agentic-ai ai evaluation machine-learning python strands-agentsby strands-agentsPython

AI-Infra-Guard 📁v4.1.4🌿 Growing⭐3,428

A full-stack AI Red Teaming platform securing AI ecosystems via OpenClaw Security Scan, Agent Scan, Skills Scan, MCP scan, AI Infra scan and LLM jailbreak evaluation.

agent agent-security ai-infra ai-red-teaming ai-security llm llm-evaluation llm-jailbreak pythonby TencentPython

OpenClawProBench 📁main@2026-04-15🌿 Growing⭐340

OpenClawProBench is a live-first benchmark harness for evaluating LLM agents in the OpenClaw runtime with deterministic grading and repeated-trial reliability.

agent benchmark evaluation harness leaderboard llm openclaw pythonby suyoumoPython

claw-eval 📁main@2026-04-15🌿 Growing⭐394

Claw-Eval is an evaluation harness for evaluating LLM as agents. All tasks verified by humans.

agent harness llm openclaw pythonby claw-evalPython

unsloth-buddy 📁main@2026-04-15🌿 Growing⭐212

Zero-friction LLM fine-tuning skill for Claude Code, Gemini CLI & any ACP agent. Unsloth on NVIDIA · TRL+MPS/MLX on Apple Silicon. Automates env setup, LoRA training (SFT, DPO, GRPO, vision), post-hoc

apple-silicon claude-code dpo fine-tuning gaslamp grpo huggingface lora pythonby TYH-labsPython

Agentic-RAG-R1 📁0.0.0🌿 Growing⭐412

Agentic RAG R1 Framework via Reinforcement Learning

agentic grpo python rag rlby jiangxinkePython

AgenticX 📁v0.3.7🌿 Growing⭐105

AgenticX is a unified, production-ready multi-agent platform — Python SDK + CLI (agx) + Studio server + Machi desktop app. Features Meta-Agent orchestration, 15+ LLM providers, MCP Hub, hierarchical m

agent-framework agentic-workflows ai-agent ai-orchestration chatbot desktop-app electron fastapi pythonby DemonDamonPython

ISC-Bench 📁v0.0.5🌿 Growing⭐786

Internal Safety Collapse: Turning the LLM or an AI Agent into a sensitive data generator.

adversarial-attacks agent-safety ai-safety benchmark frontier-models jailbreak large-language-models llm-safety pythonby wuyoscarPython

honcho 📁main@2026-04-21🌿 Growing⭐2,030

Memory library for building stateful agents

agent-memory ai ai-agents ai-memory anthropic context-engineering continual-learning embeddings pythonby plastic-labsPython

aura 📁main@2026-04-21🌱 Seedling⭐47

A sovereign cognitive architecture with IIT 4.0 integrated information, residual-stream affective steering (CAA), Global Workspace Theory, active inference, and 72 consciousness modules — running loca

active-inference affective-computing apple-silicon artificial-consciousness autonomous-agent cognitive-architecture cognitive-science consciousness pythonby youngbryan97Python

deer-flow 📁main@2026-04-21🌿 Growing⭐60,446

An open-source long-horizon SuperAgent harness that researches, codes, and creates. With the help of sandboxes, memories, tools, skill, subagents and message gateway, it handles different levels of ta

agent agentic agentic-framework agentic-workflow ai ai-agents deep-research harness pythonby bytedancePython

LLM-Agent-Paper-daily 📁main@2026-04-21🌱 Seedling⭐20

Automatically Update LLM-Agent Papers Daily using Github Actions (Update Every 12th hours)

llm llm-agent pythonby Lyz103Python

samples 📁main@2026-04-20🌿 Growing⭐717

Agent samples built using the Strands Agents SDK.

agentic agentic-ai agents ai anthropic autonomous-agents bedrock genai pythonby strands-agentsPython

agentscope 📁v1.0.19🌿 Growing⭐23,421

Build and run agents you can see, understand and trust.

agent chatbot large-language-models llm llm-agent mcp multi-agent multi-modal pythonby agentscope-aiPython

deepeval 📁v3.9.5🌳 Mature⭐14,701

The LLM Evaluation Framework

evaluation-framework evaluation-metrics llm-evaluation llm-evaluation-framework llm-evaluation-metrics pythonby confident-aiPython

awesome-code-agents 📁main@2026-04-20🌿 Growing⭐94

A curated list of products, benchmarks, and research papers on autonomous code agents. Beyond coding — they're redefining how software changes the world.

pythonby EuniAIPython

auto-deep-researcher-24x7 📁main@2026-04-19🌿 Growing⭐261

🔥 An autonomous AI agent that runs your deep learning experiments 24/7 while you sleep. Zero-cost monitoring, Leader-Worker architecture, constant-size memory.

ai-agent autonomous-agent claude-code deep-learning experiment-automation gpu hyperparameter-tuning llm-agent pythonby Xiangyue-ZhangPython

skills-vote 📁main@2026-04-19🌱 Seedling⭐31

The Next-Gen Agent-Native Skill Recommendation Engine

agent-skill agent-skills llm llm-agent pythonby MemTensorPython

medusa 📁v2026.5.5🌿 Growing⭐252

AI-first security scanner with 76 analyzers, 9,600+ detection rules, and repo poisoning detection for AI/ML, LLM agents, and MCP servers. Scan any GitHub repo with: medusa scan --git user/repo

agent-security ai-security code-analysis cve-detection devsecops llm-security mcp nextjs pythonby Pantheon-SecurityPython

crewAI 📁1.14.2🌿 Growing⭐48,611

Framework for orchestrating role-playing, autonomous AI agents. By fostering collaborative intelligence, CrewAI empowers agents to work together seamlessly, tackling complex tasks.

agents ai ai-agents aiagentframework llms pythonby crewAIIncPython

giskard-oss 📁giskard-checks/v1.0.2b1🌱 Seedling⭐5,225

🐢 Open-Source Evaluation & Testing library for LLM Agents

agent-evaluation ai-red-team ai-security ai-testing fairness-ai llm llm-eval llm-evaluation pythonby Giskard-AIPython

maverick-mcp 📁main@2026-04-17🌿 Growing⭐479

MaverickMCP - Personal Stock Analysis MCP Server

anthropic artificial-intelligence claude equities fastmcp finance financial-analysis fintech pythonby wshobsonPython

trulens 📁trulens-2.7.2🌱 Seedling⭐3,237

Evaluation and Tracking for LLM Experiments and AI Agents

agent-evaluation agentops ai-agents ai-monitoring ai-observability evals explainable-ml llm-eval pythonby trueraPython

AReaL 📁v1.0.3🌿 Growing⭐5,017

Lightning-Fast RL for LLM Reasoning and Agents. Made Simple & Flexible.

agent llm llm-agent llm-reasoning machine-learning-systems mlsys python reinforcement-learning rlby inclusionAIPython

mlflow 📁v3.11.1🌱 Seedling⭐25,285

The open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all sizes to debug, evaluate, monitor, and optimize production-quality AI applications while controllin

agentops agents ai ai-governance apache-spark evaluation langchain llm-evaluation pythonby mlflowPython

cognitive-dissonance-dspy 📁main@2026-04-14🌿 Growing⭐276

A multi-agent LLM system for detecting and resolving cognitive dissonance.

pythonby evalopsPython

LLM-Wiki 📁main@2026-04-18🌱 Seedling⭐7

Autonomous knowledge base plugin for Claude Code - captures reserch, ideas, and decisions into an interlinked wiki with reserch-on-miss, semantic search, and a Wikipedia-style web UI. Knowledge compou

ai-tools autonomous-agent claude-code claude-code-plugin fastapi knowledge-base knowledge-management llm pythonby OshayrPython

ai-real-estate-assistant 📁dev@2026-04-13🌿 Growing⭐159

Advanced AI Real Estate Assistant using RAG, LLMs, and Python. Features market analysis, property valuation, and intelligent search.

ai assistant chatbot docker fastapi llm nextjs proptech python vector-databaseby AleksNeStuPython

Multi-Agent-Custom-Automation-Engine-Solution-Accelerator 📁v4.1.1🌿 Growing⭐770

The Multi-Agent Custom Automation Engine Solution Accelerator is an AI-driven system that manages a group of AI agents to accomplish tasks based on user input. Powered by Microsoft Agent Framework, Az

ai-azd-templates azd-templates pythonby microsoftPython

AutoRAG 📁v0.3.22🌱 Seedling⭐4,693

AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation

analysis automl benchmarking document-parser embeddings evaluation llm llm-evaluation pythonby Marker-Inc-KoreaPython

cyber-pilot 📁v3.7.0-beta🌿 Growing⭐53

Cyber Pilot is a traceable delivery system for requirements, design, plans, and code.

agents ai architecture code-generation code-review code-validation codegen codegeneration pythonby cyberfabricPython

UltraRAG 📁v0.3.0.2🌿 Growing⭐5,480

A Low-Code MCP Framework for Building Complex and Innovative RAG Pipelines

deepseek demo easy embedding flask gpt huggingface-transformers llm pythonby OpenBMBPython

sv-excel-agent 📁0.0.0🌱 Seedling⭐179

An Excel AI agent that uses MCP tools to let LLMs read, edit, and automate Excel spreadsheets.

pythonby SylvianAIPython

skill 📁v1.2.1🌱 Seedling⭐978

PinchBench is a benchmarking system for evaluating LLM models as OpenClaw coding agents. Made with 🦀 by the humans at https://kilo.ai

pythonby pinchbenchPython

llm_intents 📁1.7.1🌱 Seedling⭐111

Exposes internet search tools for use by LLM-backed Assist in Home Assistant

assist hacs hacs-integration hassio hassio-integration home-assistant home-assistant-integration home-assistant-voice pythonby skye-harrisPython

Standard 📁0.0.0🌱 Seedling⭐18

JSON Agents - A universal JSON-native standard for describing AI agents, their capabilities, tools, runtimes, and governance in a portable, framework-agnostic format. Based on RFC 8259, JSON Schema 2

agent-governance agent-manifest agent-orchestration agent-specification ai-agents ai-framework interoperability json pythonby JSON-AgentsPython

any-agent 📁1.18.0🌱 Seedling⭐1,141

A single interface to use and evaluate different agent frameworks

a2a agent-evaluation agents ai mcp pythonby mozilla-aiPython

camel 📁v0.2.90🌱 Seedling⭐16,654

🐫 CAMEL: The first and the best multi-agent framework. Finding the Scaling Law of Agents. https://www.camel-ai.org

agent ai-societies artificial-intelligence communicative-ai cooperative-ai deep-learning large-language-models multi-agent-systems pythonby camel-aiPython

KawaiiGPT 📁KawaiiGPT🌱 Seedling⭐831

KawaiiGPT — Open-source LLM gateway accessing DeepSeek, Gemini, and Kimi-K2 through reverse-engineered Pollinations API with no API keys required, built-in prompt injection capabilities for security r

ai-chatbot deepseek free-llm-access gemini kawaiigpt kimi-k2 linux-cli llm-jailbreak pythonby MarCmcbri1982Python

RAGElo 📁0.4.0🌱 Seedling⭐128

RAGElo is a set of tools that helps you selecting the best RAG-based LLM agents by using an Elo ranker

pythonby zetaalphavectorPython

deltallm 📁v0.1.20-rc2🌱 Seedling⭐3

Route, manage, and analyze your LLM requests across multiple providers with a unified API interface

ai-gateway ai-infrastructure api-gateway kubernetes llm-gateway llm-proxy llm-routing mcp model-context-protocol pythonby deltawiPython

ragas 📁v0.4.3🌱 Seedling⭐13,329

Supercharge Your LLM Application Evaluations 🚀

evaluation llm llmops pythonby explodinggradientsPython

LightAgent 📁v0.5.0🌱 Seedling⭐831

LightAgent: Lightweight AI agent framework with memory, tools & tree-of-thought. Supports multi-agent collaboration, self-learning, and major LLMs (OpenAI/DeepSeek/Qwen). Open-source with MCP/SSE prot

pythonby wanxingaiPython

prd-taskmaster 📁v3.0.0🌱 Seedling⭐184

AI-powered PRD generation for Claude Code with taskmaster integration

ai-development claude-code claude-skills prd product-management product-requirements python requirements-engineering taskmasterby anombyte93Python

PAI-RAG 📁v0.4.3🌱 Seedling⭐450

An easy-to-use framework for modular RAG

pythonby aigc-appsPython

py-gpt 📁v2.7.12🌱 Seedling⭐1,724

Desktop AI Assistant powered by GPT-5, GPT-4, o1, o3, Gemini, Claude, Ollama, DeepSeek, Perplexity, Grok, Bielik, chat, vision, voice, RAG, image and video generation, agents, tools, MCP, plugins, spe

ai ai-assistant artificial-intelligence autonomous-agent chatbot claude deepseek desktop-app pythonby szczyglis-devPython

evo-agents 📁master@2026-04-19🌱 Seedling⭐3

Complete Workspace Template for OpenClaw - Full agent lifecycle with unified memory system (Markdown + SQLite), self-evolution, RAG. Not for SubAgent/Skill use.

agent ai-memory bge-m3 chinese-nlp fts5 local-ai markdown memory-system python ragby luoboaskPython

uniAI 📁0.0.0🌱 Seedling⭐1

Syllabus-aware RAG study assistant for university students. Answers strictly from your own notes & PDFs, unit-scoped retrieval, cross-encoder reranking, and a hallucination gate — built to help studen

ai chromadb django genai information-retrieval llm local-llm ollama python vector-databaseby git-pratap-shreyPython

gptme 📁v0.31.0🌱 Seedling⭐4,266

Your agent in your terminal, equipped with local tools: writes code, uses the terminal, browses the web. Make your own persistent autonomous agent on top!

agent agents ai-agents ai-assistant anthropic chatbot chatgpt cli pythonby gptmePython

multi-agent-orchestration-framework 📁v0.1.0🌱 Seedling⭐26

Modular multi-agent orchestration framework powered by LangGraph and FastAPI.

agent ai-framework fastapi langchain langgraph llm memory multi-agent pythonby yx-fanPython

Government-Citizen-Services-Voice-Agent 📁main@2026-04-15🌱 Seedling⭐1

Autonomous, multilingual AI voice agent using ElevenLabs, LangGraph, and RAG for government services

conversational-ai elevenlabs fastapi govtech langgraph python rag voice-agentby AutomaticarePython

LettuceDetect 📁0.1.8💤 Dormant⭐545

Lightweight hallucination detection framework for RAG applications

bert hallucination-detection hallucination-evaluation information-extraction nlp python pytorch token-classificationby KRLabsOrgPython

HealthFlow 📁datasets💤 Dormant⭐40

HealthFlow: A Self-Evolving AI Agent with Meta Planning for Autonomous Healthcare Research

ai-for-healthcare ai-for-science ehr llm llm-agent multi-agent pythonby yhzhu99Python

RagaAI-Catalyst 📁v2.2.4💤 Dormant⭐16,130

Python SDK for Agent AI Observability, Monitoring and Evaluation Framework. Includes features like agent, llm and tools tracing, debugging multi-agentic system, self-hosted dashboard and advanced anal

agentic-ai agentic-ai-development agentneo agents ai-agent-monitoring ai-application-debugging ai-evaluation-tools ai-performance-optimization pythonby raga-ai-hubPython

FlexRAG 📁0.3.0💤 Dormant⭐235

FlexRAG: A RAG Framework for Information Retrieval and Generation.

llms nlp python ragby ictnlpPython

Qwen-Agent 📁v0.0.26💤 Dormant⭐15,963

Agent framework and applications built upon Qwen>=3.0, featuring Function Calling, MCP, Code Interpreter, RAG, Chrome extension, etc.

pythonby QwenLMPython

medicalAI 📁v1.2.9-rc⚰️ Archived⭐21

Medical-AI is a AI framework specifically for Medical Applications https://aibharata.github.io/medicalAI/

ai-framework keras medical-applications medical-imaging pdf-report prediction python tensorflow tensorflow2by aibharataPython