freshcrate — Search

Search results for "leaderboard"

16 results found (Python)

sentence-transformers 📁5.4.1🏛️ Flagship⭐18,570

Embeddings, Retrieval, and Reranking

bert embedding networks nlp pypi pytorch sentence transformer xlnetby pypiPython

headroom 📁v0.8.3🌳 Mature⭐1,474

The Context Optimization Layer for LLM Applications

agent ai anthropic compression context-engineering context-window fastapi langchain mcp pythonby chopratejasPython

onyx 📁v3.2.6🏛️ Flagship⭐27,905

Open Source AI Platform - AI Chat with advanced features that works with every LLM

ai ai-chat chatgpt chatui enterprise-search gen-ai information-retrieval llm python ragby onyx-dot-appPython

EvoScientist 📁v0.0.8🌳 Mature⭐2,796

🔬 Harness Vibe Research with Self-evolving AI Scientists

ai-agent ai4science multi-agent-system python vibe-researchby EvoScientistPython

letta 📁0.16.7🏛️ Flagship⭐22,205

Letta is the platform for building stateful agents: AI with advanced memory that can learn and self-improve over time.

ai ai-agents llm llm-agent pythonby letta-aiPython

txtai 📁v9.7.0🏛️ Flagship⭐12,412

💡 All-in-one AI framework for semantic search, LLM orchestration and language model workflows

agents ai ai-agents embeddings information-retrieval language-model large-language-models llm python vector-databaseby neumlPython

skill 📁v1.2.1🌿 Growing⭐1,039

PinchBench is a benchmarking system for evaluating LLM models as OpenClaw coding agents. Made with 🦀 by the humans at https://kilo.ai

pythonby pinchbenchPython

OpenClawProBench 📁main@2026-04-15🌿 Growing⭐453

OpenClawProBench is a live-first benchmark harness for evaluating LLM agents in the OpenClaw runtime with deterministic grading and repeated-trial reliability.

agent benchmark evaluation harness leaderboard llm openclaw pythonby suyoumoPython

OpenRA-RL 📁v0.4.1🌿 Growing⭐120

Open Framework for AI Agents to play Red Alert through Reinforcement Learning

pythonby yxc20089Python

PolyCouncil 📁v1.2.0-beta.1🌱 Seedling⭐31

PolyCouncil is an open-source multi-model deliberation engine for LM Studio. It runs multiple LLMs in parallel, gathers their answers, scores each response using a shared rubric, and produces a final,

ai ai-council ai-experiments ai-framework ai-research artificial-intelligence asyncio concensus pythonby TrentPiercePython

VectorDBBench 📁v1.0.20🌳 Mature⭐1,078

Benchmark for vector databases.

benchmark cost-effectiveness performance python vector-database vector-search vectordbby zilliztechPython

little-coder 📁v0.0.4🌱 Seedling⭐31

A coding agent optimized to smaller LLMs

ai-coding-assistant aider-polygot benchmark code-generation coding-agent coding-agents local-llm ollama pythonby itayinbarrPython

GTA 📁v0.2.0🌿 Growing⭐143

[NeurIPS 2024 D&B] GTA: A Benchmark for General Tool Agents & [arXiv 2026] GTA-2

llm-agent llm-evaluation pythonby open-compassPython

claw-eval 📁main@2026-04-15🌿 Growing⭐465

Claw-Eval is an evaluation harness for evaluating LLM as agents. All tasks verified by humans.

agent harness llm openclaw pythonby claw-evalPython

rag-chatbot 📁main@2026-04-14🌿 Growing⭐407

RAG (Retrieval-augmented generation) ChatBot that provides answers based on contextual information extracted from a collection of Markdown files.

chatbot chromadb gpu lamacpp llama3 llm python qwen3-5 ragby umbertogriffoPython

forgegod 📁main@2026-04-19🌱 Seedling⭐4

Autonomous coding agent with web research (Recon), adversarial plan debate, 5-tier cognitive memory, multi-model routing (Gemini + DeepSeek + Ollama), 24/7 loops, and $0 local mode. Apache 2.0.

agentic-ai ai-coding-agent anthropic autonomous-coding cli code-generation deepseek developer-tools pythonby waitdeadaiPython