freshcrate
Skin:/
Home > Infrastructure > SmarterRouter

SmarterRouter

SmarterRouter: An intelligent LLM gateway and VRAM-aware router for Ollama, llama.cpp, and OpenAI. Features semantic caching, model profiling, and automatic failover for local AI labs.

Why this rank:Strong adoptionRelease freshnessHealthy release cadence

Description

SmarterRouter: An intelligent LLM gateway and VRAM-aware router for Ollama, llama.cpp, and OpenAI. Features semantic caching, model profiling, and automatic failover for local AI labs.

README

SmarterRouter

Intelligent, multi-backend AI router that sits between your application and various LLM providers. It profiles your models, aggregates benchmark data, and intelligently routes each query to the best available model for the task—all locally, all free. Key Benefits:

  • Zero manual model selection - AI automatically picks the right model for each prompt
  • All local, zero cost - No cloud API fees, works with your existing models
  • Production-ready - Monitoring, metrics, and error handling built-in
  • Drop-in replacement - Works with any OpenAI-compatible client

Why SmarterRouter? (vs other LLM proxies)

Feature SmarterRouter OptiLLM ClewdR LLM-API-Proxy Reader
Intelligent Routing ✅ Auto-selects best model ❌ Manual config ❌ Claude-only ❌ Manual routing ❌ URL-only
Multi-Backend Support ✅ Ollama + llama.cpp + OpenAI ❌ OpenAI-only ❌ Claude-only ✅ 100+ providers ❌ URL proxy
Local-first ✅ All local models ⚠️ Cloud proxy ⚠️ Cloud proxy ⚠️ Cloud proxy ⚠️ Cloud proxy
Zero Code Changes ✅ OpenAI-compatible ✅ OpenAI-compatible ✅ OpenAI-compatible ✅ OpenAI-compatible ✅ URL proxy
Production Features ✅ Monitoring + Metrics ✅ Metrics ✅ Dashboard ✅ Resilience ✅ Simple
Learning Capability ✅ Profiles models over time ❌ Static config ❌ Static config ❌ Static config ❌ Static config

Quick Start (5 minutes)

Get up and running with Docker in three commands:

# 1. Clone the repository
git clone https://github.com/peva3/SmarterRouter.git
cd SmarterRouter

# 2. Start with Docker Compose
docker-compose up -d

# 3. Verify it's running
curl http://localhost:11436/health

That's it! SmarterRouter will:

  • ✅ Discover all your Ollama models automatically
  • ✅ Profile each model for performance on your hardware (first run takes 30-60 min)
  • ✅ Start routing queries to the best model
  • New in v2.1.9: Optimized performance with async GPU I/O, batched queries, and prompt caching

Access the router at: http://localhost:11436

Interactive Setup Wizard (New in v2.1.5)

SmarterRouter now includes a built-in CLI for easy setup and management:

# Run interactive setup wizard
python -m smarterrouter setup

# Validate configuration and connections
python -m smarterrouter check

# Generate optimal .env file based on hardware detection
python -m smarterrouter generate-env

The setup wizard automatically:

  • 🔍 Detects your Ollama installation (local, Docker, or remote)
  • ⚙️ Identifies GPU hardware (NVIDIA, AMD, Intel, Apple Silicon)
  • 📊 Analyzes available models and suggests optimal settings
  • 📝 Generates a tailored .env configuration file

One-Line Docker Deployment (New in v2.1.5)

For the simplest deployment experience, use the included script:

# Make script executable (if needed)
chmod +x docker-run.sh

# Run with auto-detected GPU configuration
./docker-run.sh

# Customize deployment
./docker-run.sh --port 11436 --data-dir ./smarterrouter-data --env-file .env

The script automatically:

  • 🐳 Detects GPU vendor and configures appropriate Docker device mounts
  • 📁 Creates persistent data directory
  • 🔧 Generates optimal configuration for your hardware
  • 🚀 Starts the container with proper restart policy

For production deployments, continue using docker-compose.yml with GPU-specific configurations.

Connect to OpenWebUI

  1. Open OpenWebUI → SettingsConnectionsAdd Connection
  2. Configure:
    • Name: SmarterRouter
    • Base URL: http://localhost:11436/v1
    • API Key: (leave empty)
    • Model: smarterrouter/main
  3. Save and start chatting

SmarterRouter will automatically select the best model for each prompt!

Using External Providers (OpenAI, Anthropic, etc.)

SmarterRouter can also route to external cloud providers. Use the provider prefix in model names:

Available providers: openai/, anthropic/, google/, cohere/, mistral/

Example usage with external providers:

# 1. Set your API keys in .env
ROUTER_OPENAI_API_KEY=sk-...
ROUTER_ANTHROPIC_API_KEY=sk-ant-...

# 2. Enable external providers
ROUTER_EXTERNAL_PROVIDERS_ENABLED=true
ROUTER_EXTERNAL_PROVIDERS=openai,anthropic

# 3. Use models with provider prefix
# In OpenWebUI, select: openai/gpt-4o or anthropic/claude-3-opus

Benefits:

  • Same intelligent routing as local models
  • Benchmark data from 400+ models via provider.db
  • Can mix local Ollama and external providers

See External Provider Setup for complete instructions.


Latest Features (v2.1.6)

  • Live Model Hot‑Swap: Add or remove models without restarting the router. Automatic discovery, optional auto‑profiling, and cleanup of missing models.
  • Enhanced Cache Analytics: Detailed time‑series statistics, per‑model cache counts, and advanced monitoring via new admin endpoints.
  • Improved Performance: Optimized cache statistics collection and parallel model polling.

What Gets Automated?

  • Model discovery - Automatically finds all available models from your backend
  • Performance profiling - Tests each model with standardized prompts on your hardware
  • Smart routing - Analyzes prompts and picks the optimal model based on category and complexity
  • VRAM management - Auto-detects all GPUs (NVIDIA, AMD, Intel, Apple Silicon), monitors usage, and unloads models when needed
  • Fallback handling - Automatically retries with backup models if primary fails
  • Response caching - Caches identical prompts for instant responses
  • Continuous learning - Collects user feedback to improve routing decisions

Configuration Basics

All configuration is via the .env file. Copy the template and customize:

cp ENV_DEFAULT .env
nano .env  # edit as needed

Essential settings:

Variable Purpose Default
ROUTER_OLLAMA_URL Your backend URL http://localhost:11434
ROUTER_PROVIDER Backend type: ollama, llama.cpp, openai ollama
ROUTER_QUALITY_PREFERENCE 0.0 (speed) to 1.0 (quality) 0.5
ROUTER_PINNED_MODEL Keep a small model always loaded (optional) (none)
ROUTER_ADMIN_API_KEY Required for production to secure admin endpoints (none)

VRAM monitoring: Enabled by default with auto-detection across NVIDIA, AMD, Intel, and Apple Silicon GPUs. Multi-GPU systems are fully supported. See Configuration Reference for details.

⚠️ Production security: Always set ROUTER_ADMIN_API_KEY in production to protect admin endpoints.

For complete configuration reference, see docs/configuration.md.


Documentation

Getting Started:

In-Depth Guides:

Examples:

Want to see how the sausage is made?

  • DEEPDIVE.md - Architecture, design decisions, and implementation details for the technically curious

Other Files:


Need Help?


License

MIT License - see LICENSE for details.

Release History

VersionChangesUrgencyDate
2.2.5## [2.2.5] - 2026-04-17 ### New Features - **Dynamic Model Metadata Registry** (`router/model_metadata.py`): Created comprehensive model metadata system with automatic capability detection from Ollama API, TTL caching, and pattern-based fallbacks. Supports vision, tool_calling, embedding, MoE, and quantization detection. - **Gemma 4 Support**: Added Gemma 4 series (e2b, e4b, 26b, 31b) to modality detection heuristics for both vision and tool calling capabilities. - **MoE-Aware VRAM EstimatHigh4/18/2026
2.2.4## [2.2.4] - 2026-04-06 ### Security Fixes - **Weak MD5 hash in prompt analysis cache** (`router/router.py:1302`): Replaced `hashlib.md5()` with `hashlib.sha256()` for cryptographic security in cache key generation. - **Pickle deserialization vulnerability in Redis cache** (`router/cache_redis.py:97`): Replaced `pickle.loads()/pickle.dumps()` with `json.loads()/json.dumps()` to prevent potential remote code execution from untrusted cache data. - **Redis cache connection error handling** (`High4/6/2026
2.2.3## [2.2.3] - 2026-03-27 ### Security Fixes - **SQL injection anti-pattern in index creation** (`database.py:278-281`): Changed f-string interpolation in DDL helper to parameterized query using `text(...).bindparams(...)`. The index name was hardcoded so not directly exploitable, but the pattern could be copied to user-facing code. - **Timing attack on admin API key comparison** (`state.py:467`): Changed string `!=` comparison to `hmac.compare_digest()` to prevent timing side-channel attacksMedium3/28/2026
2.2.2## [2.2.2] - 2026-03-16 ### Bug Fixes - **Ollama backend multimodal transformation**: Fixed OpenAI-style multimodal message handling in Ollama backend to properly convert image_url content parts to Ollama's expected images field, stripping data:image/...;base64, prefixes so Ollama vision models can actually receive image data. This resolves the issue where image uploads appeared to route correctly but the image payload was not translated into the format Ollama expects. Medium3/23/2026
2.2.1## [2.2.1] - 2026-03-16 ### Highlights Added modality-aware routing to intelligently route requests based on input type (vision, tool-calling, text, embeddings). Enhanced changelog organization and documentation. ### New Features #### Modality-Aware Routing - **Modality detection module** (`router/modality.py`) - Automatic detection of request modalities from request shape: - Vision: Image URL content parts in messages - Tool Calling: Presence of tools in request - Text: DefaLow3/16/2026
2.2.0## [2.2.0] - 2026-03-16 ### Highlights - Major platform update with performance improvements, reliability hardening, expanded security controls, and large documentation/testing expansion. - Main application architecture refactored into focused modules (`router/state.py`, `router/middleware.py`, `router/lifecycle.py`, `router/api/*`) with `main.py` reduced to an app shell. ### Performance & Scalability - Added configurable response compression (`ROUTER_ENABLE_RESPONSE_COMPRESSION`, `ROUTLow3/16/2026
2.1.9## [2.1.9] - 2026-03-03 ### Performance Optimizations (Phase 2 - Quick Wins) #### Critical Performance Fixes 1. **Fixed blocking GPU I/O with async wrapper**: - Added `get_memory_info_async()` method to GPU backend protocol (router/gpu_backends/base.py:63-74) - Updated VRAM monitor to use async GPU queries (router/vram_monitor.py:219-225) - Eliminates event loop blocking during GPU memory queries (5s timeout per GPU) 2. **Implemented batched VRAM estimates**: - Added `gLow3/4/2026
2.1.8# [2.1.8] - 2026-03-03 ### Performance Optimizations #### Reduced Backend API Calls - **Model list caching**: Added 10-second TTL cache for `list_models()` calls, eliminating ~100-500ms latency per request (router/router.py:33-155, main.py:125-184) - **Router engine accepts pre-fetched models**: `select_model()` now accepts optional `available_models` parameter to avoid redundant backend calls (router/router.py:1064-1079) #### Lower Resource Consumption - **Reduced model polling freqLow3/3/2026
2.1.7## [2.1.7] - 2026-02-27 ### Critical Bug Fixes & Stability Improvements #### Concurrency & Race Condition Fixes - **Fixed race condition in `SemanticCache._get_embedding()`**: Rewrote embedding cache to eliminate double lock acquisition that could cause deadlocks (router/router.py:396-467) - **Fixed global cache race condition in `_get_all_profiles()`**: Added `asyncio.Lock()` and double-checked locking pattern to prevent concurrent cache corruption (router/router.py:1363-1384) - **FixeLow2/27/2026
2.1.6## [2.1.6] - 2026-02-27 ### Enhanced Cache Statistics & API #### Detailed Cache Analytics - **Time-series tracking**: Cache hits, misses, similarity hits, evictions, and embedding cache events tracked with timestamps - **Multi-dimensional metrics**: Per-model cache counts, access patterns, and eviction reasons - **Real-time analytics**: Cache hit rates, similarity hit rates, and adaptive threshold adjustments #### New Admin Endpoints - `GET /admin/cache/stats` - Detailed cache statiLow2/27/2026
2.1.5## [2.1.5] - 2026-02-26 ### Semantic Cache V2: Complete Four-Phase Implementation #### Persistent Disk Caching - **SQLite-based persistence**: Routing decisions, LLM responses, and embeddings now survive restarts via SQLite database - **Automatic load/save**: Cache data automatically loads on startup and saves new entries to disk - **Configurable TTL**: Persistent cache respects same TTL settings as in-memory cache (default 1 hour for routing/response, 24h for embeddings) - **AutomaticLow2/27/2026
2.1.4## [2.1.4] - 2026-02-25 ### Critical Bug Fixes and Reliability Improvements Fixed critical issues identified in comprehensive analysis: #### Database Safety & Performance - **Fixed Database Session Bug**: `get_session()` context manager no longer commits transactions automatically for read-only queries, preventing performance overhead and potential data corruption - **Fixed SQLite IN Clause DoS**: Added parameter chunking to avoid exceeding SQLite's 999 parameter limit in provider_db.Low2/25/2026
2.1.3## [2.1.3] - 2026-02-23 ### External Provider Support (provider.db + External APIs) Added support for external/cloud LLM providers (OpenAI, Anthropic, Google, etc.) via: - **provider.db**: Benchmark database with 400+ models for intelligent routing - **External API Integration**: Actually route requests to external providers #### External API Features **Supported Providers:** - OpenAI (openai/gpt-4, openai/gpt-4o, etc.) - Anthropic (anthropic/claude-3-opus, anthropic/claude-3-sonLow2/24/2026
2.1.2### Model Filtering Added optional model filtering via environment variables to control which models are discovered and available for routing. **New Settings:** - `ROUTER_MODEL_FILTER_INCLUDE` - Glob patterns to include (e.g., `gemma*,mistral*`) - `ROUTER_MODEL_FILTER_EXCLUDE` - Glob patterns to exclude (e.g., `*qwen*,*test*`) **Features:** - Case-insensitive matching for convenience - Glob patterns: `*` (any), `?` (single), `[seq]` (character class) - Exclude takes precedence overLow2/23/2026
2.1.1## [2.1.1] - 2026-02-21 ### Performance Optimizations This release focuses on significant performance improvements across the routing pipeline, database operations, and backend communication layers. #### Database Optimizations - **N+1 Query Fix - Feedback Aggregation**: Changed `_get_model_feedback_scores()` to use SQL `GROUP BY` aggregation instead of loading all feedback records into memory. Reduces memory from O(N) to O(1) and improves speed 10-100x for large datasets. - **BulkLow2/21/2026
2.1.0Biggest change here is that VRAM monitoring/management was added for Apple (unified memory on the M series), AMDGPU, and Intel. Along with that there was a big reworking of the documentation. If you want the details of the release check the CHANGELOG.md file. **Full Changelog**: https://github.com/peva3/SmarterRouter/compare/2.0.0...2.1.0Low2/20/2026
2.0.0I think this is in a good enough spot right now to be released, check the CHANGELOG.md to see the full scope of what's been done. Looking forward to hearing from user feedback for any bugs or feature requests, but for the feature set I was looking for; this has met the entirety of my scope. Low2/20/2026

Dependencies & License Audit

Loading dependencies...

Similar Packages

ai-real-estate-assistantAdvanced AI Real Estate Assistant using RAG, LLMs, and Python. Features market analysis, property valuation, and intelligent search.v5.0.7
doryOne memory layer for every AI agent. Local-first, markdown source of truth, and CLI/HTTP/MCP native. Your agent forgot who you are. Again. Dory fixes that.main@2026-05-24
agent2The production runtime for AI agents. Schema in, API out. Built on PydanticAI + FastAPI.v0.4.0
BuildableAI-powered web app builder — describe it, build it, ship it. 2-agent LangGraph system (Sonnet 4.5 + o4-mini) generates React apps from natural language with live preview and one-click deploy.0.0.0
lm-proxyOpenAI-compatible HTTP LLM proxy / gateway for multi-provider inference (Google, Anthropic, OpenAI, PyTorch). Lightweight, extensible Python/FastAPI—use as library or standalone service.v3.2.2

More in Infrastructure

tensorzeroTensorZero is an open-source LLMOps platform that unifies an LLM gateway, observability, evaluation, optimization, and experimentation.
planoPlano is an AI-native proxy and data plane for agentic apps — with built-in orchestration, safety, observability, and smart LLM routing so you stay focused on your agents core logic.
modelsThis repository contains comprehensive pricing and configuration data for LLMs. It powers cost attribution for 200+ enterprises running 400B+ tokens through Portkey AI Gateway every day.
edgeeOpen-source AI gateway written in Rust, with token compression for Claude Code, Codex... and any other LLM client.