# houtini-lm

> MCP server that saves Claude Code tokens by delegating bounded tasks to local or cloud LLMs. Works with LM Studio, Ollama, vLLM, DeepSeek, Groq, Cerebras.

- **URL**: https://www.freshcrate.ai/projects/houtini-lm
- **Author**: houtini-ai
- **Category**: MCP Servers
- **Latest version**: `v2.8.0` (2026-03-18)
- **License**: MIT
- **Source**: https://github.com/houtini-ai/houtini-lm
- **Homepage**: https://houtini.com/how-to-cut-your-claude-code-bill-with-houtini-lm/
- **Language**: JavaScript
- **GitHub**: 71 stars, 14 forks
- **Registry**: github (`houtini-ai/houtini-lm`)
- **Tags**: `ai-agents`, `claude`, `claude-mcp`, `code-generation`, `developer-tool`, `developer-tools`, `javascript`, `lm-studio`, `lm-studio-mcp`

## Description

MCP server that saves Claude Code tokens by delegating bounded tasks to local or cloud LLMs. Works with LM Studio, Ollama, vLLM, DeepSeek, Groq, Cerebras.

## Recent releases

| Version | Date | Urgency | Changes |
| --- | --- | --- | --- |
| `v2.8.0` | 2026-03-18 | Low | ## What's New  ### Added - **Quality metadata** — every response includes structured quality signals (truncation, think-block detection, token estimation, finish reason) so Claude can make informed trust decisions about local LLM output - **Session metrics resource** — `houtini://metrics/session` MCP resource exposes cumulative offload stats and per-model performance as JSON, enabling proactive routing feedback - **Request semaphore** — inference calls are serialised to prevent stacked timeouts |
| `v1.0.13` | 2025-09-23 | Low | # Local LLM MCP v1.0.13  1.0.13 |
| `v1.0.12` | 2025-09-23 | Low | # Local LLM MCP v1.0.12  1.0.12 |
| `v1.0.11` | 2025-09-10 | Low | # Local LLM MCP v1.0.11  1.0.11 |
| `v1.0.10` | 2025-09-07 | Low | # Local LLM MCP v1.0.10  Merge branch 'main' of https://github.com/houtini-ai/lm |
| `v1.0.9` | 2025-09-05 | Low | # Local LLM MCP v1.0.9  fix(context): harden TokenCalculator math for context window stability  - Reduce token estimation from 4 chars/token to 3 chars/token (more conservative) - Lower context usage from 95% to 80% for safety margin - Add explicit 500 token safety buffer to all calculations - Ensure consistent buffer application across needsChunking and execution - Fix math error causing 152+ token overages on qwen.qwen3-coder-30b  CRITICAL: Fixes context overflow errors preventing analysis of |
| `v1.0.8` | 2025-09-05 | Low | # Local LLM MCP v1.0.8  feat(context): enhance Claude contextual understanding with dynamic workflow guidance  - Enhanced server registration with rich description and capabilities metadata - Added dynamic context generation in BasePlugin.getToolDefinition() - Implemented category-aware workflow context and usage tips - All 28+ functions now automatically provide workflow guidance to Claude - Improves Claude's understanding of Houtini LM purpose and optimal usage patterns - Maintains JSON-RPC co |
| `v1.0.7` | 2025-09-05 | Low | # Local LLM MCP v1.0.7  1.0.7 |
| `v1.0.6` | 2025-09-05 | Low | # Local LLM MCP v1.0.6  chore: bump version to 1.0.6 and clean up README  - Remove version badge from README header for cleaner presentation - Update package.json version to 1.0.6 - Include build system improvements and plugin registry updates - Add build utilities for shebang handling  Prepares for npm package publication with improved documentation and build process. |

## Citation

- HTML: https://www.freshcrate.ai/projects/houtini-lm
- Markdown: https://www.freshcrate.ai/projects/houtini-lm.md
- Dependencies JSON: https://www.freshcrate.ai/api/projects/houtini-lm/deps

_Generated by freshcrate.ai. Indexes github releases for AI-agent ecosystem packages._