# vllm-mlx

> OpenAI and Anthropic compatible server for Apple Silicon. Run LLMs and vision-language models (Llama, Qwen-VL, LLaVA) with continuous batching, MCP tool calling, and multimodal support. Native MLX bac

- **URL**: https://www.freshcrate.ai/projects/vllm-mlx
- **Author**: waybarrios
- **Category**: MCP Servers
- **Latest version**: `v0.3.0` (2026-05-09)
- **License**: Apache-2.0
- **Source**: https://github.com/waybarrios/vllm-mlx
- **Language**: Python
- **GitHub**: 917 stars, 140 forks
- **Registry**: github
- **Tags**: `anthropic`, `apple-silicon`, `audio-processing`, `claude-code`, `computer-vision`, `image-understanding`, `inference`, `llm`, `python`

## Description

OpenAI and Anthropic compatible server for Apple Silicon. Run LLMs and vision-language models (Llama, Qwen-VL, LLaVA) with continuous batching, MCP tool calling, and multimodal support. Native MLX backend, 400+ tok/s. Works with Claude Code.

## Recent releases

| Version | Date | Urgency | Changes |
| --- | --- | --- | --- |
| `v0.3.0` | 2026-05-09 | High | ## Highlights  - **Registry-backed multi-model serving** — register and qualify multiple models, swap them per request, with `--no-mllm` opt-out and lifecycle-managed residency for the default model. - **Gemma 4 audio** — chat completions now accept Gemma 4 audio inputs through the unified pipeline. - **MLLM chunked prefill** — non-blocking interleaved prefill/decode for multimodal requests, plus async preprocessing that no longer blocks the event loop. - **`--max-kv-size` CLI flag** — per-seque |
| `v0.2.9` | 2026-04-22 | High | ## v0.2.9  ### Security  Large wave of server hardening landed in this release. If you expose the server, upgrade.  - MCP sandbox enforced on execute endpoint (#329) - MCP high-risk tools blocked by default (#343) - MCP interpreter inline execution flags blocked (#331) - MCP newline and path traversal validation (#333) - MCP config security bypass removed (#326, #345) - Reject arbitrary endpoint model loads (#330) - `trust_remote_code` now requires explicit opt-in (#328) - Block local path trave |
| `v0.2.8` | 2026-04-12 | High | ## Compatibility & Bug Fixes - mlx-lm 0.31.x BatchGenerator API compatibility (#294, closes #293) - Gemma 4 BatchKVCache, attention and RotatingKVCache patches (#268, #256) - Qwen3.5 ArraysCache hybrid model batching (#160) - mlx-lm 0.31.x prompt_checkpoints tuple compatibility - RotatingKVCache support in MLLM batching - Streaming UTF-8 safe detokenizer (#109) - Platform module rename to avoid stdlib shadowing (#185) - Base64 image hash no longer truncated (#206) - Specprefill: avoid dense tail |
| `v0.2.7` | 2026-03-31 | Medium | ## What's New  ### Features - **SpecPrefill**: attention-based sparse prefill for TTFT reduction (#180) - **Native video support**: Qwen3-VL video pipeline with temporal 3D conv + M-RoPE (#150) - **MTP speculative decoding** for Qwen3-Next (#82) - **Qwen3.5 model support** with text-only loading and dynamic memory threshold (#127) - **GPT-OSS reasoning parser** for channel-based token format (#53) - **KV cache quantization** for prefix cache memory reduction (#62) - **`--served-model-name` CLI p |
| `v0.2.6` | 2026-02-13 | Low | This is a major feature release with tool calling, embeddings, reasoning support, Anthropic Messages API, and numerous fixes.  ## New Features  ### Tool Calling Support Full tool calling / function calling with 12 parsers covering major model families: Mistral, DeepSeek, Granite, Nemotron, GLM-4.7, Harmony (GPT-OSS), and more. Includes native format support and streaming tool call parsing for Qwen3-Coder. (#28, #31, #42, #50, #55, #64)  ### Embeddings API OpenAI-compatible `/v1/embeddings` endpo |
| `v0.2.5` | 2026-01-26 | Low | This release brings significant performance improvements for multimodal models through prefix caching and continuous batching support.  ## What's New  ### Prefix Cache for Multimodal Models  When you send the same image multiple times (like in a multi-turn conversation), the vision encoder normally has to process it again each time. Now, vllm-mlx caches the vision embeddings and KV states, so subsequent requests with the same image skip the encoder entirely.  **Real-world impact:** On a Qwen3-VL |
| `v0.2.4` | 2026-01-23 | Low | ## What's New  ### Dependency conflict resolved  Fixed pip installation error when mlx-audio conflicted with mlx-lm version requirements.  **Changes:** - mlx-audio moved to optional dependencies (install with `pip install vllm-mlx[audio]`) - Removed transformers<5.0.0 constraint for mlx-lm 0.30.2+ compatibility  ### Install ```bash pip install vllm-mlx==0.2.4  # With audio support pip install vllm-mlx[audio]==0.2.4 ```  Fixes #19 |
| `v0.2.3` | 2026-01-22 | Low | ## What's New  ### Stability improvements for continuous batching  This release fixes two issues reported in #16 when running with `--continuous-batching`:  **Metal crash fix** - Added proper synchronization for MLX operations. Previously, concurrent requests could cause Metal command buffer conflicts resulting in crashes. Now requests are serialized internally to prevent this.  **Memory management** - The prefix cache now tracks actual memory usage instead of just counting entries. For large mo |
| `v0.2.1` | 2026-01-16 | Low | ## What's New  ### Security - Timing attack prevention with `secrets.compare_digest()` for API key verification - Rate limiting support with sliding window algorithm (`--rate-limit` flag) - Request timeout to prevent resource exhaustion (`--timeout` flag)  ### Reliability - TempFileManager for automatic cleanup of temporary files (images/videos) - Thread-safe `_waiting_consumers` counter in RequestOutputCollector - Fixed asyncio timeout by running model calls in thread pool  ### API Changes - Ne |
| `v0.2.0` | 2026-01-06 | Low | ## 🚀 What's New  vLLM-MLX now supports **Text, Image, Video & Audio** - all GPU-accelerated on Apple Silicon.  ### 🎙️ Audio Support (NEW) - **STT (Speech-to-Text)**: Whisper, Parakeet - **TTS (Text-to-Speech)**: Kokoro with native multilingual voices - **Native voices**: English, Spanish, French, Chinese, Japanese, Italian, Portuguese, Hindi - **Bug fix included** for mlx-audio 0.2.9 multilingual support  ### 📦 Modular Architecture  \| Modality \| Library \| Install \| \|----------\|-- |

## Citation

- HTML: https://www.freshcrate.ai/projects/vllm-mlx
- Markdown: https://www.freshcrate.ai/projects/vllm-mlx.md
- Dependencies JSON: https://www.freshcrate.ai/api/projects/vllm-mlx/deps

_Generated by freshcrate.ai. Indexes github releases for AI-agent ecosystem packages._