# vllm

> A high-throughput and memory-efficient inference and serving engine for LLMs

- **URL**: https://www.freshcrate.ai/projects/vllm
- **Author**: vllm-project
- **Category**: RAG & Memory
- **Latest version**: `v0.22.1` (2026-06-05)
- **License**: Apache-2.0
- **Source**: https://github.com/vllm-project/vllm
- **Homepage**: https://vllm.ai
- **Language**: Python
- **GitHub**: 77,587 stars, 15,906 forks
- **Registry**: github
- **Tags**: `amd`, `blackwell`, `cuda`, `deepseek`, `deepseek-v3`, `gpt`, `gpt-oss`, `inference`, `python`

## Description

A high-throughput and memory-efficient inference and serving engine for LLMs

## Recent releases

| Version | Date | Urgency | Changes |
| --- | --- | --- | --- |
| `v0.22.1` | 2026-06-05 | High | ## Highlights  This release features 8 commits from 6 contributors (1 new)!  v0.22.1 is a patch release on top of v0.22.0 with targeted bug fixes plus a couple of additions: new model support for JetBrains' Mellum v2, zentorch-accelerated quantized linear inference on AMD Zen CPUs, and fixes for multi-node Ray data-parallel serving, DeepSeek-V4 initialization, and a few model-loading regressions.  ### Model Support * New model: JetBrains' **Mellum v2**, an open-weights Mixture-of-Experts |
| `v0.22.0` | 2026-05-29 | High | ## Highlights  This release features 459 commits from 230 contributors (63 new)!  * **DeepSeek V4 maturity**: DeepSeek V4 received a major hardening pass this cycle — the model was reorganized into a dedicated `vllm/models/deepseek_v4/` package (#43004, #43039, #43073, #43077, #43149), gained NVFP4 fused MoE support (#42209), full + piecewise CUDA graph (#42604), and MTP speculative decoding (#43385). A large set of fused kernels (MegaMoE, `mhc`, Q-norm, indexer, sparse MLA) and ROCm parity |
| `v0.21.0` | 2026-05-15 | High | ## Highlights  This release features 367 commits from 202 contributors (49 new)!  * **Transformers v4 deprecated**: This release formally deprecates `transformers` v4 support (#40389). Users should migrate to `transformers` v5. * **C++20 build requirement**: vLLM now requires a C++20-compatible compiler for compatibility with PyTorch (#40380). This is a **breaking build change**. * **KV Offload + Hybrid Memory Allocator (HMA)**: The KV offloading subsystem now integrates with the Hybrid Me |
| `v0.20.2` | 2026-05-10 | High | # vLLM v0.20.2  ## Highlights This release features 6 commits from 6 contributors (0 new)!  This is a small patch release with bug fixes for DeepSeek V4, gpt-oss, and Qwen3-VL  ### Bug Fixes * **DeepSeek V4 sparse attention**: Re-enable the persistent topk path on Hopper and ensure the memset kernel runs at CUDA graph capture time regardless of `max_seq_len`, fixing the MTP=1 hang on DeepSeek V4 (#41665, revert of #41605). * **DeepSeek V4 KV cache**: Fixed a "failure to allocate KV bloc |
| `v0.20.1` | 2026-05-03 | High | # vLLM v0.20.1  This is a patch release on top of `v0.20.0` primarily focused on **DeepSeek V4 stabilization and performance improvements**, along with several important bug fixes.  ### DeepSeek V4 * Base model support (#41006). * Multi-stream pre-attention GEMM (#41061), configurable pre-attn GEMM knob (#41443), and tuned default `VLLM_MULTI_STREAM_GEMM_TOKEN_THRESHOLD` (#41526). * BF16 and MXFP8 all-to-all support for FlashInfer one-sided communication (#40960). * PTX `cvt` instruction |
| `v0.20.0` | 2026-04-27 | High | # vLLM v0.20.0  ## Highlights This release features 752 commits from 320 contributors (123 new)!  * **DeepSeek V4**: Initial DeepSeek V4 support landed (#40860), with DSML token-leakage fix in DSV4/3.2 (#40806), DSA + MTP IMA fix (#40772), and a silu clamp limit on the shared expert (#40950). * **CUDA 13.0 default**: Default CUDA wheel on PyPI and `vllm/vllm-openai:v0.20.0` image switched to CUDA 13.0; architecture lists and build-args cleaned up (#39878), and CUDA bumped to 13.0.2 to matc |
| `v0.19.1` | 2026-04-18 | High | This is a patch release on top of `v0.19.0` with Transformers v5.5.4 upgrade and bug fixes for Gemma4: - Update to transformers v5 (#30566) - [Bugfix] Fix invalid JSON in Gemma 4 streaming tool calls by stripping partial delimiters (#38992) - [Bugfix][Frontend] Fix Gemma4 streaming HTML duplication after tool calls (#38909) - [Bugfix] Fix Gemma4 streaming tool call corruption for split boolean/number values (#39114) - [Tool] adjust_request to reasoning parser, and Gemma4 fixes (#39027) - [ |
| `v0.19.0` | 2026-04-03 | High | # vLLM v0.19.0  ## Highlights This release features 448 commits from 197 contributors (54 new)!  * **Gemma 4 support**: Full Google Gemma 4 architecture support including MoE, multimodal, reasoning, and tool-use capabilities (#38826, #38847). Requires `transformers>=5.5.0`. We recommend using pre-built docker image `vllm/vllm-openai:gemma4` for out of box usage. * **Zero-bubble async scheduling + speculative decoding**: Async scheduling now supports speculative decoding with zero-bubble ov |
| `v0.18.1` | 2026-03-31 | Medium | This is a patch release on top of v0.18.0 to address a few issues: - Change default SM100 MLA prefill backend back to TRT-LLM (#38562) - Fix mock.patch resolution failure for standalone_compile.FakeTensorMode on Python <= 3.10 (#37158) - Disable monolithic TRTLLM MoE for Renormalize routing #37605 - Pre-download missing FlashInfer headers in Docker build #38391 - Fix DeepGemm E8M0 accuracy degradation for Qwen3.5 FP8 on Blackwell (#38083) |
| `v0.18.0` | 2026-03-20 | Low | # vLLM v0.18.0  ## Known issues - Degraded accuracy when serving Qwen3.5 with FP8 KV cache on B200 (#37618) - If you previously ran into `CUBLAS_STATUS_INVALID_VALUE` and had to use a workaround in `v0.17.0`, you can reinstall `torch 2.10.0`. PyTorch published an updated wheel that addresses this bug.  ## Highlights  This release features 445 commits from 213 contributors (61 new)!  * **gRPC Serving Support**: vLLM now supports gRPC serving via the new `--grpc` flag (#36169), enabling |

## Citation

- HTML: https://www.freshcrate.ai/projects/vllm
- Markdown: https://www.freshcrate.ai/projects/vllm.md
- Dependencies JSON: https://www.freshcrate.ai/api/projects/vllm/deps

_Generated by freshcrate.ai. Indexes github releases for AI-agent ecosystem packages._
