| v0.22.1 | ## Highlights This release features 8 commits from 6 contributors (1 new)! v0.22.1 is a patch release on top of v0.22.0 with targeted bug fixes plus a couple of additions: new model support for JetBrains' Mellum v2, zentorch-accelerated quantized linear inference on AMD Zen CPUs, and fixes for multi-node Ray data-parallel serving, DeepSeek-V4 initialization, and a few model-loading regressions. ### Model Support * New model: JetBrains' **Mellum v2**, an open-weights Mixture-of-Experts | High | 6/5/2026 |
| v0.22.0 | ## Highlights This release features 459 commits from 230 contributors (63 new)! * **DeepSeek V4 maturity**: DeepSeek V4 received a major hardening pass this cycle — the model was reorganized into a dedicated `vllm/models/deepseek_v4/` package (#43004, #43039, #43073, #43077, #43149), gained NVFP4 fused MoE support (#42209), full + piecewise CUDA graph (#42604), and MTP speculative decoding (#43385). A large set of fused kernels (MegaMoE, `mhc`, Q-norm, indexer, sparse MLA) and ROCm parity | High | 5/29/2026 |
| v0.21.0 | ## Highlights This release features 367 commits from 202 contributors (49 new)! * **Transformers v4 deprecated**: This release formally deprecates `transformers` v4 support (#40389). Users should migrate to `transformers` v5. * **C++20 build requirement**: vLLM now requires a C++20-compatible compiler for compatibility with PyTorch (#40380). This is a **breaking build change**. * **KV Offload + Hybrid Memory Allocator (HMA)**: The KV offloading subsystem now integrates with the Hybrid Me | High | 5/15/2026 |
| v0.20.2 | # vLLM v0.20.2 ## Highlights This release features 6 commits from 6 contributors (0 new)! This is a small patch release with bug fixes for DeepSeek V4, gpt-oss, and Qwen3-VL ### Bug Fixes * **DeepSeek V4 sparse attention**: Re-enable the persistent topk path on Hopper and ensure the memset kernel runs at CUDA graph capture time regardless of `max_seq_len`, fixing the MTP=1 hang on DeepSeek V4 (#41665, revert of #41605). * **DeepSeek V4 KV cache**: Fixed a "failure to allocate KV bloc | High | 5/10/2026 |
| v0.20.1 | # vLLM v0.20.1 This is a patch release on top of `v0.20.0` primarily focused on **DeepSeek V4 stabilization and performance improvements**, along with several important bug fixes. ### DeepSeek V4 * Base model support (#41006). * Multi-stream pre-attention GEMM (#41061), configurable pre-attn GEMM knob (#41443), and tuned default `VLLM_MULTI_STREAM_GEMM_TOKEN_THRESHOLD` (#41526). * BF16 and MXFP8 all-to-all support for FlashInfer one-sided communication (#40960). * PTX `cvt` instruction | High | 5/3/2026 |
| v0.20.0 | # vLLM v0.20.0 ## Highlights This release features 752 commits from 320 contributors (123 new)! * **DeepSeek V4**: Initial DeepSeek V4 support landed (#40860), with DSML token-leakage fix in DSV4/3.2 (#40806), DSA + MTP IMA fix (#40772), and a silu clamp limit on the shared expert (#40950). * **CUDA 13.0 default**: Default CUDA wheel on PyPI and `vllm/vllm-openai:v0.20.0` image switched to CUDA 13.0; architecture lists and build-args cleaned up (#39878), and CUDA bumped to 13.0.2 to matc | High | 4/27/2026 |
| v0.19.1 | This is a patch release on top of `v0.19.0` with Transformers v5.5.4 upgrade and bug fixes for Gemma4: - Update to transformers v5 (#30566) - [Bugfix] Fix invalid JSON in Gemma 4 streaming tool calls by stripping partial delimiters (#38992) - [Bugfix][Frontend] Fix Gemma4 streaming HTML duplication after tool calls (#38909) - [Bugfix] Fix Gemma4 streaming tool call corruption for split boolean/number values (#39114) - [Tool] adjust_request to reasoning parser, and Gemma4 fixes (#39027) - [ | High | 4/18/2026 |
| v0.19.0 | # vLLM v0.19.0 ## Highlights This release features 448 commits from 197 contributors (54 new)! * **Gemma 4 support**: Full Google Gemma 4 architecture support including MoE, multimodal, reasoning, and tool-use capabilities (#38826, #38847). Requires `transformers>=5.5.0`. We recommend using pre-built docker image `vllm/vllm-openai:gemma4` for out of box usage. * **Zero-bubble async scheduling + speculative decoding**: Async scheduling now supports speculative decoding with zero-bubble ov | High | 4/3/2026 |
| v0.18.1 | This is a patch release on top of v0.18.0 to address a few issues: - Change default SM100 MLA prefill backend back to TRT-LLM (#38562) - Fix mock.patch resolution failure for standalone_compile.FakeTensorMode on Python <= 3.10 (#37158) - Disable monolithic TRTLLM MoE for Renormalize routing #37605 - Pre-download missing FlashInfer headers in Docker build #38391 - Fix DeepGemm E8M0 accuracy degradation for Qwen3.5 FP8 on Blackwell (#38083) | Medium | 3/31/2026 |
| v0.18.0 | # vLLM v0.18.0 ## Known issues - Degraded accuracy when serving Qwen3.5 with FP8 KV cache on B200 (#37618) - If you previously ran into `CUBLAS_STATUS_INVALID_VALUE` and had to use a workaround in `v0.17.0`, you can reinstall `torch 2.10.0`. PyTorch published an updated wheel that addresses this bug. ## Highlights This release features 445 commits from 213 contributors (61 new)! * **gRPC Serving Support**: vLLM now supports gRPC serving via the new `--grpc` flag (#36169), enabling | Low | 3/20/2026 |
| v0.17.1 | This is a patch release on top of `v0.17.0` to address a few issues: - New Model: Nemotron 3 Super - Fix passing of activation_type to trtllm fused MoE NVFP4 and FP8 (#36017) - Fix/resupport nongated fused moe triton (#36412) - Re-enable EP for trtllm MoE FP8 backend (#36494) - [Mamba][Qwen3.5] Zero freed SSM cache blocks on GPU (#35219) - Fix TRTLLM Block FP8 MoE Monolithic (#36296) - [DSV3.2][MTP] Optimize Indexer MTP handling (#36723) | Low | 3/11/2026 |
| v0.17.0 | # vLLM v0.17.0 **Known Issue**: If you are on CUDA 12.9+ and encounter a `CUBLAS_STATUS_INVALID_VALUE` error, this is caused by a CUDA library mismatch. To resolve, try one of the following: 1. Remove the path to system CUDA shared library files (e.g. `/usr/local/cuda`) from `LD_LIBRARY_PATH`, or simply `unset LD_LIBRARY_PATH`. 2. Install vLLM with `uv pip install vllm --torch-backend=auto`. 3. Install vLLM with `pip install vllm --extra-index-url https://download.pytorch.org/whl/cu129` (c | Low | 3/7/2026 |
| v0.16.0 | # vLLM v0.16.0 Please note that this release was branch cut on Feb 8, so any features added to vLLM after that date is not included. ## Highlights This release features 440 commits from 203 contributors (7 new)! * **Async scheduling + Pipeline Parallelism** is now fully supported, delivering **30.8% E2E throughput improvement** and **31.8% TPOT improvement** (#32618). * **Realtime API**: A new WebSocket-based Realtime API enables streaming audio interactions (#33187), building on the | Low | 2/25/2026 |
| v0.15.1 | v0.15.1 is a patch release with security fixes, RTX Blackwell GPU fixes support, and bug fixes. ## Security - **CVE-2025-69223**: Updated aiohttp dependency (#33621) - **CVE-2026-0994**: Updated Protobuf dependency (#33619) ## Highlights ### Bugfix Hardware Support - **RTX Blackwell (SM120)**: Fixed NVFP4 MoE kernel support for RTX Blackwell workstation GPUs. Previously, NVFP4 MoE models would fail to load on these GPUs (#33417) - **FP8 kernel selection**: Fixed FP8 CUTLASS group | Low | 2/4/2026 |
| v0.15.0 | ## Highlights This release features 335 commits from 158 contributors (39 new)! ### Model Support * **New architectures**: Kimi-K2.5 (#33131), Molmo2 (#30997), Step3vl 10B (#32329), Step1 (#32511), GLM-Lite (#31386), Eagle2.5-8B VLM (#32456). * **LoRA expansion**: Nemotron-H (#30802), InternVL2 (#32397), MiniMax M2 (#32763). * **Speculative decoding**: EAGLE3 for Pixtral/LlavaForConditionalGeneration (#32542), Qwen3 VL MoE (#32048), draft model support (#24322). * **Embeddings**: BGE-M | Low | 1/29/2026 |
| v0.14.1 | This is a patch release on top of `v0.14.0` to address a few security and memory leak fixes. | Low | 1/24/2026 |
| v0.14.0 | ## Highlights This release features approximately 660 commits from 251 contributors (86 new contributors). **Breaking Changes:** - **Async scheduling is now enabled by default** - Users who experience issues can disable with `--no-async-scheduling`. - Excludes some not-yet-supported configurations: pipeline parallel, CPU backend, non-MTP/Eagle spec decoding. - **PyTorch 2.9.1** is now required and the default wheel is compiled against cu129. - Deprecated quantization schemes have be | Low | 1/20/2026 |
| v0.13.0 | # vLLM v0.13.0 Release Notes Highlights ## Highlights This release features **442 commits from 207 contributors (61 new contributors)!** **Breaking Changes**: This release includes deprecation removals, PassConfig flag renames, and attention configuration changes from environment variables to CLI arguments. Please review the breaking changes section carefully before upgrading. ### Model Support * **New models**: BAGEL (AR only) (#28439), AudioFlamingo3 (#30539), JAIS 2 (#30188), lat | Low | 12/19/2025 |
| v0.12.0 | # vLLM v0.12.0 Release Notes Highlights ## Highlights This release features 474 commits from 213 contributors (57 new)! **Breaking Changes**: This release includes PyTorch 2.9.0 upgrade (CUDA 12.9), V0 deprecations including `xformers` backend, and scheduled removals - please review the changelog carefully. **Major Features**: * **EAGLE Speculative Decoding Improvements**: Multi-step CUDA graph support (#29559), DP>1 support (#26086), and multimodal support with Qwen3VL (#29594). * | Low | 12/3/2025 |
| v0.11.2 | This release includes 4 bug fixes on top of `v0.11.1`: - [BugFix] Ray with multiple nodes (https://github.com/vllm-project/vllm/pull/28873) - [BugFix] Fix false assertion with spec-decode=[2,4,..] and TP>2 (https://github.com/vllm-project/vllm/pull/29036) - [BugFix] Fix async-scheduling + FlashAttn MLA (https://github.com/vllm-project/vllm/pull/28990) - [NVIDIA] Guard SM100 CUTLASS MoE macro to SM100 builds v2 (https://github.com/vllm-project/vllm/pull/28938) | Low | 11/20/2025 |
| v0.11.1 | ## Highlights This release includes 1456 commits from 449 contributors (184 new contributors)! Key changes include: * **PyTorch 2.9.0 + CUDA 12.9.1**: Updated the default CUDA build to `torch==2.9.0+cu129`, enabling Inductor partitioning and landing multiple fixes in graph-partition rules and compile-cache integration. * **Batch-invariant `torch.compile`**: Generalized batch-invariant support across attention and MoE backends, with explicit support for DeepGEMM and FlashInfer on Hopper and | Low | 11/18/2025 |
| v0.11.0 | ## Highlights This release features 538 commits, 207 contributors (65 new contributors)! * This release completes the removal of V0 engine. V0 engine code including AsyncLLMEngine, LLMEngine, MQLLMEngine, all attention backends, and related components have been removed. **V1 is the only engine in the codebase now.** * This releases turns on **FULL_AND_PIECEWISE as the CUDA graph mode default**. This should provide better out of the box performance for most models, particularly fine-grained | Low | 10/2/2025 |
| v0.10.2 | ## Highlights This release contains 740 commits from 266 contributors (97 new)! **Breaking Changes**: This release includes PyTorch 2.8.0 upgrade, V0 deprecations, and API changes - please review the changelog carefully. **aarch64 support**: This release features native support for aarch64 allowing usage of vLLM on GB200 platform. The docker image `vllm/vllm-openai` should already be multiplatform. To install the wheels, you can download the wheels from this release artifact or install vi | Low | 9/13/2025 |
| v0.10.1.1 | This is a critical bugfix and security release: * Fix CUTLASS MLA Full CUDAGraph (#23200) * Limit HTTP header count and size (#23267): https://github.com/vllm-project/vllm/security/advisories/GHSA-rxc4-3w6r-4v47 * Do not use eval() to convert unknown types (#23266): https://github.com/vllm-project/vllm/security/advisories/GHSA-79j6-g2m3-jgfw **Full Changelog**: https://github.com/vllm-project/vllm/compare/v0.10.1...v0.10.1.1 | Low | 8/20/2025 |
| v0.10.1 | ## Highlights v0.10.1 release includes 727 commits, 245 committers (105 new contributors). **NOTE: This release deprecates V0 FA3 support and as a result FP8 kv-cache in V0 may have issues** ### Model Support * **New model families**: GPT-OSS with comprehensive tool calling and streaming support (#22327, #22330, #22332, #22335, #22339, #22340, #22342), Command-A-Vision (#22660), mBART (#22883), and SmolLM3 using Transformers backend (#22665). * **Vision-language models**: Official Eag | Low | 8/18/2025 |
| v0.10.1rc1 | ## What's Changed * Deduplicate Transformers backend code using inheritance by @hmellor in https://github.com/vllm-project/vllm/pull/21461 * [Bugfix][ROCm] Fix for warp_size uses on host by @gshtras in https://github.com/vllm-project/vllm/pull/21205 * [TPU][Bugfix] fix moe layer by @yaochengji in https://github.com/vllm-project/vllm/pull/21340 * [v1][Core] Clean up usages of `SpecializedManager` by @zhouwfang in https://github.com/vllm-project/vllm/pull/21407 * [Misc] Fix duplicate FusedMoEConfi | Low | 8/17/2025 |
| v0.10.0 | ## Highlights v0.10.0 release includes 308 commits, 168 contributors (62 new!). **NOTE: This release begins the cleanup of V0 engine codebase.** We have removed V0 CPU/XPU/TPU/HPU backends (#20412), long context LoRA (#21169), Prompt Adapters (#20588), Phi3-Small & BlockSparse Attention (#21217), and Spec Decode workers (#21152) so far and plan to continued to delete code that is no longer used. ### Model Support * New families: Llama 4 with EAGLE support (#20591), EXAONE 4.0 (#21060), | Low | 7/24/2025 |
| v0.10.0rc2 | ## What's Changed * [Model] use AutoWeightsLoader for bart by @calvin0327 in https://github.com/vllm-project/vllm/pull/18299 * [Model] Support VLMs with transformers backend by @zucchini-nlp in https://github.com/vllm-project/vllm/pull/20543 * [bugfix] fix syntax warning caused by backslash by @1195343015 in https://github.com/vllm-project/vllm/pull/21251 * [CI] Cleanup modelscope version constraint in Dockerfile by @yankay in https://github.com/vllm-project/vllm/pull/21243 * [Docs] Add RFC | Low | 7/24/2025 |
| v0.10.0rc1 | ## What's Changed * [Kernel] Enable fp8 support for pplx and BatchedTritonExperts. by @bnellnm in https://github.com/vllm-project/vllm/pull/18864 * [Misc] Fix `Unable to detect current VLLM config. Defaulting to NHD kv cache layout` warning by @NickLucche in https://github.com/vllm-project/vllm/pull/20400 * [Bugfix] Register reducer even if transformers_modules not available by @eicherseiji in https://github.com/vllm-project/vllm/pull/19510 * Change warn_for_unimplemented_methods to debug by @mg | Low | 7/20/2025 |
| v0.9.2 | ## Highlights This release contains 452 commits from 167 contributors (31 new!) **NOTE: This is the last version where V0 engine code and features stay intact. We highly recommend migrating to V1 engine.** ### Engine Core * Priority Scheduling is now implemented in V1 engine (#19057), embedding models in V1 (#16188), Mamba2 in V1 (#19327). * Full CUDA‑Graph execution is now available for all FlashAttention v3 (FA3) and FlashMLA paths, including prefix‑caching. CUDA graph now has a l | Low | 7/7/2025 |
| v0.9.2rc2 | ## What's Changed * [Kernel] Enable fp8 support for pplx and BatchedTritonExperts. by @bnellnm in https://github.com/vllm-project/vllm/pull/18864 * [Misc] Fix `Unable to detect current VLLM config. Defaulting to NHD kv cache layout` warning by @NickLucche in https://github.com/vllm-project/vllm/pull/20400 * [Bugfix] Register reducer even if transformers_modules not available by @eicherseiji in https://github.com/vllm-project/vllm/pull/19510 * Change warn_for_unimplemented_methods to debug by @mg | Low | 7/6/2025 |
| v0.9.2rc1 | ## What's Changed * [Docs] Note that alternative structured output backends are supported by @russellb in https://github.com/vllm-project/vllm/pull/19426 * [ROCm][V1] Adding ROCm to the list of plaforms using V1 by default by @gshtras in https://github.com/vllm-project/vllm/pull/19440 * [Model] use AutoWeightsLoader for commandr by @py-andy-c in https://github.com/vllm-project/vllm/pull/19399 * Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B-FP8 by @Xu-Wenqing in https://github.co | Low | 7/3/2025 |
| v0.9.1 | ## Highlights This release features **274 commits, from 123 contributors (27 new contributors!)** * Progress in large scale serving * DP Attention + Expert Parallelism: CUDA graph support (#18724), DeepEP dispatch-combine kernel (#18434), batched/masked DeepGEMM kernel (#19111), CUTLASS MoE kernel with PPLX (#18762) * Heterogeneous TP (#18833), NixlConnector Enable FlashInfer backend (#19090) * DP: API-server scaleout with many-to-many server-engine comms (#17546), Support DP with Ra | Low | 6/10/2025 |
| v0.9.1rc1 | ## What's Changed * [CI/Build] [TPU] Fix TPU CI exit code by @CAROLZXYZXY in https://github.com/vllm-project/vllm/pull/18282 * [Neuron] Support quantization on neuron by @aws-satyajith in https://github.com/vllm-project/vllm/pull/18283 * Support datasets in `vllm bench serve` and sync with benchmark_[serving,datasets].py by @mgoin in https://github.com/vllm-project/vllm/pull/18566 * [Bugfix] Disable prefix caching by default for benchmark by @cascade812 in https://github.com/vllm-project/vllm/pu | Low | 6/9/2025 |
| v0.9.0.1 | This patch release contains important bugfix for DeepSeek family of models on NVIDIA Ampere and below (#18807) **Full Changelog**: https://github.com/vllm-project/vllm/compare/v0.9.0...v0.9.0.1 | Low | 5/30/2025 |
| v0.9.0 | ## Highlights This release features 649 commits, from 215 contributors (82 new contributors!) * vLLM has upgraded to PyTorch 2.7! (#16859) This is a breaking change for environment dependency. * The default wheel has been upgraded from CUDA 12.4 to CUDA 12.8. We will distribute CUDA 12.6 wheel on GitHub artifact. * As a general rule of thumb, our CUDA version policy follow PyTorch's CUDA version policy. * Enhanced NVIDIA Blackwell support. vLLM now ships with initial set of optimized | Low | 5/15/2025 |
| v0.8.5.post1 | This post release contains two bug fix for memory leak and model accuracy * Fix Memory Leak in `_cached_reqs_data` (#17567) * Fix sliding window attention in V1 giving incorrect results (#17574) **Full Changelog**: https://github.com/vllm-project/vllm/compare/v0.8.5...v0.8.5.post1 | Low | 5/2/2025 |
| v0.8.5 | This release contains 310 commits from 143 contributors (55 new contributors!). ## Highlights This release features important multi-modal bug fixes, day 0 support for Qwen3, and xgrammar's structure tag feature for tool calling. ### Model Support * Day 0 support for Qwen3 and Qwen3MoE. This release fixes fp8 weight loading (#17318) and adds tuned MoE configs (#17328). * Add ModernBERT (#16648) * Add Granite Speech Support (#16246) * Add PLaMo2 (#14323) * Add Kimi-VL model support (# | Low | 4/28/2025 |
| v0.8.4 | This release contains 180 commits from 84 contributors (25 new contributors!). ## Highlights This release includes important accuracy fixes for Llama4 models, if you are using it, we highly recommend you to update. ### Model * Llama4 (#16113,#16509) bug fix and enhancements: * qknorm should be not shared across head (#16311) * Enable attention temperature tuning by default for long context (>32k) (#16439) * Index Error When Single Request Near Max Context (#16209) * Add tuned | Low | 4/14/2025 |
| v0.8.3 | ## Highlights This release features 260 commits, 109 contributors, 38 new contributors. * We are excited to announce Day 0 Support for Llama 4 Scout and Maverick (#16104). Please [see our blog for detailed user guide](https://blog.vllm.ai/2025/04/05/llama4). * Please note that Llama4 is only supported in V1 engine only for now. * V1 engine now supports native sliding window attention (#14097) with the hybrid memory allocator. ### Cluster Scale Serving * Single node data parallel | Low | 4/6/2025 |
| v0.8.3rc1 | ## What's Changed * Fix CUDA kernel index data type in vllm/csrc/quantization/gptq_marlin/awq_marlin_repack.cu +10 by @houseroad in https://github.com/vllm-project/vllm/pull/15160 * [Hardware][TPU][Bugfix] Fix v1 mp profiler by @lsy323 in https://github.com/vllm-project/vllm/pull/15409 * [Kernel][CPU] CPU MLA by @gau-nernst in https://github.com/vllm-project/vllm/pull/14744 * Dockerfile.ppc64le changes to move to UBI by @Shafi-Hussain in https://github.com/vllm-project/vllm/pull/15402 * [Misc] C | Low | 4/5/2025 |
| v0.8.2 | This release contains important bug fix for the V1 engine's memory usage. We highly recommend you upgrading! ## Highlights * Revert "Use uv python for docker rather than ppa:deadsnakess/ppa (#13569)" (#15377) * Remove openvino support in favor of external plugin (#15339) ### V1 Engine * Fix V1 Engine crash while handling requests with duplicate request id (#15043) * Support FP8 KV Cache (#14570, #15191) * Add flag to disable cascade attention (#15243) * Scheduler Refactoring: Add Sc | Low | 3/23/2025 |
| v0.8.1 | This release contains important bug fixes for v0.8.0. We highly recommend upgrading! * V1 Fixes * Ensure using int64 for sampled token ids (#15065) * Fix long dtype in topk sampling (#15049) * Refactor Structured Output for multiple backends (#14694) * Fix size calculation of processing cache (#15114) * Optimize Rejection Sampler with Triton Kernels (#14930) * Fix oracle for device checking (#15104) * TPU * Fix chunked prefill with padding (#15037) * Enhanced CI/CD (#15054 | Low | 3/19/2025 |
| v0.8.0 | v0.8.0 featured 523 commits from 166 total contributors (68 new contributors)! ## Highlights ### V1 We have now enabled V1 engine by default (#13726) for supported use cases. Please refer to [V1 user guide](https://docs.vllm.ai/en/latest/getting_started/v1_user_guide.html) for more detail. We expect better performance for supported scenarios. If you'd like to disable V1 mode, please specify the environment variable `VLLM_USE_V1=0`, and send us a GitHub issue sharing the reason! * Su | Low | 3/18/2025 |
| v0.8.0rc2 | ## What's Changed * [V1] Remove input cache client by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/14864 * [Misc][XPU] Use None as device capacity for XPU by @yma11 in https://github.com/vllm-project/vllm/pull/14932 * [Doc] Add vLLM Beijing meetup slide by @heheda12345 in https://github.com/vllm-project/vllm/pull/14938 * setup.py: drop assumption about local `main` branch by @russellb in https://github.com/vllm-project/vllm/pull/14692 * [MISC] More AMD unused var clean up by @hous | Low | 3/17/2025 |
| v0.8.0rc1 | Note: vLLM no longer sets the global seed (#14274). Please set the `seed` parameter if you need to reproduce your results. ## What's Changed * Update `pre-commit`'s `isort` version to remove warnings by @hmellor in https://github.com/vllm-project/vllm/pull/13614 * [V1][Minor] Print KV cache size in token counts by @WoosukKwon in https://github.com/vllm-project/vllm/pull/13596 * fix neuron performance issue by @ajayvohra2005 in https://github.com/vllm-project/vllm/pull/13589 * [Frontend] A | Low | 3/17/2025 |
| v0.7.3 | ## Highlights 🎉 253 commits from 93 contributors, including 29 new contributors! * Deepseek enhancements: * Support for DeepSeek Multi-Token Prediction, 1.69x speedup in low QPS scenarios (#12755) * AMD support: DeepSeek tunings, yielding 17% latency reduction (#13199) * Using FlashAttention3 for MLA (#12807) * Align the expert selection code path with official implementation (#13474) * Optimize moe_align_block_size for deepseek_v3 (#12850) * Expand MLA to support most types of | Low | 2/20/2025 |
| v0.7.2 | ## Highlights * Qwen2.5-VL is now supported in vLLM. Please note that it requires a source installation from Hugging Face `transformers` library at the moment (#12604) * Add `transformers` backend support via `--model-impl=transformers`. This allows vLLM to be ran with arbitrary Hugging Face text models (#11330, #12785, #12727). * Performance enhancement to DeepSeek models. * Align KV caches entries to start 256 byte boundaries, yielding 43% throughput enhancement (#12676) * Apply `to | Low | 2/6/2025 |
| v0.7.1 | ## Highlights This release features MLA optimization for Deepseek family of models. Compared to v0.7.0 released this Monday, we offer ~3x the generation throughput, ~10x the memory capacity for tokens, and horizontal context scalability with pipeline parallelism * MLA Kernel (#12601, #12642,#12528). * FP8 Kernels (#11589, #11868, #12587) ### V1 For the V1 architecture, we * Added a new design document for zero overhead prefix caching [here](https://docs.vllm.ai/en/latest/design/v1/pre | Low | 2/1/2025 |
| v0.7.0 | ## Highlights * vLLM's V1 engine is ready for testing! This is a rewritten engine designed for performance and architectural simplicity. You can turn it on by setting environment variable `VLLM_USE_V1=1`. See [our blog](https://blog.vllm.ai/2025/01/27/v1-alpha-release.html) for more details. (44 commits). * New methods (`LLM.sleep`, `LLM.wake_up`, `LLM.collective_rpc`, `LLM.reset_prefix_cache`) in vLLM for the post training frameworks! (#12361, #12084, #12284). * `torch.compile` is now fully | Low | 1/27/2025 |
| v0.6.6.post1 | This release restore functionalities for other quantized MoEs, which was introduced as part of initial DeepSeek V3 support 🙇 . ## What's Changed * [Docs] Document Deepseek V3 support by @simon-mo in https://github.com/vllm-project/vllm/pull/11535 * Update openai_compatible_server.md by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/11536 * [V1] Use FlashInfer Sampling Kernel for Top-P & Top-K Sampling by @WoosukKwon in https://github.com/vllm-project/vllm/pull/113 | Low | 12/27/2024 |
| v0.6.6 | ## Highlights * Support Deepseek V3 (#11523, #11502) model. * On 8xH200s or MI300x: `vllm serve deepseek-ai/DeepSeek-V3 --tensor-parallel-size 8 --trust-remote-code --max-model-len 8192`. The context length can be increased to about 32K beyond running into memory issue. * For other devices, follow our [distributed inference](https://docs.vllm.ai/en/latest/serving/distributed_serving.html) guide to enable tensor parallel and/or pipeline parallel inference * We are just getting started for | Low | 12/27/2024 |
| v0.6.5 | ## Highlights * Significant progress on the V1 engine refactor and multimodal support: New model executable interfaces for text-only and multimodal models, multiprocessing, improved configuration handling, and profiling enhancements (#10374, #10570, #10699, #11074, #11076, #10382, #10665, #10564, #11125, #11185, #11242). * Major improvements in `torch.compile` integration: Support for all attention backends, encoder-based models, dynamic FP8 fusion, shape specialization fixes, and performance | Low | 12/17/2024 |
| v0.6.4.post1 | This patch release covers bug fixes (#10347, #10349, #10348, #10352, #10363), keep compatibility for `vLLMConfig` usage in out of tree models (#10356) ## What's Changed * Add default value to avoid Falcon crash (#5363) by @wchen61 in https://github.com/vllm-project/vllm/pull/10347 * [Misc] Fix import error in tensorizer tests and cleanup some code by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/10349 * [Doc] Remove float32 choice from --lora-dtype by @xyang16 in https://gith | Low | 11/15/2024 |
| v0.6.4 | ## Highlights * Significant progress in V1 engine core refactor (#9826, #10135, #10288, #10211, #10225, #10228, #10268, #9954, #10272, #9971, #10224, #10166, #9289, #10058, #9888, #9972, #10059, #9945, #9679, #9871, #10227, #10245, #9629, #10097, #10203, #10148). You can checkout more details regarding the design and plan ahead in our recent [meetup slides](https://docs.google.com/presentation/d/1e3CxQBV3JsfGp30SwyvS3eM_tW-ghOhJ9PAJGK6KR54/edit#slide=id.g31455c8bc1e_2_130) * Signficant progres | Low | 11/15/2024 |
| v0.6.3.post1 | ## Highlights ### New Models * Support Ministral 3B and Ministral 8B via interleaved attention (#9414) * Support multiple and interleaved images for Llama3.2 (#9095) * Support VLM2Vec, the first multimodal embedding model in vLLM (#9303) ### Important bug fix * Fix chat API continuous usage stats (#9357) * Fix vLLM UsageInfo and logprobs None AssertionError with empty token_ids (#9034) * Fix Molmo text-only input bug (#9397) * Fix CUDA 11.8 Build (#9386) * Fix `_version.py` not fou | Low | 10/17/2024 |