vllm

Home > RAG & Memory > vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

amd blackwell cuda deepseek deepseek-v3 gpt gpt-oss inference python

Why this rank:Strong adoptionRecent releaseHealthy release cadence

Description

A high-throughput and memory-efficient inference and serving engine for LLMs

README

Easy, fast, and cheap LLM serving for everyone

🔥 We have built a vllm website to help you get started with vllm. Please visit vllm.ai to learn more. For events, please visit vllm.ai/events to join us.

About

vLLM is a fast and easy-to-use library for LLM inference and serving.

Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has grown into one of the most active open-source AI projects built and maintained by a diverse community of many dozens of academic institutions and companies from over 2000 contributors.

vLLM is fast with:

State-of-the-art serving throughput
Efficient management of attention key and value memory with PagedAttention
Continuous batching of incoming requests, chunked prefill, prefix caching
Fast and flexible model execution with piecewise and full CUDA/HIP graphs
Quantization: FP8, MXFP8/MXFP4, NVFP4, INT8, INT4, GPTQ/AWQ, GGUF, compressed-tensors, ModelOpt, TorchAO, and more
Optimized attention kernels including FlashAttention, FlashInfer, TRTLLM-GEN, FlashMLA, and Triton
Optimized GEMM/MoE kernels for various precisions using CUTLASS, TRTLLM-GEN, CuTeDSL
Speculative decoding including n-gram, suffix, EAGLE, DFlash
Automatic kernel generation and graph-level transformations using torch.compile
Disaggregated prefill, decode, and encode

vLLM is flexible and easy to use with:

Seamless integration with popular Hugging Face models
High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
Tensor, pipeline, data, expert, and context parallelism for distributed inference
Streaming outputs
Generation of structured outputs using xgrammar or guidance
Tool calling and reasoning parsers
OpenAI-compatible API server, plus Anthropic Messages API and gRPC support
Efficient multi-LoRA support for dense and MoE layers
Support for NVIDIA GPUs, AMD GPUs, and x86/ARM/PowerPC CPUs. Additionally, diverse hardware plugins such as Google TPUs, Intel Gaudi, IBM Spyre, Huawei Ascend, Rebellions NPU, Apple Silicon, MetaX GPU, and more.

vLLM seamlessly supports 200+ model architectures on HuggingFace, including:

Decoder-only LLMs (e.g., Llama, Qwen, Gemma)
Mixture-of-Expert LLMs (e.g., Mixtral, DeepSeek-V3, Qwen-MoE, GPT-OSS)
Hybrid attention and state-space models (e.g., Mamba, Qwen3.5)
Multi-modal models (e.g., LLaVA, Qwen-VL, Pixtral)
Embedding and retrieval models (e.g., E5-Mistral, GTE, ColBERT)
Reward and classification models (e.g., Qwen-Math)

Find the full list of supported models here.

Getting Started

Install vLLM with uv (recommended) or pip:

uv pip install vllm

Or build from source for development.

Visit our documentation to learn more.

Contributing

We welcome and value any contributions and collaborations. Please check out Contributing to vLLM for how to get involved.

Citation

If you use vLLM for your research, please cite our paper:

@inproceedings{kwon2023efficient,
  title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
  author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},
  booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},
  year={2023}
}

Contact Us

For technical questions and feature requests, please use GitHub Issues
For discussing with fellow users, please use the vLLM Forum
For coordinating contributions and development, please use Slack
For security disclosures, please use GitHub's Security Advisories feature
For collaborations and partnerships, please contact us at collaboration@vllm.ai

Media Kit

If you wish to use vLLM's logo, please refer to our media kit repo

Release History

Version	Changes	Urgency	Date
v0.26.0	# vLLM v0.26.0 Release Notes ## Highlights This release features 411 commits from 212 contributors (61 new)! * New Inkling model family with a full support stack: base modeling (#48799), piecewise CUDA graph support (#48822), Hopper FA4 relative attention (#48858), MTP=1 speculative decoding (#48869), LoRA (#48884), and standard ModelOpt NVFP4 quantization (#48990). * DeepSeek-V4 performance push across vendors: a specialized routing kernel (2.94% E2E TPOT, #48660), `fused_topk	High	7/25/2026
v0.25.1	# vLLM v0.25.1 ## Highlights This release features 2 commits from 2 contributors (1 new)! v0.25.1 is a patch release containing two targeted bug fixes on top of v0.25.0. ### Bug Fixes * Avoid blocking model launching when no system FFmpeg is available for TorchCodec (#47888). Previously `import torchcodec` raised a `RuntimeError` at import time when system FFmpeg was missing, which blocked startup (e.g. `vllm serve Qwen/Qwen3-VL-2B-Instruct`) even when TorchCodec was not in use.	Medium	7/14/2026
v0.25.0	# vLLM v0.25.0 Release Notes ## Highlights This release features 558 commits from 232 contributors (64 new)! * Model Runner V2 is now the default for all dense models (#44443). Building on quantized-model support from the previous release, MRv2 is now the standard execution path, with new support for EVS (#46535), realtime embeddings (#46762), prefix caching for Mamba hybrid models (#42406), multimodal-prefix bidirectional attention (#46942), and dynamic speculative decoding compati	High	7/11/2026
v0.24.0	# vLLM v0.24.0 Release Notes ## Highlights This release features 571 commits from 256 contributors (77 new)! * MiniMax-M3: Added support for the new MiniMax-M3 model (#45381), with a fast follow-on of BF16/FP8 indexer via MSA (#45892), MXFP4 support (#45896), FP8 sparse GQA (#45744), and extensive AMD/ROCm tuning — mxfp8 MoE/linear on gfx950 (#45725), fp8_per_channel for bf16 weights on MI300X (#45854), FP8 KV-cache fix (#45720), and packed-modules mapping (#45794). A MiniMax-M2	High	6/29/2026
v0.23.0	# vLLM v0.23.0 Release Notes Please note that Minimax M3 is not yet supported in this version. Please follow [vLLM recipe](https://recipes.vllm.ai/MiniMaxAI/MiniMax-M3) for usage guides for M3. ## Highlights This release features 408 commits from 200 contributors (63 new)! * DeepSeek-V4 matures across backends: Following its introduction in v0.22.0, DeepSeek-V4 received another large hardening and optimization pass. Its sparse MLA metadata is now decoupled from DeepSeek-V3.2 (#44	High	6/12/2026
v0.22.1	## Highlights This release features 8 commits from 6 contributors (1 new)! v0.22.1 is a patch release on top of v0.22.0 with targeted bug fixes plus a couple of additions: new model support for JetBrains' Mellum v2, zentorch-accelerated quantized linear inference on AMD Zen CPUs, and fixes for multi-node Ray data-parallel serving, DeepSeek-V4 initialization, and a few model-loading regressions. ### Model Support * New model: JetBrains' Mellum v2, an open-weights Mixture-of-Experts	High	6/5/2026
v0.22.0	## Highlights This release features 459 commits from 230 contributors (63 new)! * DeepSeek V4 maturity: DeepSeek V4 received a major hardening pass this cycle — the model was reorganized into a dedicated `vllm/models/deepseek_v4/` package (#43004, #43039, #43073, #43077, #43149), gained NVFP4 fused MoE support (#42209), full + piecewise CUDA graph (#42604), and MTP speculative decoding (#43385). A large set of fused kernels (MegaMoE, `mhc`, Q-norm, indexer, sparse MLA) and ROCm parity	High	5/29/2026
v0.21.0	## Highlights This release features 367 commits from 202 contributors (49 new)! * Transformers v4 deprecated: This release formally deprecates `transformers` v4 support (#40389). Users should migrate to `transformers` v5. * C++20 build requirement: vLLM now requires a C++20-compatible compiler for compatibility with PyTorch (#40380). This is a breaking build change. * KV Offload + Hybrid Memory Allocator (HMA): The KV offloading subsystem now integrates with the Hybrid Me	High	5/15/2026
v0.20.2	# vLLM v0.20.2 ## Highlights This release features 6 commits from 6 contributors (0 new)! This is a small patch release with bug fixes for DeepSeek V4, gpt-oss, and Qwen3-VL ### Bug Fixes * DeepSeek V4 sparse attention: Re-enable the persistent topk path on Hopper and ensure the memset kernel runs at CUDA graph capture time regardless of `max_seq_len`, fixing the MTP=1 hang on DeepSeek V4 (#41665, revert of #41605). * DeepSeek V4 KV cache: Fixed a "failure to allocate KV bloc	High	5/10/2026
v0.20.1	# vLLM v0.20.1 This is a patch release on top of `v0.20.0` primarily focused on DeepSeek V4 stabilization and performance improvements, along with several important bug fixes. ### DeepSeek V4 * Base model support (#41006). * Multi-stream pre-attention GEMM (#41061), configurable pre-attn GEMM knob (#41443), and tuned default `VLLM_MULTI_STREAM_GEMM_TOKEN_THRESHOLD` (#41526). * BF16 and MXFP8 all-to-all support for FlashInfer one-sided communication (#40960). * PTX `cvt` instruction	High	5/3/2026
v0.20.0	# vLLM v0.20.0 ## Highlights This release features 752 commits from 320 contributors (123 new)! * DeepSeek V4: Initial DeepSeek V4 support landed (#40860), with DSML token-leakage fix in DSV4/3.2 (#40806), DSA + MTP IMA fix (#40772), and a silu clamp limit on the shared expert (#40950). * CUDA 13.0 default: Default CUDA wheel on PyPI and `vllm/vllm-openai:v0.20.0` image switched to CUDA 13.0; architecture lists and build-args cleaned up (#39878), and CUDA bumped to 13.0.2 to matc	High	4/27/2026
v0.19.1	This is a patch release on top of `v0.19.0` with Transformers v5.5.4 upgrade and bug fixes for Gemma4: - Update to transformers v5 (#30566) - [Bugfix] Fix invalid JSON in Gemma 4 streaming tool calls by stripping partial delimiters (#38992) - [Bugfix][Frontend] Fix Gemma4 streaming HTML duplication after tool calls (#38909) - [Bugfix] Fix Gemma4 streaming tool call corruption for split boolean/number values (#39114) - [Tool] adjust_request to reasoning parser, and Gemma4 fixes (#39027) - [	High	4/18/2026
v0.19.0	# vLLM v0.19.0 ## Highlights This release features 448 commits from 197 contributors (54 new)! * Gemma 4 support: Full Google Gemma 4 architecture support including MoE, multimodal, reasoning, and tool-use capabilities (#38826, #38847). Requires `transformers>=5.5.0`. We recommend using pre-built docker image `vllm/vllm-openai:gemma4` for out of box usage. * Zero-bubble async scheduling + speculative decoding: Async scheduling now supports speculative decoding with zero-bubble ov	High	4/3/2026
v0.18.1	This is a patch release on top of v0.18.0 to address a few issues: - Change default SM100 MLA prefill backend back to TRT-LLM (#38562) - Fix mock.patch resolution failure for standalone_compile.FakeTensorMode on Python <= 3.10 (#37158) - Disable monolithic TRTLLM MoE for Renormalize routing #37605 - Pre-download missing FlashInfer headers in Docker build #38391 - Fix DeepGemm E8M0 accuracy degradation for Qwen3.5 FP8 on Blackwell (#38083)	Medium	3/31/2026
v0.18.0	# vLLM v0.18.0 ## Known issues - Degraded accuracy when serving Qwen3.5 with FP8 KV cache on B200 (#37618) - If you previously ran into `CUBLAS_STATUS_INVALID_VALUE` and had to use a workaround in `v0.17.0`, you can reinstall `torch 2.10.0`. PyTorch published an updated wheel that addresses this bug. ## Highlights This release features 445 commits from 213 contributors (61 new)! * gRPC Serving Support: vLLM now supports gRPC serving via the new `--grpc` flag (#36169), enabling	Low	3/20/2026
v0.17.1	This is a patch release on top of `v0.17.0` to address a few issues: - New Model: Nemotron 3 Super - Fix passing of activation_type to trtllm fused MoE NVFP4 and FP8 (#36017) - Fix/resupport nongated fused moe triton (#36412) - Re-enable EP for trtllm MoE FP8 backend (#36494) - [Mamba][Qwen3.5] Zero freed SSM cache blocks on GPU (#35219) - Fix TRTLLM Block FP8 MoE Monolithic (#36296) - [DSV3.2][MTP] Optimize Indexer MTP handling (#36723)	Low	3/11/2026
v0.17.0	# vLLM v0.17.0 Known Issue: If you are on CUDA 12.9+ and encounter a `CUBLAS_STATUS_INVALID_VALUE` error, this is caused by a CUDA library mismatch. To resolve, try one of the following: 1. Remove the path to system CUDA shared library files (e.g. `/usr/local/cuda`) from `LD_LIBRARY_PATH`, or simply `unset LD_LIBRARY_PATH`. 2. Install vLLM with `uv pip install vllm --torch-backend=auto`. 3. Install vLLM with `pip install vllm --extra-index-url https://download.pytorch.org/whl/cu129` (c	Low	3/7/2026
v0.16.0	# vLLM v0.16.0 Please note that this release was branch cut on Feb 8, so any features added to vLLM after that date is not included. ## Highlights This release features 440 commits from 203 contributors (7 new)! * Async scheduling + Pipeline Parallelism is now fully supported, delivering 30.8% E2E throughput improvement and 31.8% TPOT improvement (#32618). * Realtime API: A new WebSocket-based Realtime API enables streaming audio interactions (#33187), building on the	Low	2/25/2026
v0.15.1	v0.15.1 is a patch release with security fixes, RTX Blackwell GPU fixes support, and bug fixes. ## Security - CVE-2025-69223: Updated aiohttp dependency (#33621) - CVE-2026-0994: Updated Protobuf dependency (#33619) ## Highlights ### Bugfix Hardware Support - RTX Blackwell (SM120): Fixed NVFP4 MoE kernel support for RTX Blackwell workstation GPUs. Previously, NVFP4 MoE models would fail to load on these GPUs (#33417) - FP8 kernel selection: Fixed FP8 CUTLASS group	Low	2/4/2026
v0.15.0	## Highlights This release features 335 commits from 158 contributors (39 new)! ### Model Support * New architectures: Kimi-K2.5 (#33131), Molmo2 (#30997), Step3vl 10B (#32329), Step1 (#32511), GLM-Lite (#31386), Eagle2.5-8B VLM (#32456). * LoRA expansion: Nemotron-H (#30802), InternVL2 (#32397), MiniMax M2 (#32763). * Speculative decoding: EAGLE3 for Pixtral/LlavaForConditionalGeneration (#32542), Qwen3 VL MoE (#32048), draft model support (#24322). * Embeddings: BGE-M	Low	1/29/2026
v0.14.1	This is a patch release on top of `v0.14.0` to address a few security and memory leak fixes.	Low	1/24/2026
v0.14.0	## Highlights This release features approximately 660 commits from 251 contributors (86 new contributors). Breaking Changes: - Async scheduling is now enabled by default - Users who experience issues can disable with `--no-async-scheduling`. - Excludes some not-yet-supported configurations: pipeline parallel, CPU backend, non-MTP/Eagle spec decoding. - PyTorch 2.9.1 is now required and the default wheel is compiled against cu129. - Deprecated quantization schemes have be	Low	1/20/2026
v0.13.0	# vLLM v0.13.0 Release Notes Highlights ## Highlights This release features 442 commits from 207 contributors (61 new contributors)! Breaking Changes: This release includes deprecation removals, PassConfig flag renames, and attention configuration changes from environment variables to CLI arguments. Please review the breaking changes section carefully before upgrading. ### Model Support * New models: BAGEL (AR only) (#28439), AudioFlamingo3 (#30539), JAIS 2 (#30188), lat	Low	12/19/2025
v0.12.0	# vLLM v0.12.0 Release Notes Highlights ## Highlights This release features 474 commits from 213 contributors (57 new)！ Breaking Changes: This release includes PyTorch 2.9.0 upgrade (CUDA 12.9), V0 deprecations including `xformers` backend, and scheduled removals - please review the changelog carefully. Major Features: * EAGLE Speculative Decoding Improvements: Multi-step CUDA graph support (#29559), DP>1 support (#26086), and multimodal support with Qwen3VL (#29594). *	Low	12/3/2025
v0.11.2	This release includes 4 bug fixes on top of `v0.11.1`: - [BugFix] Ray with multiple nodes (https://github.com/vllm-project/vllm/pull/28873) - [BugFix] Fix false assertion with spec-decode=[2,4,..] and TP>2 (https://github.com/vllm-project/vllm/pull/29036) - [BugFix] Fix async-scheduling + FlashAttn MLA (https://github.com/vllm-project/vllm/pull/28990) - [NVIDIA] Guard SM100 CUTLASS MoE macro to SM100 builds v2 (https://github.com/vllm-project/vllm/pull/28938)	Low	11/20/2025
v0.11.1	## Highlights This release includes 1456 commits from 449 contributors (184 new contributors)! Key changes include: * PyTorch 2.9.0 + CUDA 12.9.1: Updated the default CUDA build to `torch==2.9.0+cu129`, enabling Inductor partitioning and landing multiple fixes in graph-partition rules and compile-cache integration. * Batch-invariant `torch.compile`: Generalized batch-invariant support across attention and MoE backends, with explicit support for DeepGEMM and FlashInfer on Hopper and	Low	11/18/2025
v0.11.0	## Highlights This release features 538 commits, 207 contributors (65 new contributors)! * This release completes the removal of V0 engine. V0 engine code including AsyncLLMEngine, LLMEngine, MQLLMEngine, all attention backends, and related components have been removed. V1 is the only engine in the codebase now. * This releases turns on FULL_AND_PIECEWISE as the CUDA graph mode default. This should provide better out of the box performance for most models, particularly fine-grained	Low	10/2/2025
v0.10.2	## Highlights This release contains 740 commits from 266 contributors (97 new)! Breaking Changes: This release includes PyTorch 2.8.0 upgrade, V0 deprecations, and API changes - please review the changelog carefully. aarch64 support: This release features native support for aarch64 allowing usage of vLLM on GB200 platform. The docker image `vllm/vllm-openai` should already be multiplatform. To install the wheels, you can download the wheels from this release artifact or install vi	Low	9/13/2025
v0.10.1.1	This is a critical bugfix and security release: * Fix CUTLASS MLA Full CUDAGraph (#23200) * Limit HTTP header count and size (#23267): https://github.com/vllm-project/vllm/security/advisories/GHSA-rxc4-3w6r-4v47 * Do not use eval() to convert unknown types (#23266): https://github.com/vllm-project/vllm/security/advisories/GHSA-79j6-g2m3-jgfw Full Changelog: https://github.com/vllm-project/vllm/compare/v0.10.1...v0.10.1.1	Low	8/20/2025
v0.10.1	## Highlights v0.10.1 release includes 727 commits, 245 committers (105 new contributors). NOTE: This release deprecates V0 FA3 support and as a result FP8 kv-cache in V0 may have issues ### Model Support * New model families: GPT-OSS with comprehensive tool calling and streaming support (#22327, #22330, #22332, #22335, #22339, #22340, #22342), Command-A-Vision (#22660), mBART (#22883), and SmolLM3 using Transformers backend (#22665). * Vision-language models: Official Eag	Low	8/18/2025
v0.10.1rc1	## What's Changed * Deduplicate Transformers backend code using inheritance by @hmellor in https://github.com/vllm-project/vllm/pull/21461 * [Bugfix][ROCm] Fix for warp_size uses on host by @gshtras in https://github.com/vllm-project/vllm/pull/21205 * [TPU][Bugfix] fix moe layer by @yaochengji in https://github.com/vllm-project/vllm/pull/21340 * [v1][Core] Clean up usages of `SpecializedManager` by @zhouwfang in https://github.com/vllm-project/vllm/pull/21407 * [Misc] Fix duplicate FusedMoEConfi	Low	8/17/2025
v0.10.0	## Highlights v0.10.0 release includes 308 commits, 168 contributors (62 new!). NOTE: This release begins the cleanup of V0 engine codebase. We have removed V0 CPU/XPU/TPU/HPU backends (#20412), long context LoRA (#21169), Prompt Adapters (#20588), Phi3-Small & BlockSparse Attention (#21217), and Spec Decode workers (#21152) so far and plan to continued to delete code that is no longer used. ### Model Support * New families: Llama 4 with EAGLE support (#20591), EXAONE 4.0 (#21060),	Low	7/24/2025
v0.10.0rc2	## What's Changed * [Model] use AutoWeightsLoader for bart by @calvin0327 in https://github.com/vllm-project/vllm/pull/18299 * [Model] Support VLMs with transformers backend by @zucchini-nlp in https://github.com/vllm-project/vllm/pull/20543 * [bugfix] fix syntax warning caused by backslash by @1195343015 in https://github.com/vllm-project/vllm/pull/21251 * [CI] Cleanup modelscope version constraint in Dockerfile by @yankay in https://github.com/vllm-project/vllm/pull/21243 * [Docs] Add RFC	Low	7/24/2025
v0.10.0rc1	## What's Changed * [Kernel] Enable fp8 support for pplx and BatchedTritonExperts. by @bnellnm in https://github.com/vllm-project/vllm/pull/18864 * [Misc] Fix `Unable to detect current VLLM config. Defaulting to NHD kv cache layout` warning by @NickLucche in https://github.com/vllm-project/vllm/pull/20400 * [Bugfix] Register reducer even if transformers_modules not available by @eicherseiji in https://github.com/vllm-project/vllm/pull/19510 * Change warn_for_unimplemented_methods to debug by @mg	Low	7/20/2025
v0.9.2	## Highlights This release contains 452 commits from 167 contributors (31 new!) NOTE: This is the last version where V0 engine code and features stay intact. We highly recommend migrating to V1 engine. ### Engine Core * Priority Scheduling is now implemented in V1 engine (#19057), embedding models in V1 (#16188), Mamba2 in V1 (#19327). * Full CUDA‑Graph execution is now available for all FlashAttention v3 (FA3) and FlashMLA paths, including prefix‑caching. CUDA graph now has a l	Low	7/7/2025
v0.9.2rc2	## What's Changed * [Kernel] Enable fp8 support for pplx and BatchedTritonExperts. by @bnellnm in https://github.com/vllm-project/vllm/pull/18864 * [Misc] Fix `Unable to detect current VLLM config. Defaulting to NHD kv cache layout` warning by @NickLucche in https://github.com/vllm-project/vllm/pull/20400 * [Bugfix] Register reducer even if transformers_modules not available by @eicherseiji in https://github.com/vllm-project/vllm/pull/19510 * Change warn_for_unimplemented_methods to debug by @mg	Low	7/6/2025
v0.9.2rc1	## What's Changed * [Docs] Note that alternative structured output backends are supported by @russellb in https://github.com/vllm-project/vllm/pull/19426 * [ROCm][V1] Adding ROCm to the list of plaforms using V1 by default by @gshtras in https://github.com/vllm-project/vllm/pull/19440 * [Model] use AutoWeightsLoader for commandr by @py-andy-c in https://github.com/vllm-project/vllm/pull/19399 * Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B-FP8 by @Xu-Wenqing in https://github.co	Low	7/3/2025
v0.9.1	## Highlights This release features 274 commits, from 123 contributors (27 new contributors!) * Progress in large scale serving * DP Attention + Expert Parallelism: CUDA graph support (#18724), DeepEP dispatch-combine kernel (#18434), batched/masked DeepGEMM kernel (#19111), CUTLASS MoE kernel with PPLX (#18762) * Heterogeneous TP (#18833), NixlConnector Enable FlashInfer backend (#19090) * DP: API-server scaleout with many-to-many server-engine comms (#17546), Support DP with Ra	Low	6/10/2025
v0.9.1rc1	## What's Changed * [CI/Build] [TPU] Fix TPU CI exit code by @CAROLZXYZXY in https://github.com/vllm-project/vllm/pull/18282 * [Neuron] Support quantization on neuron by @aws-satyajith in https://github.com/vllm-project/vllm/pull/18283 * Support datasets in `vllm bench serve` and sync with benchmark_[serving,datasets].py by @mgoin in https://github.com/vllm-project/vllm/pull/18566 * [Bugfix] Disable prefix caching by default for benchmark by @cascade812 in https://github.com/vllm-project/vllm/pu	Low	6/9/2025
v0.9.0.1	This patch release contains important bugfix for DeepSeek family of models on NVIDIA Ampere and below (#18807) Full Changelog: https://github.com/vllm-project/vllm/compare/v0.9.0...v0.9.0.1	Low	5/30/2025
v0.9.0	## Highlights This release features 649 commits, from 215 contributors (82 new contributors!) * vLLM has upgraded to PyTorch 2.7! (#16859) This is a breaking change for environment dependency. * The default wheel has been upgraded from CUDA 12.4 to CUDA 12.8. We will distribute CUDA 12.6 wheel on GitHub artifact. * As a general rule of thumb, our CUDA version policy follow PyTorch's CUDA version policy. * Enhanced NVIDIA Blackwell support. vLLM now ships with initial set of optimized	Low	5/15/2025
v0.8.5.post1	This post release contains two bug fix for memory leak and model accuracy * Fix Memory Leak in `_cached_reqs_data` (#17567) * Fix sliding window attention in V1 giving incorrect results (#17574) Full Changelog: https://github.com/vllm-project/vllm/compare/v0.8.5...v0.8.5.post1	Low	5/2/2025
v0.8.5	This release contains 310 commits from 143 contributors (55 new contributors!). ## Highlights This release features important multi-modal bug fixes, day 0 support for Qwen3, and xgrammar's structure tag feature for tool calling. ### Model Support * Day 0 support for Qwen3 and Qwen3MoE. This release fixes fp8 weight loading (#17318) and adds tuned MoE configs (#17328). * Add ModernBERT (#16648) * Add Granite Speech Support (#16246) * Add PLaMo2 (#14323) * Add Kimi-VL model support (#	Low	4/28/2025
v0.8.4	This release contains 180 commits from 84 contributors (25 new contributors!). ## Highlights This release includes important accuracy fixes for Llama4 models, if you are using it, we highly recommend you to update. ### Model * Llama4 (#16113,#16509) bug fix and enhancements: * qknorm should be not shared across head (#16311) * Enable attention temperature tuning by default for long context (>32k) (#16439) * Index Error When Single Request Near Max Context (#16209) * Add tuned	Low	4/14/2025
v0.8.3	## Highlights This release features 260 commits, 109 contributors, 38 new contributors. * We are excited to announce Day 0 Support for Llama 4 Scout and Maverick (#16104). Please [see our blog for detailed user guide](https://blog.vllm.ai/2025/04/05/llama4). * Please note that Llama4 is only supported in V1 engine only for now. * V1 engine now supports native sliding window attention (#14097) with the hybrid memory allocator. ### Cluster Scale Serving * Single node data parallel	Low	4/6/2025
v0.8.3rc1	## What's Changed * Fix CUDA kernel index data type in vllm/csrc/quantization/gptq_marlin/awq_marlin_repack.cu +10 by @houseroad in https://github.com/vllm-project/vllm/pull/15160 * [Hardware][TPU][Bugfix] Fix v1 mp profiler by @lsy323 in https://github.com/vllm-project/vllm/pull/15409 * [Kernel][CPU] CPU MLA by @gau-nernst in https://github.com/vllm-project/vllm/pull/14744 * Dockerfile.ppc64le changes to move to UBI by @Shafi-Hussain in https://github.com/vllm-project/vllm/pull/15402 * [Misc] C	Low	4/5/2025
v0.8.2	This release contains important bug fix for the V1 engine's memory usage. We highly recommend you upgrading! ## Highlights * Revert "Use uv python for docker rather than ppa:deadsnakess/ppa (#13569)" (#15377) * Remove openvino support in favor of external plugin (#15339) ### V1 Engine * Fix V1 Engine crash while handling requests with duplicate request id (#15043) * Support FP8 KV Cache (#14570, #15191) * Add flag to disable cascade attention (#15243) * Scheduler Refactoring: Add Sc	Low	3/23/2025
v0.8.1	This release contains important bug fixes for v0.8.0. We highly recommend upgrading! * V1 Fixes * Ensure using int64 for sampled token ids (#15065) * Fix long dtype in topk sampling (#15049) * Refactor Structured Output for multiple backends (#14694) * Fix size calculation of processing cache (#15114) * Optimize Rejection Sampler with Triton Kernels (#14930) * Fix oracle for device checking (#15104) * TPU * Fix chunked prefill with padding (#15037) * Enhanced CI/CD (#15054	Low	3/19/2025
v0.8.0	v0.8.0 featured 523 commits from 166 total contributors (68 new contributors)! ## Highlights ### V1 We have now enabled V1 engine by default (#13726) for supported use cases. Please refer to [V1 user guide](https://docs.vllm.ai/en/latest/getting_started/v1_user_guide.html) for more detail. We expect better performance for supported scenarios. If you'd like to disable V1 mode, please specify the environment variable `VLLM_USE_V1=0`, and send us a GitHub issue sharing the reason! * Su	Low	3/18/2025
v0.8.0rc2	## What's Changed * [V1] Remove input cache client by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/14864 * [Misc][XPU] Use None as device capacity for XPU by @yma11 in https://github.com/vllm-project/vllm/pull/14932 * [Doc] Add vLLM Beijing meetup slide by @heheda12345 in https://github.com/vllm-project/vllm/pull/14938 * setup.py: drop assumption about local `main` branch by @russellb in https://github.com/vllm-project/vllm/pull/14692 * [MISC] More AMD unused var clean up by @hous	Low	3/17/2025
v0.8.0rc1	Note: vLLM no longer sets the global seed (#14274). Please set the `seed` parameter if you need to reproduce your results. ## What's Changed * Update `pre-commit`'s `isort` version to remove warnings by @hmellor in https://github.com/vllm-project/vllm/pull/13614 * [V1][Minor] Print KV cache size in token counts by @WoosukKwon in https://github.com/vllm-project/vllm/pull/13596 * fix neuron performance issue by @ajayvohra2005 in https://github.com/vllm-project/vllm/pull/13589 * [Frontend] A	Low	3/17/2025
v0.7.3	## Highlights 🎉 253 commits from 93 contributors, including 29 new contributors! * Deepseek enhancements: * Support for DeepSeek Multi-Token Prediction, 1.69x speedup in low QPS scenarios (#12755) * AMD support: DeepSeek tunings, yielding 17% latency reduction (#13199) * Using FlashAttention3 for MLA (#12807) * Align the expert selection code path with official implementation (#13474) * Optimize moe_align_block_size for deepseek_v3 (#12850) * Expand MLA to support most types of	Low	2/20/2025
v0.7.2	## Highlights * Qwen2.5-VL is now supported in vLLM. Please note that it requires a source installation from Hugging Face `transformers` library at the moment (#12604) * Add `transformers` backend support via `--model-impl=transformers`. This allows vLLM to be ran with arbitrary Hugging Face text models (#11330, #12785, #12727). * Performance enhancement to DeepSeek models. * Align KV caches entries to start 256 byte boundaries, yielding 43% throughput enhancement (#12676) * Apply `to	Low	2/6/2025
v0.7.1	## Highlights This release features MLA optimization for Deepseek family of models. Compared to v0.7.0 released this Monday, we offer ~3x the generation throughput, ~10x the memory capacity for tokens, and horizontal context scalability with pipeline parallelism * MLA Kernel (#12601, #12642,#12528). * FP8 Kernels (#11589, #11868, #12587) ### V1 For the V1 architecture, we * Added a new design document for zero overhead prefix caching [here](https://docs.vllm.ai/en/latest/design/v1/pre	Low	2/1/2025
v0.7.0	## Highlights * vLLM's V1 engine is ready for testing! This is a rewritten engine designed for performance and architectural simplicity. You can turn it on by setting environment variable `VLLM_USE_V1=1`. See [our blog](https://blog.vllm.ai/2025/01/27/v1-alpha-release.html) for more details. (44 commits). * New methods (`LLM.sleep`, `LLM.wake_up`, `LLM.collective_rpc`, `LLM.reset_prefix_cache`) in vLLM for the post training frameworks! (#12361, #12084, #12284). * `torch.compile` is now fully	Low	1/27/2025
v0.6.6.post1	This release restore functionalities for other quantized MoEs, which was introduced as part of initial DeepSeek V3 support 🙇 . ## What's Changed * [Docs] Document Deepseek V3 support by @simon-mo in https://github.com/vllm-project/vllm/pull/11535 * Update openai_compatible_server.md by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/11536 * [V1] Use FlashInfer Sampling Kernel for Top-P & Top-K Sampling by @WoosukKwon in https://github.com/vllm-project/vllm/pull/113	Low	12/27/2024
v0.6.6	## Highlights * Support Deepseek V3 (#11523, #11502) model. * On 8xH200s or MI300x: `vllm serve deepseek-ai/DeepSeek-V3 --tensor-parallel-size 8 --trust-remote-code --max-model-len 8192`. The context length can be increased to about 32K beyond running into memory issue. * For other devices, follow our [distributed inference](https://docs.vllm.ai/en/latest/serving/distributed_serving.html) guide to enable tensor parallel and/or pipeline parallel inference * We are just getting started for	Low	12/27/2024
v0.6.5	## Highlights * Significant progress on the V1 engine refactor and multimodal support: New model executable interfaces for text-only and multimodal models, multiprocessing, improved configuration handling, and profiling enhancements (#10374, #10570, #10699, #11074, #11076, #10382, #10665, #10564, #11125, #11185, #11242). * Major improvements in `torch.compile` integration: Support for all attention backends, encoder-based models, dynamic FP8 fusion, shape specialization fixes, and performance	Low	12/17/2024
v0.6.4.post1	This patch release covers bug fixes (#10347, #10349, #10348, #10352, #10363), keep compatibility for `vLLMConfig` usage in out of tree models (#10356) ## What's Changed * Add default value to avoid Falcon crash (#5363) by @wchen61 in https://github.com/vllm-project/vllm/pull/10347 * [Misc] Fix import error in tensorizer tests and cleanup some code by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/10349 * [Doc] Remove float32 choice from --lora-dtype by @xyang16 in https://gith	Low	11/15/2024
v0.6.4	## Highlights * Significant progress in V1 engine core refactor (#9826, #10135, #10288, #10211, #10225, #10228, #10268, #9954, #10272, #9971, #10224, #10166, #9289, #10058, #9888, #9972, #10059, #9945, #9679, #9871, #10227, #10245, #9629, #10097, #10203, #10148). You can checkout more details regarding the design and plan ahead in our recent [meetup slides](https://docs.google.com/presentation/d/1e3CxQBV3JsfGp30SwyvS3eM_tW-ghOhJ9PAJGK6KR54/edit#slide=id.g31455c8bc1e_2_130) * Signficant progres	Low	11/15/2024
v0.6.3.post1	## Highlights ### New Models * Support Ministral 3B and Ministral 8B via interleaved attention (#9414) * Support multiple and interleaved images for Llama3.2 (#9095) * Support VLM2Vec, the first multimodal embedding model in vLLM (#9303) ### Important bug fix * Fix chat API continuous usage stats (#9357) * Fix vLLM UsageInfo and logprobs None AssertionError with empty token_ids (#9034) * Fix Molmo text-only input bug (#9397) * Fix CUDA 11.8 Build (#9386) * Fix `_version.py` not fou	Low	10/17/2024

Dependencies & License Audit

Loading dependencies...

Similar Packages

ai-guideProvide free, open access to comprehensive AI tools, guides, reviews, and resources to reduce knowledge gaps and empower users.main@2026-07-19

UltraRAGA Low-Code MCP Framework for Building Complex and Innovative RAG Pipelinesv0.3.0.2

awesome-opensource-aiCurated list of the best truly open-source AI projects, models, tools, and infrastructure.main@2026-07-25

langroidHarness LLMs with Multi-Agent Programming0.65.10

ollamaGet up and running with Kimi-K2.5, GLM-5, MiniMax, DeepSeek, gpt-oss, Qwen, Gemma and other models.v0.32.3

More in RAG & Memory

edgequakeEdegQuake 🌋 High-performance GraphRAG inspired from LightRag written in Rust; Transform documents into intelligent knowledge graphs for superior retrieval and generation

awesome-opensource-aiCurated list of the best truly open-source AI projects, models, tools, and infrastructure.

rust-sdkThe official Rust SDK for the Model Context Protocol

pdf_oxideThe fastest PDF library for Python and Rust. Text extraction, image extraction, markdown conversion, PDF creation & editing. 0.8ms mean, 5× faster than industry leaders, 100% pass rate on 3,830 PDFs.