freshcrate
Skin:/

vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

Why this rank:Strong adoptionRecent releaseHealthy release cadence

Description

A high-throughput and memory-efficient inference and serving engine for LLMs

README

vLLM

Easy, fast, and cheap LLM serving for everyone

| Documentation | Blog | Paper | Twitter/X | User Forum | Developer Slack |

🔥 We have built a vllm website to help you get started with vllm. Please visit vllm.ai to learn more. For events, please visit vllm.ai/events to join us.


About

vLLM is a fast and easy-to-use library for LLM inference and serving.

Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has grown into one of the most active open-source AI projects built and maintained by a diverse community of many dozens of academic institutions and companies from over 2000 contributors.

vLLM is fast with:

  • State-of-the-art serving throughput
  • Efficient management of attention key and value memory with PagedAttention
  • Continuous batching of incoming requests, chunked prefill, prefix caching
  • Fast and flexible model execution with piecewise and full CUDA/HIP graphs
  • Quantization: FP8, MXFP8/MXFP4, NVFP4, INT8, INT4, GPTQ/AWQ, GGUF, compressed-tensors, ModelOpt, TorchAO, and more
  • Optimized attention kernels including FlashAttention, FlashInfer, TRTLLM-GEN, FlashMLA, and Triton
  • Optimized GEMM/MoE kernels for various precisions using CUTLASS, TRTLLM-GEN, CuTeDSL
  • Speculative decoding including n-gram, suffix, EAGLE, DFlash
  • Automatic kernel generation and graph-level transformations using torch.compile
  • Disaggregated prefill, decode, and encode

vLLM is flexible and easy to use with:

  • Seamless integration with popular Hugging Face models
  • High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
  • Tensor, pipeline, data, expert, and context parallelism for distributed inference
  • Streaming outputs
  • Generation of structured outputs using xgrammar or guidance
  • Tool calling and reasoning parsers
  • OpenAI-compatible API server, plus Anthropic Messages API and gRPC support
  • Efficient multi-LoRA support for dense and MoE layers
  • Support for NVIDIA GPUs, AMD GPUs, and x86/ARM/PowerPC CPUs. Additionally, diverse hardware plugins such as Google TPUs, Intel Gaudi, IBM Spyre, Huawei Ascend, Rebellions NPU, Apple Silicon, MetaX GPU, and more.

vLLM seamlessly supports 200+ model architectures on HuggingFace, including:

  • Decoder-only LLMs (e.g., Llama, Qwen, Gemma)
  • Mixture-of-Expert LLMs (e.g., Mixtral, DeepSeek-V3, Qwen-MoE, GPT-OSS)
  • Hybrid attention and state-space models (e.g., Mamba, Qwen3.5)
  • Multi-modal models (e.g., LLaVA, Qwen-VL, Pixtral)
  • Embedding and retrieval models (e.g., E5-Mistral, GTE, ColBERT)
  • Reward and classification models (e.g., Qwen-Math)

Find the full list of supported models here.

Getting Started

Install vLLM with uv (recommended) or pip:

uv pip install vllm

Or build from source for development.

Visit our documentation to learn more.

Contributing

We welcome and value any contributions and collaborations. Please check out Contributing to vLLM for how to get involved.

Citation

If you use vLLM for your research, please cite our paper:

@inproceedings{kwon2023efficient,
  title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
  author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},
  booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},
  year={2023}
}

Contact Us

  • For technical questions and feature requests, please use GitHub Issues
  • For discussing with fellow users, please use the vLLM Forum
  • For coordinating contributions and development, please use Slack
  • For security disclosures, please use GitHub's Security Advisories feature
  • For collaborations and partnerships, please contact us at collaboration@vllm.ai

Media Kit

Release History

VersionChangesUrgencyDate
v0.22.1## Highlights This release features 8 commits from 6 contributors (1 new)! v0.22.1 is a patch release on top of v0.22.0 with targeted bug fixes plus a couple of additions: new model support for JetBrains' Mellum v2, zentorch-accelerated quantized linear inference on AMD Zen CPUs, and fixes for multi-node Ray data-parallel serving, DeepSeek-V4 initialization, and a few model-loading regressions. ### Model Support * New model: JetBrains' **Mellum v2**, an open-weights Mixture-of-Experts High6/5/2026
v0.22.0## Highlights This release features 459 commits from 230 contributors (63 new)! * **DeepSeek V4 maturity**: DeepSeek V4 received a major hardening pass this cycle — the model was reorganized into a dedicated `vllm/models/deepseek_v4/` package (#43004, #43039, #43073, #43077, #43149), gained NVFP4 fused MoE support (#42209), full + piecewise CUDA graph (#42604), and MTP speculative decoding (#43385). A large set of fused kernels (MegaMoE, `mhc`, Q-norm, indexer, sparse MLA) and ROCm parity High5/29/2026
v0.21.0## Highlights This release features 367 commits from 202 contributors (49 new)! * **Transformers v4 deprecated**: This release formally deprecates `transformers` v4 support (#40389). Users should migrate to `transformers` v5. * **C++20 build requirement**: vLLM now requires a C++20-compatible compiler for compatibility with PyTorch (#40380). This is a **breaking build change**. * **KV Offload + Hybrid Memory Allocator (HMA)**: The KV offloading subsystem now integrates with the Hybrid MeHigh5/15/2026
v0.20.2# vLLM v0.20.2 ## Highlights This release features 6 commits from 6 contributors (0 new)! This is a small patch release with bug fixes for DeepSeek V4, gpt-oss, and Qwen3-VL ### Bug Fixes * **DeepSeek V4 sparse attention**: Re-enable the persistent topk path on Hopper and ensure the memset kernel runs at CUDA graph capture time regardless of `max_seq_len`, fixing the MTP=1 hang on DeepSeek V4 (#41665, revert of #41605). * **DeepSeek V4 KV cache**: Fixed a "failure to allocate KV blocHigh5/10/2026
v0.20.1# vLLM v0.20.1 This is a patch release on top of `v0.20.0` primarily focused on **DeepSeek V4 stabilization and performance improvements**, along with several important bug fixes. ### DeepSeek V4 * Base model support (#41006). * Multi-stream pre-attention GEMM (#41061), configurable pre-attn GEMM knob (#41443), and tuned default `VLLM_MULTI_STREAM_GEMM_TOKEN_THRESHOLD` (#41526). * BF16 and MXFP8 all-to-all support for FlashInfer one-sided communication (#40960). * PTX `cvt` instructionHigh5/3/2026
v0.20.0# vLLM v0.20.0 ## Highlights This release features 752 commits from 320 contributors (123 new)! * **DeepSeek V4**: Initial DeepSeek V4 support landed (#40860), with DSML token-leakage fix in DSV4/3.2 (#40806), DSA + MTP IMA fix (#40772), and a silu clamp limit on the shared expert (#40950). * **CUDA 13.0 default**: Default CUDA wheel on PyPI and `vllm/vllm-openai:v0.20.0` image switched to CUDA 13.0; architecture lists and build-args cleaned up (#39878), and CUDA bumped to 13.0.2 to matcHigh4/27/2026
v0.19.1This is a patch release on top of `v0.19.0` with Transformers v5.5.4 upgrade and bug fixes for Gemma4: - Update to transformers v5 (#30566) - [Bugfix] Fix invalid JSON in Gemma 4 streaming tool calls by stripping partial delimiters (#38992) - [Bugfix][Frontend] Fix Gemma4 streaming HTML duplication after tool calls (#38909) - [Bugfix] Fix Gemma4 streaming tool call corruption for split boolean/number values (#39114) - [Tool] adjust_request to reasoning parser, and Gemma4 fixes (#39027) - [High4/18/2026
v0.19.0# vLLM v0.19.0 ## Highlights This release features 448 commits from 197 contributors (54 new)! * **Gemma 4 support**: Full Google Gemma 4 architecture support including MoE, multimodal, reasoning, and tool-use capabilities (#38826, #38847). Requires `transformers>=5.5.0`. We recommend using pre-built docker image `vllm/vllm-openai:gemma4` for out of box usage. * **Zero-bubble async scheduling + speculative decoding**: Async scheduling now supports speculative decoding with zero-bubble ovHigh4/3/2026
v0.18.1This is a patch release on top of v0.18.0 to address a few issues: - Change default SM100 MLA prefill backend back to TRT-LLM (#38562) - Fix mock.patch resolution failure for standalone_compile.FakeTensorMode on Python <= 3.10 (#37158) - Disable monolithic TRTLLM MoE for Renormalize routing #37605 - Pre-download missing FlashInfer headers in Docker build #38391 - Fix DeepGemm E8M0 accuracy degradation for Qwen3.5 FP8 on Blackwell (#38083) Medium3/31/2026
v0.18.0# vLLM v0.18.0 ## Known issues - Degraded accuracy when serving Qwen3.5 with FP8 KV cache on B200 (#37618) - If you previously ran into `CUBLAS_STATUS_INVALID_VALUE` and had to use a workaround in `v0.17.0`, you can reinstall `torch 2.10.0`. PyTorch published an updated wheel that addresses this bug. ## Highlights This release features 445 commits from 213 contributors (61 new)! * **gRPC Serving Support**: vLLM now supports gRPC serving via the new `--grpc` flag (#36169), enabling Low3/20/2026
v0.17.1This is a patch release on top of `v0.17.0` to address a few issues: - New Model: Nemotron 3 Super - Fix passing of activation_type to trtllm fused MoE NVFP4 and FP8 (#36017) - Fix/resupport nongated fused moe triton (#36412) - Re-enable EP for trtllm MoE FP8 backend (#36494) - [Mamba][Qwen3.5] Zero freed SSM cache blocks on GPU (#35219) - Fix TRTLLM Block FP8 MoE Monolithic (#36296) - [DSV3.2][MTP] Optimize Indexer MTP handling (#36723)Low3/11/2026
v0.17.0# vLLM v0.17.0 **Known Issue**: If you are on CUDA 12.9+ and encounter a `CUBLAS_STATUS_INVALID_VALUE` error, this is caused by a CUDA library mismatch. To resolve, try one of the following: 1. Remove the path to system CUDA shared library files (e.g. `/usr/local/cuda`) from `LD_LIBRARY_PATH`, or simply `unset LD_LIBRARY_PATH`. 2. Install vLLM with `uv pip install vllm --torch-backend=auto`. 3. Install vLLM with `pip install vllm --extra-index-url https://download.pytorch.org/whl/cu129` (cLow3/7/2026
v0.16.0# vLLM v0.16.0 Please note that this release was branch cut on Feb 8, so any features added to vLLM after that date is not included. ## Highlights This release features 440 commits from 203 contributors (7 new)! * **Async scheduling + Pipeline Parallelism** is now fully supported, delivering **30.8% E2E throughput improvement** and **31.8% TPOT improvement** (#32618). * **Realtime API**: A new WebSocket-based Realtime API enables streaming audio interactions (#33187), building on the Low2/25/2026
v0.15.1v0.15.1 is a patch release with security fixes, RTX Blackwell GPU fixes support, and bug fixes. ## Security - **CVE-2025-69223**: Updated aiohttp dependency (#33621) - **CVE-2026-0994**: Updated Protobuf dependency (#33619) ## Highlights ### Bugfix Hardware Support - **RTX Blackwell (SM120)**: Fixed NVFP4 MoE kernel support for RTX Blackwell workstation GPUs. Previously, NVFP4 MoE models would fail to load on these GPUs (#33417) - **FP8 kernel selection**: Fixed FP8 CUTLASS group Low2/4/2026
v0.15.0## Highlights This release features 335 commits from 158 contributors (39 new)! ### Model Support * **New architectures**: Kimi-K2.5 (#33131), Molmo2 (#30997), Step3vl 10B (#32329), Step1 (#32511), GLM-Lite (#31386), Eagle2.5-8B VLM (#32456). * **LoRA expansion**: Nemotron-H (#30802), InternVL2 (#32397), MiniMax M2 (#32763). * **Speculative decoding**: EAGLE3 for Pixtral/LlavaForConditionalGeneration (#32542), Qwen3 VL MoE (#32048), draft model support (#24322). * **Embeddings**: BGE-MLow1/29/2026
v0.14.1This is a patch release on top of `v0.14.0` to address a few security and memory leak fixes.Low1/24/2026
v0.14.0## Highlights This release features approximately 660 commits from 251 contributors (86 new contributors). **Breaking Changes:** - **Async scheduling is now enabled by default** - Users who experience issues can disable with `--no-async-scheduling`. - Excludes some not-yet-supported configurations: pipeline parallel, CPU backend, non-MTP/Eagle spec decoding. - **PyTorch 2.9.1** is now required and the default wheel is compiled against cu129. - Deprecated quantization schemes have beLow1/20/2026
v0.13.0# vLLM v0.13.0 Release Notes Highlights ## Highlights This release features **442 commits from 207 contributors (61 new contributors)!** **Breaking Changes**: This release includes deprecation removals, PassConfig flag renames, and attention configuration changes from environment variables to CLI arguments. Please review the breaking changes section carefully before upgrading. ### Model Support * **New models**: BAGEL (AR only) (#28439), AudioFlamingo3 (#30539), JAIS 2 (#30188), latLow12/19/2025
v0.12.0# vLLM v0.12.0 Release Notes Highlights ## Highlights This release features 474 commits from 213 contributors (57 new)! **Breaking Changes**: This release includes PyTorch 2.9.0 upgrade (CUDA 12.9), V0 deprecations including `xformers` backend, and scheduled removals - please review the changelog carefully. **Major Features**: * **EAGLE Speculative Decoding Improvements**: Multi-step CUDA graph support (#29559), DP>1 support (#26086), and multimodal support with Qwen3VL (#29594). *Low12/3/2025
v0.11.2This release includes 4 bug fixes on top of `v0.11.1`: - [BugFix] Ray with multiple nodes (https://github.com/vllm-project/vllm/pull/28873) - [BugFix] Fix false assertion with spec-decode=[2,4,..] and TP>2 (https://github.com/vllm-project/vllm/pull/29036) - [BugFix] Fix async-scheduling + FlashAttn MLA (https://github.com/vllm-project/vllm/pull/28990) - [NVIDIA] Guard SM100 CUTLASS MoE macro to SM100 builds v2 (https://github.com/vllm-project/vllm/pull/28938)Low11/20/2025
v0.11.1## Highlights This release includes 1456 commits from 449 contributors (184 new contributors)! Key changes include: * **PyTorch 2.9.0 + CUDA 12.9.1**: Updated the default CUDA build to `torch==2.9.0+cu129`, enabling Inductor partitioning and landing multiple fixes in graph-partition rules and compile-cache integration. * **Batch-invariant `torch.compile`**: Generalized batch-invariant support across attention and MoE backends, with explicit support for DeepGEMM and FlashInfer on Hopper andLow11/18/2025
v0.11.0## Highlights This release features 538 commits, 207 contributors (65 new contributors)! * This release completes the removal of V0 engine. V0 engine code including AsyncLLMEngine, LLMEngine, MQLLMEngine, all attention backends, and related components have been removed. **V1 is the only engine in the codebase now.** * This releases turns on **FULL_AND_PIECEWISE as the CUDA graph mode default**. This should provide better out of the box performance for most models, particularly fine-grained Low10/2/2025
v0.10.2## Highlights This release contains 740 commits from 266 contributors (97 new)! **Breaking Changes**: This release includes PyTorch 2.8.0 upgrade, V0 deprecations, and API changes - please review the changelog carefully. **aarch64 support**: This release features native support for aarch64 allowing usage of vLLM on GB200 platform. The docker image `vllm/vllm-openai` should already be multiplatform. To install the wheels, you can download the wheels from this release artifact or install viLow9/13/2025
v0.10.1.1This is a critical bugfix and security release: * Fix CUTLASS MLA Full CUDAGraph (#23200) * Limit HTTP header count and size (#23267): https://github.com/vllm-project/vllm/security/advisories/GHSA-rxc4-3w6r-4v47 * Do not use eval() to convert unknown types (#23266): https://github.com/vllm-project/vllm/security/advisories/GHSA-79j6-g2m3-jgfw **Full Changelog**: https://github.com/vllm-project/vllm/compare/v0.10.1...v0.10.1.1Low8/20/2025
v0.10.1## Highlights v0.10.1 release includes 727 commits, 245 committers (105 new contributors). **NOTE: This release deprecates V0 FA3 support and as a result FP8 kv-cache in V0 may have issues** ### Model Support * **New model families**: GPT-OSS with comprehensive tool calling and streaming support (#22327, #22330, #22332, #22335, #22339, #22340, #22342), Command-A-Vision (#22660), mBART (#22883), and SmolLM3 using Transformers backend (#22665). * **Vision-language models**: Official EagLow8/18/2025
v0.10.1rc1## What's Changed * Deduplicate Transformers backend code using inheritance by @hmellor in https://github.com/vllm-project/vllm/pull/21461 * [Bugfix][ROCm] Fix for warp_size uses on host by @gshtras in https://github.com/vllm-project/vllm/pull/21205 * [TPU][Bugfix] fix moe layer by @yaochengji in https://github.com/vllm-project/vllm/pull/21340 * [v1][Core] Clean up usages of `SpecializedManager` by @zhouwfang in https://github.com/vllm-project/vllm/pull/21407 * [Misc] Fix duplicate FusedMoEConfiLow8/17/2025
v0.10.0## Highlights v0.10.0 release includes 308 commits, 168 contributors (62 new!). **NOTE: This release begins the cleanup of V0 engine codebase.** We have removed V0 CPU/XPU/TPU/HPU backends (#20412), long context LoRA (#21169), Prompt Adapters (#20588), Phi3-Small & BlockSparse Attention (#21217), and Spec Decode workers (#21152) so far and plan to continued to delete code that is no longer used. ### Model Support * New families: Llama 4 with EAGLE support (#20591), EXAONE 4.0 (#21060), Low7/24/2025
v0.10.0rc2## What's Changed * [Model] use AutoWeightsLoader for bart by @calvin0327 in https://github.com/vllm-project/vllm/pull/18299 * [Model] Support VLMs with transformers backend by @zucchini-nlp in https://github.com/vllm-project/vllm/pull/20543 * [bugfix] fix syntax warning caused by backslash by @1195343015 in https://github.com/vllm-project/vllm/pull/21251 * [CI] Cleanup modelscope version constraint in Dockerfile by @yankay in https://github.com/vllm-project/vllm/pull/21243 * [Docs] Add RFCLow7/24/2025
v0.10.0rc1## What's Changed * [Kernel] Enable fp8 support for pplx and BatchedTritonExperts. by @bnellnm in https://github.com/vllm-project/vllm/pull/18864 * [Misc] Fix `Unable to detect current VLLM config. Defaulting to NHD kv cache layout` warning by @NickLucche in https://github.com/vllm-project/vllm/pull/20400 * [Bugfix] Register reducer even if transformers_modules not available by @eicherseiji in https://github.com/vllm-project/vllm/pull/19510 * Change warn_for_unimplemented_methods to debug by @mgLow7/20/2025
v0.9.2## Highlights This release contains 452 commits from 167 contributors (31 new!) **NOTE: This is the last version where V0 engine code and features stay intact. We highly recommend migrating to V1 engine.** ### Engine Core * Priority Scheduling is now implemented in V1 engine (#19057), embedding models in V1 (#16188), Mamba2 in V1 (#19327). * Full CUDA‑Graph execution is now available for all FlashAttention v3 (FA3) and FlashMLA paths, including prefix‑caching. CUDA graph now has a lLow7/7/2025
v0.9.2rc2## What's Changed * [Kernel] Enable fp8 support for pplx and BatchedTritonExperts. by @bnellnm in https://github.com/vllm-project/vllm/pull/18864 * [Misc] Fix `Unable to detect current VLLM config. Defaulting to NHD kv cache layout` warning by @NickLucche in https://github.com/vllm-project/vllm/pull/20400 * [Bugfix] Register reducer even if transformers_modules not available by @eicherseiji in https://github.com/vllm-project/vllm/pull/19510 * Change warn_for_unimplemented_methods to debug by @mgLow7/6/2025
v0.9.2rc1## What's Changed * [Docs] Note that alternative structured output backends are supported by @russellb in https://github.com/vllm-project/vllm/pull/19426 * [ROCm][V1] Adding ROCm to the list of plaforms using V1 by default by @gshtras in https://github.com/vllm-project/vllm/pull/19440 * [Model] use AutoWeightsLoader for commandr by @py-andy-c in https://github.com/vllm-project/vllm/pull/19399 * Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B-FP8 by @Xu-Wenqing in https://github.coLow7/3/2025
v0.9.1## Highlights This release features **274 commits, from 123 contributors (27 new contributors!)** * Progress in large scale serving * DP Attention + Expert Parallelism: CUDA graph support (#18724), DeepEP dispatch-combine kernel (#18434), batched/masked DeepGEMM kernel (#19111), CUTLASS MoE kernel with PPLX (#18762) * Heterogeneous TP (#18833), NixlConnector Enable FlashInfer backend (#19090) * DP: API-server scaleout with many-to-many server-engine comms (#17546), Support DP with RaLow6/10/2025
v0.9.1rc1## What's Changed * [CI/Build] [TPU] Fix TPU CI exit code by @CAROLZXYZXY in https://github.com/vllm-project/vllm/pull/18282 * [Neuron] Support quantization on neuron by @aws-satyajith in https://github.com/vllm-project/vllm/pull/18283 * Support datasets in `vllm bench serve` and sync with benchmark_[serving,datasets].py by @mgoin in https://github.com/vllm-project/vllm/pull/18566 * [Bugfix] Disable prefix caching by default for benchmark by @cascade812 in https://github.com/vllm-project/vllm/puLow6/9/2025
v0.9.0.1This patch release contains important bugfix for DeepSeek family of models on NVIDIA Ampere and below (#18807) **Full Changelog**: https://github.com/vllm-project/vllm/compare/v0.9.0...v0.9.0.1Low5/30/2025
v0.9.0## Highlights This release features 649 commits, from 215 contributors (82 new contributors!) * vLLM has upgraded to PyTorch 2.7! (#16859) This is a breaking change for environment dependency. * The default wheel has been upgraded from CUDA 12.4 to CUDA 12.8. We will distribute CUDA 12.6 wheel on GitHub artifact. * As a general rule of thumb, our CUDA version policy follow PyTorch's CUDA version policy. * Enhanced NVIDIA Blackwell support. vLLM now ships with initial set of optimizedLow5/15/2025
v0.8.5.post1This post release contains two bug fix for memory leak and model accuracy * Fix Memory Leak in `_cached_reqs_data` (#17567) * Fix sliding window attention in V1 giving incorrect results (#17574) **Full Changelog**: https://github.com/vllm-project/vllm/compare/v0.8.5...v0.8.5.post1Low5/2/2025
v0.8.5This release contains 310 commits from 143 contributors (55 new contributors!). ## Highlights This release features important multi-modal bug fixes, day 0 support for Qwen3, and xgrammar's structure tag feature for tool calling. ### Model Support * Day 0 support for Qwen3 and Qwen3MoE. This release fixes fp8 weight loading (#17318) and adds tuned MoE configs (#17328). * Add ModernBERT (#16648) * Add Granite Speech Support (#16246) * Add PLaMo2 (#14323) * Add Kimi-VL model support (#Low4/28/2025
v0.8.4This release contains 180 commits from 84 contributors (25 new contributors!). ## Highlights This release includes important accuracy fixes for Llama4 models, if you are using it, we highly recommend you to update. ### Model * Llama4 (#16113,#16509) bug fix and enhancements: * qknorm should be not shared across head (#16311) * Enable attention temperature tuning by default for long context (>32k) (#16439) * Index Error When Single Request Near Max Context (#16209) * Add tunedLow4/14/2025
v0.8.3## Highlights This release features 260 commits, 109 contributors, 38 new contributors. * We are excited to announce Day 0 Support for Llama 4 Scout and Maverick (#16104). Please [see our blog for detailed user guide](https://blog.vllm.ai/2025/04/05/llama4). * Please note that Llama4 is only supported in V1 engine only for now. * V1 engine now supports native sliding window attention (#14097) with the hybrid memory allocator. ### Cluster Scale Serving * Single node data parallel Low4/6/2025
v0.8.3rc1## What's Changed * Fix CUDA kernel index data type in vllm/csrc/quantization/gptq_marlin/awq_marlin_repack.cu +10 by @houseroad in https://github.com/vllm-project/vllm/pull/15160 * [Hardware][TPU][Bugfix] Fix v1 mp profiler by @lsy323 in https://github.com/vllm-project/vllm/pull/15409 * [Kernel][CPU] CPU MLA by @gau-nernst in https://github.com/vllm-project/vllm/pull/14744 * Dockerfile.ppc64le changes to move to UBI by @Shafi-Hussain in https://github.com/vllm-project/vllm/pull/15402 * [Misc] CLow4/5/2025
v0.8.2This release contains important bug fix for the V1 engine's memory usage. We highly recommend you upgrading! ## Highlights * Revert "Use uv python for docker rather than ppa:deadsnakess/ppa (#13569)" (#15377) * Remove openvino support in favor of external plugin (#15339) ### V1 Engine * Fix V1 Engine crash while handling requests with duplicate request id (#15043) * Support FP8 KV Cache (#14570, #15191) * Add flag to disable cascade attention (#15243) * Scheduler Refactoring: Add ScLow3/23/2025
v0.8.1This release contains important bug fixes for v0.8.0. We highly recommend upgrading! * V1 Fixes * Ensure using int64 for sampled token ids (#15065) * Fix long dtype in topk sampling (#15049) * Refactor Structured Output for multiple backends (#14694) * Fix size calculation of processing cache (#15114) * Optimize Rejection Sampler with Triton Kernels (#14930) * Fix oracle for device checking (#15104) * TPU * Fix chunked prefill with padding (#15037) * Enhanced CI/CD (#15054Low3/19/2025
v0.8.0v0.8.0 featured 523 commits from 166 total contributors (68 new contributors)! ## Highlights ### V1 We have now enabled V1 engine by default (#13726) for supported use cases. Please refer to [V1 user guide](https://docs.vllm.ai/en/latest/getting_started/v1_user_guide.html) for more detail. We expect better performance for supported scenarios. If you'd like to disable V1 mode, please specify the environment variable `VLLM_USE_V1=0`, and send us a GitHub issue sharing the reason! * SuLow3/18/2025
v0.8.0rc2## What's Changed * [V1] Remove input cache client by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/14864 * [Misc][XPU] Use None as device capacity for XPU by @yma11 in https://github.com/vllm-project/vllm/pull/14932 * [Doc] Add vLLM Beijing meetup slide by @heheda12345 in https://github.com/vllm-project/vllm/pull/14938 * setup.py: drop assumption about local `main` branch by @russellb in https://github.com/vllm-project/vllm/pull/14692 * [MISC] More AMD unused var clean up by @housLow3/17/2025
v0.8.0rc1Note: vLLM no longer sets the global seed (#14274). Please set the `seed` parameter if you need to reproduce your results. ## What's Changed * Update `pre-commit`'s `isort` version to remove warnings by @hmellor in https://github.com/vllm-project/vllm/pull/13614 * [V1][Minor] Print KV cache size in token counts by @WoosukKwon in https://github.com/vllm-project/vllm/pull/13596 * fix neuron performance issue by @ajayvohra2005 in https://github.com/vllm-project/vllm/pull/13589 * [Frontend] ALow3/17/2025
v0.7.3## Highlights 🎉 253 commits from 93 contributors, including 29 new contributors! * Deepseek enhancements: * Support for DeepSeek Multi-Token Prediction, 1.69x speedup in low QPS scenarios (#12755) * AMD support: DeepSeek tunings, yielding 17% latency reduction (#13199) * Using FlashAttention3 for MLA (#12807) * Align the expert selection code path with official implementation (#13474) * Optimize moe_align_block_size for deepseek_v3 (#12850) * Expand MLA to support most types ofLow2/20/2025
v0.7.2## Highlights * Qwen2.5-VL is now supported in vLLM. Please note that it requires a source installation from Hugging Face `transformers` library at the moment (#12604) * Add `transformers` backend support via `--model-impl=transformers`. This allows vLLM to be ran with arbitrary Hugging Face text models (#11330, #12785, #12727). * Performance enhancement to DeepSeek models. * Align KV caches entries to start 256 byte boundaries, yielding 43% throughput enhancement (#12676) * Apply `toLow2/6/2025
v0.7.1## Highlights This release features MLA optimization for Deepseek family of models. Compared to v0.7.0 released this Monday, we offer ~3x the generation throughput, ~10x the memory capacity for tokens, and horizontal context scalability with pipeline parallelism * MLA Kernel (#12601, #12642,#12528). * FP8 Kernels (#11589, #11868, #12587) ### V1 For the V1 architecture, we * Added a new design document for zero overhead prefix caching [here](https://docs.vllm.ai/en/latest/design/v1/preLow2/1/2025
v0.7.0## Highlights * vLLM's V1 engine is ready for testing! This is a rewritten engine designed for performance and architectural simplicity. You can turn it on by setting environment variable `VLLM_USE_V1=1`. See [our blog](https://blog.vllm.ai/2025/01/27/v1-alpha-release.html) for more details. (44 commits). * New methods (`LLM.sleep`, `LLM.wake_up`, `LLM.collective_rpc`, `LLM.reset_prefix_cache`) in vLLM for the post training frameworks! (#12361, #12084, #12284). * `torch.compile` is now fully Low1/27/2025
v0.6.6.post1This release restore functionalities for other quantized MoEs, which was introduced as part of initial DeepSeek V3 support 🙇 . ## What's Changed * [Docs] Document Deepseek V3 support by @simon-mo in https://github.com/vllm-project/vllm/pull/11535 * Update openai_compatible_server.md by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/11536 * [V1] Use FlashInfer Sampling Kernel for Top-P & Top-K Sampling by @WoosukKwon in https://github.com/vllm-project/vllm/pull/113Low12/27/2024
v0.6.6## Highlights * Support Deepseek V3 (#11523, #11502) model. * On 8xH200s or MI300x: `vllm serve deepseek-ai/DeepSeek-V3 --tensor-parallel-size 8 --trust-remote-code --max-model-len 8192`. The context length can be increased to about 32K beyond running into memory issue. * For other devices, follow our [distributed inference](https://docs.vllm.ai/en/latest/serving/distributed_serving.html) guide to enable tensor parallel and/or pipeline parallel inference * We are just getting started forLow12/27/2024
v0.6.5## Highlights * Significant progress on the V1 engine refactor and multimodal support: New model executable interfaces for text-only and multimodal models, multiprocessing, improved configuration handling, and profiling enhancements (#10374, #10570, #10699, #11074, #11076, #10382, #10665, #10564, #11125, #11185, #11242). * Major improvements in `torch.compile` integration: Support for all attention backends, encoder-based models, dynamic FP8 fusion, shape specialization fixes, and performance Low12/17/2024
v0.6.4.post1This patch release covers bug fixes (#10347, #10349, #10348, #10352, #10363), keep compatibility for `vLLMConfig` usage in out of tree models (#10356) ## What's Changed * Add default value to avoid Falcon crash (#5363) by @wchen61 in https://github.com/vllm-project/vllm/pull/10347 * [Misc] Fix import error in tensorizer tests and cleanup some code by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/10349 * [Doc] Remove float32 choice from --lora-dtype by @xyang16 in https://githLow11/15/2024
v0.6.4## Highlights * Significant progress in V1 engine core refactor (#9826, #10135, #10288, #10211, #10225, #10228, #10268, #9954, #10272, #9971, #10224, #10166, #9289, #10058, #9888, #9972, #10059, #9945, #9679, #9871, #10227, #10245, #9629, #10097, #10203, #10148). You can checkout more details regarding the design and plan ahead in our recent [meetup slides](https://docs.google.com/presentation/d/1e3CxQBV3JsfGp30SwyvS3eM_tW-ghOhJ9PAJGK6KR54/edit#slide=id.g31455c8bc1e_2_130) * Signficant progresLow11/15/2024
v0.6.3.post1## Highlights ### New Models * Support Ministral 3B and Ministral 8B via interleaved attention (#9414) * Support multiple and interleaved images for Llama3.2 (#9095) * Support VLM2Vec, the first multimodal embedding model in vLLM (#9303) ### Important bug fix * Fix chat API continuous usage stats (#9357) * Fix vLLM UsageInfo and logprobs None AssertionError with empty token_ids (#9034) * Fix Molmo text-only input bug (#9397) * Fix CUDA 11.8 Build (#9386) * Fix `_version.py` not fouLow10/17/2024

Dependencies & License Audit

Loading dependencies...

Similar Packages

ai-guideProvide free, open access to comprehensive AI tools, guides, reviews, and resources to reduce knowledge gaps and empower users.main@2026-06-04
UltraRAGA Low-Code MCP Framework for Building Complex and Innovative RAG Pipelinesv0.3.0.2
RAG🧠 Build an offline RAG chatbot to answer questions from PDFs, adapting responses based on user experience levels with a smooth chat interface.main@2026-06-06
ComfyUI-LoaderUtils🔄 Optimize model loading in ComfyUI with flexible node connections and controlled sequences for better performance and memory management.main@2026-06-06
mcp-rag-agent🔍 Build a production-ready RAG system that combines LangGraph and MCP integration for precise, context-aware AI-driven question answering.main@2026-06-06

More in RAG & Memory

spiceaiA portable accelerated SQL query, search, and LLM-inference engine, written in Rust, for data-grounded AI apps and agents.
awesome-opensource-aiCurated list of the best truly open-source AI projects, models, tools, and infrastructure.
antflyNo description
generative-ai-for-beginners21 Lessons, Get Started Building with Generative AI