Description
<div align="center" id="sglangtop"> <img src="https://raw.githubusercontent.com/sgl-project/sglang/main/assets/logo.png" alt="logo" width="400" margin="10px"></img> [](https://pypi.org/project/sglang)  [](https://github.com/sgl-project/sglang/tree/main/LICENSE) [](https://github.com/sgl-project/sglang/issues) [](https://github.com/sgl-project/sglang/issues) [](https://deepwiki.com/sgl-project/sglang) </div> -------------------------------------------------------------------------------- <p align="center"> <a href="https://lmsys.org/blog/"><b>Blog</b></a> | <a href="https://docs.sglang.io/"><b>Documentation</b></a> | <a href="https://roadmap.sglang.io/"><b>Roadmap</b></a> | <a href="https://slack.sglang.io/"><b>Join Slack</b></a> | <a href="https://meet.sglang.io/"><b>Weekly Dev Meeting</b></a> | <a href="https://github.com/sgl-project/sgl-learning-materials?tab=readme-ov-file#slides"><b>Slides</b></a> </p> ## News - [2026/02] π₯ Unlocking 25x Inference Performance with SGLang on NVIDIA GB300 NVL72 ([blog](https://lmsys.org/blog/2026-02-20-gb300-inferencex/)). - [2026/01] π₯ SGLang Diffusion accelerates video and image generation ([blog](https://lmsys.org/blog/2026-01-16-sglang-diffusion/)). - [2025/12] SGLang provides day-0 support for latest open models ([MiMo-V2-Flash](https://lmsys.org/blog/2025-12-16-mimo-v2-flash/), [Nemotron 3 Nano](https://lmsys.org/blog/2025-12-15-run-nvidia-nemotron-3-nano/), [Mistral Large 3](https://github.com/sgl-project/sglang/pull/14213), [LLaDA 2.0 Diffusion LLM](https://lmsys.org/blog/2025-12-19-diffusion-llm/), [MiniMax M2](https://lmsys.org/blog/2025-11-04-miminmax-m2/)). - [2025/10] π₯ SGLang now runs natively on TPU with the SGLang-Jax backend ([blog](https://lmsys.org/blog/2025-10-29-sglang-jax/)). - [2025/09] Deploying DeepSeek on GB200 NVL72 with PD and Large Scale EP (Part II): 3.8x Prefill, 4.8x Decode Throughput ([blog](https://lmsys.org/blog/2025-09-25-gb200-part-2/)). - [2025/09] SGLang Day 0 Support for DeepSeek-V3.2 with Sparse Attention ([blog](https://lmsys.org/blog/2025-09-29-deepseek-V32/)). - [2025/08] SGLang x AMD SF Meetup on 8/22: Hands-on GPU workshop, tech talks by AMD/xAI/SGLang, and networking ([Roadmap](https://github.com/sgl-project/sgl-learning-materials/blob/main/slides/amd_meetup_sglang_roadmap.pdf), [Large-scale EP](https://github.com/sgl-project/sgl-learning-materials/blob/main/slides/amd_meetup_sglang_ep.pdf), [Highlights](https://github.com/sgl-project/sgl-learning-materials/blob/main/slides/amd_meetup_highlights.pdf), [AITER/MoRI](https://github.com/sgl-project/sgl-learning-materials/blob/main/slides/amd_meetup_aiter_mori.pdf), [Wave](https://github.com/sgl-project/sgl-learning-materials/blob/main/slides/amd_meetup_wave.pdf)). <details> <summary>More</summary> - [2025/11] SGLang Diffusion accelerates video and image generation ([blog](https://lmsys.org/blog/2025-11-07-sglang-diffusion/)). - [2025/10] PyTorch Conference 2025 SGLang Talk ([slide](https://github.com/sgl-project/sgl-learning-materials/blob/main/slides/sglang_pytorch_2025.pdf)). - [2025/10] SGLang x Nvidia SF Meetup on 10/2 ([recap](https://x.com/lmsysorg/status/1975339501934510231)). - [2025/08] SGLang provides day-0 support for OpenAI gpt-oss model ([instructions](https://github.com/sgl-project/sglang/issues/8833)) - [2025/06] SGLang, the high-performance serving infrastructure powering trillions of tokens daily, has been awarded the third batch of the Open Source AI Grant by a16z ([a16z blog](https://a16z.com/advancing-open-source-ai-through-benchmarks-and-bold-experimentation/)). - [2025/05] Deploying DeepSeek with PD Disaggregation and Large-scale Expert Parallelism on 96 H100 GPUs ([blog](https://lmsys.org/blog/2025-05-05-large-scale-ep/)). - [2025/06] Deploying DeepSeek on GB200 NVL72 with PD and Large Scale EP (Part I): 2.7x Higher Decoding Throughput ([blog](https://lmsys.org/blog/2025-06-16-gb200-part-1/)). - [2025/03] Supercharge DeepSeek-R1 Inference on AMD Instinct MI300X ([AMD blog](https://rocm.blogs.amd.com/artificial-intelligence/DeepSeekR1-Part2/README.html)) - [2025/03] SGLang Joins PyTorch Ecosystem: Efficient LLM Serving Engine ([PyTorch blog](https://pytorch.org/blog/sglang-joins-pytorch/)) - [2025/02] Unlock DeepSeek-R1 Inference Performance on AMD Instinctβ’ MI300X GPU ([AMD blog](https://rocm.blogs.amd.com/artificial-intelligence/DeepSeekR1_Perf/README.html)) - [2025/01] SGLang provides day one support for DeepSeek V3/R1 models on NVIDIA and AMD GPUs with DeepSeek-specific optimizations. ([instructions](https://github.com/sgl-project/sglang/tree/main/benchma
Release History
| Version | Changes | Urgency | Date |
|---|---|---|---|
| 0.5.10.post1 | Imported from PyPI (0.5.10.post1) | Low | 4/21/2026 |
| v0.5.10.post1 | **Full Changelog**: https://github.com/sgl-project/sglang/compare/v0.5.10...v0.5.10.post1 Bumps flashinfer from v0.6.7.post2 to v0.6.7.post3 to resolve an issue in its jit cubin downloader. | Medium | 4/9/2026 |
| v0.5.10 | # Highlights - **Piecewise CUDA Graph Enabled by Default**: Piecewise CUDA graph capture is now the default execution mode, reducing memory overhead and improving throughput for models with complex control flow patterns: #16331 - **Elastic EP for Partial Failure Tolerance**: Integrate Elastic NIXL-EP into SGLang, enabling partial failure tolerance for DeepSeek MoE deployments β when a GPU fails, the system redistributes expert weights and continues serving without full restart: #19248, #17 | Medium | 4/6/2026 |
| v0.5.10rc0 | # Highlights - **Piecewise CUDA Graph Enabled by Default**: Piecewise CUDA graph capture is now the default execution mode, reducing memory overhead and improving throughput for models with complex control flow patterns: #16331 - **Elastic EP for Partial Failure Tolerance**: Integrate Elastic NIXL-EP into SGLang, enabling partial failure tolerance for DeepSeek MoE deployments β when a GPU fails, the system redistributes expert weights and continues serving without full restart: #19248, #17 | Medium | 3/28/2026 |
| v0.5.9 | # Highlights - **LoRA Weight Loading Overlap with Computation**: Overlap LoRA weight loading with computation during inference, reducing TTFT by ~78% and TPOT by ~34.88% on large adaptors: #15512 - **TRT-LLM NSA Kernel Integration for DeepSeek V3.2**: Integrate TRT-LLM DSA kernels for Native Sparse Attention, boosting DeepSeek V3.2 performance by 3x-5x on Blackwell platforms with trtllm for both --nsa-prefill-backend and --nsa-decode-backend (with minor accuracy drop): #16758, #17662, #18 | Low | 2/24/2026 |
| v0.5.8 | # Highlights - Up to 1.5x faster across the board for all major diffusion models https://lmsys.org/blog/2026-01-16-sglang-diffusion/ - Close to linear scaling with chunked pipeline parallelism for super long million-token context https://lmsys.org/blog/2026-01-15-chunked-pipeline/ - Optimizing GLM4-MoE for Production: 65% Faster TTFT https://lmsy | Low | 1/23/2026 |
| gateway-v0.3.1 | ## π SMG v0.3.1 Released! We're excited to announce SMG v0.3.1 β a game-changing release with 10-12x performance improvement and 99% memory reduction in cache-aware routing, plus enterprise-grade security! ## π² Radix Tree / Cache-Aware Routing: 10-12x Faster + 99% Less Memory β‘ Complete optimization overhaul of our cache-aware routing engine with stunning performance and memory gains: ### Performance Improvements - Our cache-aware routing can now handle over 216,000 cache insertions p | Low | 1/9/2026 |
| v0.5.7 | ## Highlights - New Model Support: - Day 0 Support for Mimo-V2-Flash: #15207, https://lmsys.org/blog/2025-12-16-mimo-v2-flash/ - Day 0 Support for Nemotron-Nano-v3: https://lmsys.org/blog/2025-12-15-run-nvidia-nemotron-3-nano/ - Day 0 Support for LLaDA 2.0: https://lmsys.org/blog/2025-12-19-diffusion-llm/ - [SGLang-Diffusion] Day 0 Support for Qwen-Image-Edit-2509, Qwen-Image-Edit-2511, Qwen-Image-2512 and Qwen-Image-Layered - EAGLE 3 speculative decoding draft mo | Low | 1/1/2026 |
| gateway-v0.3.0 | ## π SGLang Model Gateway v0.3.0 Released! We're thrilled to announce SGLang Model Gateway v0.3.0 β a major release with powerful new features, architectural improvements, and important breaking changes! ## β οΈ Breaking Changes ### π Metrics Architecture Redesigned Complete overhaul with new 6-layer metrics architecture covering protocol (HTTP/gRPC), router, worker, streaming (TTFT/TPOT), circuit breaker, and policy metrics with unified error codes. **Action Required**: Update your Prome | Low | 12/24/2025 |
| gateway-v0.2.4 | ## π SGLang Model Gateway v0.2.4 Released! We're excited to announce SGLang Model Gateway v0.2.4 β a massive release focused on performance, security, and production-ready observability! ## β¨ Headline Features ### β‘ Major Performance Optimizations We've invested heavily in performance across the entire stack: - Optimized radix tree for cache-aware load balancing β Smarter routing decisions with lower overhead - Tokenizer optimization β Dramatically reduced CPU and memory footprint durin | Low | 12/10/2025 |
| v0.5.6 | ## Highlights - Support for DeepSeek V3.2/V3.2 Speciale #14249 - Blockwise diffusion language model support #12588 - Support for new diffusion models (Flux2 #14000, Z-image #14067) - Introduce JIT Kernels #13453 - Upgrade to Torch 2.9 #12969 - Kimi-K2-Thinking model enhancement #12882 - Memory management/Overlap spec compatibility #12224 #12839 - More performance optimization: DeepSeek-v3-fp4/GLM-4.6/Kimi-K2/DeepSeek-V3.2... - CI/CD Enhancement ## What's Changed * [router][grpc] | Low | 12/3/2025 |
| gateway-v0.2.3 | ## π SGLang Model Gateway - New Release! We're excited to announce another powerful update to **SGLang Model Gateway** with performance improvements and expanded database support! ### β¨ **Headline Features** **β‘ Bucket Mode Routing - 20-30% Performance Boost** Introducing our new **bucket-based routing algorithm** that dramatically improves performance in PD mode. See up to **20-30% improvements in TTFT (Time To First Token) and overall throughput** **πΎ PostgreSQL Support for Cha | Low | 11/17/2025 |
| gateway-v0.2.2 | ## π SGLang Model Gateway v0.2.2 Released! ### β¨ **Features** **π― Industry-First Responses API for All Models** We're bringing OpenAI's Responses API to the entire open-source ecosystem! Now enjoy native support for **Llama, DeepSeek, Qwen**, and more β with built-in chat history management, multi-turn conversations, and seamless MCP integration. This is the first solution to democratize advanced conversation management across all OSS models. **βΈοΈ Production-Ready Kubernetes Operat | Low | 11/17/2025 |
| gateway-v0.2.1 | ## π SGLang Model Gateway v0.2.1 Released! This release focuses on stability, cleanup, and two big new performance features. ### π§Ύ Docs & CI - Updated router documentation to reflect recent feature additions ### π§Ή Code Cleanup - Refactored StopSequenceDecoder for cleaner incremental decoding - Added spec.rs test harness under spec/ for structured unit tests ### π Bug Fixes - Fixed UTF-8 boundary in stop-sequence decoding - Fixed gRPC timeout configuration - Fixed worker | Low | 11/17/2025 |
| gateway-v0.2.0 | ## π Release: SGLang Model Gateway v0.2.0 (formerly βSGLang Routerβ) ## π₯ Whatβs new ### π§ Multi-Model Inference Gateway (IGW) Mode IGW turns one router into many β letting you manage multiple models at once, each with its own routing policy, priorities, and metadata. Think of it as running several routers under one roof, with shared reliability, observability, and API surface. You can dynamically register models via /workers, assign labels like tier or policy, and let the gateway han | Low | 11/17/2025 |
| gateway-v0.1.9 | ## What's Changed in Gateway ### Gateway Changes (10 commits) - [router] upgrade router version to 0.1.9 (#8844) by @slin1237 in https://github.com/sgl-project/sglang/pull/8844 - refactor(sgl-router): Replace `once_cell` with `LazyLock` in worker.rs and remove once_cell dependency from Cargo.toml (#8698) by @htiennv in https://github.com/sgl-project/sglang/pull/8698 - [router] fix req handling order, improve serialization, remove retry (#8888) by @slin1237 in https://github.com/sgl-proje | Low | 11/17/2025 |
| gateway-v0.1.8 | ## What's Changed in Gateway ### Gateway Changes (4 commits) - [router] upgrade router version to 0.1.8 (#8645) by @slin1237 in https://github.com/sgl-project/sglang/pull/8645 - [router] add basic usage doc (#8640) by @slin1237 in https://github.com/sgl-project/sglang/pull/8640 - [bugfix] fix router python parser for pd urls (#8644) by @slin1237 in https://github.com/sgl-project/sglang/pull/8644 - Fix typos in py_test/test_launch_server.py (#6227) by @windsonsea in https://github.com/sg | Low | 11/17/2025 |
| gateway-v0.1.7 | ## What's Changed in Gateway ### Gateway/Router Changes (11 commits) - [router] update router pypi version (#8628) by @slin1237 in https://github.com/sgl-project/sglang/pull/8628 - [router] migrate router from actix to axum (#8479) by @slin1237 in https://github.com/sgl-project/sglang/pull/8479 - [feature] [sgl-router] Add a dp-aware routing strategy (#6869) by @oldsharp in https://github.com/sgl-project/sglang/pull/6869 - [router] improve router logs and request id header (#8415) by @s | Low | 11/17/2025 |
| gateway-v0.1.6 | ## What's Changed in Gateway ### Gateway Changes (12 commits) - [router] upgade router version to 0.1.6 (#8209) by @slin1237 in https://github.com/sgl-project/sglang/pull/8209 - [router] add ut for pd router (#8208) by @slin1237 in https://github.com/sgl-project/sglang/pull/8208 - [router] add ut for pd request, metrics and config (#8184) by @slin1237 in https://github.com/sgl-project/sglang/pull/8184 - [router] add ut for worker and errors (#8170) by @slin1237 in https://github.com/sgl | Low | 11/17/2025 |
| v0.5.5 | ## Highlights - Day 0 support for Kimi-K2-Thinking https://huggingface.co/moonshotai/Kimi-K2-Thinking - Day 0 support for Minimax-M2 https://huggingface.co/MiniMaxAI/MiniMax-M2 - Video and image generation support https://lmsys.org/blog/2025-11-07-sglang-diffusion/ - Q4 Roadmap: https://github.com/sgl-project/sglang/issues/12780 - Blackwell kernel optimizations and MoE runner backend refactor - Overlap spec and prefill cuda graph support more models ## What's Changed * [8/n] decouple | Low | 11/6/2025 |
| v0.5.4 | ## Highlights - AMD AI Dev Day 2025 SGLang ([slide](https://github.com/sgl-project/sgl-learning-materials/blob/main/slides/sglang_amd_ai_devday_2025.pdf)), PyTorch Conference 2025 SGLang ([slide](https://github.com/sgl-project/sgl-learning-materials/blob/main/slides/sglang_pytorch_2025.pdf)) - Model gateway v0.2 release: https://docs.sglang.ai/advanced_features/router.html - [beta] Overlap scheduler for speculative decoding: https://github.com/sgl-project/sglang/issues/11762 - [beta] Piecewi | Low | 10/26/2025 |
| v0.5.3 | ## Highlights - Day 0 Support for DeepSeek-V3.2 with Sparse Attention: https://lmsys.org/blog/2025-09-29-deepseek-V32/ - Deterministic inference on multiple attention backends: https://lmsys.org/blog/2025-09-22-sglang-deterministic/ - Integration of FlashAttention 4 prefill kernels - Enhancing support of Qwen3-Next with MTP, DP, optimized kernels and multiple hardware platforms - Support models including Qwen3-VL series, dots.vlm1, Ling-V2, Apertus, SOLAR ## What's Changed * [Auto Syn | Low | 10/6/2025 |
| v0.5.2 | ## Highlights - SGLang HiCache: Fast Hierarchical KV Caching with Your Favorite Storage Backends: https://lmsys.org/blog/2025-09-10-sglang-hicache/ ## What's Changed * feat: allow use local branch to build image by @gongwei-130 in https://github.com/sgl-project/sglang/pull/9546 * [readme] Include additional resources for the SGLang x AMD SF Meetup event by @wisclmy0611 in https://github.com/sgl-project/sglang/pull/9547 * [doc] deepseekv31 support by @XiaotongJiang in https://github.com/ | Low | 9/12/2025 |
| v0.5.1 | ## What's Changed * [PD] Use batch transfer for rdma transport and add notes for mnnvl usage by @ShangmingCai in https://github.com/sgl-project/sglang/pull/8595 * [bugifx] QWen-1M context support[2/3] using current cuda stream in the DCA's kernel for bugfix. by @sighingnow in https://github.com/sgl-project/sglang/pull/8611 * Fix hf3fs_fuse import error by @ispobock in https://github.com/sgl-project/sglang/pull/8623 * Update step3v default config by @ispobock in https://github.com/sgl-project | Low | 8/23/2025 |
| v0.4.10 | ## Highlights This is a regular release with many new optimizations, features, and fixes. Please checkout the following exciting roadmaps and blogs - Please check the 2025 H2 roadmap https://github.com/sgl-project/sglang/issues/7736 - GLM-4.5 Meets SGLang: Reasoning, Coding, and Agentic Abilities https://lmsys.org/blog/2025-07-31-glm4-5/ - SpecForge: Accelerating Speculative Decoding Training for SGLang https://lmsys.org/blog/2025-07-25-spec-forge/ - Deploying Kimi K2 with PD Disaggregati | Low | 7/31/2025 |
| v0.4.8 | ## Highlights ### OpenAI-Compatible Server Refactor Re-structured the OpenAI-compatible server to support production and enterprise environments. Key improvements include: - Consistent metrics and logging for better observability and debugging. - Unified error handling, request validation, and processing logic for improved reliability and maintainability. - Improved request tracking across sessions and components. - Fixed bugs in embedding requests and reasoning parsers. Thi | Low | 6/24/2025 |
| v0.4.7 | ## Highlights - The PD disaggregation and large-scale EP functionalities from the [blog post](https://lmsys.org/blog/2025-05-05-large-scale-ep/) have now been **fully merged into the latest release**. - The blog has been successfully [reproduced](https://github.com/sgl-project/sglang/issues/6017) by over six industry teams, including the **TensorRT LLM team**. - SGLangβs large-scale EP is now actively used by leading organizations such as **Cursor, Qwen, Alimama, Alibaba Cloud, iFlytek* | Low | 6/11/2025 |
| v0.4.6 | ## Highlights - Use FlashAttention3 as the default attention backend for main stream models (DeepSeek, Qwen, Llama, etc). https://github.com/sgl-project/sglang/issues/4709#issuecomment-2817728855 - PD disaggregation with mooncake and NIXL transfer backends #4880 #5477 #4655 - DeepSeek performance improvements: turn on DeepGemm by default and some kernel fusions. #5580 #5628 - Update torch to 2.6.0. Fix torch.compile cache. #5417 #5213 - Preliminary support for blackwell #5303 Thanks ver | Low | 4/27/2025 |
| v0.4.5 | # Highlights The SGLang team is excited to the release of v0.4.5! This version introduces several significant features, including Llama 4 support, FlashAttention 3 backend, EAGLE3 speculative decoding, DeepEP integration, and disaggregated prefill and decoding. ## New Features - **Llama 4 Support**: We supported [Llama 4 model](https://github.com/meta-llama/llama-models/blob/main/models/llama4/MODEL_CARD.md) with accuracy matching official benchmark numbers, achieving a zero-shot score of | Low | 4/7/2025 |
| v0.4.4 | ## Highlights The SGLang team is excited to announce the release of v0.4.4. We will keep improving DeepSeek V3/R1 performance. With the combination of FlashInfer, MTP, DeepGEMM, and Torch Compile optimizations on H200, it can achieve nearly **100 tokens/s**, which is currently the fastest open-source implementation. Look out for new optimizations coming soon! Thanks very much to xAI Team, NVIDIA Team, AMD Team, LinkedIn team, Baseten Team, Oracle Team, Meituan Team and the open source com | Low | 3/13/2025 |
| v0.4.3 | ## Highlights The SGLang team is excited to announce the release of v0.4.3. We will keep improving DeepSeek V3/R1 performance. In the last six weeks, SGLang has been the fastest engine running DeepSeek V3/R1 among all open-source LLM inference engines. We stay ahead by integrating FlashInfer MLA and optimizing further. Look out for new optimizations coming soon! Please feel free to join our Slack channel https://slack.sglang.ai Cheers! ### Performance Improvements #### DeepSeek V3/R1 Op | Low | 2/14/2025 |
| v0.4.1 | ## Highlights - We're excited to announce SGLang v0.4.1, which now supports [DeepSeek V3](https://huggingface.co/deepseek-ai/DeepSeek-V3) - currently the strongest open-source LLM, even surpassing GPT-4o. The SGLang and DeepSeek teams worked together to get DeepSeek V3 FP8 running on NVIDIA and AMD GPU **from day one**. We've also supported MLA optimization and DP attention before, making SGLang one of the best open-source LLM engines for running DeepSeek models. Special thanks to Me | Low | 12/25/2024 |
| v0.4.0 | ## Highlights blog: https://lmsys.org/blog/2024-12-04-sglang-v0-4/ Weβre excited to release SGLang v0.4, featuring significant performance improvements and new features: - Zero-overhead batch scheduler: 1.1x increase in throughput. - Cache-aware load balancer: up to 1.9x increase in throughput with 3.8x higher cache hit rate. - Data parallelism attention for DeepSeek models: up to 1.9x decoding throughput improvement. - Fast structured outputs with xgrammar: up to 10x faster. ## W | Low | 12/4/2024 |
| v0.3.6 | ## Highlights * Reduce CPU overhead by enabling overlap scheduler by default. **1.1x higher throughput**. (#2105, #2067, #2095) * Support data parallelism for attention and MLA. 1.5x higher decoding throughput. (#1970, #2061) * Cache-aware load balancer. 4x higher cache hit rate (#1934) * Support xgrammar backend for grammar-guided decoding (#2056) * Support Prometheus metrics (#1853, #1981) * Support torch 2.5.1 (#2069) and torch-native tensor parallelism (#1876) * Support graceful term | Low | 11/22/2024 |
| v0.3.4.post1 | ## Highlights - Hosted the first LMSYS online meetup: Efficient LLM Deployment and Serving. - Covered CPU overhead hiding, faster constrained decoding, and DeepSeek MLA. [Slides](https://github.com/sgl-project/sgl-learning-materials?tab=readme-ov-file#the-first-lmsys-online-meetup-efficient-llm-deployment-and-serving) - Added Engine API for offline inference with reduced overhead. [Usage](https://github.com/sgl-project/sglang/blob/main/README.md#engine-without-http-server). #1614 #1567 - | Low | 10/22/2024 |
| v0.3.2 | ## Highlight - Support torch.compile, cuda graph for triton attention backend and DeepSeek MLA #1442 #1422 - Initial support for multi-LoRA serving #1307 - Integrate torchao for quantization #1341 - Optimize the CPU scheduler overhead - Multiple critical bug fixes for llama and llava (tokenizer, modality) - Support AMD backend #1420 - New models: MiniCPM3, OLMoE ## What's Changed * Remove useless fields in global_config.py by @merrymercy in https://github.com/sgl-project/sglang/pu | Low | 10/2/2024 |
| v0.3.0 | ## Highlights Checkout the release blog post https://lmsys.org/blog/2024-09-04-sglang-v0-3/ to find detailed instructions and descriptions for the items below. - Up to 7x higher throughput for DeepSeek Multi-head Latent Attention (MLA) - Up to 1.5x lower latency with torch.compile on small batch sizes - Support for interleaved text and multi-image/video in LLaVA-OneVision - Support for interleaved window attention and 2x longer context length in Gemma-2 - Chunked prefill is turned on by de | Low | 9/19/2024 |
| v0.2.13 | ## Highlights * **New Feature**: Support window attention for Gemma-2 (#1056 #1090 #1112), enable chunked-prefill by default (#1040 #984), support all sampling penalties (#973) * **New Models**: Support embedding model e5-mistral (#983 #987 #988 #997 #1014) and comprehensive OpenAI-compatible API. * **Performance**: Accelerate Multi-head Latent Attention (MLA). Bring 2x end-to-end improvement on Deepseek v2 (#905). * **More CI Tests**: Accuracy test (multiple benchmarks), unit test (APIs, mo | Low | 9/19/2024 |
| v0.2.9 | ## Highlights - **New feature**: Chunked prefill (#800, #811) - **New models**: Deepseek v2 - **Performance improvement**: vectorized logprob computation - **Accuracy fix**: fix the double BOS problem in the chat template; move logits to float32; update flashinfer sampling kernels - **Feature fix**: fixed many missing logprob-related features in the OpenAI API server - **CI/CD infra** is now fully ready. The tests cover frontend, backend, accuracy, and performance tests. ## What's Cha | Low | 8/2/2024 |
| v0.2.5 | ## Highlights - We recently released a [blog](https://lmsys.org/blog/2024-07-25-sglang-llama3/). Compared to TensorRT-LLM and vLLM, SGLang Runtime consistently delivers superior or competitive performance in both online and offline scenarios, handling models from Llama-8B to Llama-405B, and on A100 and H100 GPUs, using FP8 and FP16. **SGLang consistently outperforms vLLM**, achieving up to **3.1x** higher throughput on Llama-70B. It also often matches or sometimes outperforms TensorRT-LLM. | Low | 7/26/2024 |
| v0.2.0 | ## Highlights - We performed extensive engineering to improve the base performance. Compared to TensorRT-LLM and vLLM, SGLang now consistently delivers superior or competitive performance in both online and offline scenarios, handling models from Llama-8B to Llama-405B, on A100 and H100 GPUs, using FP8 and FP16. See the latest [blog](https://lmsys.org/blog/2024-07-25-sglang-llama3/). - New models: Llama3 405B, Deepseek MoE, InternLM, GPTBigCode, Mistral-Nemo ## What's Changed * Optimize me | Low | 7/25/2024 |
| v0.1.20 | ## Highlights * Enable CUDA graph by default. It brings 1.5x - 2x speedup for small batch size decoding (#612) * Model support: Gemma2, minicpm, Qwen2 MoE * Docker support (#217 ) * Various latency optimizations ## What's Changed * Add docker file by @Ying1123 in https://github.com/sgl-project/sglang/pull/588 * Add Gemma2 by @Ying1123 in https://github.com/sgl-project/sglang/pull/592 * Format by @Ying1123 in https://github.com/sgl-project/sglang/pull/593 * Fix Llava model by @wisclmy0 | Low | 7/14/2024 |
| v0.1.18 | ## Highlight - 2x large batch prefill improvement with the new flashinfer kernels #579 - Multi-node tensor parallelism #550 - New model support: ChatGLM #516 ## What's Changed * Fix missing numpy dependency in pyproject.toml by @fpreiss in https://github.com/sgl-project/sglang/pull/524 * Fix RAG nb, parea setup (parea -> parea-ai) by @fpreiss in https://github.com/sgl-project/sglang/pull/525 * [Minor] Correct Optional type hints in api by @fpreiss in https://github.com/sgl-project/s | Low | 7/4/2024 |
| v0.1.17 | ## Highlights - Add data parallelim #480 - Add speculative execution for OpenAI API #250 - Update vllm to v0.4.3 for new quantization features #511 - Better error handling (#457, #449, #514) ## What's Changed * [Feat] Add llava qwen, llava mistral by @kcz358 in https://github.com/sgl-project/sglang/pull/419 * Format code by @hnyls2002 in https://github.com/sgl-project/sglang/pull/441 * Add finish_reason to OpenAI API by @mgerstgrasser in https://github.com/sgl-project/sglang/pull/446 | Low | 6/8/2024 |
| v0.1.16 | ## Highlight * Support more models: DBRX, Command-R, Gemma * Support llava-video (#423, https://llava-vl.github.io/blog/2024-04-30-llava-next-video/) * Cache performance improvements (#418, #364) * Marlin quantization kernels * Many bug fixes * Update dependencies to be compatible with their latest versions ## What's Changed * Fix Runtime missing some ServerArgs options by @Qubitium in https://github.com/sgl-project/sglang/pull/281 * adding the triton docker build minimal example by @ | Low | 5/14/2024 |
| v0.1.13 | ## Highlights * Gemma Support by @hnyls2002 in https://github.com/sgl-project/sglang/pull/256 * Add Together and AzureOpenAI examples by @merrymercy in https://github.com/sgl-project/sglang/pull/184 ## What's Changed * correct a mistake on the README.md by @yaya-sy in https://github.com/sgl-project/sglang/pull/182 * correct reference dtype openai.py by @yaya-sy in https://github.com/sgl-project/sglang/pull/181 * Add Together and AzureOpenAI examples by @merrymercy in https://github.com/s | Low | 3/11/2024 |
| v0.1.12 | ## Highlights - Fast JSON Decoding ([blog](https://lmsys.org/blog/2024-02-05-compressed-fsm/)) - Output logprobs for decoding tokens - Multiple bug fixes ## What's Changed * Fix no-cache mode by @Ying1123 in https://github.com/sgl-project/sglang/pull/136 * Support Faster JSON decoding for llava by @hnyls2002 in https://github.com/sgl-project/sglang/pull/137 * fix undfined variable by @yaya-sy in https://github.com/sgl-project/sglang/pull/142 * jump-forward rename by @hnyls2002 in https | Low | 2/11/2024 |
| v0.1.11 | ## Highlights - Serve the official release demo of LLaVA v1.6 [blog](https://llava-vl.github.io/blog/2024-01-30-llava-1-6/) - Support Yi-VL [example](https://github.com/sgl-project/sglang/blob/main/examples/quick_start/srt_example_yi_vl.py) - Faster JSON decoding [blog](https://lmsys.org/blog/2024-02-05-compressed-fsm/) - Support QWen 2 ## What's Changed * Fix the error message and dependency of openai backend by @merrymercy in https://github.com/sgl-project/sglang/pull/71 * Add an asyn | Low | 2/3/2024 |
| v0.1.6 | ## Major features - Add OpenAI-compatible API server (Completion and ChatCompletion) - Fix `sgl.select` ## All PRs * Support v1/chat/completions by @comaniac in https://github.com/sgl-project/sglang/pull/50 * Fix select and normalized logprobs by @merrymercy in https://github.com/sgl-project/sglang/pull/67 * Bump version to 0.1.5 by @merrymercy in https://github.com/sgl-project/sglang/pull/33 * Use HTTP link in 3rdparty module by @comaniac in https://github.com/sgl-project/sglang/pull/4 | Low | 1/21/2024 |
| v0.1.5 | ## What's Changed * Fix for T4 GPUs by @Ying1123 in https://github.com/sgl-project/sglang/pull/16 * Gemini Backend by @caoshiyi in https://github.com/sgl-project/sglang/pull/9 * Teak mem fraction by @merrymercy in https://github.com/sgl-project/sglang/pull/20 * Add option to return metadata in async streaming by @BabyChouSr in https://github.com/sgl-project/sglang/pull/18 * Expose more arguments to control the scheduling policy by @merrymercy in https://github.com/sgl-project/sglang/pull/32 | Low | 1/18/2024 |
