sglang

SGLang is a fast serving framework for large language models and vision language models.

Why this rank:Strong adoptionRecent releaseHealthy release cadence

Description

<div align="center" id="sglangtop"> <img src="https://raw.githubusercontent.com/sgl-project/sglang/main/assets/logo.png" alt="logo" width="400" margin="10px"></img> [![PyPI](https://img.shields.io/pypi/v/sglang)](https://pypi.org/project/sglang) ![PyPI - Downloads](https://static.pepy.tech/badge/sglang?period=month) [![license](https://img.shields.io/github/license/sgl-project/sglang.svg)](https://github.com/sgl-project/sglang/tree/main/LICENSE) [![issue resolution](https://img.shields.io/github/issues-closed-raw/sgl-project/sglang)](https://github.com/sgl-project/sglang/issues) [![open issues](https://img.shields.io/github/issues-raw/sgl-project/sglang)](https://github.com/sgl-project/sglang/issues) [![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/sgl-project/sglang) </div> -------------------------------------------------------------------------------- <a href="https://lmsys.org/blog/">Blog</a> | <a href="https://docs.sglang.io/">Documentation</a> | <a href="https://roadmap.sglang.io/">Roadmap</a> | <a href="https://slack.sglang.io/">Join Slack</a> | <a href="https://meet.sglang.io/">Weekly Dev Meeting</a> | <a href="https://github.com/sgl-project/sgl-learning-materials?tab=readme-ov-file#slides">Slides</a> ## News - [2026/02] 🔥 Unlocking 25x Inference Performance with SGLang on NVIDIA GB300 NVL72 ([blog](https://lmsys.org/blog/2026-02-20-gb300-inferencex/)). - [2026/01] 🔥 SGLang Diffusion accelerates video and image generation ([blog](https://lmsys.org/blog/2026-01-16-sglang-diffusion/)). - [2025/12] SGLang provides day-0 support for latest open models ([MiMo-V2-Flash](https://lmsys.org/blog/2025-12-16-mimo-v2-flash/), [Nemotron 3 Nano](https://lmsys.org/blog/2025-12-15-run-nvidia-nemotron-3-nano/), [Mistral Large 3](https://github.com/sgl-project/sglang/pull/14213), [LLaDA 2.0 Diffusion LLM](https://lmsys.org/blog/2025-12-19-diffusion-llm/), [MiniMax M2](https://lmsys.org/blog/2025-11-04-miminmax-m2/)). - [2025/10] 🔥 SGLang now runs natively on TPU with the SGLang-Jax backend ([blog](https://lmsys.org/blog/2025-10-29-sglang-jax/)). - [2025/09] Deploying DeepSeek on GB200 NVL72 with PD and Large Scale EP (Part II): 3.8x Prefill, 4.8x Decode Throughput ([blog](https://lmsys.org/blog/2025-09-25-gb200-part-2/)). - [2025/09] SGLang Day 0 Support for DeepSeek-V3.2 with Sparse Attention ([blog](https://lmsys.org/blog/2025-09-29-deepseek-V32/)). - [2025/08] SGLang x AMD SF Meetup on 8/22: Hands-on GPU workshop, tech talks by AMD/xAI/SGLang, and networking ([Roadmap](https://github.com/sgl-project/sgl-learning-materials/blob/main/slides/amd_meetup_sglang_roadmap.pdf), [Large-scale EP](https://github.com/sgl-project/sgl-learning-materials/blob/main/slides/amd_meetup_sglang_ep.pdf), [Highlights](https://github.com/sgl-project/sgl-learning-materials/blob/main/slides/amd_meetup_highlights.pdf), [AITER/MoRI](https://github.com/sgl-project/sgl-learning-materials/blob/main/slides/amd_meetup_aiter_mori.pdf), [Wave](https://github.com/sgl-project/sgl-learning-materials/blob/main/slides/amd_meetup_wave.pdf)). <details> <summary>More</summary> - [2025/11] SGLang Diffusion accelerates video and image generation ([blog](https://lmsys.org/blog/2025-11-07-sglang-diffusion/)). - [2025/10] PyTorch Conference 2025 SGLang Talk ([slide](https://github.com/sgl-project/sgl-learning-materials/blob/main/slides/sglang_pytorch_2025.pdf)). - [2025/10] SGLang x Nvidia SF Meetup on 10/2 ([recap](https://x.com/lmsysorg/status/1975339501934510231)). - [2025/08] SGLang provides day-0 support for OpenAI gpt-oss model ([instructions](https://github.com/sgl-project/sglang/issues/8833)) - [2025/06] SGLang, the high-performance serving infrastructure powering trillions of tokens daily, has been awarded the third batch of the Open Source AI Grant by a16z ([a16z blog](https://a16z.com/advancing-open-source-ai-through-benchmarks-and-bold-experimentation/)). - [2025/05] Deploying DeepSeek with PD Disaggregation and Large-scale Expert Parallelism on 96 H100 GPUs ([blog](https://lmsys.org/blog/2025-05-05-large-scale-ep/)). - [2025/06] Deploying DeepSeek on GB200 NVL72 with PD and Large Scale EP (Part I): 2.7x Higher Decoding Throughput ([blog](https://lmsys.org/blog/2025-06-16-gb200-part-1/)). - [2025/03] Supercharge DeepSeek-R1 Inference on AMD Instinct MI300X ([AMD blog](https://rocm.blogs.amd.com/artificial-intelligence/DeepSeekR1-Part2/README.html)) - [2025/03] SGLang Joins PyTorch Ecosystem: Efficient LLM Serving Engine ([PyTorch blog](https://pytorch.org/blog/sglang-joins-pytorch/)) - [2025/02] Unlock DeepSeek-R1 Inference Performance on AMD Instinct™ MI300X GPU ([AMD blog](https://rocm.blogs.amd.com/artificial-intelligence/DeepSeekR1_Perf/README.html)) - [2025/01] SGLang provides day one support for DeepSeek V3/R1 models on NVIDIA and AMD GPUs with DeepSeek-specific optimizations. ([instructions](https://github.com/sgl-project/sglang/tree/main/benchma

Release History

Version	Changes	Urgency	Date
v0.5.12.post1	v0.5.12.post1 is a stability patch on top of v0.5.12. It cherry-picks 12 fixes — primarily for DeepSeek V4 — onto the release branch. # Bug Fixes ## DeepSeek V4 * DSV4-Pro emits garbled text during single-token decode on B200/B300 (fix `deep_gemm` UE8M0 scale-packing path by ceiling activation scales before packing): #25733 * DSV4 + EAGLE/MTP in disaggregation decode crashes around 2000 requests with a SWA allocator assertion (recycled KV pages kept stale sliding-window mappings): #25805	High	5/26/2026
v0.5.12	# Highlights - DeepSeek V4 support: Full inference path for DeepSeek-V4 (#23882), including: Day-0 Features: #23882 - Parallelism: Tensor Parallelism/Expert Parallelism/Context Parallelism/Data Parallel Attention - Hardware: Nvidia B300/B200/H200/H100/GB200/GB300, AMD MI35X - Prefill-Decode Disaggregation - HiSparse for offloading inactive KV cache to CPU memory - Reasoning parser and Tool Call Parser - DeepGemm and FlashMLA kernels for DeepSeek V4, in	High	5/16/2026
v0.5.11	# Highlights - CUDA 13 + Torch 2.11: Default CUDA version moves to 13.0 across SGLang, sgl-kernel, and Docker images, and PyTorch is upgraded from 2.9 to 2.11 — modernizing the build matrix and unlocking newer kernels: #21247, #24162, #24183, #23593 ([tracking issue #21498](https://github.com/sgl-project/sglang/issues/21498)) - Speculative Decoding V2 by default: Spec V2 (with overlap scheduling to hide CPU overhead) is now the default, materially reducing per-step CPU cost for EAG	High	5/5/2026
0.5.10.post1	Imported from PyPI (0.5.10.post1)	Low	4/21/2026
v0.5.10.post1	Full Changelog: https://github.com/sgl-project/sglang/compare/v0.5.10...v0.5.10.post1 Bumps flashinfer from v0.6.7.post2 to v0.6.7.post3 to resolve an issue in its jit cubin downloader.	Medium	4/9/2026
v0.5.10.post1	Full Changelog: https://github.com/sgl-project/sglang/compare/v0.5.10...v0.5.10.post1 Bumps flashinfer from v0.6.7.post2 to v0.6.7.post3 to resolve an issue in its jit cubin downloader.	Medium	4/9/2026
v0.5.10.post1	Full Changelog: https://github.com/sgl-project/sglang/compare/v0.5.10...v0.5.10.post1 Bumps flashinfer from v0.6.7.post2 to v0.6.7.post3 to resolve an issue in its jit cubin downloader.	Medium	4/9/2026
v0.5.10.post1	Full Changelog: https://github.com/sgl-project/sglang/compare/v0.5.10...v0.5.10.post1 Bumps flashinfer from v0.6.7.post2 to v0.6.7.post3 to resolve an issue in its jit cubin downloader.	Medium	4/9/2026
v0.5.10.post1	Full Changelog: https://github.com/sgl-project/sglang/compare/v0.5.10...v0.5.10.post1 Bumps flashinfer from v0.6.7.post2 to v0.6.7.post3 to resolve an issue in its jit cubin downloader.	Medium	4/9/2026
v0.5.10.post1	Full Changelog: https://github.com/sgl-project/sglang/compare/v0.5.10...v0.5.10.post1 Bumps flashinfer from v0.6.7.post2 to v0.6.7.post3 to resolve an issue in its jit cubin downloader.	Medium	4/9/2026
v0.5.10.post1	Full Changelog: https://github.com/sgl-project/sglang/compare/v0.5.10...v0.5.10.post1 Bumps flashinfer from v0.6.7.post2 to v0.6.7.post3 to resolve an issue in its jit cubin downloader.	Medium	4/9/2026
v0.5.10.post1	Full Changelog: https://github.com/sgl-project/sglang/compare/v0.5.10...v0.5.10.post1 Bumps flashinfer from v0.6.7.post2 to v0.6.7.post3 to resolve an issue in its jit cubin downloader.	Medium	4/9/2026
v0.5.10.post1	Full Changelog: https://github.com/sgl-project/sglang/compare/v0.5.10...v0.5.10.post1 Bumps flashinfer from v0.6.7.post2 to v0.6.7.post3 to resolve an issue in its jit cubin downloader.	Medium	4/9/2026
v0.5.10.post1	Full Changelog: https://github.com/sgl-project/sglang/compare/v0.5.10...v0.5.10.post1 Bumps flashinfer from v0.6.7.post2 to v0.6.7.post3 to resolve an issue in its jit cubin downloader.	Medium	4/9/2026
v0.5.10.post1	Full Changelog: https://github.com/sgl-project/sglang/compare/v0.5.10...v0.5.10.post1 Bumps flashinfer from v0.6.7.post2 to v0.6.7.post3 to resolve an issue in its jit cubin downloader.	Medium	4/9/2026
v0.5.10.post1	Full Changelog: https://github.com/sgl-project/sglang/compare/v0.5.10...v0.5.10.post1 Bumps flashinfer from v0.6.7.post2 to v0.6.7.post3 to resolve an issue in its jit cubin downloader.	Medium	4/9/2026
v0.5.10.post1	Full Changelog: https://github.com/sgl-project/sglang/compare/v0.5.10...v0.5.10.post1 Bumps flashinfer from v0.6.7.post2 to v0.6.7.post3 to resolve an issue in its jit cubin downloader.	Medium	4/9/2026
v0.5.10	# Highlights - Piecewise CUDA Graph Enabled by Default: Piecewise CUDA graph capture is now the default execution mode, reducing memory overhead and improving throughput for models with complex control flow patterns: #16331 - Elastic EP for Partial Failure Tolerance: Integrate Elastic NIXL-EP into SGLang, enabling partial failure tolerance for DeepSeek MoE deployments — when a GPU fails, the system redistributes expert weights and continues serving without full restart: #19248, #17	Medium	4/6/2026
v0.5.10rc0	# Highlights - Piecewise CUDA Graph Enabled by Default: Piecewise CUDA graph capture is now the default execution mode, reducing memory overhead and improving throughput for models with complex control flow patterns: #16331 - Elastic EP for Partial Failure Tolerance: Integrate Elastic NIXL-EP into SGLang, enabling partial failure tolerance for DeepSeek MoE deployments — when a GPU fails, the system redistributes expert weights and continues serving without full restart: #19248, #17	Medium	3/28/2026
v0.5.9	# Highlights - LoRA Weight Loading Overlap with Computation: Overlap LoRA weight loading with computation during inference, reducing TTFT by ~78% and TPOT by ~34.88% on large adaptors: #15512 - TRT-LLM NSA Kernel Integration for DeepSeek V3.2: Integrate TRT-LLM DSA kernels for Native Sparse Attention, boosting DeepSeek V3.2 performance by 3x-5x on Blackwell platforms with trtllm for both --nsa-prefill-backend and --nsa-decode-backend (with minor accuracy drop): #16758, #17662, #18	Low	2/24/2026
v0.5.8	# Highlights - Up to 1.5x faster across the board for all major diffusion models https://lmsys.org/blog/2026-01-16-sglang-diffusion/ - Close to linear scaling with chunked pipeline parallelism for super long million-token context https://lmsys.org/blog/2026-01-15-chunked-pipeline/ - Optimizing GLM4-MoE for Production: 65% Faster TTFT https://lmsy	Low	1/23/2026
gateway-v0.3.1	## 🚀 SMG v0.3.1 Released! We're excited to announce SMG v0.3.1 – a game-changing release with 10-12x performance improvement and 99% memory reduction in cache-aware routing, plus enterprise-grade security! ## 🌲 Radix Tree / Cache-Aware Routing: 10-12x Faster + 99% Less Memory ⚡ Complete optimization overhaul of our cache-aware routing engine with stunning performance and memory gains: ### Performance Improvements - Our cache-aware routing can now handle over 216,000 cache insertions p	Low	1/9/2026
v0.5.7	## Highlights - New Model Support: - Day 0 Support for Mimo-V2-Flash: #15207, https://lmsys.org/blog/2025-12-16-mimo-v2-flash/ - Day 0 Support for Nemotron-Nano-v3: https://lmsys.org/blog/2025-12-15-run-nvidia-nemotron-3-nano/ - Day 0 Support for LLaDA 2.0: https://lmsys.org/blog/2025-12-19-diffusion-llm/ - [SGLang-Diffusion] Day 0 Support for Qwen-Image-Edit-2509, Qwen-Image-Edit-2511, Qwen-Image-2512 and Qwen-Image-Layered - EAGLE 3 speculative decoding draft mo	Low	1/1/2026
gateway-v0.3.0	## 🚀 SGLang Model Gateway v0.3.0 Released! We're thrilled to announce SGLang Model Gateway v0.3.0 – a major release with powerful new features, architectural improvements, and important breaking changes! ## ⚠️ Breaking Changes ### 📊 Metrics Architecture Redesigned Complete overhaul with new 6-layer metrics architecture covering protocol (HTTP/gRPC), router, worker, streaming (TTFT/TPOT), circuit breaker, and policy metrics with unified error codes. Action Required: Update your Prome	Low	12/24/2025
gateway-v0.2.4	## 🚀 SGLang Model Gateway v0.2.4 Released! We're excited to announce SGLang Model Gateway v0.2.4 – a massive release focused on performance, security, and production-ready observability! ## ✨ Headline Features ### ⚡ Major Performance Optimizations We've invested heavily in performance across the entire stack: - Optimized radix tree for cache-aware load balancing – Smarter routing decisions with lower overhead - Tokenizer optimization – Dramatically reduced CPU and memory footprint durin	Low	12/10/2025
v0.5.6	## Highlights - Support for DeepSeek V3.2/V3.2 Speciale #14249 - Blockwise diffusion language model support #12588 - Support for new diffusion models (Flux2 #14000, Z-image #14067) - Introduce JIT Kernels #13453 - Upgrade to Torch 2.9 #12969 - Kimi-K2-Thinking model enhancement #12882 - Memory management/Overlap spec compatibility #12224 #12839 - More performance optimization: DeepSeek-v3-fp4/GLM-4.6/Kimi-K2/DeepSeek-V3.2... - CI/CD Enhancement ## What's Changed * [router][grpc]	Low	12/3/2025
gateway-v0.2.3	## 🚀 SGLang Model Gateway - New Release! We're excited to announce another powerful update to SGLang Model Gateway with performance improvements and expanded database support! ### ✨ Headline Features ⚡ Bucket Mode Routing - 20-30% Performance Boost Introducing our new bucket-based routing algorithm that dramatically improves performance in PD mode. See up to 20-30% improvements in TTFT (Time To First Token) and overall throughput **💾 PostgreSQL Support for Cha	Low	11/17/2025
gateway-v0.2.2	## 🚀 SGLang Model Gateway v0.2.2 Released! ### ✨ Features 🎯 Industry-First Responses API for All Models We're bringing OpenAI's Responses API to the entire open-source ecosystem! Now enjoy native support for Llama, DeepSeek, Qwen, and more – with built-in chat history management, multi-turn conversations, and seamless MCP integration. This is the first solution to democratize advanced conversation management across all OSS models. **☸️ Production-Ready Kubernetes Operat	Low	11/17/2025
gateway-v0.2.1	## 🚀 SGLang Model Gateway v0.2.1 Released! This release focuses on stability, cleanup, and two big new performance features. ### 🧾 Docs & CI - Updated router documentation to reflect recent feature additions ### 🧹 Code Cleanup - Refactored StopSequenceDecoder for cleaner incremental decoding - Added spec.rs test harness under spec/ for structured unit tests ### 🐞 Bug Fixes - Fixed UTF-8 boundary in stop-sequence decoding - Fixed gRPC timeout configuration - Fixed worker	Low	11/17/2025
gateway-v0.2.0	## 🚀 Release: SGLang Model Gateway v0.2.0 (formerly “SGLang Router”) ## 🔥 What’s new ### 🧠 Multi-Model Inference Gateway (IGW) Mode IGW turns one router into many — letting you manage multiple models at once, each with its own routing policy, priorities, and metadata. Think of it as running several routers under one roof, with shared reliability, observability, and API surface. You can dynamically register models via /workers, assign labels like tier or policy, and let the gateway han	Low	11/17/2025
gateway-v0.1.9	## What's Changed in Gateway ### Gateway Changes (10 commits) - [router] upgrade router version to 0.1.9 (#8844) by @slin1237 in https://github.com/sgl-project/sglang/pull/8844 - refactor(sgl-router): Replace `once_cell` with `LazyLock` in worker.rs and remove once_cell dependency from Cargo.toml (#8698) by @htiennv in https://github.com/sgl-project/sglang/pull/8698 - [router] fix req handling order, improve serialization, remove retry (#8888) by @slin1237 in https://github.com/sgl-proje	Low	11/17/2025
gateway-v0.1.8	## What's Changed in Gateway ### Gateway Changes (4 commits) - [router] upgrade router version to 0.1.8 (#8645) by @slin1237 in https://github.com/sgl-project/sglang/pull/8645 - [router] add basic usage doc (#8640) by @slin1237 in https://github.com/sgl-project/sglang/pull/8640 - [bugfix] fix router python parser for pd urls (#8644) by @slin1237 in https://github.com/sgl-project/sglang/pull/8644 - Fix typos in py_test/test_launch_server.py (#6227) by @windsonsea in https://github.com/sg	Low	11/17/2025
gateway-v0.1.7	## What's Changed in Gateway ### Gateway/Router Changes (11 commits) - [router] update router pypi version (#8628) by @slin1237 in https://github.com/sgl-project/sglang/pull/8628 - [router] migrate router from actix to axum (#8479) by @slin1237 in https://github.com/sgl-project/sglang/pull/8479 - [feature] [sgl-router] Add a dp-aware routing strategy (#6869) by @oldsharp in https://github.com/sgl-project/sglang/pull/6869 - [router] improve router logs and request id header (#8415) by @s	Low	11/17/2025
gateway-v0.1.6	## What's Changed in Gateway ### Gateway Changes (12 commits) - [router] upgade router version to 0.1.6 (#8209) by @slin1237 in https://github.com/sgl-project/sglang/pull/8209 - [router] add ut for pd router (#8208) by @slin1237 in https://github.com/sgl-project/sglang/pull/8208 - [router] add ut for pd request, metrics and config (#8184) by @slin1237 in https://github.com/sgl-project/sglang/pull/8184 - [router] add ut for worker and errors (#8170) by @slin1237 in https://github.com/sgl	Low	11/17/2025
v0.5.5	## Highlights - Day 0 support for Kimi-K2-Thinking https://huggingface.co/moonshotai/Kimi-K2-Thinking - Day 0 support for Minimax-M2 https://huggingface.co/MiniMaxAI/MiniMax-M2 - Video and image generation support https://lmsys.org/blog/2025-11-07-sglang-diffusion/ - Q4 Roadmap: https://github.com/sgl-project/sglang/issues/12780 - Blackwell kernel optimizations and MoE runner backend refactor - Overlap spec and prefill cuda graph support more models ## What's Changed * [8/n] decouple	Low	11/6/2025
v0.5.4	## Highlights - AMD AI Dev Day 2025 SGLang ([slide](https://github.com/sgl-project/sgl-learning-materials/blob/main/slides/sglang_amd_ai_devday_2025.pdf)), PyTorch Conference 2025 SGLang ([slide](https://github.com/sgl-project/sgl-learning-materials/blob/main/slides/sglang_pytorch_2025.pdf)) - Model gateway v0.2 release: https://docs.sglang.ai/advanced_features/router.html - [beta] Overlap scheduler for speculative decoding: https://github.com/sgl-project/sglang/issues/11762 - [beta] Piecewi	Low	10/26/2025
v0.5.3	## Highlights - Day 0 Support for DeepSeek-V3.2 with Sparse Attention: https://lmsys.org/blog/2025-09-29-deepseek-V32/ - Deterministic inference on multiple attention backends: https://lmsys.org/blog/2025-09-22-sglang-deterministic/ - Integration of FlashAttention 4 prefill kernels - Enhancing support of Qwen3-Next with MTP, DP, optimized kernels and multiple hardware platforms - Support models including Qwen3-VL series, dots.vlm1, Ling-V2, Apertus, SOLAR ## What's Changed * [Auto Syn	Low	10/6/2025
v0.5.2	## Highlights - SGLang HiCache: Fast Hierarchical KV Caching with Your Favorite Storage Backends: https://lmsys.org/blog/2025-09-10-sglang-hicache/ ## What's Changed * feat: allow use local branch to build image by @gongwei-130 in https://github.com/sgl-project/sglang/pull/9546 * [readme] Include additional resources for the SGLang x AMD SF Meetup event by @wisclmy0611 in https://github.com/sgl-project/sglang/pull/9547 * [doc] deepseekv31 support by @XiaotongJiang in https://github.com/	Low	9/12/2025
v0.5.1	## What's Changed * [PD] Use batch transfer for rdma transport and add notes for mnnvl usage by @ShangmingCai in https://github.com/sgl-project/sglang/pull/8595 * [bugifx] QWen-1M context support[2/3] using current cuda stream in the DCA's kernel for bugfix. by @sighingnow in https://github.com/sgl-project/sglang/pull/8611 * Fix hf3fs_fuse import error by @ispobock in https://github.com/sgl-project/sglang/pull/8623 * Update step3v default config by @ispobock in https://github.com/sgl-project	Low	8/23/2025
v0.4.10	## Highlights This is a regular release with many new optimizations, features, and fixes. Please checkout the following exciting roadmaps and blogs - Please check the 2025 H2 roadmap https://github.com/sgl-project/sglang/issues/7736 - GLM-4.5 Meets SGLang: Reasoning, Coding, and Agentic Abilities https://lmsys.org/blog/2025-07-31-glm4-5/ - SpecForge: Accelerating Speculative Decoding Training for SGLang https://lmsys.org/blog/2025-07-25-spec-forge/ - Deploying Kimi K2 with PD Disaggregati	Low	7/31/2025
v0.4.8	## Highlights ### OpenAI-Compatible Server Refactor Re-structured the OpenAI-compatible server to support production and enterprise environments. Key improvements include: - Consistent metrics and logging for better observability and debugging. - Unified error handling, request validation, and processing logic for improved reliability and maintainability. - Improved request tracking across sessions and components. - Fixed bugs in embedding requests and reasoning parsers. Thi	Low	6/24/2025
v0.4.7	## Highlights - The PD disaggregation and large-scale EP functionalities from the [blog post](https://lmsys.org/blog/2025-05-05-large-scale-ep/) have now been fully merged into the latest release. - The blog has been successfully [reproduced](https://github.com/sgl-project/sglang/issues/6017) by over six industry teams, including the TensorRT LLM team. - SGLang’s large-scale EP is now actively used by leading organizations such as *Cursor, Qwen, Alimama, Alibaba Cloud, iFlytek	Low	6/11/2025
v0.4.6	## Highlights - Use FlashAttention3 as the default attention backend for main stream models (DeepSeek, Qwen, Llama, etc). https://github.com/sgl-project/sglang/issues/4709#issuecomment-2817728855 - PD disaggregation with mooncake and NIXL transfer backends #4880 #5477 #4655 - DeepSeek performance improvements: turn on DeepGemm by default and some kernel fusions. #5580 #5628 - Update torch to 2.6.0. Fix torch.compile cache. #5417 #5213 - Preliminary support for blackwell #5303 Thanks ver	Low	4/27/2025
v0.4.5	# Highlights The SGLang team is excited to the release of v0.4.5! This version introduces several significant features, including Llama 4 support, FlashAttention 3 backend, EAGLE3 speculative decoding, DeepEP integration, and disaggregated prefill and decoding. ## New Features - Llama 4 Support: We supported [Llama 4 model](https://github.com/meta-llama/llama-models/blob/main/models/llama4/MODEL_CARD.md) with accuracy matching official benchmark numbers, achieving a zero-shot score of	Low	4/7/2025
v0.4.4	## Highlights The SGLang team is excited to announce the release of v0.4.4. We will keep improving DeepSeek V3/R1 performance. With the combination of FlashInfer, MTP, DeepGEMM, and Torch Compile optimizations on H200, it can achieve nearly 100 tokens/s, which is currently the fastest open-source implementation. Look out for new optimizations coming soon! Thanks very much to xAI Team, NVIDIA Team, AMD Team, LinkedIn team, Baseten Team, Oracle Team, Meituan Team and the open source com	Low	3/13/2025
v0.4.3	## Highlights The SGLang team is excited to announce the release of v0.4.3. We will keep improving DeepSeek V3/R1 performance. In the last six weeks, SGLang has been the fastest engine running DeepSeek V3/R1 among all open-source LLM inference engines. We stay ahead by integrating FlashInfer MLA and optimizing further. Look out for new optimizations coming soon! Please feel free to join our Slack channel https://slack.sglang.ai Cheers! ### Performance Improvements #### DeepSeek V3/R1 Op	Low	2/14/2025
v0.4.1	## Highlights - We're excited to announce SGLang v0.4.1, which now supports [DeepSeek V3](https://huggingface.co/deepseek-ai/DeepSeek-V3) - currently the strongest open-source LLM, even surpassing GPT-4o. The SGLang and DeepSeek teams worked together to get DeepSeek V3 FP8 running on NVIDIA and AMD GPU from day one. We've also supported MLA optimization and DP attention before, making SGLang one of the best open-source LLM engines for running DeepSeek models. Special thanks to Me	Low	12/25/2024
v0.4.0	## Highlights blog: https://lmsys.org/blog/2024-12-04-sglang-v0-4/ We’re excited to release SGLang v0.4, featuring significant performance improvements and new features: - Zero-overhead batch scheduler: 1.1x increase in throughput. - Cache-aware load balancer: up to 1.9x increase in throughput with 3.8x higher cache hit rate. - Data parallelism attention for DeepSeek models: up to 1.9x decoding throughput improvement. - Fast structured outputs with xgrammar: up to 10x faster. ## W	Low	12/4/2024
v0.3.6	## Highlights * Reduce CPU overhead by enabling overlap scheduler by default. 1.1x higher throughput. (#2105, #2067, #2095) * Support data parallelism for attention and MLA. 1.5x higher decoding throughput. (#1970, #2061) * Cache-aware load balancer. 4x higher cache hit rate (#1934) * Support xgrammar backend for grammar-guided decoding (#2056) * Support Prometheus metrics (#1853, #1981) * Support torch 2.5.1 (#2069) and torch-native tensor parallelism (#1876) * Support graceful term	Low	11/22/2024
v0.3.4.post1	## Highlights - Hosted the first LMSYS online meetup: Efficient LLM Deployment and Serving. - Covered CPU overhead hiding, faster constrained decoding, and DeepSeek MLA. [Slides](https://github.com/sgl-project/sgl-learning-materials?tab=readme-ov-file#the-first-lmsys-online-meetup-efficient-llm-deployment-and-serving) - Added Engine API for offline inference with reduced overhead. [Usage](https://github.com/sgl-project/sglang/blob/main/README.md#engine-without-http-server). #1614 #1567 -	Low	10/22/2024
v0.3.2	## Highlight - Support torch.compile, cuda graph for triton attention backend and DeepSeek MLA #1442 #1422 - Initial support for multi-LoRA serving #1307 - Integrate torchao for quantization #1341 - Optimize the CPU scheduler overhead - Multiple critical bug fixes for llama and llava (tokenizer, modality) - Support AMD backend #1420 - New models: MiniCPM3, OLMoE ## What's Changed * Remove useless fields in global_config.py by @merrymercy in https://github.com/sgl-project/sglang/pu	Low	10/2/2024
v0.3.0	## Highlights Checkout the release blog post https://lmsys.org/blog/2024-09-04-sglang-v0-3/ to find detailed instructions and descriptions for the items below. - Up to 7x higher throughput for DeepSeek Multi-head Latent Attention (MLA) - Up to 1.5x lower latency with torch.compile on small batch sizes - Support for interleaved text and multi-image/video in LLaVA-OneVision - Support for interleaved window attention and 2x longer context length in Gemma-2 - Chunked prefill is turned on by de	Low	9/19/2024
v0.2.13	## Highlights * New Feature: Support window attention for Gemma-2 (#1056 #1090 #1112), enable chunked-prefill by default (#1040 #984), support all sampling penalties (#973) * New Models: Support embedding model e5-mistral (#983 #987 #988 #997 #1014) and comprehensive OpenAI-compatible API. * Performance: Accelerate Multi-head Latent Attention (MLA). Bring 2x end-to-end improvement on Deepseek v2 (#905). * More CI Tests: Accuracy test (multiple benchmarks), unit test (APIs, mo	Low	9/19/2024
v0.2.9	## Highlights - New feature: Chunked prefill (#800, #811) - New models: Deepseek v2 - Performance improvement: vectorized logprob computation - Accuracy fix: fix the double BOS problem in the chat template; move logits to float32; update flashinfer sampling kernels - Feature fix: fixed many missing logprob-related features in the OpenAI API server - CI/CD infra is now fully ready. The tests cover frontend, backend, accuracy, and performance tests. ## What's Cha	Low	8/2/2024
v0.2.5	## Highlights - We recently released a [blog](https://lmsys.org/blog/2024-07-25-sglang-llama3/). Compared to TensorRT-LLM and vLLM, SGLang Runtime consistently delivers superior or competitive performance in both online and offline scenarios, handling models from Llama-8B to Llama-405B, and on A100 and H100 GPUs, using FP8 and FP16. SGLang consistently outperforms vLLM, achieving up to 3.1x higher throughput on Llama-70B. It also often matches or sometimes outperforms TensorRT-LLM.	Low	7/26/2024
v0.2.0	## Highlights - We performed extensive engineering to improve the base performance. Compared to TensorRT-LLM and vLLM, SGLang now consistently delivers superior or competitive performance in both online and offline scenarios, handling models from Llama-8B to Llama-405B, on A100 and H100 GPUs, using FP8 and FP16. See the latest [blog](https://lmsys.org/blog/2024-07-25-sglang-llama3/). - New models: Llama3 405B, Deepseek MoE, InternLM, GPTBigCode, Mistral-Nemo ## What's Changed * Optimize me	Low	7/25/2024
v0.1.20	## Highlights * Enable CUDA graph by default. It brings 1.5x - 2x speedup for small batch size decoding (#612) * Model support: Gemma2, minicpm, Qwen2 MoE * Docker support (#217 ) * Various latency optimizations ## What's Changed * Add docker file by @Ying1123 in https://github.com/sgl-project/sglang/pull/588 * Add Gemma2 by @Ying1123 in https://github.com/sgl-project/sglang/pull/592 * Format by @Ying1123 in https://github.com/sgl-project/sglang/pull/593 * Fix Llava model by @wisclmy0	Low	7/14/2024
v0.1.18	## Highlight - 2x large batch prefill improvement with the new flashinfer kernels #579 - Multi-node tensor parallelism #550 - New model support: ChatGLM #516 ## What's Changed * Fix missing numpy dependency in pyproject.toml by @fpreiss in https://github.com/sgl-project/sglang/pull/524 * Fix RAG nb, parea setup (parea -> parea-ai) by @fpreiss in https://github.com/sgl-project/sglang/pull/525 * [Minor] Correct Optional type hints in api by @fpreiss in https://github.com/sgl-project/s	Low	7/4/2024
v0.1.17	## Highlights - Add data parallelim #480 - Add speculative execution for OpenAI API #250 - Update vllm to v0.4.3 for new quantization features #511 - Better error handling (#457, #449, #514) ## What's Changed * [Feat] Add llava qwen, llava mistral by @kcz358 in https://github.com/sgl-project/sglang/pull/419 * Format code by @hnyls2002 in https://github.com/sgl-project/sglang/pull/441 * Add finish_reason to OpenAI API by @mgerstgrasser in https://github.com/sgl-project/sglang/pull/446	Low	6/8/2024
v0.1.16	## Highlight * Support more models: DBRX, Command-R, Gemma * Support llava-video (#423, https://llava-vl.github.io/blog/2024-04-30-llava-next-video/) * Cache performance improvements (#418, #364) * Marlin quantization kernels * Many bug fixes * Update dependencies to be compatible with their latest versions ## What's Changed * Fix Runtime missing some ServerArgs options by @Qubitium in https://github.com/sgl-project/sglang/pull/281 * adding the triton docker build minimal example by @	Low	5/14/2024
v0.1.13	## Highlights * Gemma Support by @hnyls2002 in https://github.com/sgl-project/sglang/pull/256 * Add Together and AzureOpenAI examples by @merrymercy in https://github.com/sgl-project/sglang/pull/184 ## What's Changed * correct a mistake on the README.md by @yaya-sy in https://github.com/sgl-project/sglang/pull/182 * correct reference dtype openai.py by @yaya-sy in https://github.com/sgl-project/sglang/pull/181 * Add Together and AzureOpenAI examples by @merrymercy in https://github.com/s	Low	3/11/2024
v0.1.12	## Highlights - Fast JSON Decoding ([blog](https://lmsys.org/blog/2024-02-05-compressed-fsm/)) - Output logprobs for decoding tokens - Multiple bug fixes ## What's Changed * Fix no-cache mode by @Ying1123 in https://github.com/sgl-project/sglang/pull/136 * Support Faster JSON decoding for llava by @hnyls2002 in https://github.com/sgl-project/sglang/pull/137 * fix undfined variable by @yaya-sy in https://github.com/sgl-project/sglang/pull/142 * jump-forward rename by @hnyls2002 in https	Low	2/11/2024
v0.1.11	## Highlights - Serve the official release demo of LLaVA v1.6 [blog](https://llava-vl.github.io/blog/2024-01-30-llava-1-6/) - Support Yi-VL [example](https://github.com/sgl-project/sglang/blob/main/examples/quick_start/srt_example_yi_vl.py) - Faster JSON decoding [blog](https://lmsys.org/blog/2024-02-05-compressed-fsm/) - Support QWen 2 ## What's Changed * Fix the error message and dependency of openai backend by @merrymercy in https://github.com/sgl-project/sglang/pull/71 * Add an asyn	Low	2/3/2024
v0.1.6	## Major features - Add OpenAI-compatible API server (Completion and ChatCompletion) - Fix `sgl.select` ## All PRs * Support v1/chat/completions by @comaniac in https://github.com/sgl-project/sglang/pull/50 * Fix select and normalized logprobs by @merrymercy in https://github.com/sgl-project/sglang/pull/67 * Bump version to 0.1.5 by @merrymercy in https://github.com/sgl-project/sglang/pull/33 * Use HTTP link in 3rdparty module by @comaniac in https://github.com/sgl-project/sglang/pull/4	Low	1/21/2024
v0.1.5	## What's Changed * Fix for T4 GPUs by @Ying1123 in https://github.com/sgl-project/sglang/pull/16 * Gemini Backend by @caoshiyi in https://github.com/sgl-project/sglang/pull/9 * Teak mem fraction by @merrymercy in https://github.com/sgl-project/sglang/pull/20 * Add option to return metadata in async streaming by @BabyChouSr in https://github.com/sgl-project/sglang/pull/18 * Expose more arguments to control the scheduling policy by @merrymercy in https://github.com/sgl-project/sglang/pull/32	Low	1/18/2024

Dependencies & License Audit

Loading dependencies...

Similar Packages

schemathesisProperty-based testing framework for Open API and GraphQL based appsv4.21.1

ctranslate2Fast inference engine for Transformer modelsv4.8.0

cadwynProduction-ready community-driven modern Stripe-like API versioning in FastAPI7.0.0

tqdmFast, Extensible Progress Meterv4.68.1

inspect-aiFramework for large language model evaluationsmain@2026-06-05

More from pypi

markitdownUtility tool for converting various files to Markdown

fastapiFastAPI framework, high performance, easy to learn, fast to code, ready for production

djangoA high-level Python web framework that encourages rapid development and clean, pragmatic design.

flaskA simple framework for building complex web applications.

More in Frameworks

spec_driven_developSpec-Driven Develop is a platform-agnostic AI agent skill that automates the pre-development workflow for large-scale complex tasks. It is not a framework, not a runtime, not a package manager — it is

deer-flowAn open-source long-horizon SuperAgent harness that researches, codes, and creates. With the help of sandboxes, memories, tools, skill, subagents and message gateway, it handles different levels of ta

simBuild, deploy, and orchestrate AI agents. Sim is the central intelligence layer for your AI workforce.

ctranslate2Fast inference engine for Transformer models

Project Info

Author:pypi

License:non-standard

view declared license

Apache License
                                   Version 2.0, January 2004
                                http://www.apache.org/licenses/
        
           TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
        
           1. Definitions.
        
              "License" shall mean the terms and conditions for use, reproduction,
              and distribution as defined by Sections 1 through 9 of this document.
        
              "Licensor" shall mean the copyright owner or entity authorized by
              the copyright owner that is granting the License.
        
              "Legal Entity" shall mean the union of the acting entity and all
              other entities that control, are controlled by, or are under common
              control with that entity. For the purposes of this definition,
              "control" means (i) the power, direct or indirect, to cause the
              direction or management of such entity, whether by contract or
              otherwise, or (ii) ownership of fifty percent (50%) or more of the
              outstanding shares, or (iii) beneficial ownership of such entity.
        
              "You" (or "Your") shall mean an individual or Legal Entity
              exercising permissions granted by this License.
        
              "Source" form shall mean the preferred form for making modifications,
              including but not limited to software source code, documentation
              source, and configuration files.
        
              "Object" form shall mean any form resulting from mechanical
              transformation or translation of a Source form, including but
              not limited to compiled object code, generated documentation,
              and conversions to other media types.
        
              "Work" shall mean the work of authorship, whether in Source or
              Object form, made available under the License, as indicated by a
              copyright notice that is included in or attached to the work
              (an example is provided in the Appendix below).
        
              "Derivative Works" shall mean any work, whether in Source or Object
              form, that is based on (or derived from) the Work and for which the
              editorial revisions, annotations, elaborations, or other modifications
              represent, as a whole, an original work of authorship. For the purposes
              of this License, Derivative Works shall not include works that remain
              separable from, or merely link (or bind by name) to the interfaces of,
              the Work and Derivative Works thereof.
        
              "Contribution" shall mean any work of authorship, including
              the original version of the Work and any modifications or additions
              to that Work or Derivative Works thereof, that is intentionally
              submitted to Licensor for inclusion in the Work by the copyright owner
              or by an individual or Legal Entity authorized to submit on behalf of
              the copyright owner. For the purposes of this definition, "submitted"
              means any form of electronic, verbal, or written communication sent
              to the Licensor or its representatives, including but not limited to
              communication on electronic mailing lists, source code control systems,
              and issue tracking systems that are managed by, or on behalf of, the
              Licensor for the purpose of discussing and improving the Work, but
              excluding communication that is conspicuously marked or otherwise
              designated in writing by the copyright owner as "Not a Contribution."
        
              "Contributor" shall mean Licensor and any individual or Legal Entity
              on behalf of whom a Contribution has been received by Licensor and
              subsequently incorporated within the Work.
        
           2. Grant of Copyright License. Subject to the terms and conditions of
              this License, each Contributor hereby grants to You a perpetual,
              worldwide, non-exclusive, no-charge, royalty-free, irrevocable
              copyright license to reproduce, prepare Derivative Works of,
              publicly display, publicly perform, sublicense, and distribute the
              Work and such Derivative Works in Source or Object form.
        
           3. Grant of Patent License. Subject to the terms and conditions of
              this License, each Contributor hereby grants to You a perpetual,
              worldwide, non-exclusive, no-charge, royalty-free, irrevocable
              (except as stated in this section) patent license to make, have made,
              use, offer to sell, sell, import, and otherwise transfer the Work,
              where such license applies only to those patent claims licensable
              by such Contributor that are necessarily infringed by their
              Contribution(s) alone or by combination of their Contribution(s)
              with the Work to which such Contribution(s) was submitted. If You
              institute patent litigation against any entity (including a
              cross-claim or counterclaim in a lawsuit) alleging that the Work
              or a Contribution incorporated within the Work constitutes direct
              or contributory patent infringement, then any patent licenses
              granted to You under this License for that Work shall terminate
              as of the date such litigation is filed.
        
           4. Redistribution. You may reproduce and distribute copies of the
              Work or Derivative Works thereof in any medium, with or without
              modifications, and in Source or Object form, provided that You
              meet the following conditions:
        
              (a) You must give any other recipients of the Work or
                  Derivative Works a copy of this License; and
        
              (b) You must cause any modified files to carry prominent notices
                  stating that You changed the files; and
        
              (c) You must retain, in the Source form of any Derivative Works
                  that You distribute, all copyright, patent, trademark, and
                  attribution notices from the Source form of the Work,
                  excluding those notices that do not pertain to any part of
                  the Derivative Works; and
        
              (d) If the Work includes a "NOTICE" text file as part of its
                  distribution, then any Derivative Works that You distribute must
                  include a readable copy of the attribution notices contained
                  within such NOTICE file, excluding those notices that do not
                  pertain to any part of the Derivative Works, in at least one
                  of the following places: within a NOTICE text file distributed
                  as part of the Derivative Works; within the Source form or
                  documentation, if provided along with the Derivative Works; or,
                  within a display generated by the Derivative Works, if and
                  wherever such third-party notices normally appear. The contents
                  of the NOTICE file are for informational purposes only and
                  do not modify the License. You may add Your own attribution
                  notices within Derivative Works that You distribute, alongside
                  or as an addendum to the NOTICE text from the Work, provided
                  that such additional attribution notices cannot be construed
                  as modifying the License.
        
              You may add Your own copyright statement to Your modifications and
              may provide additional or different license terms and conditions
              for use, reproduction, or distribution of Your modifications, or
              for any such Derivative Works as a whole, provided Your use,
              reproduction, and distribution of the Work otherwise complies with
              the conditions stated in this License.
        
           5. Submission of Contributions. Unless You explicitly state otherwise,
              any Contribution intentionally submitted for inclusion in the Work
              by You to the Licensor shall be under the terms and conditions of
              this License, without any additional terms or conditions.
              Notwithstanding the above, nothing herein shall supersede or modify
              the terms of any separate license agreement you may have executed
              with Licensor regarding such Contributions.
        
           6. Trademarks. This License does not grant permission to use the trade
              names, trademarks, service marks, or product names of the Licensor,
              except as required for reasonable and customary use in describing the
              origin of the Work and reproducing the content of the NOTICE file.
        
           7. Disclaimer of Warranty. Unless required by applicable law or
              agreed to in writing, Licensor provides the Work (and each
              Contributor provides its Contributions) on an "AS IS" BASIS,
              WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
              implied, including, without limitation, any warranties or conditions
              of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
              PARTICULAR PURPOSE. You are solely responsible for determining the
              appropriateness of using or redistributing the Work and assume any
              risks associated with Your exercise of permissions under this License.
        
           8. Limitation of Liability. In no event and under no legal theory,
              whether in tort (including negligence), contract, or otherwise,
              unless required by applicable law (such as deliberate and grossly
              negligent acts) or agreed to in writing, shall any Contributor be
              liable to You for damages, including any direct, indirect, special,
              incidental, or consequential damages of any character arising as a
              result of this License or out of the use or inability to use the
              Work (including but not limited to damages for loss of goodwill,
              work stoppage, computer failure or malfunction, or any and all
              other commercial damages or losses), even if such Contributor
              has been advised of the possibility of such damages.
        
           9. Accepting Warranty or Additional Liability. While redistributing
              the Work or Derivative Works thereof, You may choose to offer,
              and charge a fee for, acceptance of support, warranty, indemnity,
              or other liability obligations and/or rights consistent with this
              License. However, in accepting such obligations, You may act only
              on Your own behalf and on Your sole responsibility, not on behalf
              of any other Contributor, and only if You agree to indemnify,
              defend, and hold each Contributor harmless for any liability
              incurred by, or claims asserted against, such Contributor by reason
              of your accepting any such warranty or additional liability.
        
           END OF TERMS AND CONDITIONS
        
           APPENDIX: How to apply the Apache License to your work.
        
              To apply the Apache License to your work, attach the following
              boilerplate notice, with the fields enclosed by brackets "[]"
              replaced with your own identifying information. (Don't include
              the brackets!)  The text should be enclosed in the appropriate
              comment syntax for the file format. We also recommend that a
              file or class name and description of purpose be included on the
              same "printed page" as the copyright notice for easier
              identification within third-party archives.
        
           Copyright 2023-2024 SGLang Team
        
           Licensed under the Apache License, Version 2.0 (the "License");
           you may not use this file except in compliance with the License.
           You may obtain a copy of the License at
        
               http://www.apache.org/licenses/LICENSE-2.0
        
           Unless required by applicable law or agreed to in writing, software
           distributed under the License is distributed on an "AS IS" BASIS,
           WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
           See the License for the specific language governing permissions and
           limitations under the License.

Category:Frameworks

Source:

pypisglang

Confidence: 20%

Latest:v0.5.12.post1

Registered:1/8/2024