flashinfer-python

Home > Developer Tools > flashinfer-python

FlashInfer: Kernel Library for LLM Serving

Why this rank:Strong adoptionRecent releaseHealthy release cadence

Description

<picture> <source media="(prefers-color-scheme: dark)" srcset="https://github.com/flashinfer-ai/web-data/blob/main/logo/FlashInfer-black-background.png?raw=true"> <img alt="FlashInfer" src="https://github.com/flashinfer-ai/web-data/blob/main/logo/FlashInfer-white-background.png?raw=true" width=55%> </picture> <h1 align="center"> High-Performance GPU Kernels for Inference </h1> | <a href="https://docs.flashinfer.ai">Documentation</a> | <a href="https://github.com/flashinfer-ai/flashinfer/releases/latest">Latest Release</a> | <a href="https://flashinfer.ai">Blog</a> | <a href="https://join.slack.com/t/flashinfer/shared_invite/zt-379wct3hc-D5jR~1ZKQcU00WHsXhgvtA">Slack</a> | <a href="https://github.com/orgs/flashinfer-ai/discussions">Discussion Forum</a> | [![Build Status](https://ci.tlcpack.ai/job/flashinfer-ci/job/main/badge/icon)](https://ci.tlcpack.ai/job/flashinfer-ci/job/main/) [![Documentation](https://github.com/flashinfer-ai/flashinfer/actions/workflows/build-doc.yml/badge.svg)](https://github.com/flashinfer-ai/flashinfer/actions/workflows/build-doc.yml) **FlashInfer** is a library and kernel generator for inference that delivers state-of-the-art performance across diverse GPU architectures. It provides unified APIs for attention, GEMM, and MoE operations with multiple backend implementations including FlashAttention-2/3, cuDNN, CUTLASS, and TensorRT-LLM. ## Why FlashInfer? - **State-of-the-art Performance**: Optimized kernels for prefill, decode, and mixed batching scenarios - **Multiple Backends**: Automatically selects the best backend for your hardware and workload - **Modern Architecture Support**: Support for SM75 (Turing) and later (through Blackwell) - **Low-Precision Compute**: FP8 and FP4 quantization for attention, GEMM, and MoE operations - **Production-Ready**: CUDAGraph and torch.compile compatible for low-latency serving ## Core Features ### Attention Kernels - **Paged and Ragged KV-Cache**: Efficient memory management for dynamic batch serving - **Decode, Prefill, and Append**: Optimized kernels for all attention phases - **MLA Attention**: Native support for DeepSeek's Multi-Latent Attention - **Cascade Attention**: Memory-efficient hierarchical KV-Cache for shared prefixes - **Sparse Attention**: Block-sparse and variable block-sparse patterns - **POD-Attention**: Fused prefill+decode for mixed batching ### GEMM & Linear Operations - **BF16 GEMM**: BF16 matrix multiplication for SM10.0+ GPUs. - **FP8 GEMM**: Per-tensor and groupwise scaling - **FP4 GEMM**: NVFP4 and MXFP4 matrix multiplication for Blackwell GPUs - **Grouped GEMM**: Efficient batched matrix operations for LoRA and multi-expert routing ### Mixture of Experts (MoE) - **Fused MoE Kernels** - **Multiple Routing Methods**: DeepSeek-V3, Llama-4, and standard top-k routing - **Quantized MoE**: FP8 and FP4 expert weights with block-wise scaling ### Sampling & Decoding - **Sorting-Free Sampling**: Efficient Top-K, Top-P, and Min-P without sorting - **Speculative Decoding**: Chain speculative sampling support ### Communication - **AllReduce**: Custom implementations - **Multi-Node NVLink**: MNNVL support for multi-node inference - **NVSHMEM Integration**: For distributed memory operations ### Other Operators - **RoPE**: LLaMA-style rotary position embeddings (including LLaMA 3.1) - **Normalization**: RMSNorm, LayerNorm, Gemma-style fused operations - **Activations**: SiLU, GELU with fused gating ## GPU Support | Architecture | Compute Capability | Example GPUs | |--------------|-------------------|------| | Turing | SM 7.5 | T4, RTX 20 series | | Ampere | SM 8.0, 8.6 | A100, A10, RTX 30 series | | Ada Lovelace | SM 8.9 | L4, L40, RTX 40 series | | Hopper | SM 9.0 | H100, H200 | | Blackwell | SM 10.0, 10.3 | B200, B300 | | Blackwell | SM 11.0 | Jetson Thor | | Blackwell | SM 12.0, 12.1 | RTX 50 series, DGX Spark | > **Note:** Not all features are supported across all compute capabilities. ## News Latest: [![GitHub Release](https://img.shields.io/github/v/release/flashinfer-ai/flashinfer)](https://github.com/flashinfer-ai/flashinfer/releases/latest) Notable updates: - [2025-10-08] Blackwell support added in [v0.4.0](https://github.com/flashinfer-ai/flashinfer/releases/tag/v0.4.0) - [2025-03-10] [Blog Post](https://flashinfer.ai/2025/03/10/sampling.html) Sorting-Free GPU Kernels for LLM Sampling, which explains the design of sampling kernels in FlashInfer. ## Getting Started ### Installation **Quickstart:** ```bash pip install flashinfer-python ``` **Package Options:** - **flashinfer-python**: Core package that compiles/downloads kernels on first use - **flashinfer-cubin**: Pre-compiled kernel binaries for all supported GPU architectures - **flashinfer-jit-cache**: Pre-built kernel cache for specific CUDA versions **For faster initialization and offline usage**, install the optional packages to have most kernels

Release History

Version	Changes	Urgency	Date
v0.6.12	## What's Changed * Loosened trtllm_ragged_attention_deepseek shape assertion by @nvjullin in https://github.com/flashinfer-ai/flashinfer/pull/3064 * Update moe gemm by @IwakuraRein in https://github.com/flashinfer-ai/flashinfer/pull/3239 * perf: optimize per-token nvfp4 quantization kernel. by @IwakuraRein in https://github.com/flashinfer-ai/flashinfer/pull/3237 * build: add sccache-backed jit-cache builds and AOT diagnostics by @dierksen in https://github.com/flashinfer-ai/flashinfer/pull/3205	High	5/29/2026
v0.6.12rc1	## What's Changed * Loosened trtllm_ragged_attention_deepseek shape assertion by @nvjullin in https://github.com/flashinfer-ai/flashinfer/pull/3064 * Update moe gemm by @IwakuraRein in https://github.com/flashinfer-ai/flashinfer/pull/3239 * perf: optimize per-token nvfp4 quantization kernel. by @IwakuraRein in https://github.com/flashinfer-ai/flashinfer/pull/3237 * build: add sccache-backed jit-cache builds and AOT diagnostics by @dierksen in https://github.com/flashinfer-ai/flashinfer/pull/3205	High	5/22/2026
v0.6.11.post3	Full Changelog: https://github.com/flashinfer-ai/flashinfer/compare/v0.6.11.post2...v0.6.11.post3	High	5/15/2026
v0.6.11	## What's Changed * trying this one character fix for main branch by @aleozlx in https://github.com/flashinfer-ai/flashinfer/pull/3213 * Add git submodule update to build_backend.py by @kahyunnam in https://github.com/flashinfer-ai/flashinfer/pull/3190 * fix(cute_dsl/moe): correct tile_size=256 gemm2 tactic enumeration by @leejnau in https://github.com/flashinfer-ai/flashinfer/pull/3171 * Fix trace-bmm-fp8 test: B should be K-major for subword types by @xrq-phys in https://github.com/flashinfer-	High	5/9/2026
v0.6.10	## What's Changed * Vendor CCCL v3.3.2 from GitHub instead of relying on CTK-bundled copy by @kahyunnam in https://github.com/flashinfer-ai/flashinfer/pull/3091 * [Fmha] Add head_dim=512 support for trtllm attention kernels by @djmmoss in https://github.com/flashinfer-ai/flashinfer/pull/2959 * perf: optimize MXFP4xBF16 & INT4xFP8 CUTLASS MoE backend for SM90 by @samuellees in https://github.com/flashinfer-ai/flashinfer/pull/3084 * Add support for the combinations of allreduce, allgather, and red	High	5/4/2026
v0.6.10rc1	## What's Changed * Vendor CCCL v3.3.2 from GitHub instead of relying on CTK-bundled copy by @kahyunnam in https://github.com/flashinfer-ai/flashinfer/pull/3091 * [Fmha] Add head_dim=512 support for trtllm attention kernels by @djmmoss in https://github.com/flashinfer-ai/flashinfer/pull/2959 * perf: optimize MXFP4xBF16 & INT4xFP8 CUTLASS MoE backend for SM90 by @samuellees in https://github.com/flashinfer-ai/flashinfer/pull/3084 * Add support for the combinations of allreduce, allgather, and red	High	4/30/2026
v0.6.9	## What's Changed * feat: Add backend="b12x" for mm_fp4 on SM120 by @bkryu in https://github.com/flashinfer-ai/flashinfer/pull/3051 * docs: document MAX_JOBS env var and its interaction with FLASHINFER_N… by @aleozlx in https://github.com/flashinfer-ai/flashinfer/pull/3060 * PR #2772 might have introduced a device side compilation regression by @aleozlx in https://github.com/flashinfer-ai/flashinfer/pull/3056 * [feat] Add routing_replay_out support to MoE kernels and Python API by @TomerBN-Nvidi	High	4/24/2026
0.6.8.post1	Imported from PyPI (0.6.8.post1)	Low	4/21/2026
nightly-v0.6.8-20260421	Automated nightly build for version 0.6.8 (dev20260421)	High	4/21/2026
v0.6.8.post1	Full Changelog: https://github.com/flashinfer-ai/flashinfer/compare/v0.6.8...v0.6.8.post1	High	4/18/2026
nightly-v0.6.8-20260416	Automated nightly build for version 0.6.8 (dev20260416)	High	4/16/2026
v0.6.8	## What's Changed * Add to CODEOWNER by @aleozlx in https://github.com/flashinfer-ai/flashinfer/pull/2875 * fix: int32 overflow in `trtllm_fp4_block_scale_moe` causing "Unsupported hidden state scale shape" for EP32+ configs by @qiching in https://github.com/flashinfer-ai/flashinfer/pull/2853 * feat: bump nvidia-cutlass-dsl to >=4.4.2 by @limin2021 in https://github.com/flashinfer-ai/flashinfer/pull/2833 * fix: add cute dsl moe utils to AOT by @nv-yunzheq in https://github.com/flashinfer-ai/flas	High	4/16/2026
v0.6.8rc1	## What's Changed * Add to CODEOWNER by @aleozlx in https://github.com/flashinfer-ai/flashinfer/pull/2875 * fix: int32 overflow in `trtllm_fp4_block_scale_moe` causing "Unsupported hidden state scale shape" for EP32+ configs by @qiching in https://github.com/flashinfer-ai/flashinfer/pull/2853 * feat: bump nvidia-cutlass-dsl to >=4.4.2 by @limin2021 in https://github.com/flashinfer-ai/flashinfer/pull/2833 * fix: add cute dsl moe utils to AOT by @nv-yunzheq in https://github.com/flashinfer-ai/flas	Medium	4/14/2026
nightly-v0.6.7-20260414	Automated nightly build for version 0.6.7 (dev20260414)	Medium	4/14/2026
nightly-v0.6.7-20260413	Automated nightly build for version 0.6.7 (dev20260413)	Medium	4/13/2026
nightly-v0.6.7-20260411	Automated nightly build for version 0.6.7 (dev20260411)	Medium	4/11/2026
nightly-v0.6.7-20260410	Automated nightly build for version 0.6.7 (dev20260410)	Medium	4/10/2026
nightly-v0.6.7-20260408	Automated nightly build for version 0.6.7 (dev20260408)	Medium	4/8/2026
nightly-v0.6.7-20260406	Automated nightly build for version 0.6.7 (dev20260406)	Medium	4/6/2026
v0.6.7.post3	Full Changelog: https://github.com/flashinfer-ai/flashinfer/compare/v0.6.7.post2...v0.6.7.post3	Medium	4/6/2026
nightly-v0.6.7-20260405	Automated nightly build for version 0.6.7 (dev20260405)	Medium	4/5/2026
nightly-v0.6.7-20260404	Automated nightly build for version 0.6.7 (dev20260404)	Medium	4/4/2026
v0.6.7.post2	Full Changelog: https://github.com/flashinfer-ai/flashinfer/compare/v0.6.7.post1...v0.6.7.post2	Medium	4/4/2026
v0.6.7.post1	Full Changelog: https://github.com/flashinfer-ai/flashinfer/compare/v0.6.7...v0.6.7.post1	Medium	4/3/2026
nightly-v0.6.7-20260402	Automated nightly build for version 0.6.7 (dev20260402)	Medium	4/2/2026
nightly-v0.6.7-20260401	Automated nightly build for version 0.6.7 (dev20260401)	Medium	4/1/2026
nightly-v0.6.7-20260331	Automated nightly build for version 0.6.7 (dev20260331)	Medium	3/31/2026
nightly-v0.6.7-20260328	Automated nightly build for version 0.6.7 (dev20260328)	Medium	3/28/2026
nightly-v0.6.7-20260326	Automated nightly build for version 0.6.7 (dev20260326)	Medium	3/26/2026
v0.6.7	## What's Changed * perf(gdn): optimize MTP kernel with ILP rows and SMEM v caching by @ameynaik-hub in https://github.com/flashinfer-ai/flashinfer/pull/2618 * Feat/gdn decode pooled by @xutizhou in https://github.com/flashinfer-ai/flashinfer/pull/2521 * fix(jit): GEMM kernels produce NaN under concurrency — missing GDC flags cause PDL synchronization barriers to compile as no-ops by @voipmonitor in https://github.com/flashinfer-ai/flashinfer/pull/2716 * Support NVFP4 KV cache decode on SM120 by	Medium	3/25/2026
nightly-v0.6.7-20260324	Automated nightly build for version 0.6.7 (dev20260324)	Medium	3/24/2026
nightly-v0.6.6-20260323	Automated nightly build for version 0.6.6 (dev20260323)	Medium	3/23/2026
nightly-v0.6.6-20260322	Automated nightly build for version 0.6.6 (dev20260322)	Low	3/22/2026
nightly-v0.6.6-20260321	Automated nightly build for version 0.6.6 (dev20260321)	Low	3/21/2026
nightly-v0.6.6-20260320	Automated nightly build for version 0.6.6 (dev20260320)	Low	3/20/2026
nightly-v0.6.6-20260319	Automated nightly build for version 0.6.6 (dev20260319)	Low	3/19/2026
nightly-v0.6.6-20260318	Automated nightly build for version 0.6.6 (dev20260318)	Low	3/18/2026
nightly-v0.6.6-20260317	Automated nightly build for version 0.6.6 (dev20260317)	Low	3/17/2026
nightly-v0.6.6-20260316	Automated nightly build for version 0.6.6 (dev20260316)	Low	3/16/2026
nightly-v0.6.6-20260315	Automated nightly build for version 0.6.6 (dev20260315)	Low	3/15/2026
nightly-v0.6.6-20260314	Automated nightly build for version 0.6.6 (dev20260314)	Low	3/14/2026
nightly-v0.6.6-20260313	Automated nightly build for version 0.6.6 (dev20260313)	Low	3/13/2026
nightly-v0.6.6-20260312	Automated nightly build for version 0.6.6 (dev20260312)	Low	3/12/2026
v0.6.6	## What's Changed * fix: move ArtifactPath/CheckSumHash imports inside gen_moe_utils_modu… by @dierksen in https://github.com/flashinfer-ai/flashinfer/pull/2681 * Enable sm120f compilation by @kahyunnam in https://github.com/flashinfer-ai/flashinfer/pull/2650 * Ensure -gencode flags are in deterministic order (for ccache) by @benbarsdell in https://github.com/flashinfer-ai/flashinfer/pull/2674 * int16 Block-Scaled State and Stochastic Rounding for SSU (mamba) by @ishovkun in https://github.com/f	Low	3/11/2026
nightly-v0.6.5-20260309	Automated nightly build for version 0.6.5 (dev20260309)	Low	3/9/2026
nightly-v0.6.5-20260308	Automated nightly build for version 0.6.5 (dev20260308)	Low	3/8/2026
nightly-v0.6.5-20260307	Automated nightly build for version 0.6.5 (dev20260307)	Low	3/7/2026
nightly-v0.6.5-20260306	Automated nightly build for version 0.6.5 (dev20260306)	Low	3/6/2026
nightly-v0.6.5-20260305	Automated nightly build for version 0.6.5 (dev20260305)	Low	3/5/2026
nightly-v0.6.5-20260304	Automated nightly build for version 0.6.5 (dev20260304)	Low	3/4/2026
v0.6.5	## What's Changed * feat: BF16 GEMM benchmarking support by @raayandhar in https://github.com/flashinfer-ai/flashinfer/pull/2525 * [bugfix]Correct chunk_end calculation in multi-CTA collaboration when max_len > length by @huangzhilin-hzl in https://github.com/flashinfer-ai/flashinfer/pull/2489 * test: Skip test_decode_delta_rule.py by @bkryu in https://github.com/flashinfer-ai/flashinfer/pull/2600 * feat: add issue self-claim workflow for external contributors by @jwu1980 in https://github.	Low	3/4/2026
nightly-v0.6.4-20260303	Automated nightly build for version 0.6.4 (dev20260303)	Low	3/3/2026
nightly-v0.6.4-20260302	Automated nightly build for version 0.6.4 (dev20260302)	Low	3/2/2026
nightly-v0.6.4-20260301	Automated nightly build for version 0.6.4 (dev20260301)	Low	3/1/2026
nightly-v0.6.4-20260228	Automated nightly build for version 0.6.4 (dev20260228)	Low	2/28/2026
nightly-v0.6.4-20260227	Automated nightly build for version 0.6.4 (dev20260227)	Low	2/27/2026
nightly-v0.6.4-20260226	Automated nightly build for version 0.6.4 (dev20260226)	Low	2/26/2026
nightly-v0.6.4-20260225	Automated nightly build for version 0.6.4 (dev20260225)	Low	2/25/2026

Dependencies & License Audit

Loading dependencies...

Similar Packages

modalPython client library for Modalmain@2026-06-05

unstructured-clientPython Client SDK for Unstructured APIv0.45.0

anthropicThe official Python library for the anthropic APIv0.106.0

pipThe PyPA recommended tool for installing Python packages.main@2026-06-05

openinference-semantic-conventionsOpenInference Semantic Conventionspython-openinference-instrumentation-pipecat-v2.0.0

More in Developer Tools

system_prompts_leaksExtracted system prompts from ChatGPT (GPT-5.4, GPT-5.3, Codex), Claude (Opus 4.6, Sonnet 4.6, Claude Code), Gemini (3.1 Pro, 3 Flash, CLI), Grok (4.2, 4), Perplexity, and more. Updated regularly.

mypyOptional static typing for Python

pipThe PyPA recommended tool for installing Python packages.

anthropicThe official Python library for the anthropic API