Description
<p align="center"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://github.com/flashinfer-ai/web-data/blob/main/logo/FlashInfer-black-background.png?raw=true"> <img alt="FlashInfer" src="https://github.com/flashinfer-ai/web-data/blob/main/logo/FlashInfer-white-background.png?raw=true" width=55%> </picture> </p> <h1 align="center"> High-Performance GPU Kernels for Inference </h1> <p align="center"> | <a href="https://docs.flashinfer.ai"><b>Documentation</b></a> | <a href="https://github.com/flashinfer-ai/flashinfer/releases/latest"><b>Latest Release</b></a> | <a href="https://flashinfer.ai"><b>Blog</b></a> | <a href="https://join.slack.com/t/flashinfer/shared_invite/zt-379wct3hc-D5jR~1ZKQcU00WHsXhgvtA"><b>Slack</b></a> | <a href="https://github.com/orgs/flashinfer-ai/discussions"><b>Discussion Forum</b></a> | </p> [](https://ci.tlcpack.ai/job/flashinfer-ci/job/main/) [](https://github.com/flashinfer-ai/flashinfer/actions/workflows/build-doc.yml) **FlashInfer** is a library and kernel generator for inference that delivers state-of-the-art performance across diverse GPU architectures. It provides unified APIs for attention, GEMM, and MoE operations with multiple backend implementations including FlashAttention-2/3, cuDNN, CUTLASS, and TensorRT-LLM. ## Why FlashInfer? - **State-of-the-art Performance**: Optimized kernels for prefill, decode, and mixed batching scenarios - **Multiple Backends**: Automatically selects the best backend for your hardware and workload - **Modern Architecture Support**: Support for SM75 (Turing) and later (through Blackwell) - **Low-Precision Compute**: FP8 and FP4 quantization for attention, GEMM, and MoE operations - **Production-Ready**: CUDAGraph and torch.compile compatible for low-latency serving ## Core Features ### Attention Kernels - **Paged and Ragged KV-Cache**: Efficient memory management for dynamic batch serving - **Decode, Prefill, and Append**: Optimized kernels for all attention phases - **MLA Attention**: Native support for DeepSeek's Multi-Latent Attention - **Cascade Attention**: Memory-efficient hierarchical KV-Cache for shared prefixes - **Sparse Attention**: Block-sparse and variable block-sparse patterns - **POD-Attention**: Fused prefill+decode for mixed batching ### GEMM & Linear Operations - **BF16 GEMM**: BF16 matrix multiplication for SM10.0+ GPUs. - **FP8 GEMM**: Per-tensor and groupwise scaling - **FP4 GEMM**: NVFP4 and MXFP4 matrix multiplication for Blackwell GPUs - **Grouped GEMM**: Efficient batched matrix operations for LoRA and multi-expert routing ### Mixture of Experts (MoE) - **Fused MoE Kernels** - **Multiple Routing Methods**: DeepSeek-V3, Llama-4, and standard top-k routing - **Quantized MoE**: FP8 and FP4 expert weights with block-wise scaling ### Sampling & Decoding - **Sorting-Free Sampling**: Efficient Top-K, Top-P, and Min-P without sorting - **Speculative Decoding**: Chain speculative sampling support ### Communication - **AllReduce**: Custom implementations - **Multi-Node NVLink**: MNNVL support for multi-node inference - **NVSHMEM Integration**: For distributed memory operations ### Other Operators - **RoPE**: LLaMA-style rotary position embeddings (including LLaMA 3.1) - **Normalization**: RMSNorm, LayerNorm, Gemma-style fused operations - **Activations**: SiLU, GELU with fused gating ## GPU Support | Architecture | Compute Capability | Example GPUs | |--------------|-------------------|------| | Turing | SM 7.5 | T4, RTX 20 series | | Ampere | SM 8.0, 8.6 | A100, A10, RTX 30 series | | Ada Lovelace | SM 8.9 | L4, L40, RTX 40 series | | Hopper | SM 9.0 | H100, H200 | | Blackwell | SM 10.0, 10.3 | B200, B300 | | Blackwell | SM 11.0 | Jetson Thor | | Blackwell | SM 12.0, 12.1 | RTX 50 series, DGX Spark | > **Note:** Not all features are supported across all compute capabilities. ## News Latest: [](https://github.com/flashinfer-ai/flashinfer/releases/latest) Notable updates: - [2025-10-08] Blackwell support added in [v0.4.0](https://github.com/flashinfer-ai/flashinfer/releases/tag/v0.4.0) - [2025-03-10] [Blog Post](https://flashinfer.ai/2025/03/10/sampling.html) Sorting-Free GPU Kernels for LLM Sampling, which explains the design of sampling kernels in FlashInfer. ## Getting Started ### Installation **Quickstart:** ```bash pip install flashinfer-python ``` **Package Options:** - **flashinfer-python**: Core package that compiles/downloads kernels on first use - **flashinfer-cubin**: Pre-compiled kernel binaries for all supported GPU architectures - **flashinfer-jit-cache**: Pre-built kernel cache for specific CUDA versions **For faster initialization and offline usage**, install the optional packages to have most kernels
Release History
| Version | Changes | Urgency | Date |
|---|---|---|---|
| 0.6.8.post1 | Imported from PyPI (0.6.8.post1) | Low | 4/21/2026 |
| nightly-v0.6.8-20260421 | Automated nightly build for version 0.6.8 (dev20260421) | High | 4/21/2026 |
| v0.6.8.post1 | **Full Changelog**: https://github.com/flashinfer-ai/flashinfer/compare/v0.6.8...v0.6.8.post1 | High | 4/18/2026 |
| nightly-v0.6.8-20260416 | Automated nightly build for version 0.6.8 (dev20260416) | High | 4/16/2026 |
| v0.6.8 | ## What's Changed * Add to CODEOWNER by @aleozlx in https://github.com/flashinfer-ai/flashinfer/pull/2875 * fix: int32 overflow in `trtllm_fp4_block_scale_moe` causing "Unsupported hidden state scale shape" for EP32+ configs by @qiching in https://github.com/flashinfer-ai/flashinfer/pull/2853 * feat: bump nvidia-cutlass-dsl to >=4.4.2 by @limin2021 in https://github.com/flashinfer-ai/flashinfer/pull/2833 * fix: add cute dsl moe utils to AOT by @nv-yunzheq in https://github.com/flashinfer-ai/flas | High | 4/16/2026 |
| v0.6.8rc1 | ## What's Changed * Add to CODEOWNER by @aleozlx in https://github.com/flashinfer-ai/flashinfer/pull/2875 * fix: int32 overflow in `trtllm_fp4_block_scale_moe` causing "Unsupported hidden state scale shape" for EP32+ configs by @qiching in https://github.com/flashinfer-ai/flashinfer/pull/2853 * feat: bump nvidia-cutlass-dsl to >=4.4.2 by @limin2021 in https://github.com/flashinfer-ai/flashinfer/pull/2833 * fix: add cute dsl moe utils to AOT by @nv-yunzheq in https://github.com/flashinfer-ai/flas | Medium | 4/14/2026 |
| nightly-v0.6.7-20260414 | Automated nightly build for version 0.6.7 (dev20260414) | Medium | 4/14/2026 |
| nightly-v0.6.7-20260413 | Automated nightly build for version 0.6.7 (dev20260413) | Medium | 4/13/2026 |
| nightly-v0.6.7-20260411 | Automated nightly build for version 0.6.7 (dev20260411) | Medium | 4/11/2026 |
| nightly-v0.6.7-20260410 | Automated nightly build for version 0.6.7 (dev20260410) | Medium | 4/10/2026 |
| nightly-v0.6.7-20260408 | Automated nightly build for version 0.6.7 (dev20260408) | Medium | 4/8/2026 |
| nightly-v0.6.7-20260406 | Automated nightly build for version 0.6.7 (dev20260406) | Medium | 4/6/2026 |
| v0.6.7.post3 | **Full Changelog**: https://github.com/flashinfer-ai/flashinfer/compare/v0.6.7.post2...v0.6.7.post3 | Medium | 4/6/2026 |
| nightly-v0.6.7-20260405 | Automated nightly build for version 0.6.7 (dev20260405) | Medium | 4/5/2026 |
| nightly-v0.6.7-20260404 | Automated nightly build for version 0.6.7 (dev20260404) | Medium | 4/4/2026 |
| v0.6.7.post2 | **Full Changelog**: https://github.com/flashinfer-ai/flashinfer/compare/v0.6.7.post1...v0.6.7.post2 | Medium | 4/4/2026 |
| v0.6.7.post1 | **Full Changelog**: https://github.com/flashinfer-ai/flashinfer/compare/v0.6.7...v0.6.7.post1 | Medium | 4/3/2026 |
| nightly-v0.6.7-20260402 | Automated nightly build for version 0.6.7 (dev20260402) | Medium | 4/2/2026 |
| nightly-v0.6.7-20260401 | Automated nightly build for version 0.6.7 (dev20260401) | Medium | 4/1/2026 |
| nightly-v0.6.7-20260331 | Automated nightly build for version 0.6.7 (dev20260331) | Medium | 3/31/2026 |
| nightly-v0.6.7-20260328 | Automated nightly build for version 0.6.7 (dev20260328) | Medium | 3/28/2026 |
| nightly-v0.6.7-20260326 | Automated nightly build for version 0.6.7 (dev20260326) | Medium | 3/26/2026 |
| v0.6.7 | ## What's Changed * perf(gdn): optimize MTP kernel with ILP rows and SMEM v caching by @ameynaik-hub in https://github.com/flashinfer-ai/flashinfer/pull/2618 * Feat/gdn decode pooled by @xutizhou in https://github.com/flashinfer-ai/flashinfer/pull/2521 * fix(jit): GEMM kernels produce NaN under concurrency — missing GDC flags cause PDL synchronization barriers to compile as no-ops by @voipmonitor in https://github.com/flashinfer-ai/flashinfer/pull/2716 * Support NVFP4 KV cache decode on SM120 by | Medium | 3/25/2026 |
| nightly-v0.6.7-20260324 | Automated nightly build for version 0.6.7 (dev20260324) | Medium | 3/24/2026 |
| nightly-v0.6.6-20260323 | Automated nightly build for version 0.6.6 (dev20260323) | Medium | 3/23/2026 |
| nightly-v0.6.6-20260322 | Automated nightly build for version 0.6.6 (dev20260322) | Low | 3/22/2026 |
| nightly-v0.6.6-20260321 | Automated nightly build for version 0.6.6 (dev20260321) | Low | 3/21/2026 |
| nightly-v0.6.6-20260320 | Automated nightly build for version 0.6.6 (dev20260320) | Low | 3/20/2026 |
| nightly-v0.6.6-20260319 | Automated nightly build for version 0.6.6 (dev20260319) | Low | 3/19/2026 |
| nightly-v0.6.6-20260318 | Automated nightly build for version 0.6.6 (dev20260318) | Low | 3/18/2026 |
| nightly-v0.6.6-20260317 | Automated nightly build for version 0.6.6 (dev20260317) | Low | 3/17/2026 |
| nightly-v0.6.6-20260316 | Automated nightly build for version 0.6.6 (dev20260316) | Low | 3/16/2026 |
| nightly-v0.6.6-20260315 | Automated nightly build for version 0.6.6 (dev20260315) | Low | 3/15/2026 |
| nightly-v0.6.6-20260314 | Automated nightly build for version 0.6.6 (dev20260314) | Low | 3/14/2026 |
| nightly-v0.6.6-20260313 | Automated nightly build for version 0.6.6 (dev20260313) | Low | 3/13/2026 |
| nightly-v0.6.6-20260312 | Automated nightly build for version 0.6.6 (dev20260312) | Low | 3/12/2026 |
| v0.6.6 | ## What's Changed * fix: move ArtifactPath/CheckSumHash imports inside gen_moe_utils_modu… by @dierksen in https://github.com/flashinfer-ai/flashinfer/pull/2681 * Enable sm120f compilation by @kahyunnam in https://github.com/flashinfer-ai/flashinfer/pull/2650 * Ensure -gencode flags are in deterministic order (for ccache) by @benbarsdell in https://github.com/flashinfer-ai/flashinfer/pull/2674 * int16 Block-Scaled State and Stochastic Rounding for SSU (mamba) by @ishovkun in https://github.com/f | Low | 3/11/2026 |
| nightly-v0.6.5-20260309 | Automated nightly build for version 0.6.5 (dev20260309) | Low | 3/9/2026 |
| nightly-v0.6.5-20260308 | Automated nightly build for version 0.6.5 (dev20260308) | Low | 3/8/2026 |
| nightly-v0.6.5-20260307 | Automated nightly build for version 0.6.5 (dev20260307) | Low | 3/7/2026 |
| nightly-v0.6.5-20260306 | Automated nightly build for version 0.6.5 (dev20260306) | Low | 3/6/2026 |
| nightly-v0.6.5-20260305 | Automated nightly build for version 0.6.5 (dev20260305) | Low | 3/5/2026 |
| nightly-v0.6.5-20260304 | Automated nightly build for version 0.6.5 (dev20260304) | Low | 3/4/2026 |
| v0.6.5 | ## What's Changed * feat: BF16 GEMM benchmarking support by @raayandhar in https://github.com/flashinfer-ai/flashinfer/pull/2525 * [bugfix]Correct chunk_end calculation in multi-CTA collaboration when max_len > length by @huangzhilin-hzl in https://github.com/flashinfer-ai/flashinfer/pull/2489 * test: Skip test_decode_delta_rule.py by @bkryu in https://github.com/flashinfer-ai/flashinfer/pull/2600 * feat: add issue self-claim workflow for external contributors by @jwu1980 in https://github. | Low | 3/4/2026 |
| nightly-v0.6.4-20260303 | Automated nightly build for version 0.6.4 (dev20260303) | Low | 3/3/2026 |
| nightly-v0.6.4-20260302 | Automated nightly build for version 0.6.4 (dev20260302) | Low | 3/2/2026 |
| nightly-v0.6.4-20260301 | Automated nightly build for version 0.6.4 (dev20260301) | Low | 3/1/2026 |
| nightly-v0.6.4-20260228 | Automated nightly build for version 0.6.4 (dev20260228) | Low | 2/28/2026 |
| nightly-v0.6.4-20260227 | Automated nightly build for version 0.6.4 (dev20260227) | Low | 2/27/2026 |
| nightly-v0.6.4-20260226 | Automated nightly build for version 0.6.4 (dev20260226) | Low | 2/26/2026 |
| nightly-v0.6.4-20260225 | Automated nightly build for version 0.6.4 (dev20260225) | Low | 2/25/2026 |
