freshcrate
Home > Developer Tools > flashinfer-python

flashinfer-python

FlashInfer: Kernel Library for LLM Serving

Description

<p align="center"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://github.com/flashinfer-ai/web-data/blob/main/logo/FlashInfer-black-background.png?raw=true"> <img alt="FlashInfer" src="https://github.com/flashinfer-ai/web-data/blob/main/logo/FlashInfer-white-background.png?raw=true" width=55%> </picture> </p> <h1 align="center"> High-Performance GPU Kernels for Inference </h1> <p align="center"> | <a href="https://docs.flashinfer.ai"><b>Documentation</b></a> | <a href="https://github.com/flashinfer-ai/flashinfer/releases/latest"><b>Latest Release</b></a> | <a href="https://flashinfer.ai"><b>Blog</b></a> | <a href="https://join.slack.com/t/flashinfer/shared_invite/zt-379wct3hc-D5jR~1ZKQcU00WHsXhgvtA"><b>Slack</b></a> | <a href="https://github.com/orgs/flashinfer-ai/discussions"><b>Discussion Forum</b></a> | </p> [![Build Status](https://ci.tlcpack.ai/job/flashinfer-ci/job/main/badge/icon)](https://ci.tlcpack.ai/job/flashinfer-ci/job/main/) [![Documentation](https://github.com/flashinfer-ai/flashinfer/actions/workflows/build-doc.yml/badge.svg)](https://github.com/flashinfer-ai/flashinfer/actions/workflows/build-doc.yml) **FlashInfer** is a library and kernel generator for inference that delivers state-of-the-art performance across diverse GPU architectures. It provides unified APIs for attention, GEMM, and MoE operations with multiple backend implementations including FlashAttention-2/3, cuDNN, CUTLASS, and TensorRT-LLM. ## Why FlashInfer? - **State-of-the-art Performance**: Optimized kernels for prefill, decode, and mixed batching scenarios - **Multiple Backends**: Automatically selects the best backend for your hardware and workload - **Modern Architecture Support**: Support for SM75 (Turing) and later (through Blackwell) - **Low-Precision Compute**: FP8 and FP4 quantization for attention, GEMM, and MoE operations - **Production-Ready**: CUDAGraph and torch.compile compatible for low-latency serving ## Core Features ### Attention Kernels - **Paged and Ragged KV-Cache**: Efficient memory management for dynamic batch serving - **Decode, Prefill, and Append**: Optimized kernels for all attention phases - **MLA Attention**: Native support for DeepSeek's Multi-Latent Attention - **Cascade Attention**: Memory-efficient hierarchical KV-Cache for shared prefixes - **Sparse Attention**: Block-sparse and variable block-sparse patterns - **POD-Attention**: Fused prefill+decode for mixed batching ### GEMM & Linear Operations - **BF16 GEMM**: BF16 matrix multiplication for SM10.0+ GPUs. - **FP8 GEMM**: Per-tensor and groupwise scaling - **FP4 GEMM**: NVFP4 and MXFP4 matrix multiplication for Blackwell GPUs - **Grouped GEMM**: Efficient batched matrix operations for LoRA and multi-expert routing ### Mixture of Experts (MoE) - **Fused MoE Kernels** - **Multiple Routing Methods**: DeepSeek-V3, Llama-4, and standard top-k routing - **Quantized MoE**: FP8 and FP4 expert weights with block-wise scaling ### Sampling & Decoding - **Sorting-Free Sampling**: Efficient Top-K, Top-P, and Min-P without sorting - **Speculative Decoding**: Chain speculative sampling support ### Communication - **AllReduce**: Custom implementations - **Multi-Node NVLink**: MNNVL support for multi-node inference - **NVSHMEM Integration**: For distributed memory operations ### Other Operators - **RoPE**: LLaMA-style rotary position embeddings (including LLaMA 3.1) - **Normalization**: RMSNorm, LayerNorm, Gemma-style fused operations - **Activations**: SiLU, GELU with fused gating ## GPU Support | Architecture | Compute Capability | Example GPUs | |--------------|-------------------|------| | Turing | SM 7.5 | T4, RTX 20 series | | Ampere | SM 8.0, 8.6 | A100, A10, RTX 30 series | | Ada Lovelace | SM 8.9 | L4, L40, RTX 40 series | | Hopper | SM 9.0 | H100, H200 | | Blackwell | SM 10.0, 10.3 | B200, B300 | | Blackwell | SM 11.0 | Jetson Thor | | Blackwell | SM 12.0, 12.1 | RTX 50 series, DGX Spark | > **Note:** Not all features are supported across all compute capabilities. ## News Latest: [![GitHub Release](https://img.shields.io/github/v/release/flashinfer-ai/flashinfer)](https://github.com/flashinfer-ai/flashinfer/releases/latest) Notable updates: - [2025-10-08] Blackwell support added in [v0.4.0](https://github.com/flashinfer-ai/flashinfer/releases/tag/v0.4.0) - [2025-03-10] [Blog Post](https://flashinfer.ai/2025/03/10/sampling.html) Sorting-Free GPU Kernels for LLM Sampling, which explains the design of sampling kernels in FlashInfer. ## Getting Started ### Installation **Quickstart:** ```bash pip install flashinfer-python ``` **Package Options:** - **flashinfer-python**: Core package that compiles/downloads kernels on first use - **flashinfer-cubin**: Pre-compiled kernel binaries for all supported GPU architectures - **flashinfer-jit-cache**: Pre-built kernel cache for specific CUDA versions **For faster initialization and offline usage**, install the optional packages to have most kernels

Release History

VersionChangesUrgencyDate
0.6.8.post1Imported from PyPI (0.6.8.post1)Low4/21/2026
nightly-v0.6.8-20260421Automated nightly build for version 0.6.8 (dev20260421)High4/21/2026
v0.6.8.post1**Full Changelog**: https://github.com/flashinfer-ai/flashinfer/compare/v0.6.8...v0.6.8.post1High4/18/2026
nightly-v0.6.8-20260416Automated nightly build for version 0.6.8 (dev20260416)High4/16/2026
v0.6.8## What's Changed * Add to CODEOWNER by @aleozlx in https://github.com/flashinfer-ai/flashinfer/pull/2875 * fix: int32 overflow in `trtllm_fp4_block_scale_moe` causing "Unsupported hidden state scale shape" for EP32+ configs by @qiching in https://github.com/flashinfer-ai/flashinfer/pull/2853 * feat: bump nvidia-cutlass-dsl to >=4.4.2 by @limin2021 in https://github.com/flashinfer-ai/flashinfer/pull/2833 * fix: add cute dsl moe utils to AOT by @nv-yunzheq in https://github.com/flashinfer-ai/flasHigh4/16/2026
v0.6.8rc1## What's Changed * Add to CODEOWNER by @aleozlx in https://github.com/flashinfer-ai/flashinfer/pull/2875 * fix: int32 overflow in `trtllm_fp4_block_scale_moe` causing "Unsupported hidden state scale shape" for EP32+ configs by @qiching in https://github.com/flashinfer-ai/flashinfer/pull/2853 * feat: bump nvidia-cutlass-dsl to >=4.4.2 by @limin2021 in https://github.com/flashinfer-ai/flashinfer/pull/2833 * fix: add cute dsl moe utils to AOT by @nv-yunzheq in https://github.com/flashinfer-ai/flasMedium4/14/2026
nightly-v0.6.7-20260414Automated nightly build for version 0.6.7 (dev20260414)Medium4/14/2026
nightly-v0.6.7-20260413Automated nightly build for version 0.6.7 (dev20260413)Medium4/13/2026
nightly-v0.6.7-20260411Automated nightly build for version 0.6.7 (dev20260411)Medium4/11/2026
nightly-v0.6.7-20260410Automated nightly build for version 0.6.7 (dev20260410)Medium4/10/2026
nightly-v0.6.7-20260408Automated nightly build for version 0.6.7 (dev20260408)Medium4/8/2026
nightly-v0.6.7-20260406Automated nightly build for version 0.6.7 (dev20260406)Medium4/6/2026
v0.6.7.post3**Full Changelog**: https://github.com/flashinfer-ai/flashinfer/compare/v0.6.7.post2...v0.6.7.post3Medium4/6/2026
nightly-v0.6.7-20260405Automated nightly build for version 0.6.7 (dev20260405)Medium4/5/2026
nightly-v0.6.7-20260404Automated nightly build for version 0.6.7 (dev20260404)Medium4/4/2026
v0.6.7.post2**Full Changelog**: https://github.com/flashinfer-ai/flashinfer/compare/v0.6.7.post1...v0.6.7.post2Medium4/4/2026
v0.6.7.post1**Full Changelog**: https://github.com/flashinfer-ai/flashinfer/compare/v0.6.7...v0.6.7.post1Medium4/3/2026
nightly-v0.6.7-20260402Automated nightly build for version 0.6.7 (dev20260402)Medium4/2/2026
nightly-v0.6.7-20260401Automated nightly build for version 0.6.7 (dev20260401)Medium4/1/2026
nightly-v0.6.7-20260331Automated nightly build for version 0.6.7 (dev20260331)Medium3/31/2026
nightly-v0.6.7-20260328Automated nightly build for version 0.6.7 (dev20260328)Medium3/28/2026
nightly-v0.6.7-20260326Automated nightly build for version 0.6.7 (dev20260326)Medium3/26/2026
v0.6.7## What's Changed * perf(gdn): optimize MTP kernel with ILP rows and SMEM v caching by @ameynaik-hub in https://github.com/flashinfer-ai/flashinfer/pull/2618 * Feat/gdn decode pooled by @xutizhou in https://github.com/flashinfer-ai/flashinfer/pull/2521 * fix(jit): GEMM kernels produce NaN under concurrency — missing GDC flags cause PDL synchronization barriers to compile as no-ops by @voipmonitor in https://github.com/flashinfer-ai/flashinfer/pull/2716 * Support NVFP4 KV cache decode on SM120 byMedium3/25/2026
nightly-v0.6.7-20260324Automated nightly build for version 0.6.7 (dev20260324)Medium3/24/2026
nightly-v0.6.6-20260323Automated nightly build for version 0.6.6 (dev20260323)Medium3/23/2026
nightly-v0.6.6-20260322Automated nightly build for version 0.6.6 (dev20260322)Low3/22/2026
nightly-v0.6.6-20260321Automated nightly build for version 0.6.6 (dev20260321)Low3/21/2026
nightly-v0.6.6-20260320Automated nightly build for version 0.6.6 (dev20260320)Low3/20/2026
nightly-v0.6.6-20260319Automated nightly build for version 0.6.6 (dev20260319)Low3/19/2026
nightly-v0.6.6-20260318Automated nightly build for version 0.6.6 (dev20260318)Low3/18/2026
nightly-v0.6.6-20260317Automated nightly build for version 0.6.6 (dev20260317)Low3/17/2026
nightly-v0.6.6-20260316Automated nightly build for version 0.6.6 (dev20260316)Low3/16/2026
nightly-v0.6.6-20260315Automated nightly build for version 0.6.6 (dev20260315)Low3/15/2026
nightly-v0.6.6-20260314Automated nightly build for version 0.6.6 (dev20260314)Low3/14/2026
nightly-v0.6.6-20260313Automated nightly build for version 0.6.6 (dev20260313)Low3/13/2026
nightly-v0.6.6-20260312Automated nightly build for version 0.6.6 (dev20260312)Low3/12/2026
v0.6.6## What's Changed * fix: move ArtifactPath/CheckSumHash imports inside gen_moe_utils_modu… by @dierksen in https://github.com/flashinfer-ai/flashinfer/pull/2681 * Enable sm120f compilation by @kahyunnam in https://github.com/flashinfer-ai/flashinfer/pull/2650 * Ensure -gencode flags are in deterministic order (for ccache) by @benbarsdell in https://github.com/flashinfer-ai/flashinfer/pull/2674 * int16 Block-Scaled State and Stochastic Rounding for SSU (mamba) by @ishovkun in https://github.com/fLow3/11/2026
nightly-v0.6.5-20260309Automated nightly build for version 0.6.5 (dev20260309)Low3/9/2026
nightly-v0.6.5-20260308Automated nightly build for version 0.6.5 (dev20260308)Low3/8/2026
nightly-v0.6.5-20260307Automated nightly build for version 0.6.5 (dev20260307)Low3/7/2026
nightly-v0.6.5-20260306Automated nightly build for version 0.6.5 (dev20260306)Low3/6/2026
nightly-v0.6.5-20260305Automated nightly build for version 0.6.5 (dev20260305)Low3/5/2026
nightly-v0.6.5-20260304Automated nightly build for version 0.6.5 (dev20260304)Low3/4/2026
v0.6.5## What's Changed * feat: BF16 GEMM benchmarking support by @raayandhar in https://github.com/flashinfer-ai/flashinfer/pull/2525 * [bugfix]Correct chunk_end calculation in multi-CTA collaboration when max_len > length by @huangzhilin-hzl in https://github.com/flashinfer-ai/flashinfer/pull/2489 * test: Skip test_decode_delta_rule.py by @bkryu in https://github.com/flashinfer-ai/flashinfer/pull/2600 * feat: add issue self-claim workflow for external contributors by @jwu1980 in https://github.Low3/4/2026
nightly-v0.6.4-20260303Automated nightly build for version 0.6.4 (dev20260303)Low3/3/2026
nightly-v0.6.4-20260302Automated nightly build for version 0.6.4 (dev20260302)Low3/2/2026
nightly-v0.6.4-20260301Automated nightly build for version 0.6.4 (dev20260301)Low3/1/2026
nightly-v0.6.4-20260228Automated nightly build for version 0.6.4 (dev20260228)Low2/28/2026
nightly-v0.6.4-20260227Automated nightly build for version 0.6.4 (dev20260227)Low2/27/2026
nightly-v0.6.4-20260226Automated nightly build for version 0.6.4 (dev20260226)Low2/26/2026
nightly-v0.6.4-20260225Automated nightly build for version 0.6.4 (dev20260225)Low2/25/2026

Dependencies & License Audit

Loading dependencies...

Similar Packages

azure-coreMicrosoft Azure Core Library for Pythonazure-template_0.1.0b6187637
azure-mgmt-coreMicrosoft Azure Management Core Library for Pythonazure-template_0.1.0b6187637
azure-monitor-opentelemetry-exporterMicrosoft Azure Monitor Opentelemetry Exporter Client Library for Pythonazure-template_0.1.0b6187637
azure-servicebusMicrosoft Azure Service Bus Client Library for Pythonazure-template_0.1.0b6187637
azure-monitor-opentelemetryMicrosoft Azure Monitor Opentelemetry Distro Client Library for Pythonazure-template_0.1.0b6187637