High-Performance GPU Kernels for Inference

# flashinfer-python

> FlashInfer: Kernel Library for LLM Serving

- **URL**: https://www.freshcrate.ai/projects/flashinfer-python
- **Author**: FlashInfer team
- **Category**: Developer Tools
- **Latest version**: `v0.6.12` (2026-05-29)
- **License**: Unknown
- **Source**: https://github.com/flashinfer-ai/flashinfer
- **Homepage**: https://pypi.org/project/flashinfer-python/
- **Language**: Python
- **GitHub**: 5,467 stars, 915 forks
- **Registry**: pypi (`flashinfer-python`)
- **Tags**: `pypi`

## Description

<p align="center">
  <picture>
    <source media="(prefers-color-scheme: dark)" srcset="https://github.com/flashinfer-ai/web-data/blob/main/logo/FlashInfer-black-background.png?raw=true">
    <img alt="FlashInfer" src="https://github.com/flashinfer-ai/web-data/blob/main/logo/FlashInfer-white-background.png?raw=true" width=55%>
  </picture>
</p>
<h1 align="center">
High-Performance GPU Kernels for Inference
</h1>

<p align="center">
| <a href="https://docs.flashinfer.ai"><b>Documentation</b></a> | <a href="https://github.com/flashinfer-ai/flashinfer/releases/latest"><b>Latest Release</b></a> | <a href="https://flashinfer.ai"><b>Blog</b></a> | <a href="https://join.slack.com/t/flashinfer/shared_invite/zt-379wct3hc-D5jR~1ZKQcU00WHsXhgvtA"><b>Slack</b></a> |  <a href="https://github.com/orgs/flashinfer-ai/discussions"><b>Discussion Forum</b></a> |
</p>

[![Build Status](https://ci.tlcpack.ai/job/flashinfer-ci/job/main/badge/icon)](https://ci.tlcpack.ai/job/flashinfer-ci/job/main/)
[![Documentation](https://github.com/flashinfer-ai/flashinfer/actions/workflows/build-doc.yml/badge.svg)](https://github.com/flashinfer-ai/flashinfer/actions/workflows/build-doc.yml)

**FlashInfer** is a library and kernel generator for inference that delivers state-of-the-art performance across diverse GPU architectures. It provides unified APIs for attention, GEMM, and MoE operations with multiple backend implementations including FlashAttention-2/3, cuDNN, CUTLASS, and TensorRT-LLM.

## Why FlashInfer?

- **State-of-the-art Performance**: Optimized kernels for prefill, decode, and mixed batching scenarios
- **Multiple Backends**: Automatically selects the best backend for your hardware and workload
- **Modern Architecture Support**: Support for SM75 (Turing) and later (through Blackwell)
- **Low-Precision Compute**: FP8 and FP4 quantization for attention, GEMM, and MoE operations
- **Production-Ready**: CUDAGraph and torch.compile compatible for low-latency serving

## Core Features

### Attention Kernels
- **Paged and Ragged KV-Cache**: Efficient memory management for dynamic batch serving
- **Decode, Prefill, and Append**: Optimized kernels for all attention phases
- **MLA Attention**: Native support for DeepSeek's Multi-Latent Attention
- **Cascade Attention**: Memory-efficient hierarchical KV-Cache for shared prefixes
- **Sparse Attention**: Block-sparse and variable block-sparse patterns
- **POD-Attention**: Fused prefill+decode for mixed batching

### GEMM & Linear Operations
- **BF16 GEMM**: BF16 matrix multiplication for SM10.0+ GPUs.
- **FP8 GEMM**: Per-tensor and groupwise scaling
- **FP4 GEMM**: NVFP4 and MXFP4 matrix multiplication for Blackwell GPUs
- **Grouped GEMM**: Efficient batched matrix operations for LoRA and multi-expert routing

### Mixture of Experts (MoE)
- **Fused MoE Kernels**
- **Multiple Routing Methods**: DeepSeek-V3, Llama-4, and standard top-k routing
- **Quantized MoE**: FP8 and FP4 expert weights with block-wise scaling

### Sampling & Decoding
- **Sorting-Free Sampling**: Efficient Top-K, Top-P, and Min-P without sorting
- **Speculative Decoding**: Chain speculative sampling support

### Communication
- **AllReduce**: Custom implementations
- **Multi-Node NVLink**: MNNVL support for multi-node inference
- **NVSHMEM Integration**: For distributed memory operations

### Other Operators
- **RoPE**: LLaMA-style rotary position embeddings (including LLaMA 3.1)
- **Normalization**: RMSNorm, LayerNorm, Gemma-style fused operations
- **Activations**: SiLU, GELU with fused gating

## GPU Support

| Architecture | Compute Capability | Example GPUs |
|--------------|-------------------|------|
| Turing | SM 7.5 | T4, RTX 20 series |
| Ampere | SM 8.0, 8.6 | A100, A10, RTX 30 series |
| Ada Lovelace | SM 8.9 | L4, L40, RTX 40 series |
| Hopper | SM 9.0 | H100, H200 |
| Blackwell | SM 10.0, 10.3 | B200, B300 |
| Blackwell | SM 11.0 | Jetson Thor |
| Blackwell | SM 12.0, 12.1 | RTX 50 series, DGX Spark |

> **Note:** Not all features are supported across all compute capabilities.

## News

Latest: [![GitHub Release](https://img.shields.io/github/v/release/flashinfer-ai/flashinfer)](https://github.com/flashinfer-ai/flashinfer/releases/latest)

Notable updates:
- [2025-10-08] Blackwell support added in [v0.4.0](https://github.com/flashinfer-ai/flashinfer/releases/tag/v0.4.0)
- [2025-03-10] [Blog Post](https://flashinfer.ai/2025/03/10/sampling.html) Sorting-Free GPU Kernels for LLM Sampling, which explains the design of sampling kernels in FlashInfer.

## Getting Started

### Installation

**Quickstart:**

```bash
pip install flashinfer-python
```

**Package Options:**

- **flashinfer-python**: Core package that compiles/downloads kernels on first use
- **flashinfer-cubin**: Pre-compiled kernel binaries for all supported GPU architectures
- **flashinfer-jit-cache**: Pre-built kernel cache for specific CUDA versions

**For faster initialization and offline usage**, install the optional packages to have most kernels

## Recent releases

| Version | Date | Urgency | Changes |
| --- | --- | --- | --- |
| `v0.6.12` | 2026-05-29 | High | ## What's Changed * Loosened trtllm_ragged_attention_deepseek shape assertion by @nvjullin in https://github.com/flashinfer-ai/flashinfer/pull/3064 * Update moe gemm by @IwakuraRein in https://github.com/flashinfer-ai/flashinfer/pull/3239 * perf: optimize per-token nvfp4 quantization kernel. by @IwakuraRein in https://github.com/flashinfer-ai/flashinfer/pull/3237 * build: add sccache-backed jit-cache builds and AOT diagnostics by @dierksen in https://github.com/flashinfer-ai/flashinfer/pull/3205 |
| `v0.6.12rc1` | 2026-05-22 | High | ## What's Changed * Loosened trtllm_ragged_attention_deepseek shape assertion by @nvjullin in https://github.com/flashinfer-ai/flashinfer/pull/3064 * Update moe gemm by @IwakuraRein in https://github.com/flashinfer-ai/flashinfer/pull/3239 * perf: optimize per-token nvfp4 quantization kernel. by @IwakuraRein in https://github.com/flashinfer-ai/flashinfer/pull/3237 * build: add sccache-backed jit-cache builds and AOT diagnostics by @dierksen in https://github.com/flashinfer-ai/flashinfer/pull/3205 |
| `v0.6.11.post3` | 2026-05-15 | High | **Full Changelog**: https://github.com/flashinfer-ai/flashinfer/compare/v0.6.11.post2...v0.6.11.post3 |
| `v0.6.11` | 2026-05-09 | High | ## What's Changed * trying this one character fix for main branch by @aleozlx in https://github.com/flashinfer-ai/flashinfer/pull/3213 * Add git submodule update to build_backend.py by @kahyunnam in https://github.com/flashinfer-ai/flashinfer/pull/3190 * fix(cute_dsl/moe): correct tile_size=256 gemm2 tactic enumeration by @leejnau in https://github.com/flashinfer-ai/flashinfer/pull/3171 * Fix trace-bmm-fp8 test: B should be K-major for subword types by @xrq-phys in https://github.com/flashinfer- |
| `v0.6.10` | 2026-05-04 | High | ## What's Changed * Vendor CCCL v3.3.2 from GitHub instead of relying on CTK-bundled copy by @kahyunnam in https://github.com/flashinfer-ai/flashinfer/pull/3091 * [Fmha] Add head_dim=512 support for trtllm attention kernels by @djmmoss in https://github.com/flashinfer-ai/flashinfer/pull/2959 * perf: optimize MXFP4xBF16 & INT4xFP8 CUTLASS MoE backend for SM90 by @samuellees in https://github.com/flashinfer-ai/flashinfer/pull/3084 * Add support for the combinations of allreduce, allgather, and red |
| `v0.6.10rc1` | 2026-04-30 | High | ## What's Changed * Vendor CCCL v3.3.2 from GitHub instead of relying on CTK-bundled copy by @kahyunnam in https://github.com/flashinfer-ai/flashinfer/pull/3091 * [Fmha] Add head_dim=512 support for trtllm attention kernels by @djmmoss in https://github.com/flashinfer-ai/flashinfer/pull/2959 * perf: optimize MXFP4xBF16 & INT4xFP8 CUTLASS MoE backend for SM90 by @samuellees in https://github.com/flashinfer-ai/flashinfer/pull/3084 * Add support for the combinations of allreduce, allgather, and red |
| `v0.6.9` | 2026-04-24 | High | ## What's Changed * feat: Add backend="b12x" for mm_fp4 on SM120 by @bkryu in https://github.com/flashinfer-ai/flashinfer/pull/3051 * docs: document MAX_JOBS env var and its interaction with FLASHINFER_N… by @aleozlx in https://github.com/flashinfer-ai/flashinfer/pull/3060 * PR #2772 might have introduced a device side compilation regression by @aleozlx in https://github.com/flashinfer-ai/flashinfer/pull/3056 * [feat] Add routing_replay_out support to MoE kernels and Python API by @TomerBN-Nvidi |
| `0.6.8.post1` | 2026-04-21 | Low | Imported from PyPI (0.6.8.post1) |
| `nightly-v0.6.8-20260421` | 2026-04-21 | High | Automated nightly build for version 0.6.8 (dev20260421) |
| `v0.6.8.post1` | 2026-04-18 | High | **Full Changelog**: https://github.com/flashinfer-ai/flashinfer/compare/v0.6.8...v0.6.8.post1 |

## Citation

- HTML: https://www.freshcrate.ai/projects/flashinfer-python
- Markdown: https://www.freshcrate.ai/projects/flashinfer-python.md
- Dependencies JSON: https://www.freshcrate.ai/api/projects/flashinfer-python/deps

_Generated by freshcrate.ai. Indexes pypi releases for AI-agent ecosystem packages._