freshcrate
Home > RAG & Memory > torchao

torchao

Package for applying ao techniques to GPU models

Description

<div align="center"> # TorchAO </div> ### PyTorch-Native Training-to-Serving Model Optimization - Pre-train Llama-3.1-70B **1.5x faster** with float8 training - Recover **67% of quantized accuracy degradation** on Gemma3-4B with QAT - Quantize Llama-3-8B to int4 for **1.89x faster** inference with **58% less memory** <div align="center"> [![](https://img.shields.io/badge/CodeML_%40_ICML-2025-blue)](https://openreview.net/attachment?id=HpqH0JakHf&name=pdf) [![](https://dcbadge.vercel.app/api/server/gpumode?style=flat&label=TorchAO%20in%20GPU%20Mode)](https://discord.com/channels/1189498204333543425/1205223658021458100) [![](https://img.shields.io/github/contributors-anon/pytorch/ao?color=yellow&style=flat-square)](https://github.com/pytorch/ao/graphs/contributors) [![](https://img.shields.io/badge/torchao-documentation-blue?color=DE3412)](https://docs.pytorch.org/ao/stable/index.html) [![license](https://img.shields.io/badge/license-BSD_3--Clause-lightgrey.svg)](./LICENSE) [Latest News](#-latest-news) | [Overview](#-overview) | [Quick Start](#-quick-start) | [Installation](#-installation) | [Integrations](#-integrations) | [Inference](#-inference) | [Training](#-training) | [Videos](#-videos) | [Citation](#-citation) </div> ## 📣 Latest News - [Oct 25] QAT is now integrated into [Unsloth](https://docs.unsloth.ai/new/quantization-aware-training-qat) for both full and LoRA fine-tuning! Try it out using [this notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_%284B%29_Instruct-QAT.ipynb). - [Oct 25] MXFP8 MoE training prototype achieved **~1.45x speedup** for MoE layer in Llama4 Scout, and **~1.25x** speedup for MoE layer in DeepSeekV3 671b - with comparable numerics to bfloat16! Check out the [docs](./torchao/prototype/moe_training/) to try it out. - [Sept 25] MXFP8 training achieved [1.28x speedup on Crusoe B200 cluster](https://pytorch.org/blog/accelerating-2k-scale-pre-training-up-to-1-28x-with-torchao-mxfp8-and-torchtitan-on-crusoe-b200-cluster/) with virtually identical loss curve to bfloat16! - [Sept 19] [TorchAO Quantized Model and Quantization Recipes Now Available on Huggingface Hub](https://pytorch.org/blog/torchao-quantized-models-and-quantization-recipes-now-available-on-huggingface-hub/)! - [Jun 25] Our [TorchAO paper](https://openreview.net/attachment?id=HpqH0JakHf&name=pdf) was accepted to CodeML @ ICML 2025! <details> <summary>Older news</summary> - [May 25] QAT is now integrated into [Axolotl](https://github.com/axolotl-ai-cloud/axolotl) for fine-tuning ([docs](https://docs.axolotl.ai/docs/qat.html))! - [Apr 25] Float8 rowwise training yielded [1.34-1.43x training speedup](https://pytorch.org/blog/accelerating-large-scale-training-and-convergence-with-pytorch-float8-rowwise-on-crusoe-2k-h200s/) at 2k H100 GPU scale - [Apr 25] TorchAO is added as a [quantization backend to vLLM](https://docs.vllm.ai/en/latest/features/quantization/torchao.html) ([docs](https://docs.vllm.ai/en/latest/features/quantization/torchao.html))! - [Mar 25] Our [2:4 Sparsity paper](https://openreview.net/pdf?id=O5feVk7p6Y) was accepted to SLLM @ ICLR 2025! - [Jan 25] Our [integration with GemLite and SGLang](https://pytorch.org/blog/accelerating-llm-inference/) yielded 1.1-2x faster inference with int4 and float8 quantization across different batch sizes and tensor parallel sizes - [Jan 25] We added [1-8 bit ARM CPU kernels](https://pytorch.org/blog/hi-po-low-bit-operators/) for linear and embedding ops - [Nov 24] We achieved [1.43-1.51x faster pre-training](https://pytorch.org/blog/training-using-float8-fsdp2/) on Llama-3.1-70B and 405B using float8 training - [Oct 24] TorchAO is added as a quantization backend to HF Transformers! - [Sep 24] We officially launched TorchAO. Check out our blog [here](https://pytorch.org/blog/pytorch-native-architecture-optimization/)! - [Jul 24] QAT [recovered up to 96% accuracy degradation](https://pytorch.org/blog/quantization-aware-training/) from quantization on Llama-3-8B - [Jun 24] Semi-structured 2:4 sparsity [achieved 1.1x inference speedup and 1.3x training speedup](https://pytorch.org/blog/accelerating-neural-network-training/) on the SAM and ViT models respectively - [Jun 24] Block sparsity [achieved 1.46x training speeedup](https://pytorch.org/blog/speeding-up-vits/) on the ViT model with <2% drop in accuracy </details> ## 🌅 Overview TorchAO is an easy to use quantization library for native PyTorch. TorchAO works out-of-the-box with `torch.compile()` and `FSDP2` across most HuggingFace PyTorch models. For a detailed overview of stable and prototype workflows for different hardware and dtypes, see the [Workflows documentation](https://docs.pytorch.org/ao/main/workflows.html). Check out our [docs](https://docs.pytorch.org/ao/main/) for more details! ## 🚀 Quick Start First, install TorchAO. We recommend installing the latest stable version: ```bash pip install torchao ``` Quantize your model weights to int4! ```pyt

Release History

VersionChangesUrgencyDate
0.17.0Imported from PyPI (0.17.0)Low4/21/2026
v0.17.0## Highlights We are excited to announce the 0.17 release of torchao\! This release adds support for cuteDSL MXFP8 MoE kernels, per-head FP8 quantized low precision attention, ABI stability, and more\! ### CuteDSL MXFP8 MoE Kernels We added a new CuteDSL MXFP8 quantization kernel for 3d expert weights that writes scale factors directly to blocked layout for tensorcores: [https://github.com/pytorch/ao/pull/4090](https://github.com/pytorch/ao/pull/4090) * Used for scaling along dim1 inMedium3/30/2026
v0.16.0## Highlights We are excited to announce the 0.16.0 release of torchao! This release adds support for MXFP8 MoE Building Blocks for Training with Expert Parallelism and deprecated older versions of some configs and less used quantization options to keep torchao leaner! We also revamped our [doc page](https://docs.pytorch.org/ao/main/), [README](https://github.com/pytorch/ao/blob/main/README.md) and made some progress in making torchao [ABI stable](https://github.com/pytorch/ao/issues/3516). Low2/10/2026
v0.15.0## Highlights We are excited to announce the 0.15.0 release of torchao! This release adds: - MXFP8 MoE training demonstrates 1.2x e2e training speedup with identical convergence versus bf16, training Llama4 Scout on a 64 node GB200 Crusoe cluster! - MXFP8 MoE kernels shipped with torchao builds for CUDA 12.8+ (just pip install instead of building from source to use!) - Safetensors enablement - Quantization with parameter level targeting ### MXFP8 MoE training demonstrates 1.2x e2e trLow12/22/2025
v0.14.1## **Highlights** We are excited to announce the 0.14.1 release of torchao\! This release adds support for MoE training on Backwell GPUs and NVFP4 QAT\! ### **(Prototype) MoE training on Blackwell GPUs** We’ve added a quantized building block for speeding up MoE training on Blackwell GPUs: torchao’s \`\_scaled\_grouped\_mm\`\! It is a differentiable drop-in replacement for \`torch.\_grouped\_mm\` that dynamically quantizes inputs using the given recipe, performs a scaled grouped GEMM, tLow10/13/2025
v0.13.0-rc8## **Highlights** We are excited to announce the 0.13.0 release of torchao\! This release adds support for numerous QAT improvements, faster mxfp8 pretraining and more\! ### **Simpler Multi-step QAT API (**[https://github.com/pytorch/ao/pull/2629](https://github.com/pytorch/ao/pull/2629)**)** We added a new, simpler, multi-step QAT API that uses only a single config. Now users can specify the target post-training quantization (PTQ) config as the base config and we will automatically infLow9/2/2025
v0.12.0## Highlights We are excited to announce the 0.12.0 release of torchao\! This release adds support for QAT \+ Axolotl Integration and prototype MXFP/NVFP support on Blackwell GPUs\! ### QAT \+ Axolotl Integration TorchAO’s QAT support has been integrated into Axolotl’s fine-tuning recipes\! Check out the docs [here](https://docs.axolotl.ai/docs/qat.html) or run it yourself using the following command: ```shell axolotl train examples/llama-3/3b-qat-fsdp2.yaml axolotl quantize examplLow7/17/2025
v0.11.0## Highlights We are excited to announce the 0.11.0 release of torchao\! This release adds support for mixture-of-experts (MoE) quantization, PyTorch 2 Export Quantization (PT2E), and a microbenchmarking framework for inference APIs\! ### MoE Quantization We’ve a prototype feature for quantizing MoE modules with a number of TorchAO quantization techniques. This approach leverages the existing TorchAO features for quantizing linear ops and allows them to be used to quantize MoE modules. Low5/9/2025
v0.10.0## Highlights We are excited to announce the 0.10.0 release of torchao! This release adds support for end to end training for mxfp8 on Nvidia B200, PARQ (for quantization aware training), module swap quantization API to for research, and some updates for low bit kernels! ### Low Bit Optimizers moved to Official Support ([https://github.com/pytorch/ao/pull/1864](https://github.com/pytorch/ao/pull/1864)) [Low bit optimizers](https://github.com/pytorch/ao/releases/tag/v0.4.0) (added in 0.4Low4/7/2025
v0.9.0# Highlights We are excited to announce the 0.9.0 release of torchao! This release moves a number of sparsity techniques out of prototype, a significant overhaul of the quantize_ api, a new cutlass kernel for 4 bit dynamic quantization and more! ### Block Sparsity promoted out of prototype We’ve promoted block sparsity out of torchao.prototype and made several performance improvements. You can accelerate your models with block sparsity as follows: ```python from torchao.sparsity impLow2/28/2025
v0.8.0# Highlights We are excited to announce the 0.8.0 release of torchao\! In this release we’ve shipped the first CUTLASS kernel in torchAO which adds support for W4A8 linear operator. In addition to this, we’ve also added TTFT benchmarks to torchAO and compared different quantization \+ sparsity speedups for prefill / decoding. ## W4A8 based on CUTLASS A new W4A8 linear operator is implemented, that corresponds to int8\_dynamic\_activation\_int4\_weight quantization where two 4-bit weighLow1/15/2025
v0.7.0-rc3# Highlights We are excited to announce the 0.7.0 release of torchao! This release moves QAT out of prototype with improved LoRA support and more flexible APIs, and adds support for new experimental kernels such as Marlin QQQ (for CUDA), `int8_dynamic_activation_intx_weight` (for ARM CPU), and more! ## QAT moved out of prototype, LoRA integration, new flexible APIs (#1020, #1085, #1152, #1037, #1152) QAT has been moved out of prototype to `torchao/quantization/qat` to provide better APLow12/6/2024
v0.6.1## Highlights We are excited to announce the 0.6.1 release of torchao! This release adds support for Auto-Round support, Float8 Axiswise scaled training, a BitNet training recipe, an implementation of AWQ and much more! ### Auto-Round Support (#581) Auto-Round is a new weight-only quantization algorithm, it has as achieved superior accuracy compared to [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), and [OmniQuant](https://arxiv.org/abs/2308.13137) acroLow10/21/2024
v0.5.0## Highlights We are excited to announce the 0.5 release of torchao! This release adds support for memory efficient inference, float8 training and inference, int8 quantized training, HQQ, automatic mixed-precision quantization through bayesian optimization, sparse marlin, and integrations with HuggingFace, SGLang, and diffusers. ## Memory Efficient Inference Support https://github.com/pytorch/ao/pull/738 We've added support for Llama 3.1 to the llama benchmarks in TorchAO and added new feLow9/8/2024
v0.4.0### v0.4.0 ## Highlights We are excited to announce the 0.4 release of torchao! This release adds support for KV cache quantization, quantization aware training (QAT), low bit optimizer support, composing quantization and sparsity, and more! ## KV cache quantization (https://github.com/pytorch/ao/pull/532) We've added support for KV cache quantization, showing a peak memory reduction from 19.7 -> 19.2 GB on Llama3-8B at an 8192 context length. We plan to investigate Llama3.1 next. Low8/7/2024
v0.3.0### v0.3.1 ## Highlights We are excited to announce the 0.3 release of torchao! This release adds support for a new quantize API, MX format, FP6 dtype and bitpacking, 2:4 sparse accelerated training and benchmarking infra for llama2/llama3 models. ### `quantize` API (https://github.com/pytorch/ao/pull/256) We added a tensor subclass based quantization API, see [docs](https://github.com/pytorch/ao/tree/main/torchao/quantization) and README for details on usage, this is planned to repLow6/26/2024
v0.2.0## What's Changed ## Highlights ### Custom CPU/CUDA extension to ship CPU/CUDA binaries. PyTorch core has recently shipped a new custom op registration mechanism with [torch.library](https://pytorch.org/docs/stable/library.html) with the benefit being that custom ops will compose with as many PyTorch subsystems as possible most notably NOT graph breaking with `torch.compile()` We'd added some documentation for how you could register your own custom ops https://github.com/pytorch/ao/tLow5/20/2024
v0.1# Highlights We’re excited to announce the release of TorchAO v0.1.0! TorchAO is a repository to host architecture optimization techniques such as quantization and sparsity and performance kernels on different backends such as CUDA and CPU. In this release, we added support for a few quantization techniques like int4 weight only GPTQ quantization, added nf4 dtype support for QLoRA and sparsity features like WandaSparsifier, we also added autotuner that can tune triton integer matrix multiplicatLow4/4/2024

Dependencies & License Audit

Loading dependencies...

Similar Packages

azure-search-documentsMicrosoft Azure Cognitive Search Client Library for Pythonazure-template_0.1.0b6187637
apache-tvm-ffitvm ffi0.1.10
luqumA Lucene query parser generating ElasticSearch queries and more !1.0.0
banksA prompt programming language2.4.1
tensorflowjsNo description4.22.0