# ctranslate2

> Fast inference engine for Transformer models

- **URL**: https://www.freshcrate.ai/projects/ctranslate2
- **Author**: OpenNMT
- **Category**: Frameworks
- **Latest version**: `v4.7.2` (2026-05-19)
- **License**: MIT
- **Source**: https://github.com/OpenNMT/CTranslate2
- **Homepage**: https://opennmt.net
- **Language**: C++
- **GitHub**: 4,444 stars, 473 forks
- **Registry**: pypi (`ctranslate2`)
- **Tags**: `cuda`, `inference`, `machine`, `mkl`, `neural`, `nmt`, `opennmt`, `pypi`, `translation`

## Description

[![CI](https://github.com/OpenNMT/CTranslate2/workflows/CI/badge.svg)](https://github.com/OpenNMT/CTranslate2/actions?query=workflow%3ACI) [![PyPI version](https://badge.fury.io/py/ctranslate2.svg)](https://badge.fury.io/py/ctranslate2) [![Documentation](https://img.shields.io/badge/docs-latest-blue.svg)](https://opennmt.net/CTranslate2/) [![Gitter](https://badges.gitter.im/OpenNMT/CTranslate2.svg)](https://gitter.im/OpenNMT/CTranslate2?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge) [![Forum](https://img.shields.io/discourse/status?server=https%3A%2F%2Fforum.opennmt.net%2F)](https://forum.opennmt.net/)

# CTranslate2

CTranslate2 is a C++ and Python library for efficient inference with Transformer models.

The project implements a custom runtime that applies many performance optimization techniques such as weights quantization, layers fusion, batch reordering, etc., to [accelerate and reduce the memory usage](#benchmarks) of Transformer models on CPU and GPU.

The following model types are currently supported:

* Encoder-decoder models: Transformer base/big, M2M-100, NLLB, BART, mBART, Pegasus, T5, Whisper T5Gemma
* Decoder-only models: GPT-2, GPT-J, GPT-NeoX, OPT, BLOOM, MPT, Llama, Mistral, Gemma, CodeGen, GPTBigCode, Falcon, Qwen2
* Encoder-only models: BERT, DistilBERT, XLM-RoBERTa

Compatible models should be first converted into an optimized model format. The library includes converters for multiple frameworks:

* [OpenNMT-py](https://opennmt.net/CTranslate2/guides/opennmt_py.html)
* [OpenNMT-tf](https://opennmt.net/CTranslate2/guides/opennmt_tf.html)
* [Fairseq](https://opennmt.net/CTranslate2/guides/fairseq.html)
* [Marian](https://opennmt.net/CTranslate2/guides/marian.html)
* [OPUS-MT](https://opennmt.net/CTranslate2/guides/opus_mt.html)
* [Transformers](https://opennmt.net/CTranslate2/guides/transformers.html)

The project is production-oriented and comes with [backward compatibility guarantees](https://opennmt.net/CTranslate2/versioning.html), but it also includes experimental features related to model compression and inference acceleration.

## Key features

* **Fast and efficient execution on CPU and GPU**<br/>The execution [is significantly faster and requires less resources](#benchmarks) than general-purpose deep learning frameworks on supported models and tasks thanks to many advanced optimizations: layer fusion, padding removal, batch reordering, in-place operations, caching mechanism, etc.
* **Quantization and reduced precision**<br/>The model serialization and computation support weights with [reduced precision](https://opennmt.net/CTranslate2/quantization.html): 16-bit floating points (FP16), 16-bit brain floating points (BF16), 16-bit integers (INT16), 8-bit integers (INT8) and AWQ quantization (INT4).
* **Multiple CPU architectures support**<br/>The project supports x86-64 and AArch64/ARM64 processors and integrates multiple backends that are optimized for these platforms: [Intel MKL](https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/onemkl.html), [oneDNN](https://github.com/oneapi-src/oneDNN), [OpenBLAS](https://www.openblas.net/), [Ruy](https://github.com/google/ruy), and [Apple Accelerate](https://developer.apple.com/documentation/accelerate).
* **Automatic CPU detection and code dispatch**<br/>One binary can include multiple backends (e.g. Intel MKL and oneDNN) and instruction set architectures (e.g. AVX, AVX2) that are automatically selected at runtime based on the CPU information.
* **Parallel and asynchronous execution**<br/>Multiple batches can be processed in parallel and asynchronously using multiple GPUs or CPU cores.
* **Dynamic memory usage**<br/>The memory usage changes dynamically depending on the request size while still meeting performance requirements thanks to caching allocators on both CPU and GPU.
* **Lightweight on disk**<br/>Quantization can make the models 4 times smaller on disk with minimal accuracy loss.
* **Simple integration**<br/>The project has few dependencies and exposes simple APIs in [Python](https://opennmt.net/CTranslate2/python/overview.html) and C++ to cover most integration needs.
* **Configurable and interactive decoding**<br/>[Advanced decoding features](https://opennmt.net/CTranslate2/decoding.html) allow autocompleting a partial sequence and returning alternatives at a specific location in the sequence.
* **Support tensor parallelism for distributed inference**<br/>Very large model can be split into multiple GPUs. Following this [documentation](docs/parallel.md#model-and-tensor-parallelism) to set up the required environment.

Some of these features are difficult to achieve with standard deep learning frameworks and are the motivation for this project.

## Installation and usage

CTranslate2 can be installed with pip:

```bash
pip install ctranslate2
```

The Python module is used to convert models and can translate or generate text with few lines of code:

```python
translator = ctranslate2.Translator(tra

## Recent releases

| Version | Date | Urgency | Changes |
| --- | --- | --- | --- |
| `v4.7.2` | 2026-05-19 | High | ## [v4.7.2](https://github.com/OpenNMT/CTranslate2/releases/tag/v4.7.2) (2026-05-18)  ### New features  * Gemma4 support for dense model (#2048) by [@jordimas](https://github.com/jordimas)  ## Fixes and improvements  * Gemma 3 model conversion fixes (#2037) by [@jordimas](https://github.com/jordimas) * Update source ROCM version from 7.2 to 7.2.1 (#2030) by [@racedale](https://github.com/racedale) * Free curand states before the thread is destroyed (#1912) by [@no1d](https://github.com |
| `4.7.1` | 2026-04-21 | Low | Imported from PyPI (4.7.1) |
| `v4.7.1` | 2026-02-04 | Low | ### Fixes and improvements  * Fix Windows build (#2007) [@sssshhhhhh](https://github.com/sssshhhhhh) |
| `v4.7.1` | 2026-02-04 | Low | ### Fixes and improvements  * Fix Windows build (#2007) [@sssshhhhhh](https://github.com/sssshhhhhh) |
| `v4.7.1` | 2026-02-04 | Low | ### Fixes and improvements  * Fix Windows build (#2007) [@sssshhhhhh](https://github.com/sssshhhhhh) |
| `v4.7.1` | 2026-02-04 | Low | ### Fixes and improvements  * Fix Windows build (#2007) [@sssshhhhhh](https://github.com/sssshhhhhh) |
| `v4.7.1` | 2026-02-04 | Low | ### Fixes and improvements  * Fix Windows build (#2007) [@sssshhhhhh](https://github.com/sssshhhhhh) |
| `v4.7.1` | 2026-02-04 | Low | ### Fixes and improvements  * Fix Windows build (#2007) [@sssshhhhhh](https://github.com/sssshhhhhh) |
| `v4.7.1` | 2026-02-04 | Low | ### Fixes and improvements  * Fix Windows build (#2007) [@sssshhhhhh](https://github.com/sssshhhhhh) |
| `v4.7.1` | 2026-02-04 | Low | ### Fixes and improvements  * Fix Windows build (#2007) [@sssshhhhhh](https://github.com/sssshhhhhh) |

## Citation

- HTML: https://www.freshcrate.ai/projects/ctranslate2
- Markdown: https://www.freshcrate.ai/projects/ctranslate2.md
- Dependencies JSON: https://www.freshcrate.ai/api/projects/ctranslate2/deps

_Generated by freshcrate.ai. Indexes pypi releases for AI-agent ecosystem packages._