ctranslate2
Fast inference engine for Transformer models
Description
[](https://github.com/OpenNMT/CTranslate2/actions?query=workflow%3ACI) [](https://badge.fury.io/py/ctranslate2) [](https://opennmt.net/CTranslate2/) [](https://gitter.im/OpenNMT/CTranslate2?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge) [](https://forum.opennmt.net/) # CTranslate2 CTranslate2 is a C++ and Python library for efficient inference with Transformer models. The project implements a custom runtime that applies many performance optimization techniques such as weights quantization, layers fusion, batch reordering, etc., to [accelerate and reduce the memory usage](#benchmarks) of Transformer models on CPU and GPU. The following model types are currently supported: * Encoder-decoder models: Transformer base/big, M2M-100, NLLB, BART, mBART, Pegasus, T5, Whisper T5Gemma * Decoder-only models: GPT-2, GPT-J, GPT-NeoX, OPT, BLOOM, MPT, Llama, Mistral, Gemma, CodeGen, GPTBigCode, Falcon, Qwen2 * Encoder-only models: BERT, DistilBERT, XLM-RoBERTa Compatible models should be first converted into an optimized model format. The library includes converters for multiple frameworks: * [OpenNMT-py](https://opennmt.net/CTranslate2/guides/opennmt_py.html) * [OpenNMT-tf](https://opennmt.net/CTranslate2/guides/opennmt_tf.html) * [Fairseq](https://opennmt.net/CTranslate2/guides/fairseq.html) * [Marian](https://opennmt.net/CTranslate2/guides/marian.html) * [OPUS-MT](https://opennmt.net/CTranslate2/guides/opus_mt.html) * [Transformers](https://opennmt.net/CTranslate2/guides/transformers.html) The project is production-oriented and comes with [backward compatibility guarantees](https://opennmt.net/CTranslate2/versioning.html), but it also includes experimental features related to model compression and inference acceleration. ## Key features * **Fast and efficient execution on CPU and GPU**<br/>The execution [is significantly faster and requires less resources](#benchmarks) than general-purpose deep learning frameworks on supported models and tasks thanks to many advanced optimizations: layer fusion, padding removal, batch reordering, in-place operations, caching mechanism, etc. * **Quantization and reduced precision**<br/>The model serialization and computation support weights with [reduced precision](https://opennmt.net/CTranslate2/quantization.html): 16-bit floating points (FP16), 16-bit brain floating points (BF16), 16-bit integers (INT16), 8-bit integers (INT8) and AWQ quantization (INT4). * **Multiple CPU architectures support**<br/>The project supports x86-64 and AArch64/ARM64 processors and integrates multiple backends that are optimized for these platforms: [Intel MKL](https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/onemkl.html), [oneDNN](https://github.com/oneapi-src/oneDNN), [OpenBLAS](https://www.openblas.net/), [Ruy](https://github.com/google/ruy), and [Apple Accelerate](https://developer.apple.com/documentation/accelerate). * **Automatic CPU detection and code dispatch**<br/>One binary can include multiple backends (e.g. Intel MKL and oneDNN) and instruction set architectures (e.g. AVX, AVX2) that are automatically selected at runtime based on the CPU information. * **Parallel and asynchronous execution**<br/>Multiple batches can be processed in parallel and asynchronously using multiple GPUs or CPU cores. * **Dynamic memory usage**<br/>The memory usage changes dynamically depending on the request size while still meeting performance requirements thanks to caching allocators on both CPU and GPU. * **Lightweight on disk**<br/>Quantization can make the models 4 times smaller on disk with minimal accuracy loss. * **Simple integration**<br/>The project has few dependencies and exposes simple APIs in [Python](https://opennmt.net/CTranslate2/python/overview.html) and C++ to cover most integration needs. * **Configurable and interactive decoding**<br/>[Advanced decoding features](https://opennmt.net/CTranslate2/decoding.html) allow autocompleting a partial sequence and returning alternatives at a specific location in the sequence. * **Support tensor parallelism for distributed inference**<br/>Very large model can be split into multiple GPUs. Following this [documentation](docs/parallel.md#model-and-tensor-parallelism) to set up the required environment. Some of these features are difficult to achieve with standard deep learning frameworks and are the motivation for this project. ## Installation and usage CTranslate2 can be installed with pip: ```bash pip install ctranslate2 ``` The Python module is used to convert models and can translate or generate text with few lines of code: ```python translator = ctranslate2.Translator(tra
Release History
| Version | Changes | Urgency | Date |
|---|---|---|---|
| 4.7.1 | Imported from PyPI (4.7.1) | Low | 4/21/2026 |
| v4.7.1 | ### Fixes and improvements * Fix Windows build (#2007) [@sssshhhhhh](https://github.com/sssshhhhhh) | Low | 2/4/2026 |
| v4.7.0 | ### New features * Introduce AMD GPU support with ROCm HIP (#1989) [@sssshhhhhh](https://github.com/sssshhhhhh) * Compatibility with Transformers v5 (#1999) by [@jordimas](https://github.com/jordimas) ## Fixes and improvements * Assume less about whisper vocab (#2000) by [@sssshhhhhh](https://github.com/sssshhhhhh) * Use LLVM ThreadSanitizer instead of Google (#1993) by [@3manifold](https://github.com/3manifold) * Optimize all builds with parallel execution (#1992) by [@3manifold](ht | Low | 2/3/2026 |
| v4.6.3 | ## [v4.6.3](https://github.com/OpenNMT/CTranslate2/releases/tag/v4.6.3) (2026-01-06) ### New features * T5Gemma model conversion and inference (#1962) by [@jordimas](https://github.com/jordimas) * Support for CUDA 12.8 (#1937, #1940) by [@Purfview](https://github.com/Purfview) * Conv1d pure CUDA implementation (#1949), makes cuDNN an optional dependency by [@jordimas](https://github.com/jordimas) * Add CUDA implementation for median filter (#1917) by [@ja2d8a4v](https://github.com/a2d8a | Low | 1/6/2026 |
| v4.6.2 | ### New features * Qwen 3 support (#1943) by [@jordimas](https://github.com/jordimas) * Gemma 3 text support (#1936) by [@jordimas](https://github.com/jordimas) ### Fixes and improvements * Fixed pkg_resources Deprecated Warning (#1911) by [@thawancomt](https://github.com/thawancomt) * Disable INT8 for sm120 - Blackwell GPUs (#1937) by [@Purfview](https://github.com/Purfview) * FIX: package libctranslate2.so in wheel to avoid build fail (#1920) by [@yzewei](https://github.com/yzewei) | Low | 12/5/2025 |
| v4.6.1 | ### New features * Python 3.14 support (#1926) * Support for Cuda 12.4 (#1925) * Update Intel oneAPI to version 2025.3 (#1931) | Low | 11/7/2025 |
| v4.6.0 | Note: The Ctranslate2 Python package now supports python 3.13, drop the support for python 3.8. ### New features * Pyhton 3.13 support (#1858) * Support returning hidden vector in Wav2Vec2 and Wav2Vec2Bert Models (#1867) * Add noexecstack linker flags (#1852 + #1861) * Support Qwen2 (#1820) * Eoleconv (#1832) * Add support RobertModel (#1864) ### Fixes and improvements * Fix github action (#1871) * Prevent double library def (#1818) | Low | 4/8/2025 |
| v4.5.0 | Note: The Ctranslate2 Python package now supports CUDNN 9 and is no longer compatible with CUDNN 8. ### New features * Support Phi3 (#1800) * Support Mistral Nemo (#1785) * Support Wav2Vec2Bert ASR (#1778) ### Fixes and improvements * Upgrade to CUDNN9 (#1803) * Fix logits vocab (#1786 + #1791) * Update doc AWQ (#1795) | Low | 10/22/2024 |
| v4.4.0 | **Removed**: Flash Attention support in the Python package due to significant package size increase with minimal performance gain. Note: Flash Attention remains supported in the C++ package with the `WITH_FLASH_ATTN` option. Flash Attention may be re-added in the future if substantial improvements are made. ### New features * Support Llama3 (#1751) * Support Gemma2 (#1772) * Add log probs for all tokens in vocab (#1755) * Grouped conv1d (#1749 + #1758) ### Fixes and improvements | Low | 9/9/2024 |
| v4.3.1 | Note: Because of exceeding project's size on Pypi (> 20 GB), the release v4.3.0 was pushed unsuccessfully. ### Fixes and improvements * Improve the compilation (#1706 and #1705) * Fix position bias in tensor parallel mode (#1714) | Low | 6/11/2024 |
| v4.3.0 | ### New features * Support phi-3 (8k and 128k) (#1700 and #1680) ### Fixes and improvements * Fix regression Flash Attention (#1695) | Low | 5/17/2024 |
| v4.2.1 | Note: Because of the increasing of package's size (> 100 MB), the release v4.2.0 was pushed unsuccessfully. ### New features * Support load/unload for generator/Whisper Attention (#1670) ### Fixes and improvements * Fix Llama 3 (#1671) | Low | 4/24/2024 |
| v4.2.0 | ### New features * Support Flash Attention (#1651) * Implementation of gemm for FLOAT32 compute type with RUY backend (#1598) * Conv1D quantization for only CPU (DNNL and CUDA backend is not supported) (#1601) ### Fixes and improvements * Fix bug tensor parallel (#1643) * Use BestSampler when temperature is 0 (#1659) * Fix bug gemma (#1660) * Optimize loading/unloading time for Translator with cache (#1645) | Low | 4/10/2024 |
| v4.1.1 | ### Fixes and improvements * Fix classifiers in setup.py to push pypi package | Low | 3/12/2024 |
| v4.1.0 | ### New features * Support Gemma Model (#1631) * Support Tensor Parallelism (#1599) ### Fixes and improvements * Avoid initializing unused GPU (#1633) * Read very large tensor by chunk if the size > max value of int (#1636) * Update Readme | Low | 3/11/2024 |
| v4.0.0 | This major version introduces the breaking change while updating to cuda 12. ## Breaking changes ### Python * Support cuda 12 ## New features * Add feature to_device() in class StorageView in Python to move data between host <-> device ## Fixes and improvements * Implement Conv1D with im2col and GEMM to improvement in performance * Get tokens in the range of the vocab size for LlaMa models * Fix loss of performance * Update cibuildwheel to 2.16.5 | Low | 2/15/2024 |
| v3.24.0 | ### New features * Support of new option offset to ignore token score of special tokens | Low | 1/9/2024 |
| v3.23.0 | ### New features * Support Phi model ### Fixes and improvements * Fix the conversion for whisper without the "alignment_heads" in the "generation_config.json" * Fix forward batch | Low | 12/5/2023 |
| v3.22.0 | ### New features * Support "sliding window" and "chunking input" for Mistral ### Fixes and improvements * Take into account the "generation_config.json" and fix "lang_ids" getter for Whisper converter * Accept callback even on "generate_tokens" method * Fix iomp5 linking with latest Intel OpenAPI on Ubuntu * Fixed "decoder_start_token_id" for T5 | Low | 11/22/2023 |
| v3.21.0 | New features * Minimal Support for Mistral (Loader and Rotary extension for long sequence). No sliding yet * Support Distil-Whisper * Support Whisper-large-v3 | Low | 11/9/2023 |
| v3.20.0 | ## New features * Update the Transformers converter to support more model architectures: * MixFormerSequential (used by microsoft/phi-1_5) * Accept batch inputs in methods `generate_tokens` * Add method `Generator.async_generate_tokens` to return an asynchronous generator compatible with `asyncio` ## Fixes and improvements * Remove the epsilon value in the softmax CPU kernel for consistency with other implementations * Optimize implementation of the Dynamic Time Wrapping (DTW) fun | Low | 9/18/2023 |
| v3.19.0 | ## Changes * Binary wheels for Python 3.7 are no longer built ## New features * Build wheels for Python 3.12 * Update the Transformers converter to support more model architectures: * Falcon-RW * DistilBERT * Llama with linear RoPE scaling (e.g. Vicuna v1.5) * Llama with a non default RoPE base period (e.g. CodeLlama) * Accept the token type IDs as inputs for encoder models * Add property `GenerationStepResult.hypothesis_id` to identify the different hypotheses when runni | Low | 8/31/2023 |
| v3.18.0 | ## Changes Converted models now uses the same floating point precision as the original models. For example, a model saved in float16 will be converted to a float16 model. Before this change, the weights were casted to float32 by default. Similarly, selecting int8 keeps non quantized weights in their original precision unless a more specific quantization type is selected: * int8_float32 * int8_float16 * int8_bfloat16 ## New features * Add property `compute_type` to model instance | Low | 8/3/2023 |
| v3.17.1 | ## Fixes and improvements * Fix an error when running models with the new `int8_bfloat16` computation type * Fix a vocabulary error when converting Llama 2 models with the Transformers converter * Update the Transformers converter to correctly convert Llama models using GQA * Stop the decoding when the generator returned by the method `generate_tokens` is closed | Low | 7/20/2023 |
| v3.17.0 | ## New features * Add new computation types: `bfloat16` and `int8_bfloat16` (require a GPU with Compute Capability 8.0 or above) * Support multi-query attention for encoder-decoder models * Allow converters to register weights as PyTorch tensors instead of Numpy arrays ## Fixes and improvements * Pass the flag `trust_remote_code` when loading the tokenizer in the Transformers converter * Improve performance of T5 models by reusing the same relative position bias in every layers * Wh | Low | 7/18/2023 |
| v3.16.1 | ## Fixes and improvements * Fix repeated outputs in version 3.16.0 when using `include_prompt_in_result=False` and a batch input with variable lengths: a typo in the code led to `min_length` being incorrectly applied * Update the Transformers converter to accept extra tokens for Falcon models * Release the Python GIL when loading the model * Initialize the rotary embeddings on the GPU instead of the CPU * Avoid a copy for the input features passed to the Whisper methods * Vectorize copy | Low | 7/3/2023 |
| v3.16.0 | ## New features * Update the Transformers converter to support more architectures: * Falcon-40B * XLM-RoBERTa * Add the generation option `sampling_topp` to enable top-p (nucleus) sampling * Save vocabulary files in the JSON format to better support tokens containing newlines or carriage returns ## Fixes and improvements * Fix the application of `min_length` and `max_length` when using `include_prompt_in_result=False` and a batch input with variable lengths: the length constrain | Low | 6/15/2023 |
| v3.15.1 | ## Fixes and improvements * Fix an error when using the new `static_prompt` argument in the methods `generate_tokens` and `generate_batch` * Improve the performance of models using ALiBi | Low | 6/9/2023 |
| v3.15.0 | ## New features * Initial support of encoder-only Transformer model via a new class `ctranslate2.Encoder` * Update the Transformers converter to support the Falcon models * Add a generation argument `static_prompt` to optimize the execution for models using system prompts: the model state for this prompt is cached and reused in future calls * Support early stopping in greedy search when the callback function returns `True` * Make the layer norm epsilon value configurable in the model conf | Low | 6/6/2023 |
| v3.14.0 | ## New features * Update the Transformers converter with new architectures: * CodeGen * GPTBigCode * LLaMa * MPT * Update the OpenNMT-py converter to support some recent options: * `layer_norm="rms"` * `max_relative_positions=-1` (rotary embeddings) * `max_relative_positions=-2` (ALiBi) * `pos_ffn_activation_fn="silu"` * Update the OpenNMT-tf converter to support models using different configurations for the encoder and decoder (e.g. post-norm in the encoder and pre- | Low | 5/26/2023 |
| v3.13.0 | ## New features * Support conversion of GPT-NeoX models with the Transformers converter * Extend the `end_token` argument to also accept a list of tokens * Add option `return_end_token` to include the end token in the results of the methods `generate_batch` and `translate_batch` (by default the end token is removed) * Expose the `callback` argument for the methods `generate_batch` and `translate_batch` to get early results from the decoding loop * Fallback to a custom threading implementa | Low | 4/26/2023 |
| v3.12.0 | ## New features * Add methods `Generator.generate_tokens` and `Translator.generate_tokens` returning a generator that yields tokens as soon as they are generated by the model (not compatible with beam search) * Improve performance of rotary embeddings on CPU with an alternative implementation that is enabled when setting `rotary_interleave=False` in the model specification (may require to permute QK weights) * Support a variable number of input frames in method `Whisper.align` to improve ba | Low | 4/17/2023 |
| v3.11.0 | ## Changes * The Python wheels for macOS ARM are now built with the Ruy backend to support INT8 computation. This will change the performance and results when loading an INT8 model and/or using the `auto` compute type. To keep the previous behavior, set `compute_type="float32"`. ## New features * Support conversion of the GPT-J architecture * Support conversion of models using rotary position embeddings * Apply the new OpenNMT-py option `decoder_start_token` * Add option `revision` i | Low | 4/6/2023 |
| v3.10.3 | ## Fixes and improvements * Fix a synchronization issue when the model input is a CUDA storage | Low | 3/30/2023 |
| v3.10.2 | ## Fixes and improvements * Select the correct device when copying a `StorageView` instance | Low | 3/27/2023 |
| v3.10.1 | ## Fixes and improvements * Add missing device setter in `Whisper.encode` | Low | 3/27/2023 |
| v3.10.0 | ## New features * Add `Generator` option `include_prompt_in_result` (`True` by default) * Add method `Whisper.encode` to only run the Whisper encoder * Add model properties `Whisper.device` and `Whisper.device_index` ## Fixes and improvements * Update the methods `Whisper.detect_language`, `Whisper.generate`, and `Whisper.align` to accept the encoder output * Fix a crash when running `Generator.forward` on GPU and the generator object is destroyed before the forward output * Fix par | Low | 3/24/2023 |
| v3.9.1 | ## Fixes and improvements * Fix missing alignments in the `Whisper.align` result due to a bug in the DTW implementation * Fix error when converting a Whisper model from a path | Low | 3/18/2023 |
| v3.9.0 | ## New features * Support BLOOM language models * Add method `Whisper.align` to return the text/audio alignment and implement word-level timestamps ## Fixes and improvements * Do not force `intra_threads` to 1 when loading a model on the GPU as some ops may still run on the CPU * Disable multithreading when copying a batch of small arrays | Low | 3/15/2023 |
| v3.8.0 | ## New features * Experimental support of AVX512 in manually vectorized functions: this code path is not enabled by default but can be enabled by setting the environment variable `CT2_FORCE_CPU_ISA=AVX512` * Add Transformers converter option `copy_files` to copy any files from the Hugging Face model to the converted model directory * Expose some Whisper parameters: * `max_initial_timestamp_index` * `suppress_blank` * `suppress_tokens` ## Fixes and improvements * Reduce conver | Low | 3/6/2023 |
| v3.7.0 | ## Changes * Rename the "float" compute type to "float32" for clarity. "float" is still accepted for backward compatibility. ## New features * Add the environment variable `CT2_CUDA_TRUE_FP16_GEMM`. This flag is enabled by default so that FP16 GEMMs are running in full FP16. When disabled, the compute type of FP16 GEMMs is set to FP32, which is what PyTorch and TensorFlow do by default. ## Fixes and improvements * Improve the numerical precision of Whisper models running in FP16 b | Low | 2/23/2023 |
| v3.6.0 | ## New features * Build the Windows Python wheels with cuDNN to enable GPU execution of Whisper models * Add the model attribute `Whisper.is_multilingual` ## Fixes and improvements * Reduce the beam search memory usage by not duplicating the decoder states that are the same in each beam (e.g. the projected memory keys and values) * Optimize the dot product attention during beam search by moving the query beam dimension to the time dimension * Fix support of English-only Whisper model | Low | 2/16/2023 |
| v3.5.1 | ## Fixes and improvements * Whisper: fix an incorrect timestamp rule that prevented timestamps to be generated in pairs * Whisper: ignore the EOS token when applying the length penalty to match the original implementation | Low | 2/13/2023 |
| v3.5.0 | ## New features * Add a patience factor for beam search to continue decoding until `beam_size * patience` hypotheses are finished, as described in [Kasai et al. 2022](https://arxiv.org/abs/2204.05424) * Implement all GELU variants and select them accordingly when converting models: * Tanh approximation (already implemented) * Sigmoid approximation * Reference implementation based on the CDF ## Fixes and improvements * Fix incorrect outputs of T5 models due to a bug in the CUDA | Low | 2/10/2023 |
| v3.4.0 | ## Fixes and improvements * Fix incorrect vocabulary in M2M100 models after conversion with `transformers>=4.24` * Fix incorrect model outputs when executing with very large batch sizes on GPU * Fix memory error in biased decoding: the vector of divergence was read and updated past its length * Allow setting `prefix_bias_beta` > 0 with `beam_size` == 1 * Prevent timestamps from decreasing during Whisper generation * Make some error messages more helpful when implementing a custom convert | Low | 2/3/2023 |
| v3.3.0 | ## New features * Support T5 models, including the variants T5v1.1 and mT5 * Support loading the model files from memory: * Python: see the `files` argument in the constructor of classes loading models * C++: see the `models::ModelMemoryReader` class ## Fixes and improvements * Improve the quantization accuracy of OPT models by applying the [SmoothQuant](https://github.com/mit-han-lab/smoothquant) technique during conversion (pre-computed activation scales should be passed to the | Low | 1/2/2023 |
| v3.2.0 | ## New features * Add decoding option `suppress_sequences` to prevent specific sequences of tokens from being generated * Add decoding option `end_token` to stop the decoding on a different token than the model EOS token * Allow returning multiple random hypotheses from greedy search + random sampling when setting `num_hypotheses` > 1 ## Fixes and improvements * Improve support for batch generation with the Whisper model: * Improve performance of batch generation with a context (we | Low | 12/12/2022 |
| v3.1.0 | ## Changes * The input prompt is no longer included in the result of `Whisper.generate` as it is usually not useful in a transcription loop * The default beam size in `Whisper.generate` is updated from 1 to 5 to match the default value in [openai/whisper](https://github.com/openai/whisper) * Generation options `min_length` and `no_repeat_ngram_size` now penalize the logits instead of the log probs which may change some scores * Raise a deprecation warning when reading the `TranslationResul | Low | 11/29/2022 |
| v3.0.2 | ## Fixes and improvements * Whisper: fix `generate` arguments that were not correctly passed to the model | Low | 11/14/2022 |
| v3.0.1 | ## Fixes and improvements * Whisper: do not implicitly add `<|startoftranscript|>` in `generate` since it is not always the first token | Low | 11/10/2022 |
| v3.0.0 | This major version integrates the Whisper speech recognition model published by OpenAI. It also introduces some breaking changes to remove deprecated usages and simplify some modules. ## Breaking changes ### General * Remove option `normalize_scores`: the scores are now always divided by `pow(length, length_penalty)` with `length_penalty` defaulting to 1 * Remove option `allow_early_exit`: the beam search now exits early only when no penalties are used ### Python * Rename some cl | Low | 11/7/2022 |
