pdf2image
A wrapper around the pdftoppm and pdftocairo command line tools to convert PDF to a PIL Image list.
Description
# pdf2image [](https://circleci.com/gh/Belval/pdf2image/tree/master) [](https://badge.fury.io/py/pdf2image) [](https://codecov.io/gh/Belval/pdf2image) [](https://pepy.tech/project/pdf2image) [](https://belval.github.io/pdf2image) A python (3.7+) module that wraps pdftoppm and pdftocairo to convert PDF to a PIL Image object ## How to install `pip install pdf2image` ### Windows Windows users will have to build or download poppler for Windows. I recommend [@oschwartz10612 version](https://github.com/oschwartz10612/poppler-windows/releases/) which is the most up-to-date. You will then have to add the `bin/` folder to [PATH](https://www.architectryan.com/2018/03/17/add-to-the-path-on-windows-10/) or use `poppler_path = r"C:\path\to\poppler-xx\bin" as an argument` in `convert_from_path`. ### Mac Mac users will have to install [poppler](https://poppler.freedesktop.org/). Installing using [Brew](https://brew.sh/): ``` brew install poppler ``` ### Linux Most distros ship with `pdftoppm` and `pdftocairo`. If they are not installed, refer to your package manager to install `poppler-utils` ### Platform-independant (Using `conda`) 1. Install poppler: `conda install -c conda-forge poppler` 2. Install pdf2image: `pip install pdf2image` ## How does it work? ```py from pdf2image import convert_from_path, convert_from_bytes from pdf2image.exceptions import ( PDFInfoNotInstalledError, PDFPageCountError, PDFSyntaxError ) ``` Then simply do: ```py images = convert_from_path('/home/belval/example.pdf') ``` OR ```py images = convert_from_bytes(open('/home/belval/example.pdf', 'rb').read()) ``` OR better yet ```py import tempfile with tempfile.TemporaryDirectory() as path: images_from_path = convert_from_path('/home/belval/example.pdf', output_folder=path) # Do something here ``` `images` will be a list of PIL Image representing each page of the PDF document. Here are the definitions: `convert_from_path(pdf_path, dpi=200, output_folder=None, first_page=None, last_page=None, fmt='ppm', jpegopt=None, thread_count=1, userpw=None, use_cropbox=False, strict=False, transparent=False, single_file=False, output_file=str(uuid.uuid4()), poppler_path=None, grayscale=False, size=None, paths_only=False, use_pdftocairo=False, timeout=600, hide_attributes=False)` `convert_from_bytes(pdf_file, dpi=200, output_folder=None, first_page=None, last_page=None, fmt='ppm', jpegopt=None, thread_count=1, userpw=None, use_cropbox=False, strict=False, transparent=False, single_file=False, output_file=str(uuid.uuid4()), poppler_path=None, grayscale=False, size=None, paths_only=False, use_pdftocairo=False, timeout=600, hide_attributes=False)` ## What's new? - Allow users to hide attributes when using pdftoppm with `hide_attributes` (Thank you @StaticRocket) - Fix console opening on Windows (Thank you @OhMyAgnes!) - Add `timeout` parameter which raises `PDFPopplerTimeoutError` after the given number of seconds. - Add `use_pdftocairo` parameter which forces `pdf2image` to use `pdftocairo`. Should improve performance. - Fixed a bug where using `pdf2image` with multiple threads (but not multiple processes) would cause and exception - `jpegopt` parameter allows for tuning of the output JPEG when using `fmt="jpeg"` (`-jpegopt` in pdftoppm CLI) (Thank you @abieler) - `pdfinfo_from_path` and `pdfinfo_from_bytes` which expose the output of the pdfinfo CLI - `paths_only` parameter will return image paths instead of Image objects, to prevent OOM when converting a big PDF - `size` parameter allows you to define the shape of the resulting images (`-scale-to` in pdftoppm CLI) - `size=400`Β will fit the image to a 400x400 box, preserving aspect ratio - `size=(400, None)` will make the image 400 pixels wide, preserving aspect ratio - `size=(500, 500)` will resize the image to 500x500 pixels, not preserving aspect ratio - `grayscale` parameter allows you to convert images to grayscale (`-gray` in pdftoppm CLI) - `single_file` parameter allows you to convert the first PDF page only, without adding digits at the end of the `output_file` - Allow the user to specify poppler's installation path with `poppler_path` ## Performance tips - Using an output folder is significantly faster if you are using an SSD. Otherwise i/o usually becomes the bottleneck. - Using multiple threads can give you some gains but avoid more than 4 as this will cause i/o bottleneck (even on my NVMe SSD!). - If i/o is your bottleneck, using the JPEG format can lead to significant gains. - PNG format is pretty slow, this is because of the compression. - If you want to know the best settings (mos
Release History
| Version | Changes | Urgency | Date |
|---|---|---|---|
| 1.17.0 | Imported from PyPI (1.17.0) | Low | 4/21/2026 |
| v1.17.0 | ## What's Changed * Update __init__.py to mark exports as public by @John-Schreiber in https://github.com/Belval/pdf2image/pull/276 * Add first and last page (-f and -l) to pdfinfo by @magnurud in https://github.com/Belval/pdf2image/pull/274 * Import memory_profiler only when it is enabled by @musicinmybrain in https://github.com/Belval/pdf2image/pull/269 * Fixed the issue: when the `single_file` is `True` and the `thread_count` is greater than 1, the `thread_output_file = next(output_file)` | Low | 1/7/2024 |
| v.1.16.3 | ## What's Changed * Add py.typed file (Thanks @PedroPerpetua!) **Full Changelog**: https://github.com/Belval/pdf2image/compare/v1.16.2...v.1.16.3 | Low | 2/26/2023 |
| v1.16.2 | ## What's Changed * Remove posix import by @Belval in https://github.com/Belval/pdf2image/pull/250 **Full Changelog**: https://github.com/Belval/pdf2image/compare/v1.16.1...v1.16.2 | Low | 12/31/2022 |
| v1.16.0 | Version 1.15 was never tagged but was released on PyPI. - Add `hide_annotations` flag to remove the default green "highlighting" of links in PDF | Low | 6/23/2021 |
| v1.14.0 | - Add timeout parameter for all functions which will kill the underlying process after a given time. - Add `rawdates` parameter which does not attempt to parse the date output from `pdfinfo`. | Low | 8/23/2020 |
| v1.13.1 | - Fix `convert_from_bytes` not having `use_pdftocairo`. | Low | 4/30/2020 |
| v1.13.0 | **DEPRECATED PLEASE USE 1.13.1** - Add `use_pdftocairo` parameter which forces `pdf2image` to use `pdftocairo` when rasterizing the PDF. This seems to lead to an improved performance when dealing with large and complex PDF. | Low | 4/30/2020 |
| v1.12.1 | - Fixes version 1.12.0 on Windows - Version 1.12.0 will no longer be available on PyPI | Low | 2/17/2020 |
| v1.12.0 | - Fix an exception that would occur whenever `convert_from_path` was used with multiple threads (but not multiple processes). **This version was removed of PyPI as of 2020-02-17 since it introduces a deadlock on Windows** | Low | 2/10/2020 |
| v1.11.0 | - Add `jpegopt` option for finer control on output images quality when using `fmt='jpeg'` (Thank you @abieler) - Add public functions `pdfinfo_from_path` and `pdfinfo_from_bytes`Β which return a dictionary containing the parsed output of `pdfinfo` | Low | 12/19/2019 |
| v1.10.0 | - Add `paths_only=False` parameters, which returns images paths instead of image objects when set to true | Low | 11/4/2019 |
| v1.9.0 | - Add `size=None` parameter which uses either `-scale-to`, `-scale-to-x` or `-scale-to-y` in pdftoppm or pdftocairo - `size=400` fits the image to a 400x400 pixels box, preserving aspect ratio - `size=(400, None)` makes the image 400 pixels wide, preserving aspect ratio - `size=(None, 400)` makes the image 400 pixels high, preserving aspect ratio - `size=(400, 400)` makes the image 400 pixels by 400 pixels, not preserving aspect ratio | Low | 9/21/2019 |
| v1.8.0 | - **Drop support for python 2.7** - Add generator for file names - Add support for `pathlib.Path` objects | Low | 9/15/2019 |
| v1.7.1 | - Use [Black](https://github.com/psf/black) code formatting in source | Low | 9/3/2019 |
| v1.7.0 | - Add `grayscale` parameter which allows you to convert images to grayscale | Low | 8/27/2019 |
| v1.6.0 | - Add `single_page` parameter which allows you to convert the first PDF page only, without adding digits at the end of the `output_file` - Fix simplistic name matching that would sometime fail when `output_file` was contained in the name of another file or directory of the `output_folder` | Low | 7/3/2019 |
| v1.5.4 | - Fix `first_page` greater than `last_page` throwing error. From now on it will return an empty list | Low | 4/30/2019 |
| v1.5.3 | - Minor bugfix for `poppler_path` which would not find libpoppler on Linux. | Low | 4/28/2019 |
