# pdf_oxide

> The fastest PDF library for Python and Rust. Text extraction, image extraction, markdown conversion, PDF creation & editing. 0.8ms mean, 5× faster than industry leaders, 100% pass rate on 3,830 PDFs. 

- **URL**: https://www.freshcrate.ai/projects/pdf_oxide
- **Author**: yfedoseev
- **Category**: RAG & Memory
- **Latest version**: `v0.3.60` (2026-06-04)
- **License**: Apache-2.0
- **Source**: https://github.com/yfedoseev/pdf_oxide
- **Homepage**: https://oxide.fyi
- **Language**: Rust
- **GitHub**: 630 stars, 72 forks
- **Registry**: github
- **Tags**: `data-extraction`, `document-processing`, `fast`, `image-extraction`, `llm`, `markdown`, `pdf`, `pdf-editor`, `rag`, `rust`

## Description

The fastest PDF library for Python and Rust. Text extraction, image extraction, markdown conversion, PDF creation & editing. 0.8ms mean, 5× faster than industry leaders, 100% pass rate on 3,830 PDFs. MIT/Apache-2.0.

## Recent releases

| Version | Date | Urgency | Changes |
| --- | --- | --- | --- |
| `v0.3.60` | 2026-06-04 | High | ### Added  - **`TextChar::ascent` and `TextChar::descent`** — glyph ascent and descent in device space (pre-multiplied by effective font size, matching the units of `advance_width` and `rendered_advance`). Sourced from the font's `FontDescriptor` (`/Ascent` / `/Descent`), with fallbacks to built-in metrics for the 14 standard PDF fonts and then Poppler-compatible defaults (0.95em / −0.35em). For Type0/CID fonts the values are now read from the CIDFont descendant's descriptor (§9.7.4) rather than |
| `v0.3.57` | 2026-05-30 | High | ### Added  - **`TextChar::rendered_advance`** — per-glyph cursor advance to the next character's origin, including character spacing (Tc) and word spacing (Tw) per the PDF Tx formula, distinct from the shape-only `advance_width`. Enables accurate word-boundary detection and cursor reconstruction. Thanks @haberman. (#602) - **Separation plate rendering** — `render_separations(page, dpi)` / `render_separation(page, ink_name, dpi)` (Rust + Python) emit one grayscale image per ink, pixel value = ink |
| `v0.3.54` | 2026-05-23 | High | ### Fixed  - **Hebrew / RTL visual-vs-logical detection   ([#537](https://github.com/yfedoseev/pdf_oxide/issues/537))**   — Hebrew PDFs that store text in visual order (the PDF   content stream draws glyphs left-to-right even though the   script reads right-to-left) now extract in correct logical   order. New per-RTL-run X-coordinate-monotonicity detector   gates the existing UAX #9 `bidi::reorder_visual_to_logical`   pass; logical-order PDFs (the pdfium `hebrew_mirrored.pdf`   test fixture and |
| `v0.3.49` | 2026-05-16 | High | ### Fixed  - **Linearized PDFs with a non-zero `%PDF-` header offset   ([#509](https://github.com/yfedoseev/pdf_oxide/issues/509))** — files   whose `%PDF-` header is preceded by leading bytes (e.g. a captive-   portal HTML redirect injected ahead of a Linearized PDF) are now read   instead of rejected with `Trailer missing /Root entry`. The xref-   offset shift for header-offset PDFs no longer requires the final   trailer to carry `/Root`; xref reconstruction now rejects a parsed-   but-`/Root` |
| `v0.3.46` | 2026-05-11 | High | ### Added  - **Raw RGBA pixel buffer, SIMD downscaling, and thread-safe rendering   ([#446](https://github.com/yfedoseev/pdf_oxide/issues/446),   [#481](https://github.com/yfedoseev/pdf_oxide/issues/481))** —   `page.render_pixmap()` (Python), `renderToPixmap()` (Node.js / Go),   and `Page.RenderToRgba()` (C#) expose the premultiplied RGBA8888   buffer directly from `tiny_skia::Pixmap::data()`, eliminating the   encode→decode roundtrip for callers that need raw pixels (PIL,   sharp, `System.Draw |
| `v0.3.44` | 2026-05-06 | High | ### Highlights  - **`pdf_oxide::crypto::CryptoProvider` trait** — new abstraction   that decouples PDF encryption and signature paths from any one   cryptography crate. Two providers ship out of the box:   - **`RustCryptoProvider`** (default): pure-Rust stack as before     (`sha2`, `aes`, `rsa`, `p256`, `p384`, `getrandom`, `md-5`,     `sha1`). Permits every algorithm PDF specs reference, including     the legacy MD5+RC4 path required by ISO 32000-1 R≤4 documents.   - **`AwsLcProvider`** (opt-i |
| `v0.3.40` | 2026-04-29 | High | ### Community contributors  This release exists because of the community. Special thanks to:  - **[@sparkyandrew](https://github.com/sparkyandrew)** — six detailed bug   reports (#382, #385, #386, #397, #401, #425) that drove the CJK font   subsetter, encryption, font-name handling, and now the image rendering   overhaul. Every report came with a reproduction case. Issue #425 specifically   identified four separate rendering bugs and raised the API design question   that led to `ImageContent::f |
| `v0.3.38` | 2026-04-23 | High | This release closes the "Rust-only `DocumentBuilder` gap": the fluent write-side builder, embedded fonts, the HTML+CSS pipeline, annotations, form-field creation, and low-level graphics primitives are now reachable from **Python, WASM, C#, Go, and Node/TypeScript** — the Rust implementation is the single source of truth and every binding is a thin translation layer. On top of that it lands the first cryptographic signature-verification path (RSA-PKCS#1 v1.5) across every binding and a pdf.js-par |
| `v0.3.37` | 2026-04-21 | High | ### API — `Pdf::from_html_css` (#248)  ```rust let font = std::fs::read("DejaVuSans.ttf")?; let mut pdf = Pdf::from_html_css(     "<h1>Hello</h1><p>World</p>",     "h1 { color: blue; font-size: 24pt }",     font, )?; pdf.save("out.pdf")?; ```  The whole feature: pass HTML + CSS + font bytes, get a paginated PDF back. Pure Rust, MIT/Apache only (no MPL transitive deps), `extract_text` round-trips byte-equal so produced PDFs participate in the existing test infrastructure.  End-to-end test suite a |
| `v0.3.36` | 2026-04-20 | High | ### Markdown structural extraction (#377)  The headline change of this release. `to_markdown()` previously consumed only the MCID *order* from `/StructTreeRoot` and then re-derived heading levels from font-size heuristics and list markers from glyph detection. For Word/Acrobat tagged PDFs whose body and heading text share a point size, this dropped every heading; for tagged lists where `LI → LBody → MCR` nests the actual content under a Span/P, this dropped every bullet; for tagged paragraphs wh |

## Citation

- HTML: https://www.freshcrate.ai/projects/pdf_oxide
- Markdown: https://www.freshcrate.ai/projects/pdf_oxide.md
- Dependencies JSON: https://www.freshcrate.ai/api/projects/pdf_oxide/deps

_Generated by freshcrate.ai. Indexes github releases for AI-agent ecosystem packages._
