freshcrate
Skin:/
Home > RAG & Memory > pdf_oxide

pdf_oxide

The fastest PDF library for Python and Rust. Text extraction, image extraction, markdown conversion, PDF creation & editing. 0.8ms mean, 5ร— faster than industry leaders, 100% pass rate on 3,830 PDFs.

Why this rank:Strong adoptionRecent releaseHealthy release cadence

Description

The fastest PDF library for Python and Rust. Text extraction, image extraction, markdown conversion, PDF creation & editing. 0.8ms mean, 5ร— faster than industry leaders, 100% pass rate on 3,830 PDFs. MIT/Apache-2.0.

README

PDF Oxide - The Fastest PDF Toolkit for Python, Rust, Go, JS/TS, C#, WASM, CLI & AI

More language bindings coming in May 2026. Java, Ruby, PHP, Swift, and Kotlin are on the roadmap. Want another language? Open an issue and tell us.

The fastest PDF library for text extraction, image extraction, and markdown conversion. Rust core with bindings for Python, Go, JavaScript / TypeScript, C# / .NET, and WASM, plus a CLI tool and MCP server for AI assistants. 0.8ms mean per document, 5ร— faster than PyMuPDF, 15ร— faster than pypdf. 100% pass rate on 3,830 real-world PDFs. MIT licensed.

Crates.io PyPI PyPI Downloads npm Documentation Build Status License: MIT OR Apache-2.0

New in v0.3.24 โ€” now available in Go, JavaScript / TypeScript, and C# / .NET, alongside the existing Python, Rust, and WASM bindings. Same Rust core, same 0.8 ms extraction speed, same 100% pass rate. See the language guides: Python ยท Go ยท JavaScript / TypeScript ยท C# / .NET ยท WASM

Quick Start

Python

from pdf_oxide import PdfDocument

# path can be str or pathlib.Path; use with for scoped access
doc = PdfDocument("paper.pdf")
# or: with PdfDocument("paper.pdf") as doc: ...
text = doc.extract_text(0)
chars = doc.extract_chars(0)
markdown = doc.to_markdown(0, detect_headings=True)
pip install pdf_oxide

Rust

use pdf_oxide::PdfDocument;

let mut doc = PdfDocument::open("paper.pdf")?;
let text = doc.extract_text(0)?;
let images = doc.extract_images(0)?;
let markdown = doc.to_markdown(0, Default::default())?;
[dependencies]
pdf_oxide = "0.3"

CLI

pdf-oxide text document.pdf
pdf-oxide markdown document.pdf -o output.md
pdf-oxide search document.pdf "pattern"
pdf-oxide merge a.pdf b.pdf -o combined.pdf
brew install yfedoseev/tap/pdf-oxide

MCP Server (for AI assistants)

# Install
brew install yfedoseev/tap/pdf-oxide   # includes pdf-oxide-mcp

# Configure in Claude Desktop / Claude Code / Cursor
{
  "mcpServers": {
    "pdf-oxide": { "command": "crgx", "args": ["pdf_oxide_mcp@latest"] }
  }
}

Why pdf_oxide?

  • Fast โ€” 0.8ms mean per document, 5ร— faster than PyMuPDF, 15ร— faster than pypdf, 29ร— faster than pdfplumber
  • Reliable โ€” 100% pass rate on 3,830 test PDFs, zero panics, zero timeouts
  • Complete โ€” Text extraction, image extraction, PDF creation, and editing in one library
  • Multi-platform โ€” Rust, Python, Go, JavaScript/TypeScript, C#/.NET, WASM, CLI, and MCP server for AI assistants
  • Permissive license โ€” MIT / Apache-2.0 โ€” use freely in commercial and open-source projects

Performance

Benchmarked on 3,830 PDFs from three independent public test suites (veraPDF, Mozilla pdf.js, DARPA SafeDocs). Text extraction libraries only (no OCR). Single-thread, 60s timeout, no warm-up.

Python Libraries

Library Mean p99 Pass Rate License
PDF Oxide 0.8ms 9ms 100% MIT
PyMuPDF 4.6ms 28ms 99.3% AGPL-3.0
pypdfium2 4.1ms 42ms 99.2% Apache-2.0
pymupdf4llm 55.5ms 280ms 99.1% AGPL-3.0
pdftext 7.3ms 82ms 99.0% GPL-3.0
pdfminer 16.8ms 124ms 98.8% MIT
pdfplumber 23.2ms 189ms 98.8% MIT
markitdown 108.8ms 378ms 98.6% MIT
pypdf 12.1ms 97ms 98.4% BSD-3

Rust Libraries

Library Mean p99 Pass Rate Text Extraction
PDF Oxide 0.8ms 9ms 100% Built-in
oxidize_pdf 13.5ms 11ms 99.1% Basic
unpdf 2.8ms 10ms 95.1% Basic
pdf_extract 4.08ms 37ms 91.5% Basic
lopdf 0.3ms 2ms 80.2% No built-in extraction

Text Quality

99.5% text parity vs PyMuPDF and pypdfium2 across the full corpus. PDF Oxide extracts text from 7โ€“10ร— more "hard" files than it misses vs any competitor.

Corpus

Suite PDFs Pass Rate
veraPDF (PDF/A compliance) 2,907 100%
Mozilla pdf.js 897 99.2%
SafeDocs (targeted edge cases) 26 100%
Total 3,830 100%

100% pass rate on all valid PDFs โ€” the 7 non-passing files across the corpus are intentionally broken test fixtures (missing PDF header, fuzz-corrupted catalogs, invalid xref streams).

Features

Extract Create Edit
Text & Layout Documents Annotations
Images Tables Form Fields
Forms Graphics Bookmarks
Annotations Templates Links
Bookmarks Images Content

Python API

from pdf_oxide import PdfDocument

# Path can be str or pathlib.Path; use "with PdfDocument(...) as doc" for context manager
doc = PdfDocument("report.pdf")
print(f"Pages: {doc.page_count()}")
print(f"Version: {doc.version()}")

# 1. Scoped extraction (v0.3.14)
# Extract only from a specific area: (x, y, width, height)
header = doc.within(0, (0, 700, 612, 92)).extract_text()

# 2. Word-level extraction (v0.3.14)
words = doc.extract_words(0)
for w in words:
    print(f"{w.text} at {w.bbox}")
    # Access individual characters in the word
    # print(w.chars[0].font_name)

# Optional: override the adaptive word gap threshold (in PDF points)
words = doc.extract_words(0, word_gap_threshold=2.5)

# 3. Line-level extraction (v0.3.14)
lines = doc.extract_text_lines(0)
for line in lines:
    print(f"Line: {line.text}")

# Optional: override word and/or line gap thresholds (in PDF points)
lines = doc.extract_text_lines(0, word_gap_threshold=2.5, line_gap_threshold=4.0)

# Inspect the adaptive thresholds before overriding
params = doc.page_layout_params(0)
print(f"word gap: {params.word_gap_threshold:.1f}, line gap: {params.line_gap_threshold:.1f}")

# Use a pre-tuned extraction profile for specific document types
from pdf_oxide import ExtractionProfile
words = doc.extract_words(0, profile=ExtractionProfile.form())
lines = doc.extract_text_lines(0, profile=ExtractionProfile.academic())

# 4. Table extraction (v0.3.14)
tables = doc.extract_tables(0)
for table in tables:
    print(f"Table with {table.row_count} rows")

# 5. Traditional extraction
text = doc.extract_text(0)
chars = doc.extract_chars(0)

Form Fields

# Extract form fields
fields = doc.get_form_fields()
for f in fields:
    print(f"{f.name} ({f.field_type}) = {f.value}")

# Fill and save
doc.set_form_field_value("employee_name", "Jane Doe")
doc.set_form_field_value("wages", "85000.00")
doc.save("filled.pdf")

Rust API

use pdf_oxide::PdfDocument;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let mut doc = PdfDocument::open("paper.pdf")?;

    // Extract text
    let text = doc.extract_text(0)?;

    // Character-level extraction
    let chars = doc.extract_chars(0)?;

    // Extract images
    let images = doc.extract_images(0)?;

    // Vector graphics
    let paths = doc.extract_paths(0)?;

    Ok(())
}

Form Fields (Rust)

use pdf_oxide::editor::{DocumentEditor, EditableDocument, SaveOptions};
use pdf_oxide::editor::form_fields::FormFieldValue;

let mut editor = DocumentEditor::open("w2.pdf")?;
editor.set_form_field_value("employee_name", FormFieldValue::Text("Jane Doe".into()))?;
editor.save_with_options("filled.pdf", SaveOptions::incremental())?;

Installation

Python

pip install pdf_oxide

Wheels available for Linux, macOS, and Windows. Python 3.8โ€“3.14.

Rust

[dependencies]
pdf_oxide = "0.3"

JavaScript/WASM

npm install pdf-oxide-wasm
const { WasmPdfDocument } = require("pdf-oxide-wasm");

CLI

brew install yfedoseev/tap/pdf-oxide    # Homebrew (macOS/Linux)
cargo install pdf_oxide_cli             # Cargo
cargo binstall pdf_oxide_cli            # Pre-built binary via cargo-binstall

MCP Server

brew install yfedoseev/tap/pdf-oxide    # Included with CLI in Homebrew
cargo install pdf_oxide_mcp             # Cargo

Other languages

  • Go โ€” go get github.com/yfedoseev/pdf_oxide/go โ€” see go/README.md
  • JavaScript / TypeScript (Node.js) โ€” npm install pdf-oxide โ€” see js/README.md
  • C# / .NET โ€” dotnet add package PdfOxide โ€” see csharp/README.md

All three share the same Rust core as the Python and WASM bindings, so everything you read in this README applies to them as well โ€” just with each language's native naming conventions.

CLI

22 commands for PDF processing directly from your terminal:

pdf-oxide text report.pdf                      # Extract text
pdf-oxide markdown report.pdf -o report.md     # Convert to Markdown
pdf-oxide html report.pdf -o report.html       # Convert to HTML
pdf-oxide info report.pdf                      # Show metadata
pdf-oxide search report.pdf "neural.?network"  # Search (regex)
pdf-oxide images report.pdf -o ./images/       # Extract images
pdf-oxide merge a.pdf b.pdf -o combined.pdf    # Merge PDFs
pdf-oxide split report.pdf -o ./pages/         # Split into pages
pdf-oxide watermark doc.pdf "DRAFT"            # Add watermark
pdf-oxide forms w2.pdf --fill "name=Jane"      # Fill form fields

Run pdf-oxide with no arguments for interactive REPL mode. Use --pages 1-5 to process specific pages, --json for machine-readable output.

MCP Server

pdf-oxide-mcp lets AI assistants (Claude, Cursor, etc.) extract content from PDFs locally via the Model Context Protocol.

Add to your MCP client configuration:

{
  "mcpServers": {
    "pdf-oxide": { "command": "crgx", "args": ["pdf_oxide_mcp@latest"] }
  }
}

The server exposes an extract tool that supports text, markdown, and HTML output formats with optional page ranges and image extraction. All processing runs locally โ€” no files leave your machine.

Building from Source

# Clone and build
git clone https://github.com/yfedoseev/pdf_oxide
cd pdf_oxide
cargo build --release

# Run tests
cargo test

# Build Python bindings
maturin develop

# Build the shared library for Go, JS/TS, and C# bindings
cargo build --release --lib
# Output: target/release/libpdf_oxide.{so,dylib} or pdf_oxide.dll

Documentation

Use Cases

  • RAG / LLM pipelines โ€” Convert PDFs to clean Markdown for retrieval-augmented generation with LangChain, LlamaIndex, or any framework
  • AI assistants โ€” Give Claude, Cursor, or any MCP-compatible tool direct PDF access via the MCP server
  • Document processing at scale โ€” Extract text, images, and metadata from thousands of PDFs in seconds
  • Data extraction โ€” Pull structured data from forms, tables, and layouts
  • Academic research โ€” Parse papers, extract citations, and process large corpora
  • PDF generation โ€” Create invoices, reports, certificates, and templated documents programmatically
  • PyMuPDF alternative โ€” MIT licensed, 5ร— faster, no AGPL restrictions

Why I built this

I needed PyMuPDF's speed without its AGPL license, and I needed it in more than one language. Nothing existed that ticked all three boxes โ€” fast, MIT, multi-language โ€” so I wrote it. The Rust core is what does the real work; the bindings for Python, Go, JS/TS, C#, and WASM are thin shells around the same code, so a bug fix in one lands in all of them. It now passes 100% of the veraPDF + Mozilla pdf.js + DARPA SafeDocs test corpora (3,830 PDFs) on every platform I've tested.

If it's useful to you, a star on GitHub genuinely helps. If something's broken or missing, open an issue โ€” I read all of them.

โ€” Yury

License

Dual-licensed under MIT or Apache-2.0 at your option. Unlike AGPL-licensed alternatives, pdf_oxide can be used freely in any project โ€” commercial or open-source โ€” with no copyleft restrictions.

Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

cargo build && cargo test && cargo fmt && cargo clippy -- -D warnings

Citation

@software{pdf_oxide,
  title = {PDF Oxide: Fast PDF Toolkit for Rust, Python, Go, JavaScript, and C#},
  author = {Yury Fedoseev},
  year = {2025},
  url = {https://github.com/yfedoseev/pdf_oxide}
}

Rust + Python + Go + JS/TS + C# + WASM + CLI + MCP | MIT/Apache-2.0 | 100% pass rate on 3,830 PDFs | 0.8ms mean | 5ร— faster than the industry leaders

Release History

VersionChangesUrgencyDate
v0.3.57### Added - **`TextChar::rendered_advance`** โ€” per-glyph cursor advance to the next character's origin, including character spacing (Tc) and word spacing (Tw) per the PDF Tx formula, distinct from the shape-only `advance_width`. Enables accurate word-boundary detection and cursor reconstruction. Thanks @haberman. (#602) - **Separation plate rendering** โ€” `render_separations(page, dpi)` / `render_separation(page, ink_name, dpi)` (Rust + Python) emit one grayscale image per ink, pixel value = inkHigh5/30/2026
v0.3.54### Fixed - **Hebrew / RTL visual-vs-logical detection ([#537](https://github.com/yfedoseev/pdf_oxide/issues/537))** โ€” Hebrew PDFs that store text in visual order (the PDF content stream draws glyphs left-to-right even though the script reads right-to-left) now extract in correct logical order. New per-RTL-run X-coordinate-monotonicity detector gates the existing UAX #9 `bidi::reorder_visual_to_logical` pass; logical-order PDFs (the pdfium `hebrew_mirrored.pdf` test fixture and High5/23/2026
v0.3.49### Fixed - **Linearized PDFs with a non-zero `%PDF-` header offset ([#509](https://github.com/yfedoseev/pdf_oxide/issues/509))** โ€” files whose `%PDF-` header is preceded by leading bytes (e.g. a captive- portal HTML redirect injected ahead of a Linearized PDF) are now read instead of rejected with `Trailer missing /Root entry`. The xref- offset shift for header-offset PDFs no longer requires the final trailer to carry `/Root`; xref reconstruction now rejects a parsed- but-`/Root`High5/16/2026
v0.3.46### Added - **Raw RGBA pixel buffer, SIMD downscaling, and thread-safe rendering ([#446](https://github.com/yfedoseev/pdf_oxide/issues/446), [#481](https://github.com/yfedoseev/pdf_oxide/issues/481))** โ€” `page.render_pixmap()` (Python), `renderToPixmap()` (Node.js / Go), and `Page.RenderToRgba()` (C#) expose the premultiplied RGBA8888 buffer directly from `tiny_skia::Pixmap::data()`, eliminating the encodeโ†’decode roundtrip for callers that need raw pixels (PIL, sharp, `System.DrawHigh5/11/2026
v0.3.44 ### Highlights - **`pdf_oxide::crypto::CryptoProvider` trait** โ€” new abstraction that decouples PDF encryption and signature paths from any one cryptography crate. Two providers ship out of the box: - **`RustCryptoProvider`** (default): pure-Rust stack as before (`sha2`, `aes`, `rsa`, `p256`, `p384`, `getrandom`, `md-5`, `sha1`). Permits every algorithm PDF specs reference, including the legacy MD5+RC4 path required by ISO 32000-1 Rโ‰ค4 documents. - **`AwsLcProvider`** (opt-iHigh5/6/2026
v0.3.40 ### Community contributors This release exists because of the community. Special thanks to: - **[@sparkyandrew](https://github.com/sparkyandrew)** โ€” six detailed bug reports (#382, #385, #386, #397, #401, #425) that drove the CJK font subsetter, encryption, font-name handling, and now the image rendering overhaul. Every report came with a reproduction case. Issue #425 specifically identified four separate rendering bugs and raised the API design question that led to `ImageContent::fHigh4/29/2026
v0.3.38This release closes the "Rust-only `DocumentBuilder` gap": the fluent write-side builder, embedded fonts, the HTML+CSS pipeline, annotations, form-field creation, and low-level graphics primitives are now reachable from **Python, WASM, C#, Go, and Node/TypeScript** โ€” the Rust implementation is the single source of truth and every binding is a thin translation layer. On top of that it lands the first cryptographic signature-verification path (RSA-PKCS#1 v1.5) across every binding and a pdf.js-parHigh4/23/2026
v0.3.37### API โ€” `Pdf::from_html_css` (#248) ```rust let font = std::fs::read("DejaVuSans.ttf")?; let mut pdf = Pdf::from_html_css( "<h1>Hello</h1><p>World</p>", "h1 { color: blue; font-size: 24pt }", font, )?; pdf.save("out.pdf")?; ``` The whole feature: pass HTML + CSS + font bytes, get a paginated PDF back. Pure Rust, MIT/Apache only (no MPL transitive deps), `extract_text` round-trips byte-equal so produced PDFs participate in the existing test infrastructure. End-to-end test suite aHigh4/21/2026
v0.3.36### Markdown structural extraction (#377) The headline change of this release. `to_markdown()` previously consumed only the MCID *order* from `/StructTreeRoot` and then re-derived heading levels from font-size heuristics and list markers from glyph detection. For Word/Acrobat tagged PDFs whose body and heading text share a point size, this dropped every heading; for tagged lists where `LI โ†’ LBody โ†’ MCR` nests the actual content under a Span/P, this dropped every bullet; for tagged paragraphs whHigh4/20/2026
v0.3.35### Text extraction correctness - **Adjacent narrow-glyph doublets no longer collapsed at small font sizes (#378, PR #379).** `TextExtractor::deduplicate_overlapping_chars` and `deduplicate_overlapping_spans` used a hardcoded 2 pt absolute threshold to detect duplicate glyphs from stroke+fill render passes. For narrow glyphs (`l`, `r`, `I`, `i`) in compact fonts at small sizes the per-glyph advance width drops to โ‰ค 2 pt (Helvetica `l` โ‰ˆ 2.5 pt at 9 pt), so legitimate adjacent doubleHigh4/19/2026
v0.3.34### API โ€” Page abstraction (#371) All four language bindings now expose a page object so callers can iterate a document and call extraction methods on the page directly. Named consistently as `Page` in Python, Node.js, C#, and Go. ```python with PdfDocument("paper.pdf") as doc: for page in doc: # len(doc), doc[i], doc[-1] also work text = page.text md = page.markdown(detect_headings=True) ``` - **Python** โ€” `Page` with lazy properties: `text`, `chars`, `words`,High4/18/2026
v0.3.33### Text extraction correctness - **ToUnicode CMap miss returns U+FFFD instead of ASCII ciphertext (#363).** Subset Type0 fonts whose ToUnicode CMap doesn't cover a CID now emit the replacement character instead of falling through to the Identity-H `cid-as-Unicode` path that produced strings like `%B+$%8A//$2*%01*1%6APP`. - **Intra-word TJ kerning no longer splits words (#365).** Letter-pair kerning of 0.10โ€“0.20 em inside single words (`[(diffe) -150 (rent)]`) no longer triggers space insertionHigh4/17/2026
v0.3.32### Release pipeline - **Fix `x86_64-pc-windows-gnu` native-lib build failing the v0.3.31 release.** The new `scripts/shrink-staticlib.sh` introduced in v0.3.31 ran `objcopy --strip-debug` on every archive member. The MinGW cross-compile toolchain emits split-debug `.dwo` members that contain *only* DWARF sections; after stripping those sections the member has no sections left and objcopy aborted the whole archive with `'...rcgu.dwo' has no sections`, failing the job that produces the Go WindowHigh4/16/2026
v0.3.30 --- ### Installation **Rust (crates.io)** ```bash cargo add pdf_oxide ``` **Python (PyPI)** ```bash pip install pdf_oxide ``` **JavaScript/WASM (npm)** ```bash npm install pdf-oxide-wasm ``` **CLI (Homebrew)** ```bash brew install yfedoseev/tap/pdf-oxide ``` **CLI (Scoop โ€” Windows)** ```powershell scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide scoop install pdf-oxide ``` **CLI (Shell installer)** ```bash curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_High4/12/2026
v0.3.29 --- ### Installation **Rust (crates.io)** ```bash cargo add pdf_oxide ``` **Python (PyPI)** ```bash pip install pdf_oxide ``` **JavaScript/WASM (npm)** ```bash npm install pdf-oxide-wasm ``` **CLI (Homebrew)** ```bash brew install yfedoseev/tap/pdf-oxide ``` **CLI (Scoop โ€” Windows)** ```powershell scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide scoop install pdf-oxide ``` **CLI (Shell installer)** ```bash curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_Medium4/12/2026
v0.3.28 --- ### Installation **Rust (crates.io)** ```bash cargo add pdf_oxide ``` **Python (PyPI)** ```bash pip install pdf_oxide ``` **JavaScript/WASM (npm)** ```bash npm install pdf-oxide-wasm ``` **CLI (Homebrew)** ```bash brew install yfedoseev/tap/pdf-oxide ``` **CLI (Scoop โ€” Windows)** ```powershell scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide scoop install pdf-oxide ``` **CLI (Shell installer)** ```bash curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_Medium4/12/2026
v0.3.27### Language Bindings - **Go: migrate from cdylib to staticlib for self-contained binaries (#334)** โ€” `pdf_oxide` now produces `libpdf_oxide.a` alongside the cdylib (new `staticlib` entry in `Cargo.toml`'s `crate-type`), and `go/pdf_oxide.go` links the archive directly via per-platform `#cgo ... LDFLAGS` with the exact system-library list rustc needs. The resulting Go binary is fully self-contained โ€” no `LD_LIBRARY_PATH` / `DYLD_LIBRARY_PATH` / `PATH` configuration required. Windows x64 is prodMedium4/12/2026
v0.3.24This release ships official bindings for JavaScript/TypeScript, Go, and C#, built on a shared C FFI layer. 100% Rust FFI parity across all three. ### Features - **JavaScript / TypeScript bindings** (`pdf-oxide` on npm) โ€” N-API native module with `Buffer`/`Uint8Array` input, `openWithPassword()`, worker thread pool, `Symbol.dispose`, rich error hierarchy, and complete API coverage: document editor, forms, rendering, signatures/TSA, compliance, annotations, extraction with bbox. Full TypeScript Medium4/11/2026
v0.3.23### Bug Fixes - **Text extraction: SIGABRT on pages with degenerate CTM coordinates (#308)** โ€” extracting text from certain rotated dvips-generated pages (e.g., arXiv papers with `Page rot: 90`) caused a 38 petabyte allocation and SIGABRT. Degenerate CTM transforms produced text spans with bounding boxes ~19 quadrillion points wide, which blew up the column detection histogram in `detect_page_columns()`. Per PDF 32000-1:2008 ยง8.3.2.3, the visible page region is defined by MediaBox/CropBox, not High4/10/2026
v0.3.22### Breaking Changes None. All changes are backward-compatible. ### Features - **Thread-safe `PdfDocument` โ€” Send + Sync (#302)** โ€” replaced all 16 `RefCell<T>` with `Mutex<T>` and `Cell<usize>` with `AtomicUsize`. `PdfDocument` can now safely cross thread boundaries. Removes `unsendable` from `PdfDocument`, `FormField`, and `PdfPage` Python classes. Enables `asyncio.to_thread()`, free-threaded Python (cp314t), and thread pool usage without `RuntimeError`. Reported by @FireMasterK (#298). - *Medium4/9/2026
v0.3.21### Bug Fixes - **Log level now fully respected in Python (#283)** โ€” `extract_log_debug!` / `extract_log_trace!` / etc. were printing to stderr directly via `eprintln!`, bypassing the `log` crate and therefore ignoring `pdf_oxide.set_log_level(...)` and Python's `logging.basicConfig(level=...)`. Messages like `[DEBUG] Parsing content stream for text extraction` and `[TRACE] Detected document script: Latin` leaked through at ERROR level. The macros now forward to `log::debug!` / `log::trace!` / High4/6/2026
v0.3.20### Table Extraction Engine Major rewrite of the table detection system, implementing the universal `Edges โ†’ Snap/Merge โ†’ Intersections โ†’ Cells โ†’ Groups` pipeline โ€” the gold-standard approach used by Tabula, pdfplumber, and PyMuPDF, now in pure Rust. #### New Detection Capabilities - **Intersection-based table detection** โ€” Finds Hร—V line crossings, builds cells from 4-corner rectangles, groups into tables via union-find. The gold-standard approach used by Tabula/pdfplumber/PyMuPDF, now in purMedium4/4/2026
v0.3.19### Features - **`extract_page_text()` Single-Call DTO** (#268) โ€” New `PageText` struct returns spans, characters, and page dimensions from a single extraction pass, eliminating redundant content stream parsing. Available across Rust, Python, and WASM. - **Column-Aware Reading Order** (#270) โ€” New `extract_spans_with_reading_order()` method accepts a `ReadingOrder` parameter. `ReadingOrder::ColumnAware` uses XY-Cut spatial partitioning to detect columns and read each column top-to-bottom, fixinMedium4/3/2026
v0.3.18### Rendering Engine โ€” Visual Parity Major rendering improvements achieving near-perfect visual fidelity across academic papers, government documents, CJK content, presentations, forms, and complex multi-layer PDFs. #### Font Rendering - **Correct Character Spacing** โ€” Fixed proportional width resolution for CID, CFF, and TrueType subset fonts. Documents that previously rendered with monospace-like spacing now display with correct kerning and proportional widths. - **Embedded Font Support** โ€” Medium4/2/2026
v0.3.17### Features - **Refined Table Detection** โ€” The spatial table detector now requires at least **2 columns** to identify a region as a table. This significantly reduces false positives where single-column lists or bullet points were incorrectly wrapped in ASCII boxes. - **Optimized Text Extraction** โ€” Refactored the internal extraction pipeline to eliminate redundant work when processing Tagged PDFs. The structure tree and page spans are now extracted once and shared across the detection and renLow3/9/2026
v0.3.16### Features - **Smart Hybrid Table Extraction** (#206) โ€” Introduced a robust, zero-config visual detection engine that handles both bordered and borderless tables. - **Localized Grid Detection:** Uses Union-Find clustering to group vector paths into discrete table regions, enabling multiple tables per page. - **Visual Line Analysis:** Detects cell boundaries from actual drawing primitives (lines and rectangles), significantly improving accuracy for untagged PDFs. - **Visual Spans:*Low3/8/2026
v0.3.15### Features - **PDF Header/Footer Management API** (#207) โ€” Added a dedicated API for managing page artifacts across Rust, Python, and WASM. - **Add:** Ability to insert custom headers and footers with styling and placeholders via `PageTemplate`. - **Remove:** Heuristic detection engine to automatically identify and strip repeating artifacts. Includes modular methods: `remove_headers()`, `remove_footers()`, and `remove_artifacts()`. Prioritizes ISO 32000 spec-compliant `/Artifact` tagsLow3/6/2026
v0.3.14### Features - **High-Level Rendering API** (#185, #190) โ€” added `Pdf::render_page()` to Rust, Python, and WASM. Supports rendering any page to `Image` (Png/Jpeg). Restored backward compatibility for Rust by maintaining the 1-argument `render_page` and adding `render_page_with_options`. - **Word and Line Extraction** (#185, #189) โ€” added `extract_words()` and `extract_text_lines()` to all bindings. Provides semantic grouping of characters with bounding boxes, font info, and styling (parity withLow3/4/2026
v0.3.13### Bug Fixes โ€” Character Extraction (#186) Reported by **@cole-dda** โ€” garbled output when using `extract_chars()` on PDFs with multi-byte encodings (CJK text, Type0 fonts). - **Multi-byte decoding in show_text** โ€” fixed `extract_chars()` to correctly handle 2-byte and variable-width encodings (Identity-H/V, Shift-JIS, etc.). Previously, characters were processed byte-by-byte, causing multi-byte characters to be split and garbled. Now uses the same robust decoding logic as `extract_spans()`. Low3/3/2026
v0.3.12### Bug Fixes โ€” Text Extraction (#181) Reported by **@Goldziher** โ€” systematic evaluation across 10 PDFs covering word merging, encoding failures, and RTL text. - **CID font width calculation** โ€” fixed text-to-user space conversion for CID fonts. Glyph widths were not correctly scaled, causing word boundary detection to merge adjacent words (`destinationmachine` โ†’ `destination machine`, `helporganizeas` โ†’ `help organize as`). - **Font-change word boundary detection** โ€” when PDF font changes mLow3/2/2026
v0.3.11### New Features - **CLI with 22 subcommands and interactive REPL** (#176) โ€” standalone `pdf-oxide` binary with text/markdown/html extraction, merge, split, compress, encrypt/decrypt, search, images, rotate, crop, watermark, forms, bookmarks, and more. Interactive REPL with session persistence and autocomplete. OS-specific install scripts: `curl -fsSL oxide.fyi/install.sh | sh` (Linux/macOS) and `irm oxide.fyi/install.ps1 | iex` (Windows). - **MCP server for AI assistants** (#177) โ€” `pdf-oxideLow3/1/2026
v0.3.10### New Features - **WASM build support** (#151) โ€” WebAssembly bindings via wasm-bindgen. New `PdfDocument::open_from_bytes()` constructor enables browser-side PDF extraction. Internal reader changed from `BufReader<File>` to `BufReader<Cursor<Vec<u8>>>` for portability. - **Parallel page extraction** (#168) โ€” New `parallel` feature flag with rayon-based multi-threaded extraction. `ParallelExtractor` distributes pages across worker threads, each opening its own PdfDocument instance. Global fonLow2/28/2026
v0.3.9### Performance - **O(nยฒ) string concat fix** (#135) โ€” Replaced `String::push_str()` accumulation in span merging with pre-allocated `Vec<&str>` joined at the end. Eliminates quadratic growth on pages with thousands of merged spans. - **Image-only content stream parser** (#113) โ€” New `parse_content_stream_images_only()` fast path that only extracts image operators (`Do`, `BI`), skipping text and graphics entirely. Used by `extract_images()` for 3-5ร— faster image extraction. - **Fingerprint-baLow2/25/2026
v0.3.8### Performance - **Text-only content stream parser** (#110) โ€” New `parse_content_stream_text_only()` fast path skips graphics operators outside BT/ET blocks using byte-level scanning instead of full nom parsing. Only text-affecting operators are returned. - **Byte-level graphics scanner** (#112) โ€” Replaced nom-based operand loop with raw index arithmetic in `scan_graphics_region()`. Processes digits, dots, and whitespace at near-memcpy speed, skipping path coordinates without constructing anyLow2/21/2026
v0.3.7### Verified โ€” 3,829-PDF Corpus (v0.3.6 โ†’ v0.3.7) | Metric | v0.3.6 | v0.3.7 | Change | |--------|--------|--------|--------| | **Clean rate** | 95.7% | **99.6%** | 3,812 of 3,829 PDFs | | **Dirty PDFs** | 165 | **17** | **-90%** | Systematic benchmark testing across 3,829 real-world PDFs identified and fixed 13 text extraction issues. ### Added โ€” Parser & Decoders - **BrotliDecode stream filter** (PDF 2.0, ISO 32000-2:2020) โ€” New `BrotliDecoder` for PDFs using Brotli-compressed streams (#95Low2/20/2026
v0.3.6### Performance - **Bulk page tree cache** โ€” On first page access, the entire page tree is walked once and all pages are cached. Previously `get_page()` traversed from root for every uncached page โ€” O(n) per page, O(nยฒ) total for sequential access. Now O(1) per page after a single O(n) walk. - **isartor-6-1-12-t01-fail-a.pdf (10,000 pages): 55,667ms โ†’ 332ms (168ร— faster)** - Eliminates the last >5s PDF in the entire 3,830-file corpus - **Scan-for-object offset cache** (#44) โ€” When objects Low2/16/2026
v0.3.5## v0.3.5 Release This release delivers **major performance improvements**, **100% pass rate on 3,830 PDFs**, comprehensive error recovery for 28+ real-world PDF failures, and spec-correct rendering โ€” the biggest stability release to date. ### โšก Performance - **Font caching across pages** โ€” Document-level cache keyed by `ObjectRef` avoids re-parsing shared fonts on every page. For a 1000-page document sharing 20 fonts, this reduces font parsing from 40,000 operations to 20 - **Page object cachLow2/16/2026
v0.3.4## v0.3.4 Release This release delivers **PDF parsing robustness** for real-world malformed PDFs, **character-level text extraction** API, and **XObject path extraction** โ€” driven by community bug reports. ### ๐Ÿ”ง Fixed โ€” PDF Parsing Robustness (Issue #41) - **Header offset support** โ€” PDFs with binary prefixes or BOM headers now open successfully - Searches first 1024 bytes for `%PDF-` marker (PDF spec compliant) - Supports UTF-8 BOM, email headers, and other leading binary data - `parseLow2/13/2026
v0.3.1## v0.3.1 Release This release delivers **95% form field coverage** across Read/Create/Modify operations, comprehensive multimedia annotation support, and Python 3.8-3.14 compatibility via ABI3. ### ๐ŸŽฏ Form Field Coverage (95%) - **Hierarchical Fields**: Parent/child field structures (`address.street`, `address.city`) - `add_parent_field()`, `add_child_field()`, `add_form_field_hierarchical()` - Property inheritance between parent and child fields (FT, V, DV, Ff, DA, Q) - **Property ModifiLow1/14/2026
v0.3.0## v0.3.0 Release This release introduces the **Unified Pdf API** - one seamless interface for extracting, creating, and editing PDFs. Plus comprehensive PDF creation capabilities with tables, graphics, forms, and security. ### ๐ŸŽฏ Unified Pdf API (Extract + Create + Edit) - **Single API for all operations** - `Pdf::open("input.pdf")` - Open existing PDF for reading and editing - `Pdf::from_markdown(content)` - Create new PDF from Markdown - `Pdf::from_html(content)` - Create new PDF fromLow1/12/2026
v0.2.6## v0.2.6 Release This release adds comprehensive CJK (Chinese, Japanese, Korean) language support and enhances structure tree handling for Tagged PDFs. ### ๐ŸŒ CJK Language Support - **Predefined CMap support for CJK fonts** (PDF Spec Section 9.7.5.2) - Adobe-GB1 (Simplified Chinese) - ~500 common character mappings - Adobe-Japan1 (Japanese) - Hiragana, Katakana, Kanji mappings - Adobe-CNS1 (Traditional Chinese) - Bopomofo and CJK mappings - Adobe-Korea1 (Korean) - Hangul and Hanja mapLow1/10/2026
v0.2.5## v0.2.5 Release This release adds flexible image handling with embedding and export capabilities for HTML and Markdown conversion. ### ๐Ÿ–ผ๏ธ Image Handling Features - **Image Embedding** - Embed images as base64 in output (default) - HTML: `<img src="data:image/png;base64,...>">` - Markdown: `![alt](data:image/png;base64,...)` (works in Obsidian, Typora, VS Code, Jupyter) - Portable - no external file dependencies - **Image File Export** - Save images as separate files - Set `embed_imaLow1/10/2026
v0.2.4## v0.2.4 Release This release fixes a critical text positioning bug and adds formula extraction capabilities. ### ๐Ÿ› Bug Fixes - **CTM (Current Transformation Matrix) handling** - Issue #11 - CTM now correctly applied to text positions per PDF Spec Section 9.4.4 - This fix affects text positioning across the entire library - Critical for production use with complex PDFs ### โœจ New Features - **Structure Tree Enhancements** - `/Alt` (alternate description) parsing for accessibility texLow1/10/2026
v0.2.3## v0.2.3 Release This release fixes critical text positioning bugs and adds intelligent text processing for better extraction quality. ### ๐Ÿ› Bug Fixes - **BT/ET matrix reset** - Per PDF spec Section 9.4.1 (PR #10 by @drahnr) - Text matrices weren't being reset between text blocks, causing positions to accumulate - Now correctly resets transformation matrix at text block boundaries - **Geometric spacing detection** - Markdown converter now uses proper spacing analysis (#5) - **Verbose logLow1/7/2026
v0.2.2## v0.2.2 Release This release improves package discoverability with optimized keywords and metadata. ### ๐Ÿ” Discoverability Improvements - **Crate Keywords Optimization** - Better search results on crates.io - Enhanced metadata for common PDF operations - Improved categorization - Better findability for PDF extraction use cases ### โœ… Verification - Metadata validation passed - Search indexing updated - Package discovery improved ### ๐Ÿ“ฅ Installation **Rust (crates.io)** ```bash cargo Low12/15/2025
v0.2.1## v0.2.1 Release This release fixes critical encrypted PDF support issues and improves CI/CD pipeline reliability. ### ๐Ÿ› Bug Fixes - **Encrypted stream decoding** (PR #2 and #3) - Fixed decryption ordering - must happen before decompression - Fixed encryption handler initialization timing - Added Form XObject encryption support - PDFOxide now works with password-protected PDFs in production - **CI/CD pipeline fixes** - Improved build reliability ### ๐Ÿ† Community Contributors ๐Ÿฅ‡ **@tLow12/15/2025
v0.1.4## v0.1.4 Release This release improves encrypted PDF handling and fixes documentation issues. ### ๐Ÿ” Encryption Improvements - **Encrypted stream decoding refinements** (PR #2) - Improved stream cipher handling - Better compatibility with various PDF encryption methods - **Documentation and doctest fixes** - All examples updated and verified - docs.rs build fixed ### โœ… Verification - All encryption tests pass - Documentation verified - Cross-platform compatibility confirmed ### ๐Ÿ“ฅ ILow12/12/2025
v0.1.3## v0.1.3 Release This release refines encrypted PDF handling with additional improvements for stream decryption. ### ๐Ÿ” Encryption Refinements - **Encrypted stream decoding improvements** - Enhanced algorithm for stream object decryption - Better compatibility with different PDF security handlers - Improved error handling for malformed encryption dictionaries ### โœ… Verification - Encrypted PDF tests validated - Stream decryption accuracy verified - Multiple encryption method support coLow12/12/2025
v0.1.2## v0.1.2 Release This release adds Python 3.13 support and includes GitHub sponsor configuration. ### ๐Ÿ Python Support - **Python 3.13 Support** - Extended language version compatibility - PyO3 bindings updated for Python 3.13 - Full feature parity with previous Python versions - Comprehensive testing on new Python version ### ๐Ÿค Community - **GitHub Sponsor Configuration** - Support the project - Sponsor link added to repository - Community funding mechanism available ### โœ… VeriLow11/27/2025
v0.1.1## v0.1.1 Release This release introduces cross-platform binary builds for Linux, macOS, and Windows. ### ๐Ÿš€ Platform Support - **Cross-Platform Binaries** - Pre-built executables for all major platforms - **Linux**: x86_64 (glibc and musl), ARM64 support - **macOS**: Intel (x86_64) and Apple Silicon (ARM64) - **Windows**: x86_64 support - One-click installation - no build required ### ๐Ÿ“ฆ Binary Tools - `export_to_markdown` - PDF to Markdown conversion - `export_to_text` - PDF to plaiLow11/26/2025
v0.1.0## v0.1.0 Release - Initial Release Welcome to PDFOxide - **The Complete PDF Toolkit for Rust**. This initial release brings spec-compliant PDF text extraction with intelligent reading order detection, Python bindings, and support for encrypted PDFs. ### ๐Ÿ“– Core Features - **PDF Text Extraction** - Spec-compliant Unicode mapping per PDF Section 9.10 - Intelligent reading order detection - Character-level positioning metadata - Support for embedded fonts and encoding - **Form Field ExtracLow11/6/2025

Dependencies & License Audit

Loading dependencies...

Similar Packages

ragtable-extractExtract tables precisely from PDFs and convert them to clean HTML for RAG pipelines, running fast on CPU without external dependencies.main@2026-06-04
awesome-opensource-aiCurated list of the best truly open-source AI projects, models, tools, and infrastructure.main@2026-06-06
restaiRESTai is an AIaaS (AI as a Service) open-source platform. Supports many public and local LLM suported by Ollama/vLLM/etc. Precise embeddings usage, tuning, analytics etc. Built-in image/audio generatv6.3.24
edgequakeEdegQuake ๐ŸŒ‹ High-performance GraphRAG inspired from LightRag written in Rust; Transform documents into intelligent knowledge graphs for superior retrieval and generationv0.12.6
quarkus-doclingDocling simplifies document processing, parsing diverse formats โ€” including advanced PDF understanding โ€” and providing seamless integrations with the gen AI ecosystem1.3.1

More from yfedoseev

fossil-mcpThe code quality toolkit for the agentic AI era. Find dead code, clones, and scaffolding across 15 languages. MCP server + CLI.

More in RAG & Memory

vllmA high-throughput and memory-efficient inference and serving engine for LLMs
spiceaiA portable accelerated SQL query, search, and LLM-inference engine, written in Rust, for data-grounded AI apps and agents.
awesome-opensource-aiCurated list of the best truly open-source AI projects, models, tools, and infrastructure.
antflyNo description