kreuzberg

bun csharp document-intelligence elixir ffi golang java metadata-extraction rag rust

Why this rank:Strong adoptionRecent releaseHealthy release cadence

Description

A polyglot document intelligence framework with a Rust core. Extract text, metadata, images, and structured information from PDFs, Office documents, images, and 91+ formats. Available for Rust, Python, Ruby, Java, Go, PHP, Elixir, C#, R, C, TypeScript (Node/Bun/Wasm/Deno)- or use via CLI, REST API, or MCP server.

README

Kreuzberg

Extract text, metadata, and code intelligence from 97+ file formats and 305 programming languages at native speeds without needing a GPU.

Key Features

Code intelligence – Extract functions, classes, imports, symbols, and docstrings from 248 programming languages via tree-sitter. Results in ExtractionResult.code_intelligence with semantic chunking
Extensible architecture – Plugin system for custom OCR backends, validators, post-processors, document extractors, and renderers
Polyglot – Native bindings for Rust, Python, TypeScript/Node.js, Ruby, Go, Java, C#, PHP, Elixir, R, and C
91+ file formats – PDF, Office documents, images, HTML, XML, emails, archives, academic formats across 8 categories
LLM intelligence – VLM OCR (GPT-4o, Claude, Gemini, Ollama), structured JSON extraction with schema constraints, and provider-hosted embeddings via 146 LLM providers (including local engines: Ollama, LM Studio, vLLM, llama.cpp) through liter-llm
OCR support – Tesseract (all bindings, including Tesseract-WASM for browsers), PaddleOCR (all native bindings), EasyOCR (Python), VLM OCR (146 vision model providers including local engines), extensible via plugin API
High performance – Rust core with native PDFium, SIMD optimizations and full parallelism
Flexible deployment – Use as library, CLI tool, REST API server, or MCP server
TOON wire format – Token-efficient serialization for LLM/RAG pipelines, ~30-50% fewer tokens than JSON
GFM-quality output – Comrak-based rendering with proper fenced code blocks, table nodes, bracket escaping, and cross-format parity (Markdown, HTML, Djot, Plain)
HTML passthrough – HTML-to-Markdown conversion uses html-to-markdown output directly, bypassing lossy intermediate round-trips
Memory efficient – Streaming parsers for multi-GB files

Complete Documentation | Live Demo | Installation Guides

Installation

Each language binding provides comprehensive documentation with examples and best practices. Choose your platform to get started:

Scripting Languages:

Python – PyPI package, async/sync APIs, OCR backends (Tesseract, PaddleOCR, EasyOCR)
Ruby – RubyGems package, idiomatic Ruby API, native bindings
PHP – Composer package, modern PHP 8.4+ support, type-safe API, async extraction
Elixir – Hex package, OTP integration, concurrent processing
R – r-universe package, idiomatic R API, extendr bindings

JavaScript/TypeScript:

@kreuzberg/node – Native NAPI-RS bindings for Node.js/Bun, fastest performance
@kreuzberg/wasm – WebAssembly for browsers/Deno/Cloudflare Workers, full feature parity (PDF, Excel, OCR, archives)

Compiled Languages:

Go – Go module with FFI bindings, context-aware async
Java – Maven Central, Foreign Function & Memory API
C# – NuGet package, .NET 6.0+, full async/await support

Native:

Rust – Core library, flexible feature flags, zero-copy APIs
C (FFI) – C header + shared library, pkg-config/CMake support, cross-platform

Containers:

Docker – Official images with API, CLI, and MCP server modes (Core: ~1.0-1.3GB, Full: ~1.0-1.3GB with OCR + legacy format support)

Command-Line:

CLI – Cross-platform binary, batch processing, MCP server mode

All language bindings include precompiled binaries for both x86_64 and aarch64 architectures on Linux and macOS.

Platform Support

Complete architecture coverage across all language bindings:

Language	Linux x86_64	Linux aarch64	macOS ARM64	Windows x64
Python	✅	✅	✅	✅
Node.js	✅	✅	✅	✅
WASM	✅	✅	✅	✅
Ruby	✅	✅	✅	-
R	✅	✅	✅	✅
Elixir	✅	✅	✅	✅
Go	✅	✅	✅	✅
Java	✅	✅	✅	✅
C#	✅	✅	✅	✅
PHP	✅	✅	✅	✅
Rust	✅	✅	✅	✅
C (FFI)	✅	✅	✅	✅
CLI	✅	✅	✅	✅
Docker	✅	✅	✅	-

Note: ✅ = Precompiled binaries available with instant installation. WASM runs in any environment with WebAssembly support (browsers, Deno, Bun, Cloudflare Workers). All platforms are tested in CI. MacOS support is Apple Silicon only.

Embeddings Support (Optional)

To use embeddings functionality:

Install ONNX Runtime 1.24+:
- Linux: Download from ONNX Runtime releases (Debian packages may have older versions)
- MacOS: brew install onnxruntime
- Windows: Download from ONNX Runtime releases
Use embeddings in your code - see Embeddings Guide

Note: Kreuzberg requires ONNX Runtime version 1.24+ for embeddings. All other Kreuzberg features work without ONNX Runtime.

Supported Formats

91+ file formats across 8 major categories with intelligent format detection and comprehensive metadata extraction.

Office Documents

Category	Formats	Capabilities
Word Processing	`.docx`, `.docm`, `.dotx`, `.dotm`, `.dot`, `.odt`, `.pages`	Full text, tables, lists, images, metadata, styles
Spreadsheets	`.xlsx`, `.xlsm`, `.xlsb`, `.xls`, `.xla`, `.xlam`, `.xltm`, `.xltx`, `.xlt`, `.ods`, `.numbers`	Sheet data, formulas, cell metadata, charts
Presentations	`.pptx`, `.pptm`, `.ppsx`, `.potx`, `.potm`, `.pot`, `.key`	Slides, speaker notes, images, metadata
PDF	`.pdf`	Text, tables, images, metadata, OCR support
eBooks	`.epub`, `.fb2`	Chapters, metadata, embedded resources
Database	`.dbf`	Table data extraction, field type support
Hangul	`.hwp`, `.hwpx`	Korean document format, text extraction

Images (OCR-Enabled)

Category	Formats	Features
Raster	`.png`, `.jpg`, `.jpeg`, `.gif`, `.webp`, `.bmp`, `.tiff`, `.tif`	OCR, table detection, EXIF metadata, dimensions, color space
Advanced	`.jp2`, `.jpx`, `.jpm`, `.mj2`, `.jbig2`, `.jb2`, `.pnm`, `.pbm`, `.pgm`, `.ppm`	Pure Rust decoders (JPEG 2000, JBIG2), OCR, table detection
Vector	`.svg`	DOM parsing, embedded text, graphics metadata

Web & Data

Category	Formats	Features
Markup	`.html`, `.htm`, `.xhtml`, `.xml`, `.svg`	DOM parsing, metadata (Open Graph, Twitter Card), link extraction
Structured Data	`.json`, `.yaml`, `.yml`, `.toml`, `.csv`, `.tsv`	Schema detection, nested structures, validation
Text & Markdown	`.txt`, `.md`, `.markdown`, `.djot`, `.mdx`, `.rst`, `.org`, `.rtf`	CommonMark, GFM, Djot, MDX, reStructuredText, Org Mode, Rich Text

Email & Archives

Category	Formats	Features
Email	`.eml`, `.msg`	Headers, body (HTML/plain), attachments, UTF-16 support
Archives	`.zip`, `.tar`, `.tgz`, `.gz`, `.7z`	Recursive extraction, nested archives, metadata

Academic & Scientific

Category	Formats	Features
Citations	`.bib`, `.ris`, `.nbib`, `.enw`, `.csl`	BibTeX/BibLaTeX, RIS, PubMed/MEDLINE, EndNote XML, CSL JSON
Scientific	`.tex`, `.latex`, `.typ`, `.typst`, `.jats`, `.ipynb`	LaTeX, Typst, JATS journal articles, Jupyter notebooks
Publishing	`.fb2`, `.docbook`, `.dbk`, `.opml`	FictionBook, DocBook XML, OPML outlines
Documentation	`.pod`, `.mdoc`, `.troff`	Perl POD, man pages, troff

Complete Format Reference →

Code Intelligence (248 Languages)

Feature	Description
Structure Extraction	Functions, classes, methods, structs, interfaces, enums
Import/Export Analysis	Module dependencies, re-exports, wildcard imports
Symbol Extraction	Variables, constants, type aliases, properties
Docstring Parsing	Google, NumPy, Sphinx, JSDoc, RustDoc, and 10+ formats
Diagnostics	Parse errors with line/column positions
Syntax-Aware Chunking	Split code by semantic boundaries, not arbitrary byte offsets

Key Features

OCR with Table Extraction

Multiple OCR backends (Tesseract, EasyOCR, PaddleOCR) with intelligent table detection and reconstruction. Extract structured data from scanned documents and images with configurable accuracy thresholds.

OCR Backend Documentation →

Batch Processing

Process multiple documents concurrently with configurable parallelism. Optimize throughput for large-scale document processing workloads with automatic resource management.

Batch Processing Guide →

Password-Protected PDFs

Handle encrypted PDFs with single or multiple password attempts. Supports both RC4 and AES encryption with automatic fallback strategies.

PDF Configuration →

Language Detection

Automatic language detection in extracted text using fast-langdetect. Configure confidence thresholds and access per-language statistics.

Language Detection Guide →

Metadata Extraction

Extract comprehensive metadata from all supported formats: authors, titles, creation dates, page counts, EXIF data, and format-specific properties.

Metadata Guide →

AI Coding Assistants

Kreuzberg ships with an Agent Skill that teaches AI coding assistants how to use the library correctly. It works with Claude Code, Codex, Gemini CLI, Cursor, VS Code, Amp, Goose, Roo Code, and any tool supporting the Agent Skills standard.

Install the skill into any project using the Vercel Skills CLI:

npx skills add kreuzberg-dev/kreuzberg

The skill is located at skills/kreuzberg/SKILL.md and is automatically discovered by supported AI coding tools once installed.

Documentation

Installation Guide – Setup and dependencies
User Guide – Comprehensive usage guide
API Reference – Complete API documentation
Format Support – Supported file formats
OCR Backends – OCR engine setup
CLI Guide – Command-line usage
Migration Guides – Upgrading from other libraries

Contributing

Contributions are welcome! See CONTRIBUTING.md for guidelines.

License

Elastic License 2.0 (ELv2) - see LICENSE for details. See https://www.elastic.co/licensing/elastic-license for the full license text.

Release History

Version	Changes	Urgency	Date
v4.9.9	Full Changelog: https://github.com/kreuzberg-dev/kreuzberg/commits/v4.9.9	High	6/5/2026
v4.9.8	LTS patch release. Four targeted bug fixes plus dependency pinning so the branch builds against current crates.io releases. ### Fixed - #934: RTF hex byte escapes now honor `\ansicpgNNNN`, so CP1251 Cyrillic byte runs decode as readable text instead of Windows-1252 mojibake. - #937: `ExtractionConfig(cancel_token=…)` raised `TypeError: unexpected keyword argument 'cancel_token'` from Python despite the type stub declaring the kwarg. The `#[pyo3(signature = …)]` on `ExtractionConfig::_	High	5/17/2026
v4.9.7	Full Changelog: https://github.com/kreuzberg-dev/kreuzberg/compare/v4.9.6...v4.9.7	High	5/8/2026
v4.9.6	Full Changelog: https://github.com/kreuzberg-dev/kreuzberg/compare/v4.9.5...v4.9.6	High	5/8/2026
v4.9.5	## Fixed - #790: Fix GPU acceleration — kreuzberg now bundles CPU-only ONNX Runtime by default (zero-config). When a GPU execution provider (`cuda`, `tensorrt`, `coreml`) is explicitly requested via `AccelerationConfig` but unavailable, kreuzberg returns an error with setup instructions instead of silently falling back to CPU. `Auto` mode gracefully falls back to CPU with an info log. For GPU support, set `ORT_DYLIB_PATH` to a GPU-enabled ONNX Runtime. - #791: Fix DOCX OCR extraction —	High	4/23/2026
v4.9.3	See [CHANGELOG.md](https://github.com/kreuzberg-dev/kreuzberg/blob/main/CHANGELOG.md#493---2026-04-22) for full details.	High	4/22/2026
v4.9.2	## Fixed - Fix cancellation token not checked in WASM (non-tokio) path for Excel, DOC, PPT, Pages, Keynote, and Numbers extractors — cancellation was silently ignored in WASM builds - Propagate `Cancelled` error code (9) to all bindings — Go, C FFI, Python, TypeScript, Java, C#, and C API docs now include the new code - Fix PHP e2e embed tests calling instance methods statically — use procedural `\Kreuzberg\embed()` functions - Fix TypeScript e2e embed tests using wrong field names (`type`/`nam	High	4/19/2026
v4.9.1	## Fixed - #754: Preserve `_internal_bindings.pyi` type stub during wheel artifact cleanup — published wheels now include inline type information for the core binding module - Add missing `Default` impl for `PyCancellationToken` to satisfy clippy `new_without_default` lint - Improve download resilience for `eng.traineddata` in build script — increase retries from 3 to 5, add fallback URL via `raw.githubusercontent.com`, and increase timeout to 300s - Increase Task installer retry resilience	High	4/19/2026
v4.9.0	## What's Changed * Fix duplicated heading in markdown chunker with prepend_heading_context by @tobocop2 in https://github.com/kreuzberg-dev/kreuzberg/pull/701 * chore(deps): bump pnpm/action-setup from 5 to 6 by @dependabot[bot] in https://github.com/kreuzberg-dev/kreuzberg/pull/698 * chore(deps): bump actions/upload-pages-artifact from 4 to 5 by @dependabot[bot] in https://github.com/kreuzberg-dev/kreuzberg/pull/711 * fix: remove duplicate output_format key and fix numeric types in OCR metadat	High	4/18/2026
v4.8.6	## What's Changed * Fix duplicated heading in markdown chunker with prepend_heading_context by @tobocop2 in https://github.com/kreuzberg-dev/kreuzberg/pull/701 * chore(deps): bump pnpm/action-setup from 5 to 6 by @dependabot[bot] in https://github.com/kreuzberg-dev/kreuzberg/pull/698 * chore(deps): bump actions/upload-pages-artifact from 4 to 5 by @dependabot[bot] in https://github.com/kreuzberg-dev/kreuzberg/pull/711 * fix: remove duplicate output_format key and fix numeric types in OCR metadat	High	4/17/2026
v4.8.5	## What's Changed ### Added - LLM usage tracking — new `llm_usage` field on `ExtractionResult` captures token counts, estimated cost (USD), model identifier, and finish reason for every LLM call (VLM OCR, structured extraction, LLM embeddings). Exposed across all 12 bindings. ### Fixed - Markdown chunker heading duplication when `prepend_heading_context` is enabled (#701) - Helm chart icon 404 on Artifact Hub — `.png` → `.svg` - Python wheel manylinux compliance — bumped to `ma	High	4/14/2026
v4.8.4	## What's Changed ### Added - Helm chart for Kubernetes deployment — minimal, security-hardened Helm chart with Deployment, Service, Ingress, PVC, HPA, PDB, and ServiceAccount templates. Publishes to GHCR as an OCI artifact. (#695) - Helm lint and kubeconform pre-commit hooks — added `helm lint --strict` and `kubeconform` (k8s 1.28.0 schema validation) to pre-commit and CI pipeline. - Helm chart publish workflow — new `publish-helm.yaml` GitHub Actions workflow pushes versioned char	High	4/13/2026
v4.8.2	## Added - `HtmlOutputConfig` typed in all bindings — `html_output` config field (themes, CSS classes, embed CSS, custom CSS, class prefix) now fully typed in Python, TypeScript/Node, Go, Ruby, Elixir, PHP, Java, C#, R, and FFI. Previously only available in Rust core. ## Fixed - PDF: legitimate repeated content stripped during page merging regardless of `strip_repeating_text` flag — `deduplicate_paragraphs()` runs unconditionally, stripping brand names and other legitimately repeated	High	4/10/2026
benchmark-run-24180384454	Comparative benchmark results from workflow run [24180384454](https://github.com/kreuzberg-dev/kreuzberg/actions/runs/24180384454). Commit: 2e10062f4424621cae8bd7844377c70d223e1724 Date: 2026-04-09	Medium	4/9/2026
v4.8.1	### Added - Styled HTML output — New `HtmlOutputConfig` on `ExtractionConfig` with 5 built-in themes (`default`, `github`, `dark`, `light`, `unstyled`), semantic `kb-` CSS class hooks on every structural element, CSS custom properties (`--kb-`), custom CSS injection (inline or file), and configurable class prefix. The existing `Html` output format is upgraded in-place when `html_output` is set (#633, #665) - 5 new CLI flags: `--html-theme`, `--html-css`, `--html-css-file`, `--html-class-p	Medium	4/9/2026
benchmark-run-24136316825	Comparative benchmark results from workflow run [24136316825](https://github.com/kreuzberg-dev/kreuzberg/actions/runs/24136316825). Commit: cc69ae69bfda402b016b90fcfd84ccfdbf816c8d Date: 2026-04-08	Medium	4/8/2026
benchmark-run-24118817847	Comparative benchmark results from workflow run [24118817847](https://github.com/kreuzberg-dev/kreuzberg/actions/runs/24118817847). Commit: c6a1aadd63e6c685ba06b998e6c8c98d83b13c16 Date: 2026-04-08	Medium	4/8/2026
v4.8.0	## What's Changed * fix: correct HWP tag constants and control character handling by @nuri-yoo in https://github.com/kreuzberg-dev/kreuzberg/pull/659 * feat: standalone text embedding API (#599) by @kh3rld in https://github.com/kreuzberg-dev/kreuzberg/pull/614 * feat: integrate liter-llm for VLM OCR, embeddings, and structured extraction by @Goldziher in https://github.com/kreuzberg-dev/kreuzberg/pull/662 ## New Contributors * @nuri-yoo made their first contribution in https://github.com/kreuzb	High	4/8/2026
v4.7.4	### Added - Re-added `--layout` boolean CLI flag for easy layout detection enablement (use `--layout` to enable with model defaults, `--layout false` to explicitly disable) - arXiv watermark/sidebar noise filtering for academic PDFs — strips LaTeX sidebar identifiers from extracted text - Second-tier cross-page repeating text detection — catches conference headers and journal running titles that repeat on >70% of pages but appear outside the margin zone - Figure/picture text suppression — text	Medium	4/6/2026
v4.7.3	## Fixed - Archive extraction SIGBUS crash on macOS ARM64 — ZIP, 7Z, TAR, and GZIP archive extraction crashed with SIGBUS (signal 10) in release builds due to miscompilation of unsafe code in `sevenz-rust2` and `zip` crates under `opt-level=3`. Reduced optimization level to 2 for these crates. This also fixes Elixir, R, Go, and C benchmark crashes when processing archive files. - Native-text PDF extraction fails when OCR backend unavailable (#646) — PDFs with extractable native text har	Medium	4/5/2026
v4.7.2	## What's Changed ### Added - E2E generator published mode — Generate standalone test apps against published registry versions for all 12 language bindings ### Changed - Global model cache (#641) — Models now download to platform-appropriate global cache directory instead of per-directory `.kreuzberg/` folders ### Fixed - Leptonica DPI crash (#606) — Images with 0 DPI caused C++ exception abort during preprocessing. Now validates and fixes DPI to 72 before preprocessing. Also disa	Medium	4/4/2026
v4.7.1	## [4.7.1] - 2026-04-03 ### Added - Tree-sitter grammar management CLI — New `kreuzberg tree-sitter` subcommand with `download`, `list`, `cache-dir`, and `clean` sub-commands for managing tree-sitter grammar parsers. Supports downloading by language name, group (`--groups web,systems,scripting`), or all (`--all`). Reads `[tree_sitter]` config from `kreuzberg.toml` with `--from-config`. - Tree-sitter grammar management API — New REST endpoints: `POST /grammars/download`, `GET /grammars/	Medium	4/4/2026
v4.7.0	## Highlights ### Code Intelligence — 248 Languages New `code_intelligence` field on `ExtractionResult` via [tree-sitter-language-pack](https://github.com/kreuzberg-dev/tree-sitter-language-pack). Extract functions, classes, imports, exports, symbols, docstrings, and diagnostics. Semantic code chunking that respects scope boundaries. Configure with `CodeContentMode`: `chunks`, `raw`, or `structure`. ### Benchmark-Driven Extraction Quality 350+ test documents across 23 formats with Structural F	Medium	4/3/2026
v4.6.3	## [4.6.3] - 2026-03-27 ### Added - Tower service layer (`service` module): Composable `ExtractionService` implementing `tower::Service` with configurable middleware layers (tracing, metrics, timeout, concurrency limit). New `tower-service` feature flag, auto-enabled by `api` and `mcp`. `ExtractionServiceBuilder` provides ergonomic layer composition. - Semantic OpenTelemetry conventions (`telemetry` module): Formal `kreuzberg.*` attribute namespace with 30+ span attributes, metric name	Medium	3/27/2026
v4.6.2	## [4.6.2] - 2026-03-26 ### Added - PDF page rendering API (#583): New `render_pdf_page` function and `PdfPageIterator` for rendering individual PDF pages as PNG images. Available across all 11 language bindings with idiomatic patterns (Python context manager, Go Close(), Java AutoCloseable, C# IDisposable, Elixir Stream, etc.). Default 150 DPI, configurable per call. ### Fixed - Table recognition coordinate mismatch on scanned PDFs (#582): Layout detection bboxes (640x640 model spac	Medium	3/26/2026
v4.6.1	## Fixes - OCR memory usage reduced 60-78%: Restructured the OCR batch rendering loop to render-and-encode one page at a time instead of holding all decoded RGB buffers simultaneously. A 98-page scanned PDF dropped from 4.6GB to 1.9GB peak RSS (batch_size=4), and from 3.3GB to 713MB (batch_size=1). Batch size now adapts to available system memory on Linux and macOS. - PDF control character encoding artifacts: PDFs with broken ToUnicode font mappings that produce U+0002 (STX) and other c	Medium	3/25/2026
v4.6.0	## v4.6.0 — Recursive Archives, DocumentStructure, Bug Fixes ### Added - Recursive archive extraction: Archives (ZIP, TAR, 7Z, GZIP) now recursively extract all processable files, each with its own `ExtractionResult`. New `ArchiveEntry` type and `max_archive_depth` config. - YAML/JSON section chunker: New `ChunkerType::Yaml` with full hierarchy paths and auto-inference from metadata. - Unified DocumentStructure: Extended with 7 new node types, 4 annotation kinds, attributes bag. Al	Medium	3/25/2026
v4.5.4	## [4.5.4] - 2026-03-23 ### Fixed - PDF image extraction panic on mismatched buffer lengths (#552): Replaced `assert!` with graceful error handling. Malformed PDF images are now skipped instead of panicking. Regression from v4.5.0. - `pdf` feature compilation without `layout-detection` (#550): `config.layout` reference gated behind `#[cfg(feature = "layout-detection")]`. - WASM module resolution in Supabase/Deno edge functions (#551): Added explicit `package.json` exports and Deno	Medium	3/23/2026
v4.5.3	## What's New ### SLANeXT Table Structure Recognition Alternative table structure backends alongside TATR. New `table_model` field on `LayoutDetectionConfig` selects the backend: \| Model \| Config Value \| Size \| Best For \| \|-------\|-------------\|------\|----------\| \| TATR \| `"tatr"` (default) \| 30 MB \| General-purpose, consistent results \| \| SLANeXT Wired \| `"slanet_wired"` \| 365 MB \| Bordered/gridlined tables \| \| SLANeXT Wireless \| `"slanet_wireless"` \| 365 MB \| Borderless tables \| \| SLANeXT A	Low	3/22/2026
v4.5.2	## Fixed - PDF word splitting in extracted text: Pdfium's text extraction inserted spurious spaces mid-word (e.g. `"s hall a b e active"` instead of `"shall be active"`). Added selective page-level respacing: pages with detected broken word spacing are re-extracted using character-level gap analysis (`font_size × 0.33` threshold). Clean pages use the fast single-call path. Reduces garbled lines from 406 to 0 on the ISO 21111-10 test document with no performance impact. - **Markdown undersco	Low	3/21/2026
v4.5.1	See [CHANGELOG.md](https://github.com/kreuzberg-dev/kreuzberg/blob/main/CHANGELOG.md#451---2026-03-20) for release notes.	Low	3/21/2026
benchmark-run-23359982805	Comparative benchmark results from workflow run [23359982805](https://github.com/kreuzberg-dev/kreuzberg/actions/runs/23359982805). Commit: d0624792f343e8dab8c7468dcc2a2c4930741157 Date: 2026-03-21	Low	3/21/2026
v4.5.0	See [CHANGELOG.md](https://github.com/kreuzberg-dev/kreuzberg/blob/main/CHANGELOG.md#450---2026-03-20) for full release notes.	Low	3/20/2026
v4.4.6	## Added - dBASE (.dbf) format support: Extract table data from dBASE files as markdown tables with field type support. - Hangul Word Processor (.hwp/.hwpx) support: Extract text content from HWP 5.0 documents (standard Korean document format). - Office template/macro format variants: Added support for `.docm`, `.dotx`, `.dotm`, `.dot` (Word), `.potx`, `.potm`, `.pot` (PowerPoint), `.xltx`, `.xlt` (Excel) formats. ## Fixed - DOCX image placeholders missing (#484): Extracting `	Low	3/13/2026
benchmark-run-23042674034	Comparative benchmark results from workflow run [23042674034](https://github.com/kreuzberg-dev/kreuzberg/actions/runs/23042674034). Commit: 31ae19c77ebfd0cd6e809078c28a7cfb2388edeb Date: 2026-03-13	Low	3/13/2026
v4.4.4	## Fixed - CLI test app fixes: Fixed broken symlinks in CLI test documents, corrected `--format` to `--output-format` flag usage, fixed multipart form field name (`file=` → `files=`) in serve tests, and rewrote MCP test to use JSON-RPC stdin protocol instead of background process detection. - Publish idempotency check scripts: Fixed `check_nuget.sh` and `check-nuget-version.sh` using bash 4+ `${var,,}` syntax incompatible with bash 3.x. Fixed `check_pypi.sh` and `check_packagist.sh` wri	Low	3/7/2026
v4.4.3	## Added - PDF image placeholder toggle: New `inject_placeholders` option on `ImageExtractionConfig` (default: `true`). Set to `false` to extract images as data without injecting `![image](...)` references into the markdown content. ## Fixed - Token reduction not applied ([#436](https://github.com/kreuzberg-dev/kreuzberg/issues/436)): Token reduction config was accepted but never executed during extraction. The pipeline now applies `reduce_tokens()` when `token_reduction.mode` is conf	Low	3/6/2026
v4.4.2	## Fixed - E2E element type assertions: Fixed element type field name in E2E generator templates for Python, TypeScript, WASM Deno, Elixir, Ruby, PHP, and C# - Ruby PDF annotation extraction: Fixed `PdfAnnotation` and `PdfAnnotationBoundingBox` autoload and bounding box field name mismatch - WASM OCR blocking event loop: OCR now runs in a worker thread, keeping the main thread responsive - JPEG 2000 OCR decode failure: Shared `load_image_for_ocr()` helper with `hayro-jpeg2000`/`	Low	3/4/2026
benchmark-run-22610076103	Comparative benchmark results from workflow run [22610076103](https://github.com/kreuzberg-dev/kreuzberg/actions/runs/22610076103). Commit: 8b7e35c641e3a918146d73e4b954c6d3a5cb6bdf Date: 2026-03-03	Low	3/3/2026
v4.4.1	## Added - OCR table inlining into markdown content (#421): When `output_format = Markdown` and OCR detects tables, the markdown pipe tables are now inlined into `result.content` at their correct vertical positions instead of only appearing in `result.tables`. Adds `OcrTableBoundingBox` to `OcrTable` for spatial positioning. Sets `metadata.output_format = "markdown"` to signal pre-formatted content and skip re-conversion. - OCR table bounding boxes: OCR-detected tables now include bound	Low	3/1/2026
benchmark-run-22521432636	Comparative benchmark results from workflow run [22521432636](https://github.com/kreuzberg-dev/kreuzberg/actions/runs/22521432636). Commit: 978102f360273632db02791b07f4106af9af7408 Date: 2026-03-01	Low	3/1/2026
v4.4.0	## Added - R language bindings -- Added kreuzberg R package via extendr with full extraction API (sync/async, batch, bytes), typed error conditions, S3 result class with accessors, config discovery, OCR/chunking configuration, plugin system, and 32 documentation snippets. - PHP async extraction: Non-blocking extraction via `DeferredResult` pattern with Tokio thread pool. Includes `extractFileAsync()`, `extractBytesAsync()`, `batchExtractFilesAsync()`, `batchExtractBytesAsync()` across O	Low	2/28/2026
benchmark-run-22450525715	Comparative benchmark results from workflow run [22450525715](https://github.com/kreuzberg-dev/kreuzberg/actions/runs/22450525715). Commit: 310e9d5cd64ede2e12516d34a5471484d78fb88b Date: 2026-02-26	Low	2/26/2026
benchmark-run-22394565362	Comparative benchmark results from workflow run [22394565362](https://github.com/kreuzberg-dev/kreuzberg/actions/runs/22394565362). Commit: 47a391a555a76733d93507941671d83347de9bf9 Date: 2026-02-25	Low	2/25/2026
benchmark-run-22314146367	Comparative benchmark results from workflow run [22314146367](https://github.com/kreuzberg-dev/kreuzberg/actions/runs/22314146367). Commit: 1717eac8d1066271bbe3f3bc4d2371378a954299 Date: 2026-02-23	Low	2/23/2026
benchmark-run-22297171341	Comparative benchmark results from workflow run [22297171341](https://github.com/kreuzberg-dev/kreuzberg/actions/runs/22297171341). Commit: c72b316aff6d2c052409b0c398735f6920ab3231 Date: 2026-02-23 Note: Release created locally after CI aggregation step failed due to missing `gh` CLI on runner.	Low	2/23/2026
v4.3.8	## Added - MDX format support (`mdx` feature): Extract text from `.mdx` files, stripping JSX/import/export syntax while preserving markdown content, frontmatter, tables, and code fences - List supported formats API (#404): Query all supported file extensions and MIME types via `list_supported_formats()` in Rust, `GET /formats` REST endpoint, `list_formats` MCP tool, or `kreuzberg formats` CLI subcommand ## Fixed - PDF ligature corruption in CM/Type1 fonts: Added contextual ligatur	Low	2/22/2026
v4.3.7	## Added - NFC unicode normalization applied to all extraction outputs, ensuring consistent representation of composed characters across all backends (gated behind `quality` feature) - Configurable PDF page margin fractions (`top_margin_fraction`, `bottom_margin_fraction`) in `PdfConfig` - PDF annotation extraction with new `PdfAnnotation` type supporting `Text`, `Highlight`, `Link`, `Stamp`, `Underline`, `StrikeOut`, and `Other` annotation types - `extract_annotations` configuration option in	Low	2/20/2026
v4.3.6	## Added - Pdfium `PdfParagraph` object-based extraction: New markdown extraction path using pdfium's `PdfParagraph::from_objects()` for spatial text grouping, replacing raw page-object iteration. Provides accurate per-line baseline positions via `into_lines()` and styled text fragments with bold/italic/monospace detection. - Structure tree and content marks API in pdfium-render: New `ExtractedBlock`, `ContentRole`, and `PdfParagraph` types for tagged PDF semantic extraction. Structure	Low	2/19/2026
v4.3.5	## What's New ### Bounding Box Support - `bounding_box` on `Table` and `ExtractedImage`: Spatial position data (`BoundingBox` with `x0, y0, x1, y1`) now available on both types across all 10 language bindings (Rust, Python, TypeScript, Ruby, PHP, Go, Java, C#, Elixir, WASM). - Table bounding boxes computed from PDF character positions: During PDF extraction, table bounding boxes are calculated from constituent character positions for precise spatial layout. ### Inline Markdown Embeddin	Low	2/17/2026
benchmark-run-22075206511	Comparative benchmark results from workflow run [22075206511](https://github.com/kreuzberg-dev/kreuzberg/actions/runs/22075206511). Commit: a117bb06cad445045639e2e48694476eab38b2a5 Date: 2026-02-17	Low	2/17/2026
v4.3.4	## What's New in v4.3.4 ### Fixed - Node.js keyword extraction fields missing: The TypeScript `convertResult()` type converter was silently dropping `extractedKeywords`, `qualityScore`, and `processingWarnings` from NAPI results because it only copied explicitly listed fields. Added the missing field conversions. Also renamed the mismatched `keywords` property to `extractedKeywords` in the TypeScript types to match the NAPI binding definition. - **Windows PHP CI build failure (`crc::Table`	Low	2/16/2026
benchmark-run-22020443124	Comparative benchmark results from workflow run [22020443124](https://github.com/kreuzberg-dev/kreuzberg/actions/runs/22020443124). Commit: 865c45aec930fedfee97507fb9a0bc8b4d27a0a8 Date: 2026-02-14	Low	2/14/2026
v4.3.3	## What's New in v4.3.3 ### PaddleOCR Multi-Language Support (#388) - 106+ language support via 12 script families: PaddleOCR recognition models now cover english, chinese (simplified+traditional+japanese), latin, korean, east slavic (cyrillic), thai, greek, arabic, devanagari, tamil, telugu, and kannada script families. - Per-family recognition model architecture: Shared detection/classification models with per-family recognition models and dictionaries, downloaded on demand from Huggi	Low	2/14/2026
benchmark-run-21996566749	Comparative benchmark results from workflow run [21996566749](https://github.com/kreuzberg-dev/kreuzberg/actions/runs/21996566749). Commit: 9ee7e78ac6a57ef32bc1467dfe2755b1ab693dbf Date: 2026-02-13	Low	2/13/2026
v4.3.2	## Fixed ### PHP 8.4 Requirement Update - Updated PHP requirement to 8.4+: All PHP composer.json files, CI workflows, and documentation now require PHP 8.4+ to support PHPUnit 13.0. This fixes CI validation and PHP workflow failures caused by PHPUnit 13.0 requiring PHP 8.4.1+. ### Elixir Publishing Workflow - Fixed macOS ARM64 build timeout: Increased timeout from 180 to 300 minutes (5 hours) for macOS ARM64 Elixir native library builds. The previous timeout caused incomplete builds an	Low	2/13/2026

Dependencies & License Audit

Loading dependencies...

Similar Packages

alefGenerate fully-typed, lint-clean language bindings for Rust libraries across 11 languagesv0.23.18

mcp-for-beginnersThis open-source curriculum introduces the fundamentals of Model Context Protocol (MCP) through real-world, cross-language examples in .NET, Java, TypeScript, JavaScript, Rust and Python. Designed formain@2026-06-04

oxidetermAll-in-one terminal workspace — local shells, SSH, SFTP, remote IDE, AI agent, and file manager in a single native binary. Built with Tauri 2 and pure Rust SSH (no OpenSSL). Smart reconnect, MCP, RAG,v1.5.2

vobaseThe app framework built for AI coding agents. Own every line. Your AI already knows how to build on it.@vobase/template@3.18.0

engraphLocal knowledge graph for AI agents. Hybrid search + MCP server for Obsidian vaults.v1.7.2

More from kreuzberg-dev

alefGenerate fully-typed, lint-clean language bindings for Rust libraries across 11 languages

More in MCP Servers

claude-plugins-officialOfficial, Anthropic-managed directory of high quality Claude Code Plugins.

langchain4jLangChain4j is an open-source Java library that simplifies the integration of LLMs into Java applications through a unified API, providing access to popular LLMs and vector databases. It makes impleme

hyperframesWrite HTML. Render video. Built for agents.

claude-code-guideClaude Code Guide - Setup, Commands, workflows, agents, skills & tips-n-tricks go from beginner to power user!