freshcrate
Skin:/
Home > MCP Servers > kreuzberg

kreuzberg

A polyglot document intelligence framework with a Rust core. Extract text, metadata, images, and structured information from PDFs, Office documents, images, and 91+ formats. Available for Rust, Python

Why this rank:Strong adoptionRecent releaseHealthy release cadence

Description

A polyglot document intelligence framework with a Rust core. Extract text, metadata, images, and structured information from PDFs, Office documents, images, and 91+ formats. Available for Rust, Python, Ruby, Java, Go, PHP, Elixir, C#, R, C, TypeScript (Node/Bun/Wasm/Deno)- or use via CLI, REST API, or MCP server.

README

Kreuzberg

Linkedin- Banner

Extract text, metadata, and code intelligence from 97+ file formats and 305 programming languages at native speeds without needing a GPU.

Key Features

  • Code intelligence – Extract functions, classes, imports, symbols, and docstrings from 248 programming languages via tree-sitter. Results in ExtractionResult.code_intelligence with semantic chunking
  • Extensible architecture – Plugin system for custom OCR backends, validators, post-processors, document extractors, and renderers
  • Polyglot – Native bindings for Rust, Python, TypeScript/Node.js, Ruby, Go, Java, C#, PHP, Elixir, R, and C
  • 91+ file formats – PDF, Office documents, images, HTML, XML, emails, archives, academic formats across 8 categories
  • LLM intelligence – VLM OCR (GPT-4o, Claude, Gemini, Ollama), structured JSON extraction with schema constraints, and provider-hosted embeddings via 146 LLM providers (including local engines: Ollama, LM Studio, vLLM, llama.cpp) through liter-llm
  • OCR support – Tesseract (all bindings, including Tesseract-WASM for browsers), PaddleOCR (all native bindings), EasyOCR (Python), VLM OCR (146 vision model providers including local engines), extensible via plugin API
  • High performance – Rust core with native PDFium, SIMD optimizations and full parallelism
  • Flexible deployment – Use as library, CLI tool, REST API server, or MCP server
  • TOON wire format – Token-efficient serialization for LLM/RAG pipelines, ~30-50% fewer tokens than JSON
  • GFM-quality output – Comrak-based rendering with proper fenced code blocks, table nodes, bracket escaping, and cross-format parity (Markdown, HTML, Djot, Plain)
  • HTML passthrough – HTML-to-Markdown conversion uses html-to-markdown output directly, bypassing lossy intermediate round-trips
  • Memory efficient – Streaming parsers for multi-GB files

Complete Documentation | Live Demo | Installation Guides

Installation

Each language binding provides comprehensive documentation with examples and best practices. Choose your platform to get started:

Scripting Languages:

  • Python – PyPI package, async/sync APIs, OCR backends (Tesseract, PaddleOCR, EasyOCR)
  • Ruby – RubyGems package, idiomatic Ruby API, native bindings
  • PHP – Composer package, modern PHP 8.4+ support, type-safe API, async extraction
  • Elixir – Hex package, OTP integration, concurrent processing
  • R – r-universe package, idiomatic R API, extendr bindings

JavaScript/TypeScript:

  • @kreuzberg/node – Native NAPI-RS bindings for Node.js/Bun, fastest performance
  • @kreuzberg/wasm – WebAssembly for browsers/Deno/Cloudflare Workers, full feature parity (PDF, Excel, OCR, archives)

Compiled Languages:

  • Go – Go module with FFI bindings, context-aware async
  • Java – Maven Central, Foreign Function & Memory API
  • C# – NuGet package, .NET 6.0+, full async/await support

Native:

  • Rust – Core library, flexible feature flags, zero-copy APIs
  • C (FFI) – C header + shared library, pkg-config/CMake support, cross-platform

Containers:

  • Docker – Official images with API, CLI, and MCP server modes (Core: ~1.0-1.3GB, Full: ~1.0-1.3GB with OCR + legacy format support)

Command-Line:

  • CLI – Cross-platform binary, batch processing, MCP server mode

All language bindings include precompiled binaries for both x86_64 and aarch64 architectures on Linux and macOS.

Platform Support

Complete architecture coverage across all language bindings:

Language Linux x86_64 Linux aarch64 macOS ARM64 Windows x64
Python ✅ ✅ ✅ ✅
Node.js ✅ ✅ ✅ ✅
WASM ✅ ✅ ✅ ✅
Ruby ✅ ✅ ✅ -
R ✅ ✅ ✅ ✅
Elixir ✅ ✅ ✅ ✅
Go ✅ ✅ ✅ ✅
Java ✅ ✅ ✅ ✅
C# ✅ ✅ ✅ ✅
PHP ✅ ✅ ✅ ✅
Rust ✅ ✅ ✅ ✅
C (FFI) ✅ ✅ ✅ ✅
CLI ✅ ✅ ✅ ✅
Docker ✅ ✅ ✅ -

Note: ✅ = Precompiled binaries available with instant installation. WASM runs in any environment with WebAssembly support (browsers, Deno, Bun, Cloudflare Workers). All platforms are tested in CI. MacOS support is Apple Silicon only.

Embeddings Support (Optional)

To use embeddings functionality:

  1. Install ONNX Runtime 1.24+:

  2. Use embeddings in your code - see Embeddings Guide

Note: Kreuzberg requires ONNX Runtime version 1.24+ for embeddings. All other Kreuzberg features work without ONNX Runtime.

Supported Formats

91+ file formats across 8 major categories with intelligent format detection and comprehensive metadata extraction.

Office Documents

Category Formats Capabilities
Word Processing .docx, .docm, .dotx, .dotm, .dot, .odt, .pages Full text, tables, lists, images, metadata, styles
Spreadsheets .xlsx, .xlsm, .xlsb, .xls, .xla, .xlam, .xltm, .xltx, .xlt, .ods, .numbers Sheet data, formulas, cell metadata, charts
Presentations .pptx, .pptm, .ppsx, .potx, .potm, .pot, .key Slides, speaker notes, images, metadata
PDF .pdf Text, tables, images, metadata, OCR support
eBooks .epub, .fb2 Chapters, metadata, embedded resources
Database .dbf Table data extraction, field type support
Hangul .hwp, .hwpx Korean document format, text extraction

Images (OCR-Enabled)

Category Formats Features
Raster .png, .jpg, .jpeg, .gif, .webp, .bmp, .tiff, .tif OCR, table detection, EXIF metadata, dimensions, color space
Advanced .jp2, .jpx, .jpm, .mj2, .jbig2, .jb2, .pnm, .pbm, .pgm, .ppm Pure Rust decoders (JPEG 2000, JBIG2), OCR, table detection
Vector .svg DOM parsing, embedded text, graphics metadata

Web & Data

Category Formats Features
Markup .html, .htm, .xhtml, .xml, .svg DOM parsing, metadata (Open Graph, Twitter Card), link extraction
Structured Data .json, .yaml, .yml, .toml, .csv, .tsv Schema detection, nested structures, validation
Text & Markdown .txt, .md, .markdown, .djot, .mdx, .rst, .org, .rtf CommonMark, GFM, Djot, MDX, reStructuredText, Org Mode, Rich Text

Email & Archives

Category Formats Features
Email .eml, .msg Headers, body (HTML/plain), attachments, UTF-16 support
Archives .zip, .tar, .tgz, .gz, .7z Recursive extraction, nested archives, metadata

Academic & Scientific

Category Formats Features
Citations .bib, .ris, .nbib, .enw, .csl BibTeX/BibLaTeX, RIS, PubMed/MEDLINE, EndNote XML, CSL JSON
Scientific .tex, .latex, .typ, .typst, .jats, .ipynb LaTeX, Typst, JATS journal articles, Jupyter notebooks
Publishing .fb2, .docbook, .dbk, .opml FictionBook, DocBook XML, OPML outlines
Documentation .pod, .mdoc, .troff Perl POD, man pages, troff

Complete Format Reference →

Code Intelligence (248 Languages)

Feature Description
Structure Extraction Functions, classes, methods, structs, interfaces, enums
Import/Export Analysis Module dependencies, re-exports, wildcard imports
Symbol Extraction Variables, constants, type aliases, properties
Docstring Parsing Google, NumPy, Sphinx, JSDoc, RustDoc, and 10+ formats
Diagnostics Parse errors with line/column positions
Syntax-Aware Chunking Split code by semantic boundaries, not arbitrary byte offsets

Powered by tree-sitter-language-pack with dynamic grammar download. See TSLP documentation for the full language list.

Key Features

OCR with Table Extraction

Multiple OCR backends (Tesseract, EasyOCR, PaddleOCR) with intelligent table detection and reconstruction. Extract structured data from scanned documents and images with configurable accuracy thresholds.

OCR Backend Documentation →

Batch Processing

Process multiple documents concurrently with configurable parallelism. Optimize throughput for large-scale document processing workloads with automatic resource management.

Batch Processing Guide →

Password-Protected PDFs

Handle encrypted PDFs with single or multiple password attempts. Supports both RC4 and AES encryption with automatic fallback strategies.

PDF Configuration →

Language Detection

Automatic language detection in extracted text using fast-langdetect. Configure confidence thresholds and access per-language statistics.

Language Detection Guide →

Metadata Extraction

Extract comprehensive metadata from all supported formats: authors, titles, creation dates, page counts, EXIF data, and format-specific properties.

Metadata Guide →

AI Coding Assistants

Kreuzberg ships with an Agent Skill that teaches AI coding assistants how to use the library correctly. It works with Claude Code, Codex, Gemini CLI, Cursor, VS Code, Amp, Goose, Roo Code, and any tool supporting the Agent Skills standard.

Install the skill into any project using the Vercel Skills CLI:

npx skills add kreuzberg-dev/kreuzberg

The skill is located at skills/kreuzberg/SKILL.md and is automatically discovered by supported AI coding tools once installed.

Documentation

Contributing

Contributions are welcome! See CONTRIBUTING.md for guidelines.

License

Elastic License 2.0 (ELv2) - see LICENSE for details. See https://www.elastic.co/licensing/elastic-license for the full license text.

Release History

VersionChangesUrgencyDate
v4.9.9**Full Changelog**: https://github.com/kreuzberg-dev/kreuzberg/commits/v4.9.9High6/5/2026
v4.9.8 LTS patch release. Four targeted bug fixes plus dependency pinning so the branch builds against current crates.io releases. ### Fixed - **#934**: RTF hex byte escapes now honor `\ansicpgNNNN`, so CP1251 Cyrillic byte runs decode as readable text instead of Windows-1252 mojibake. - **#937**: `ExtractionConfig(cancel_token=â€Ļ)` raised `TypeError: unexpected keyword argument 'cancel_token'` from Python despite the type stub declaring the kwarg. The `#[pyo3(signature = â€Ļ)]` on `ExtractionConfig::_High5/17/2026
v4.9.7**Full Changelog**: https://github.com/kreuzberg-dev/kreuzberg/compare/v4.9.6...v4.9.7High5/8/2026
v4.9.6**Full Changelog**: https://github.com/kreuzberg-dev/kreuzberg/compare/v4.9.5...v4.9.6High5/8/2026
v4.9.5## Fixed - **#790**: Fix GPU acceleration — kreuzberg now bundles CPU-only ONNX Runtime by default (zero-config). When a GPU execution provider (`cuda`, `tensorrt`, `coreml`) is explicitly requested via `AccelerationConfig` but unavailable, kreuzberg returns an error with setup instructions instead of silently falling back to CPU. `Auto` mode gracefully falls back to CPU with an info log. For GPU support, set `ORT_DYLIB_PATH` to a GPU-enabled ONNX Runtime. - **#791**: Fix DOCX OCR extraction — High4/23/2026
v4.9.3See [CHANGELOG.md](https://github.com/kreuzberg-dev/kreuzberg/blob/main/CHANGELOG.md#493---2026-04-22) for full details.High4/22/2026
v4.9.2## Fixed - Fix cancellation token not checked in WASM (non-tokio) path for Excel, DOC, PPT, Pages, Keynote, and Numbers extractors — cancellation was silently ignored in WASM builds - Propagate `Cancelled` error code (9) to all bindings — Go, C FFI, Python, TypeScript, Java, C#, and C API docs now include the new code - Fix PHP e2e embed tests calling instance methods statically — use procedural `\Kreuzberg\embed()` functions - Fix TypeScript e2e embed tests using wrong field names (`type`/`namHigh4/19/2026
v4.9.1## Fixed - **#754**: Preserve `_internal_bindings.pyi` type stub during wheel artifact cleanup — published wheels now include inline type information for the core binding module - Add missing `Default` impl for `PyCancellationToken` to satisfy clippy `new_without_default` lint - Improve download resilience for `eng.traineddata` in build script — increase retries from 3 to 5, add fallback URL via `raw.githubusercontent.com`, and increase timeout to 300s - Increase Task installer retry resilienceHigh4/19/2026
v4.9.0## What's Changed * Fix duplicated heading in markdown chunker with prepend_heading_context by @tobocop2 in https://github.com/kreuzberg-dev/kreuzberg/pull/701 * chore(deps): bump pnpm/action-setup from 5 to 6 by @dependabot[bot] in https://github.com/kreuzberg-dev/kreuzberg/pull/698 * chore(deps): bump actions/upload-pages-artifact from 4 to 5 by @dependabot[bot] in https://github.com/kreuzberg-dev/kreuzberg/pull/711 * fix: remove duplicate output_format key and fix numeric types in OCR metadatHigh4/18/2026
v4.8.6## What's Changed * Fix duplicated heading in markdown chunker with prepend_heading_context by @tobocop2 in https://github.com/kreuzberg-dev/kreuzberg/pull/701 * chore(deps): bump pnpm/action-setup from 5 to 6 by @dependabot[bot] in https://github.com/kreuzberg-dev/kreuzberg/pull/698 * chore(deps): bump actions/upload-pages-artifact from 4 to 5 by @dependabot[bot] in https://github.com/kreuzberg-dev/kreuzberg/pull/711 * fix: remove duplicate output_format key and fix numeric types in OCR metadatHigh4/17/2026
v4.8.5## What's Changed ### Added - **LLM usage tracking** — new `llm_usage` field on `ExtractionResult` captures token counts, estimated cost (USD), model identifier, and finish reason for every LLM call (VLM OCR, structured extraction, LLM embeddings). Exposed across all 12 bindings. ### Fixed - **Markdown chunker heading duplication** when `prepend_heading_context` is enabled (#701) - **Helm chart icon 404 on Artifact Hub** — `.png` → `.svg` - **Python wheel manylinux compliance** — bumped to `maHigh4/14/2026
v4.8.4## What's Changed ### Added - **Helm chart for Kubernetes deployment** — minimal, security-hardened Helm chart with Deployment, Service, Ingress, PVC, HPA, PDB, and ServiceAccount templates. Publishes to GHCR as an OCI artifact. (#695) - **Helm lint and kubeconform pre-commit hooks** — added `helm lint --strict` and `kubeconform` (k8s 1.28.0 schema validation) to pre-commit and CI pipeline. - **Helm chart publish workflow** — new `publish-helm.yaml` GitHub Actions workflow pushes versioned charHigh4/13/2026
v4.8.2## Added - **`HtmlOutputConfig` typed in all bindings** — `html_output` config field (themes, CSS classes, embed CSS, custom CSS, class prefix) now fully typed in Python, TypeScript/Node, Go, Ruby, Elixir, PHP, Java, C#, R, and FFI. Previously only available in Rust core. ## Fixed - **PDF: legitimate repeated content stripped during page merging regardless of `strip_repeating_text` flag** — `deduplicate_paragraphs()` runs unconditionally, stripping brand names and other legitimately repeated High4/10/2026
benchmark-run-24180384454Comparative benchmark results from workflow run [24180384454](https://github.com/kreuzberg-dev/kreuzberg/actions/runs/24180384454). **Commit:** 2e10062f4424621cae8bd7844377c70d223e1724 **Date:** 2026-04-09Medium4/9/2026
v4.8.1### Added - **Styled HTML output** — New `HtmlOutputConfig` on `ExtractionConfig` with 5 built-in themes (`default`, `github`, `dark`, `light`, `unstyled`), semantic `kb-*` CSS class hooks on every structural element, CSS custom properties (`--kb-*`), custom CSS injection (inline or file), and configurable class prefix. The existing `Html` output format is upgraded in-place when `html_output` is set (#633, #665) - 5 new CLI flags: `--html-theme`, `--html-css`, `--html-css-file`, `--html-class-pMedium4/9/2026
benchmark-run-24136316825Comparative benchmark results from workflow run [24136316825](https://github.com/kreuzberg-dev/kreuzberg/actions/runs/24136316825). **Commit:** cc69ae69bfda402b016b90fcfd84ccfdbf816c8d **Date:** 2026-04-08Medium4/8/2026
benchmark-run-24118817847Comparative benchmark results from workflow run [24118817847](https://github.com/kreuzberg-dev/kreuzberg/actions/runs/24118817847). **Commit:** c6a1aadd63e6c685ba06b998e6c8c98d83b13c16 **Date:** 2026-04-08Medium4/8/2026
v4.8.0## What's Changed * fix: correct HWP tag constants and control character handling by @nuri-yoo in https://github.com/kreuzberg-dev/kreuzberg/pull/659 * feat: standalone text embedding API (#599) by @kh3rld in https://github.com/kreuzberg-dev/kreuzberg/pull/614 * feat: integrate liter-llm for VLM OCR, embeddings, and structured extraction by @Goldziher in https://github.com/kreuzberg-dev/kreuzberg/pull/662 ## New Contributors * @nuri-yoo made their first contribution in https://github.com/kreuzbHigh4/8/2026
v4.7.4### Added - Re-added `--layout` boolean CLI flag for easy layout detection enablement (use `--layout` to enable with model defaults, `--layout false` to explicitly disable) - arXiv watermark/sidebar noise filtering for academic PDFs — strips LaTeX sidebar identifiers from extracted text - Second-tier cross-page repeating text detection — catches conference headers and journal running titles that repeat on >70% of pages but appear outside the margin zone - Figure/picture text suppression — text Medium4/6/2026
v4.7.3## Fixed - **Archive extraction SIGBUS crash on macOS ARM64** — ZIP, 7Z, TAR, and GZIP archive extraction crashed with SIGBUS (signal 10) in release builds due to miscompilation of unsafe code in `sevenz-rust2` and `zip` crates under `opt-level=3`. Reduced optimization level to 2 for these crates. This also fixes Elixir, R, Go, and C benchmark crashes when processing archive files. - **Native-text PDF extraction fails when OCR backend unavailable** (#646) — PDFs with extractable native text harMedium4/5/2026
v4.7.2## What's Changed ### Added - **E2E generator published mode** — Generate standalone test apps against published registry versions for all 12 language bindings ### Changed - **Global model cache** (#641) — Models now download to platform-appropriate global cache directory instead of per-directory `.kreuzberg/` folders ### Fixed - **Leptonica DPI crash** (#606) — Images with 0 DPI caused C++ exception abort during preprocessing. Now validates and fixes DPI to 72 before preprocessing. Also disaMedium4/4/2026
v4.7.1## [4.7.1] - 2026-04-03 ### Added - **Tree-sitter grammar management CLI** — New `kreuzberg tree-sitter` subcommand with `download`, `list`, `cache-dir`, and `clean` sub-commands for managing tree-sitter grammar parsers. Supports downloading by language name, group (`--groups web,systems,scripting`), or all (`--all`). Reads `[tree_sitter]` config from `kreuzberg.toml` with `--from-config`. - **Tree-sitter grammar management API** — New REST endpoints: `POST /grammars/download`, `GET /grammars/Medium4/4/2026
v4.7.0## Highlights ### Code Intelligence — 248 Languages New `code_intelligence` field on `ExtractionResult` via [tree-sitter-language-pack](https://github.com/kreuzberg-dev/tree-sitter-language-pack). Extract functions, classes, imports, exports, symbols, docstrings, and diagnostics. Semantic code chunking that respects scope boundaries. Configure with `CodeContentMode`: `chunks`, `raw`, or `structure`. ### Benchmark-Driven Extraction Quality 350+ test documents across 23 formats with Structural FMedium4/3/2026
v4.6.3## [4.6.3] - 2026-03-27 ### Added - **Tower service layer** (`service` module): Composable `ExtractionService` implementing `tower::Service` with configurable middleware layers (tracing, metrics, timeout, concurrency limit). New `tower-service` feature flag, auto-enabled by `api` and `mcp`. `ExtractionServiceBuilder` provides ergonomic layer composition. - **Semantic OpenTelemetry conventions** (`telemetry` module): Formal `kreuzberg.*` attribute namespace with 30+ span attributes, metric nameMedium3/27/2026
v4.6.2## [4.6.2] - 2026-03-26 ### Added - **PDF page rendering API** (#583): New `render_pdf_page` function and `PdfPageIterator` for rendering individual PDF pages as PNG images. Available across all 11 language bindings with idiomatic patterns (Python context manager, Go Close(), Java AutoCloseable, C# IDisposable, Elixir Stream, etc.). Default 150 DPI, configurable per call. ### Fixed - **Table recognition coordinate mismatch on scanned PDFs** (#582): Layout detection bboxes (640x640 model spacMedium3/26/2026
v4.6.1## Fixes - **OCR memory usage reduced 60-78%**: Restructured the OCR batch rendering loop to render-and-encode one page at a time instead of holding all decoded RGB buffers simultaneously. A 98-page scanned PDF dropped from 4.6GB to 1.9GB peak RSS (batch_size=4), and from 3.3GB to 713MB (batch_size=1). Batch size now adapts to available system memory on Linux and macOS. - **PDF control character encoding artifacts**: PDFs with broken ToUnicode font mappings that produce U+0002 (STX) and other cMedium3/25/2026
v4.6.0## v4.6.0 — Recursive Archives, DocumentStructure, Bug Fixes ### Added - **Recursive archive extraction**: Archives (ZIP, TAR, 7Z, GZIP) now recursively extract all processable files, each with its own `ExtractionResult`. New `ArchiveEntry` type and `max_archive_depth` config. - **YAML/JSON section chunker**: New `ChunkerType::Yaml` with full hierarchy paths and auto-inference from metadata. - **Unified DocumentStructure**: Extended with 7 new node types, 4 annotation kinds, attributes bag. AlMedium3/25/2026
v4.5.4## [4.5.4] - 2026-03-23 ### Fixed - **PDF image extraction panic on mismatched buffer lengths** (#552): Replaced `assert!` with graceful error handling. Malformed PDF images are now skipped instead of panicking. Regression from v4.5.0. - **`pdf` feature compilation without `layout-detection`** (#550): `config.layout` reference gated behind `#[cfg(feature = "layout-detection")]`. - **WASM module resolution in Supabase/Deno edge functions** (#551): Added explicit `package.json` exports and Deno Medium3/23/2026
v4.5.3## What's New ### SLANeXT Table Structure Recognition Alternative table structure backends alongside TATR. New `table_model` field on `LayoutDetectionConfig` selects the backend: | Model | Config Value | Size | Best For | |-------|-------------|------|----------| | TATR | `"tatr"` (default) | 30 MB | General-purpose, consistent results | | SLANeXT Wired | `"slanet_wired"` | 365 MB | Bordered/gridlined tables | | SLANeXT Wireless | `"slanet_wireless"` | 365 MB | Borderless tables | | SLANeXT ALow3/22/2026
v4.5.2## Fixed - **PDF word splitting in extracted text**: Pdfium's text extraction inserted spurious spaces mid-word (e.g. `"s hall a b e active"` instead of `"shall be active"`). Added selective page-level respacing: pages with detected broken word spacing are re-extracted using character-level gap analysis (`font_size × 0.33` threshold). Clean pages use the fast single-call path. Reduces garbled lines from 406 to 0 on the ISO 21111-10 test document with no performance impact. - **Markdown underscoLow3/21/2026
v4.5.1See [CHANGELOG.md](https://github.com/kreuzberg-dev/kreuzberg/blob/main/CHANGELOG.md#451---2026-03-20) for release notes.Low3/21/2026
benchmark-run-23359982805Comparative benchmark results from workflow run [23359982805](https://github.com/kreuzberg-dev/kreuzberg/actions/runs/23359982805). **Commit:** d0624792f343e8dab8c7468dcc2a2c4930741157 **Date:** 2026-03-21Low3/21/2026
v4.5.0See [CHANGELOG.md](https://github.com/kreuzberg-dev/kreuzberg/blob/main/CHANGELOG.md#450---2026-03-20) for full release notes.Low3/20/2026
v4.4.6## Added - **dBASE (.dbf) format support**: Extract table data from dBASE files as markdown tables with field type support. - **Hangul Word Processor (.hwp/.hwpx) support**: Extract text content from HWP 5.0 documents (standard Korean document format). - **Office template/macro format variants**: Added support for `.docm`, `.dotx`, `.dotm`, `.dot` (Word), `.potx`, `.potm`, `.pot` (PowerPoint), `.xltx`, `.xlt` (Excel) formats. ## Fixed - **DOCX image placeholders missing (#484)**: Extracting `Low3/13/2026
benchmark-run-23042674034Comparative benchmark results from workflow run [23042674034](https://github.com/kreuzberg-dev/kreuzberg/actions/runs/23042674034). **Commit:** 31ae19c77ebfd0cd6e809078c28a7cfb2388edeb **Date:** 2026-03-13Low3/13/2026
v4.4.4## Fixed - **CLI test app fixes**: Fixed broken symlinks in CLI test documents, corrected `--format` to `--output-format` flag usage, fixed multipart form field name (`file=` → `files=`) in serve tests, and rewrote MCP test to use JSON-RPC stdin protocol instead of background process detection. - **Publish idempotency check scripts**: Fixed `check_nuget.sh` and `check-nuget-version.sh` using bash 4+ `${var,,}` syntax incompatible with bash 3.x. Fixed `check_pypi.sh` and `check_packagist.sh` wriLow3/7/2026
v4.4.3## Added - **PDF image placeholder toggle**: New `inject_placeholders` option on `ImageExtractionConfig` (default: `true`). Set to `false` to extract images as data without injecting `![image](...)` references into the markdown content. ## Fixed - **Token reduction not applied** ([#436](https://github.com/kreuzberg-dev/kreuzberg/issues/436)): Token reduction config was accepted but never executed during extraction. The pipeline now applies `reduce_tokens()` when `token_reduction.mode` is confLow3/6/2026
v4.4.2## Fixed - **E2E element type assertions**: Fixed element type field name in E2E generator templates for Python, TypeScript, WASM Deno, Elixir, Ruby, PHP, and C# - **Ruby PDF annotation extraction**: Fixed `PdfAnnotation` and `PdfAnnotationBoundingBox` autoload and bounding box field name mismatch - **WASM OCR blocking event loop**: OCR now runs in a worker thread, keeping the main thread responsive - **JPEG 2000 OCR decode failure**: Shared `load_image_for_ocr()` helper with `hayro-jpeg2000`/`Low3/4/2026
benchmark-run-22610076103Comparative benchmark results from workflow run [22610076103](https://github.com/kreuzberg-dev/kreuzberg/actions/runs/22610076103). **Commit:** 8b7e35c641e3a918146d73e4b954c6d3a5cb6bdf **Date:** 2026-03-03Low3/3/2026
v4.4.1## Added - **OCR table inlining into markdown content** (#421): When `output_format = Markdown` and OCR detects tables, the markdown pipe tables are now inlined into `result.content` at their correct vertical positions instead of only appearing in `result.tables`. Adds `OcrTableBoundingBox` to `OcrTable` for spatial positioning. Sets `metadata.output_format = "markdown"` to signal pre-formatted content and skip re-conversion. - **OCR table bounding boxes**: OCR-detected tables now include boundLow3/1/2026
benchmark-run-22521432636Comparative benchmark results from workflow run [22521432636](https://github.com/kreuzberg-dev/kreuzberg/actions/runs/22521432636). **Commit:** 978102f360273632db02791b07f4106af9af7408 **Date:** 2026-03-01Low3/1/2026
v4.4.0## Added - **R language bindings** -- Added kreuzberg R package via extendr with full extraction API (sync/async, batch, bytes), typed error conditions, S3 result class with accessors, config discovery, OCR/chunking configuration, plugin system, and 32 documentation snippets. - **PHP async extraction**: Non-blocking extraction via `DeferredResult` pattern with Tokio thread pool. Includes `extractFileAsync()`, `extractBytesAsync()`, `batchExtractFilesAsync()`, `batchExtractBytesAsync()` across OLow2/28/2026
benchmark-run-22450525715Comparative benchmark results from workflow run [22450525715](https://github.com/kreuzberg-dev/kreuzberg/actions/runs/22450525715). **Commit:** 310e9d5cd64ede2e12516d34a5471484d78fb88b **Date:** 2026-02-26Low2/26/2026
benchmark-run-22394565362Comparative benchmark results from workflow run [22394565362](https://github.com/kreuzberg-dev/kreuzberg/actions/runs/22394565362). **Commit:** 47a391a555a76733d93507941671d83347de9bf9 **Date:** 2026-02-25Low2/25/2026
benchmark-run-22314146367Comparative benchmark results from workflow run [22314146367](https://github.com/kreuzberg-dev/kreuzberg/actions/runs/22314146367). **Commit:** 1717eac8d1066271bbe3f3bc4d2371378a954299 **Date:** 2026-02-23Low2/23/2026
benchmark-run-22297171341Comparative benchmark results from workflow run [22297171341](https://github.com/kreuzberg-dev/kreuzberg/actions/runs/22297171341). **Commit:** c72b316aff6d2c052409b0c398735f6920ab3231 **Date:** 2026-02-23 **Note:** Release created locally after CI aggregation step failed due to missing `gh` CLI on runner.Low2/23/2026
v4.3.8## Added - **MDX format support** (`mdx` feature): Extract text from `.mdx` files, stripping JSX/import/export syntax while preserving markdown content, frontmatter, tables, and code fences - **List supported formats API** (#404): Query all supported file extensions and MIME types via `list_supported_formats()` in Rust, `GET /formats` REST endpoint, `list_formats` MCP tool, or `kreuzberg formats` CLI subcommand ## Fixed - **PDF ligature corruption in CM/Type1 fonts**: Added contextual ligaturLow2/22/2026
v4.3.7## Added - NFC unicode normalization applied to all extraction outputs, ensuring consistent representation of composed characters across all backends (gated behind `quality` feature) - Configurable PDF page margin fractions (`top_margin_fraction`, `bottom_margin_fraction`) in `PdfConfig` - PDF annotation extraction with new `PdfAnnotation` type supporting `Text`, `Highlight`, `Link`, `Stamp`, `Underline`, `StrikeOut`, and `Other` annotation types - `extract_annotations` configuration option in Low2/20/2026
v4.3.6## Added - **Pdfium `PdfParagraph` object-based extraction**: New markdown extraction path using pdfium's `PdfParagraph::from_objects()` for spatial text grouping, replacing raw page-object iteration. Provides accurate per-line baseline positions via `into_lines()` and styled text fragments with bold/italic/monospace detection. - **Structure tree and content marks API in pdfium-render**: New `ExtractedBlock`, `ContentRole`, and `PdfParagraph` types for tagged PDF semantic extraction. Structure Low2/19/2026
v4.3.5## What's New ### Bounding Box Support - **`bounding_box` on `Table` and `ExtractedImage`**: Spatial position data (`BoundingBox` with `x0, y0, x1, y1`) now available on both types across all 10 language bindings (Rust, Python, TypeScript, Ruby, PHP, Go, Java, C#, Elixir, WASM). - **Table bounding boxes computed from PDF character positions**: During PDF extraction, table bounding boxes are calculated from constituent character positions for precise spatial layout. ### Inline Markdown EmbeddinLow2/17/2026
benchmark-run-22075206511Comparative benchmark results from workflow run [22075206511](https://github.com/kreuzberg-dev/kreuzberg/actions/runs/22075206511). **Commit:** a117bb06cad445045639e2e48694476eab38b2a5 **Date:** 2026-02-17Low2/17/2026
v4.3.4## What's New in v4.3.4 ### Fixed - **Node.js keyword extraction fields missing**: The TypeScript `convertResult()` type converter was silently dropping `extractedKeywords`, `qualityScore`, and `processingWarnings` from NAPI results because it only copied explicitly listed fields. Added the missing field conversions. Also renamed the mismatched `keywords` property to `extractedKeywords` in the TypeScript types to match the NAPI binding definition. - **Windows PHP CI build failure (`crc::Table`Low2/16/2026
benchmark-run-22020443124Comparative benchmark results from workflow run [22020443124](https://github.com/kreuzberg-dev/kreuzberg/actions/runs/22020443124). **Commit:** 865c45aec930fedfee97507fb9a0bc8b4d27a0a8 **Date:** 2026-02-14Low2/14/2026
v4.3.3## What's New in v4.3.3 ### PaddleOCR Multi-Language Support (#388) - **106+ language support via 12 script families**: PaddleOCR recognition models now cover english, chinese (simplified+traditional+japanese), latin, korean, east slavic (cyrillic), thai, greek, arabic, devanagari, tamil, telugu, and kannada script families. - **Per-family recognition model architecture**: Shared detection/classification models with per-family recognition models and dictionaries, downloaded on demand from HuggiLow2/14/2026
benchmark-run-21996566749Comparative benchmark results from workflow run [21996566749](https://github.com/kreuzberg-dev/kreuzberg/actions/runs/21996566749). **Commit:** 9ee7e78ac6a57ef32bc1467dfe2755b1ab693dbf **Date:** 2026-02-13Low2/13/2026
v4.3.2## Fixed ### PHP 8.4 Requirement Update - **Updated PHP requirement to 8.4+**: All PHP composer.json files, CI workflows, and documentation now require PHP 8.4+ to support PHPUnit 13.0. This fixes CI validation and PHP workflow failures caused by PHPUnit 13.0 requiring PHP 8.4.1+. ### Elixir Publishing Workflow - **Fixed macOS ARM64 build timeout**: Increased timeout from 180 to 300 minutes (5 hours) for macOS ARM64 Elixir native library builds. The previous timeout caused incomplete builds anLow2/13/2026

Dependencies & License Audit

Loading dependencies...

Similar Packages

alefGenerate fully-typed, lint-clean language bindings for Rust libraries across 11 languagesv0.23.18
mcp-for-beginnersThis open-source curriculum introduces the fundamentals of Model Context Protocol (MCP) through real-world, cross-language examples in .NET, Java, TypeScript, JavaScript, Rust and Python. Designed formain@2026-06-04
oxidetermAll-in-one terminal workspace — local shells, SSH, SFTP, remote IDE, AI agent, and file manager in a single native binary. Built with Tauri 2 and pure Rust SSH (no OpenSSL). Smart reconnect, MCP, RAG,v1.5.2
vobaseThe app framework built for AI coding agents. Own every line. Your AI already knows how to build on it.@vobase/template@3.18.0
engraphLocal knowledge graph for AI agents. Hybrid search + MCP server for Obsidian vaults.v1.7.2

More from kreuzberg-dev

alefGenerate fully-typed, lint-clean language bindings for Rust libraries across 11 languages

More in MCP Servers

claude-plugins-officialOfficial, Anthropic-managed directory of high quality Claude Code Plugins.
langchain4jLangChain4j is an open-source Java library that simplifies the integration of LLMs into Java applications through a unified API, providing access to popular LLMs and vector databases. It makes impleme
hyperframesWrite HTML. Render video. Built for agents.
claude-code-guideClaude Code Guide - Setup, Commands, workflows, agents, skills & tips-n-tricks go from beginner to power user!