freshcrate
Home > MCP Servers > kreuzberg

kreuzberg

A polyglot document intelligence framework with a Rust core. Extract text, metadata, images, and structured information from PDFs, Office documents, images, and 91+ formats. Available for Rust, Python

Description

A polyglot document intelligence framework with a Rust core. Extract text, metadata, images, and structured information from PDFs, Office documents, images, and 91+ formats. Available for Rust, Python, Ruby, Java, Go, PHP, Elixir, C#, R, C, TypeScript (Node/Bun/Wasm/Deno)- or use via CLI, REST API, or MCP server.

README

Kreuzberg

Linkedin- Banner

Extract text, metadata, and code intelligence from 91+ file formats and 248 programming languages at native speeds without needing a GPU.

Key Features

  • Code intelligence โ€“ Extract functions, classes, imports, symbols, and docstrings from 248 programming languages via tree-sitter. Results in ExtractionResult.code_intelligence with semantic chunking
  • Extensible architecture โ€“ Plugin system for custom OCR backends, validators, post-processors, document extractors, and renderers
  • Polyglot โ€“ Native bindings for Rust, Python, TypeScript/Node.js, Ruby, Go, Java, C#, PHP, Elixir, R, and C
  • 91+ file formats โ€“ PDF, Office documents, images, HTML, XML, emails, archives, academic formats across 8 categories
  • LLM intelligence โ€“ VLM OCR (GPT-4o, Claude, Gemini, Ollama), structured JSON extraction with schema constraints, and provider-hosted embeddings via 146 LLM providers (including local engines: Ollama, LM Studio, vLLM, llama.cpp) through liter-llm
  • OCR support โ€“ Tesseract (all bindings, including Tesseract-WASM for browsers), PaddleOCR (all native bindings), EasyOCR (Python), VLM OCR (146 vision model providers including local engines), extensible via plugin API
  • High performance โ€“ Rust core with native PDFium, SIMD optimizations and full parallelism
  • Flexible deployment โ€“ Use as library, CLI tool, REST API server, or MCP server
  • TOON wire format โ€“ Token-efficient serialization for LLM/RAG pipelines, ~30-50% fewer tokens than JSON
  • GFM-quality output โ€“ Comrak-based rendering with proper fenced code blocks, table nodes, bracket escaping, and cross-format parity (Markdown, HTML, Djot, Plain)
  • HTML passthrough โ€“ HTML-to-Markdown conversion uses html-to-markdown output directly, bypassing lossy intermediate round-trips
  • Memory efficient โ€“ Streaming parsers for multi-GB files

Complete Documentation | Live Demo | Installation Guides

Installation

Each language binding provides comprehensive documentation with examples and best practices. Choose your platform to get started:

Scripting Languages:

  • Python โ€“ PyPI package, async/sync APIs, OCR backends (Tesseract, PaddleOCR, EasyOCR)
  • Ruby โ€“ RubyGems package, idiomatic Ruby API, native bindings
  • PHP โ€“ Composer package, modern PHP 8.4+ support, type-safe API, async extraction
  • Elixir โ€“ Hex package, OTP integration, concurrent processing
  • R โ€“ r-universe package, idiomatic R API, extendr bindings

JavaScript/TypeScript:

  • @kreuzberg/node โ€“ Native NAPI-RS bindings for Node.js/Bun, fastest performance
  • @kreuzberg/wasm โ€“ WebAssembly for browsers/Deno/Cloudflare Workers, full feature parity (PDF, Excel, OCR, archives)

Compiled Languages:

  • Go โ€“ Go module with FFI bindings, context-aware async
  • Java โ€“ Maven Central, Foreign Function & Memory API
  • C# โ€“ NuGet package, .NET 6.0+, full async/await support

Native:

  • Rust โ€“ Core library, flexible feature flags, zero-copy APIs
  • C (FFI) โ€“ C header + shared library, pkg-config/CMake support, cross-platform

Containers:

  • Docker โ€“ Official images with API, CLI, and MCP server modes (Core: ~1.0-1.3GB, Full: ~1.0-1.3GB with OCR + legacy format support)

Command-Line:

  • CLI โ€“ Cross-platform binary, batch processing, MCP server mode

All language bindings include precompiled binaries for both x86_64 and aarch64 architectures on Linux and macOS.

Platform Support

Complete architecture coverage across all language bindings:

Language Linux x86_64 Linux aarch64 macOS ARM64 Windows x64
Python โœ… โœ… โœ… โœ…
Node.js โœ… โœ… โœ… โœ…
WASM โœ… โœ… โœ… โœ…
Ruby โœ… โœ… โœ… -
R โœ… โœ… โœ… โœ…
Elixir โœ… โœ… โœ… โœ…
Go โœ… โœ… โœ… โœ…
Java โœ… โœ… โœ… โœ…
C# โœ… โœ… โœ… โœ…
PHP โœ… โœ… โœ… โœ…
Rust โœ… โœ… โœ… โœ…
C (FFI) โœ… โœ… โœ… โœ…
CLI โœ… โœ… โœ… โœ…
Docker โœ… โœ… โœ… -

Note: โœ… = Precompiled binaries available with instant installation. WASM runs in any environment with WebAssembly support (browsers, Deno, Bun, Cloudflare Workers). All platforms are tested in CI. macOS support is Apple Silicon only.

Embeddings Support (Optional)

To use embeddings functionality:

  1. Install ONNX Runtime 1.24+:

  2. Use embeddings in your code - see Embeddings Guide

Note: Kreuzberg requires ONNX Runtime version 1.24+ for embeddings. All other Kreuzberg features work without ONNX Runtime.

Supported Formats

91+ file formats across 8 major categories with intelligent format detection and comprehensive metadata extraction.

Office Documents

Category Formats Capabilities
Word Processing .docx, .docm, .dotx, .dotm, .dot, .odt, .pages Full text, tables, lists, images, metadata, styles
Spreadsheets .xlsx, .xlsm, .xlsb, .xls, .xla, .xlam, .xltm, .xltx, .xlt, .ods, .numbers Sheet data, formulas, cell metadata, charts
Presentations .pptx, .pptm, .ppsx, .potx, .potm, .pot, .key Slides, speaker notes, images, metadata
PDF .pdf Text, tables, images, metadata, OCR support
eBooks .epub, .fb2 Chapters, metadata, embedded resources
Database .dbf Table data extraction, field type support
Hangul .hwp, .hwpx Korean document format, text extraction

Images (OCR-Enabled)

Category Formats Features
Raster .png, .jpg, .jpeg, .gif, .webp, .bmp, .tiff, .tif OCR, table detection, EXIF metadata, dimensions, color space
Advanced .jp2, .jpx, .jpm, .mj2, .jbig2, .jb2, .pnm, .pbm, .pgm, .ppm Pure Rust decoders (JPEG 2000, JBIG2), OCR, table detection
Vector .svg DOM parsing, embedded text, graphics metadata

Web & Data

Category Formats Features
Markup .html, .htm, .xhtml, .xml, .svg DOM parsing, metadata (Open Graph, Twitter Card), link extraction
Structured Data .json, .yaml, .yml, .toml, .csv, .tsv Schema detection, nested structures, validation
Text & Markdown .txt, .md, .markdown, .djot, .mdx, .rst, .org, .rtf CommonMark, GFM, Djot, MDX, reStructuredText, Org Mode, Rich Text

Email & Archives

Category Formats Features
Email .eml, .msg Headers, body (HTML/plain), attachments, UTF-16 support
Archives .zip, .tar, .tgz, .gz, .7z Recursive extraction, nested archives, metadata

Academic & Scientific

Category Formats Features
Citations .bib, .ris, .nbib, .enw, .csl BibTeX/BibLaTeX, RIS, PubMed/MEDLINE, EndNote XML, CSL JSON
Scientific .tex, .latex, .typ, .typst, .jats, .ipynb LaTeX, Typst, JATS journal articles, Jupyter notebooks
Publishing .fb2, .docbook, .dbk, .opml FictionBook, DocBook XML, OPML outlines
Documentation .pod, .mdoc, .troff Perl POD, man pages, troff

Complete Format Reference โ†’

Code Intelligence (248 Languages)

Feature Description
Structure Extraction Functions, classes, methods, structs, interfaces, enums
Import/Export Analysis Module dependencies, re-exports, wildcard imports
Symbol Extraction Variables, constants, type aliases, properties
Docstring Parsing Google, NumPy, Sphinx, JSDoc, RustDoc, and 10+ formats
Diagnostics Parse errors with line/column positions
Syntax-Aware Chunking Split code by semantic boundaries, not arbitrary byte offsets

Powered by tree-sitter-language-pack with dynamic grammar download. See TSLP documentation for the full language list.

Key Features

OCR with Table Extraction

Multiple OCR backends (Tesseract, EasyOCR, PaddleOCR) with intelligent table detection and reconstruction. Extract structured data from scanned documents and images with configurable accuracy thresholds.

OCR Backend Documentation โ†’

Batch Processing

Process multiple documents concurrently with configurable parallelism. Optimize throughput for large-scale document processing workloads with automatic resource management.

Batch Processing Guide โ†’

Password-Protected PDFs

Handle encrypted PDFs with single or multiple password attempts. Supports both RC4 and AES encryption with automatic fallback strategies.

PDF Configuration โ†’

Language Detection

Automatic language detection in extracted text using fast-langdetect. Configure confidence thresholds and access per-language statistics.

Language Detection Guide โ†’

Metadata Extraction

Extract comprehensive metadata from all supported formats: authors, titles, creation dates, page counts, EXIF data, and format-specific properties.

Metadata Guide โ†’

AI Coding Assistants

Kreuzberg ships with an Agent Skill that teaches AI coding assistants how to use the library correctly. It works with Claude Code, Codex, Gemini CLI, Cursor, VS Code, Amp, Goose, Roo Code, and any tool supporting the Agent Skills standard.

Install the skill into any project using the Vercel Skills CLI:

npx skills add kreuzberg-dev/kreuzberg

The skill is located at skills/kreuzberg/SKILL.md and is automatically discovered by supported AI coding tools once installed.

Documentation

Contributing

Contributions are welcome! See CONTRIBUTING.md for guidelines.

License

Elastic License 2.0 (ELv2) - see LICENSE for details. See https://www.elastic.co/licensing/elastic-license for the full license text.

Release History

VersionChangesUrgencyDate
v4.9.2## Fixed - Fix cancellation token not checked in WASM (non-tokio) path for Excel, DOC, PPT, Pages, Keynote, and Numbers extractors โ€” cancellation was silently ignored in WASM builds - Propagate `Cancelled` error code (9) to all bindings โ€” Go, C FFI, Python, TypeScript, Java, C#, and C API docs now include the new code - Fix PHP e2e embed tests calling instance methods statically โ€” use procedural `\Kreuzberg\embed()` functions - Fix TypeScript e2e embed tests using wrong field names (`type`/`namHigh4/19/2026
v4.9.1## Fixed - **#754**: Preserve `_internal_bindings.pyi` type stub during wheel artifact cleanup โ€” published wheels now include inline type information for the core binding module - Add missing `Default` impl for `PyCancellationToken` to satisfy clippy `new_without_default` lint - Improve download resilience for `eng.traineddata` in build script โ€” increase retries from 3 to 5, add fallback URL via `raw.githubusercontent.com`, and increase timeout to 300s - Increase Task installer retry resilienceHigh4/19/2026
v4.8.5## What's Changed ### Added - **LLM usage tracking** โ€” new `llm_usage` field on `ExtractionResult` captures token counts, estimated cost (USD), model identifier, and finish reason for every LLM call (VLM OCR, structured extraction, LLM embeddings). Exposed across all 12 bindings. ### Fixed - **Markdown chunker heading duplication** when `prepend_heading_context` is enabled (#701) - **Helm chart icon 404 on Artifact Hub** โ€” `.png` โ†’ `.svg` - **Python wheel manylinux compliance** โ€” bumped to `maHigh4/14/2026
v4.8.4## What's Changed ### Added - **Helm chart for Kubernetes deployment** โ€” minimal, security-hardened Helm chart with Deployment, Service, Ingress, PVC, HPA, PDB, and ServiceAccount templates. Publishes to GHCR as an OCI artifact. (#695) - **Helm lint and kubeconform pre-commit hooks** โ€” added `helm lint --strict` and `kubeconform` (k8s 1.28.0 schema validation) to pre-commit and CI pipeline. - **Helm chart publish workflow** โ€” new `publish-helm.yaml` GitHub Actions workflow pushes versioned charHigh4/13/2026
v4.8.2## Added - **`HtmlOutputConfig` typed in all bindings** โ€” `html_output` config field (themes, CSS classes, embed CSS, custom CSS, class prefix) now fully typed in Python, TypeScript/Node, Go, Ruby, Elixir, PHP, Java, C#, R, and FFI. Previously only available in Rust core. ## Fixed - **PDF: legitimate repeated content stripped during page merging regardless of `strip_repeating_text` flag** โ€” `deduplicate_paragraphs()` runs unconditionally, stripping brand names and other legitimately repeated High4/10/2026
v4.8.0## What's Changed * fix: correct HWP tag constants and control character handling by @nuri-yoo in https://github.com/kreuzberg-dev/kreuzberg/pull/659 * feat: standalone text embedding API (#599) by @kh3rld in https://github.com/kreuzberg-dev/kreuzberg/pull/614 * feat: integrate liter-llm for VLM OCR, embeddings, and structured extraction by @Goldziher in https://github.com/kreuzberg-dev/kreuzberg/pull/662 ## New Contributors * @nuri-yoo made their first contribution in https://github.com/kreuzbHigh4/8/2026

Dependencies & License Audit

Loading dependencies...

Similar Packages

alefGenerate fully-typed, lint-clean language bindings for Rust libraries across 11 languagesv0.4.4
oxidetermAll-in-one terminal workspace โ€” local shells, SSH, SFTP, remote IDE, AI agent, and file manager in a single native binary. Built with Tauri 2 and pure Rust SSH (no OpenSSL). Smart reconnect, MCP, RAG,v1.2.7
mcp-for-beginnersThis open-source curriculum introduces the fundamentals of Model Context Protocol (MCP) through real-world, cross-language examples in .NET, Java, TypeScript, JavaScript, Rust and Python. Designed formain@2026-04-17
vobaseThe app framework built for AI coding agents. Own every line. Your AI already knows how to build on it.create-vobase@0.6.2
yu-ai-agent็ผ–็จ‹ๅฏผ่ˆช 2025 ๅนด AI ๅผ€ๅ‘ๅฎžๆˆ˜ๆ–ฐ้กน็›ฎ๏ผŒๅŸบไบŽ Spring Boot 3 + Java 21 + Spring AI ๆž„ๅปบ AI ๆ‹็ˆฑๅคงๅธˆๅบ”็”จๅ’Œ ReAct ๆจกๅผ่‡ชไธป่ง„ๅˆ’ๆ™บ่ƒฝไฝ“YuManus๏ผŒ่ฆ†็›– AI ๅคงๆจกๅž‹ๆŽฅๅ…ฅใ€Spring AI ๆ ธๅฟƒ็‰นๆ€งใ€Prompt ๅทฅ็จ‹ๅ’Œไผ˜ๅŒ–ใ€RAG ๆฃ€็ดขๅขžๅผบใ€ๅ‘้‡ๆ•ฐๆฎๅบ“ใ€Tool Calling ๅทฅๅ…ท่ฐƒ็”จใ€MCP ๆจกๅž‹ไธŠไธ‹ๆ–‡ๅ่ฎฎใ€AI Agent ๅผ€ๅ‘ใ€Curs3.1.5