freshcrate
Skin:/
Home > MCP Servers > crawl-mcp

crawl-mcp

Crawl4AI MCP Server: Extract content from web pages, PDFs, Office docs, YouTube videos with AI-powered summarization. 17 tools, token reduction, production-ready.

Why this rank:Strong adoptionRecent releaseHealthy release cadence

Description

Crawl4AI MCP Server: Extract content from web pages, PDFs, Office docs, YouTube videos with AI-powered summarization. 17 tools, token reduction, production-ready.

README

Crawl-MCP: Unofficial MCP Server for crawl4ai

โš ๏ธ Important: This is an unofficial MCP server implementation for the excellent crawl4ai library.
Not affiliated with the original crawl4ai project.

A comprehensive Model Context Protocol (MCP) server that wraps the powerful crawl4ai library with advanced AI capabilities. Extract and analyze content from any source: web pages, PDFs, Office documents, YouTube videos, and more. Features intelligent summarization to dramatically reduce token usage while preserving key information.

๐ŸŒŸ Key Features

  • ๐Ÿ” Google Search Integration - 7 optimized search genres with Google official operators
  • ๐Ÿ” Advanced Web Crawling: JavaScript support, deep site mapping, entity extraction
  • ๐ŸŒ Universal Content Extraction: Web pages, PDFs, Word docs, Excel, PowerPoint, ZIP archives
  • ๐Ÿค– AI-Powered Summarization: Smart token reduction (up to 88.5%) while preserving essential information
  • ๐ŸŽฌ YouTube Integration: Extract video transcripts and summaries without API keys
  • โšก Production Ready: 19 specialized tools with comprehensive error handling

๐Ÿš€ Quick Start

Prerequisites (Required First)

  • Python 3.11 ไปฅไธŠ๏ผˆFastMCP ใŒ Python 3.11+ ใ‚’่ฆๆฑ‚๏ผ‰

Install system dependencies for Playwright:

Ubuntu 24.04 LTS (Manual Required):

# Manual setup required due to t64 library transition
sudo apt update && sudo apt install -y \
  libnss3 libatk-bridge2.0-0 libxss1 libasound2t64 \
  libgbm1 libgtk-3-0t64 libxshmfence-dev libxrandr2 \
  libxcomposite1 libxcursor1 libxdamage1 libxi6 \
  fonts-noto-color-emoji fonts-unifont python3-venv python3-pip

python3 -m venv venv && source venv/bin/activate
pip install playwright==1.55.0 && playwright install chromium
sudo playwright install-deps

Other Linux/macOS:

sudo bash scripts/prepare_for_uvx_playwright.sh

Windows (as Administrator):

scripts/prepare_for_uvx_playwright.ps1

Installation

UVX (Recommended - Easiest):

# After system preparation above - that's it!
uvx --from git+https://github.com/walksoda/crawl-mcp crawl-mcp

Docker (Production-Ready):

# Clone the repository
git clone https://github.com/walksoda/crawl-mcp
cd crawl-mcp

# Build and run with Docker Compose (STDIO mode)
docker-compose up --build

# Or build and run HTTP mode on port 8000
docker-compose --profile http up --build crawl4ai-mcp-http

# Or build manually
docker build -t crawl4ai-mcp .
docker run -it crawl4ai-mcp

Docker Features:

  • ๐Ÿ”ง Multi-Browser Support: Chromium, Firefox, Webkit headless browsers
  • ๐Ÿง Google Chrome: Additional Chrome Stable for compatibility
  • โšก Optimized Performance: Pre-configured browser flags for Docker
  • ๐Ÿ”’ Security: Non-root user execution
  • ๐Ÿ“ฆ Complete Dependencies: All required libraries included

Claude Desktop Setup

UVX Installation: Add to your claude_desktop_config.json:

{
  "mcpServers": {
    "crawl-mcp": {
      "transport": "stdio",
      "command": "uvx",
      "args": [
        "--from",
        "git+https://github.com/walksoda/crawl-mcp",
        "crawl-mcp"
      ],
      "env": {
        "CRAWL4AI_LANG": "en"
      }
    }
  }
}

Docker HTTP Mode:

{
  "mcpServers": {
    "crawl-mcp": {
      "transport": "http",
      "baseUrl": "http://localhost:8000"
    }
  }
}

For Japanese interface:

"env": {
  "CRAWL4AI_LANG": "ja"
}

๐Ÿ“– Documentation

Topic Description
Installation Guide Complete installation instructions for all platforms
API Reference Full tool documentation and usage examples
Configuration Examples Platform-specific setup configurations
HTTP Integration HTTP API access and integration methods
Advanced Usage Power user techniques and workflows
Development Guide Contributing and development setup

Language-Specific Documentation

  • English: docs/ directory
  • ๆ—ฅๆœฌ่ชž: docs/ja/ directory

๐Ÿ› ๏ธ Tool Overview

Web Crawling (3)

  • crawl_url - Extract web page content with JavaScript support
  • deep_crawl_site - Crawl multiple pages from a site with configurable depth
  • crawl_url_with_fallback - Crawl with fallback strategies for anti-bot sites

Data Extraction (3)

  • intelligent_extract - Extract specific data from web pages using LLM
  • extract_entities - Extract entities (emails, phones, etc.) from web pages
  • extract_structured_data - Extract structured data using CSS selectors or LLM

YouTube (4)

  • extract_youtube_transcript - Extract YouTube transcripts with timestamps
  • batch_extract_youtube_transcripts - Extract transcripts from multiple YouTube videos (max 3)
  • get_youtube_video_info - Get YouTube video metadata and transcript availability
  • extract_youtube_comments - Extract YouTube video comments with pagination

Search (4)

  • search_google - Search Google with genre filtering
  • batch_search_google - Perform multiple Google searches (max 3)
  • search_and_crawl - Search Google and crawl top results
  • get_search_genres - Get available search genres

File Processing (3)

  • process_file - Convert PDF, Word, Excel, PowerPoint, ZIP to markdown
  • get_supported_file_formats - Get supported file formats and capabilities
  • enhanced_process_large_content - Process large content with chunking and BM25 filtering

Batch Operations (2)

  • batch_crawl - Crawl multiple URLs with fallback (max 3 URLs)
  • multi_url_crawl - Multi-URL crawl with pattern-based config (max 5 URL patterns)

๐Ÿ’พ Persist Large Results to Disk (token-saver)

All information-gathering tools accept an optional output_path parameter that writes the full fetched content straight to disk and returns a slim metadata-only response. This lets an LLM fetch huge pages, long YouTube transcripts, or whole batches without blowing its context budget โ€” read from the saved file only when needed.

How it works:

  • Single-file tools (e.g. crawl_url, extract_youtube_transcript) write one .md (or .json for JSON-kind tools) โ€” pass an absolute file path; the extension is auto-added if omitted. An existing regular file at that path is rejected unless overwrite=true.
  • Batch tools (batch_crawl, multi_url_crawl, deep_crawl_site, search_and_crawl, batch_extract_youtube_transcripts) expect an absolute directory path and write one .md per URL plus index.json. Any non-existent path is treated as a directory and created โ€” including names containing dots such as /tmp/run.v1. If the path already exists as a regular file, the call is rejected. batch_crawl / multi_url_crawl keep their list return shape and embed an output_file key on each success item.
  • Request-dict tools (search_google, batch_search_google, search_and_crawl, batch_extract_youtube_transcripts) read the persistence keys directly from their request dict.
  • Common parameters: output_path (absolute; None or "" skips persistence), include_content_in_response (default false โ€” when true, content is included in the response too, still subject to any content_limit/content_offset/max_content_per_page slicing), overwrite (default false).
  • Writes are atomic per file (temp file + os.replace); parent directories are auto-created; the full unsliced payload is persisted before any slicing or tool-internal truncation so the on-disk copy is always complete even when the response is sliced.
  • Batch dict tools (deep_crawl_site, search_and_crawl, batch_extract_youtube_transcripts) skip per-item persistence for items that report success=false; these still appear in index.json with file: null so callers can reason about the attempt list.

Markdown single-file example:

{
  "tool": "crawl_url",
  "arguments": {
    "url": "https://example.com/long-article",
    "output_path": "/tmp/crawl_out/article.md"
  }
}

JSON structured extraction (extension auto-added):

{
  "tool": "extract_structured_data",
  "arguments": {
    "url": "https://example.com/products",
    "extraction_type": "css",
    "css_selectors": {"price": ".price", "name": "h1"},
    "output_path": "/tmp/crawl_out/products"
  }
}

Batch directory mode:

{
  "tool": "batch_crawl",
  "arguments": {
    "urls": ["https://a.example", "https://b.example"],
    "output_path": "/tmp/crawl_out/batch_run1"
  }
}

Each persisted markdown file begins with a YAML frontmatter block containing url, title, fetched_at, and source_tool so the artifact is self-describing.

๐ŸŽฏ Common Use Cases

Content Research:

search_and_crawl โ†’ extract_structured_data โ†’ analysis

Documentation Mining:

deep_crawl_site โ†’ batch processing โ†’ extraction

Media Analysis:

extract_youtube_transcript โ†’ summarization workflow

Site Mapping:

batch_crawl โ†’ multi_url_crawl โ†’ comprehensive data

๐Ÿšจ Quick Troubleshooting

Installation Issues:

  1. Re-run setup scripts with proper privileges
  2. Try development installation method
  3. Check browser dependencies are installed

Performance Issues:

  • Use wait_for_js: true for JavaScript-heavy sites
  • Increase timeout for slow-loading pages
  • Use extract_structured_data for targeted extraction

Configuration Issues:

  • Check JSON syntax in claude_desktop_config.json
  • Verify file paths are absolute
  • Restart Claude Desktop after configuration changes

๐Ÿ—๏ธ Project Structure

  • Original Library: crawl4ai by unclecode
  • MCP Wrapper: This repository (walksoda)
  • Implementation: Unofficial third-party integration

๐Ÿ“„ License

This project is an unofficial wrapper around the crawl4ai library. Please refer to the original crawl4ai license for the underlying functionality.

๐Ÿค Contributing

See our Development Guide for contribution guidelines and development setup instructions.

๐Ÿ”— Related Projects

Release History

VersionChangesUrgencyDate
v0.3.3## What's Changed ### Bug Fixes - Fix `AttributeError` from `.strip()` on None content fields returned by crawl4ai (#24, #25) - Harden `search_and_crawl` failed-page detection so None content no longer crashes it - Accept markdown-only pages instead of treating them as failures - Guard content truncation `len()` against None in both the main and fallback paths ### Internal - Resolve `__version__` dynamically so the reported version no longer drifts from the release ### Contributors - @sotaHigh6/6/2026
v0.3.2## What's Changed ### Bug Fixes - Fix `enhanced_process_large_content` crash when `content` is None (read `markdown` field first) - Surface YouTube Restricted Mode as `success=True` with structured warning instead of cryptic error - Add version upper bound to markitdown dependency (`<0.2`) ### Security - Pin urllib3>=2.7.0 to resolve CVE-2026-44431 and CVE-2026-44432 ### Dependencies - Bump markitdown to 0.1.5 with `[pdf]` extra (replaces separate pdfminer-six pin) - Bump crawl4ai to `>=0.High5/17/2026
v0.3.1## What's Changed ### New Features - Support local file processing via `file://` URIs and absolute paths - Add `is_file_uri`, `is_local_path`, `file_uri_to_local_path` validators - `process_file` tool now accepts local file paths in addition to URLs ### Bug Fixes - Normalize `CRAWL4AI_BROWSER_TYPE` env var with `strip().lower()` - Use `CRAWL4AI_BROWSER_TYPE` to override default browser list ### Security - Add minimum version constraint for litellm dependency (>=1.83.7) - Resolve known critHigh4/29/2026
v0.3.0# Release v0.3.0 - Output Persistence and Reliability Improvements ## Overview This release adds a new output_path option for persisting tool results to disk. It also includes reliability fixes for CrawlResponse handling, batch execution, and pagination. ## New Features ### output_path Option - New `output_path` parameter available across all MCP tools - Persist tool results as files to disk for downstream processing - Useful for integrating crawl results into automated pipelines ### readOnHigh4/12/2026
v0.2.0# Release v0.2.0 - YouTube Comments Tool and Codebase Modularization ## Overview This release adds a new YouTube comment extraction tool. It also includes a major codebase refactoring for better maintainability. Security, reliability, and test coverage are improved across the project. ## New Features ### extract_youtube_comments Tool - Extract YouTube video comments without API key using youtube-comment-downloader - Pagination support via `comment_offset` parameter for retrieving large commeLow3/1/2026
v0.1.7## What's Changed ### Bug Fixes - Fix `extract_media=True` Pydantic validation error - Fix Docker build compatibility for Debian bookworm/trixie ### New Features - Add Ollama LLM support for web content summarization - Add Anthropic and Ollama LLM support for file processing - Restore batch tools with rate limits - `batch_crawl` - max 5 URLs - `multi_url_crawl` - max 5 URL configurations - `batch_search_google` - max 3 queries - `batch_extract_youtube_transcripts` - max 3 URLs ### DepLow1/12/2026
v0.1.6# Release v0.1.6 - Token Optimization and MCP Interface Refinement ## Overview This release focuses on optimizing token usage for Claude Code MCP integration and refining the MCP tool interface by removing batch operations that provide limited value in sequential processing contexts. ## Major Updates ### Token Usage Optimization - **Increased token limit**: Response token limit raised from 20000 to 25000 for all crawling tools - **Markdown-only response**: New `include_cleaned_htmlLow10/18/2025
v0.1.5## Overview This maintenance release focuses on performance optimizations, code quality improvements, and dependency updates to enhance the overall stability and reliability of the crawl-mcp server. ## Major Updates ### Performance Enhancements - **Core server optimizations**: Improved response handling and resource management - **Web crawling efficiency**: Enhanced crawling performance and reliability - **Tool utilities refinement**: Optimized utility functions for better performancLow9/28/2025
v0.1.4# Release v0.1.4 - Enhanced Search Filtering with Date-Based Filtering ## Overview This release introduces enhanced search filtering capabilities with date-based filtering support, AI summarization improvements, and better API parameter standardization across all search tools. ## Major Updates ### Enhanced Search Filtering - **Date-based filtering**: New `recent_days` parameter for time-sensitive searches - **Improved search accuracy**: Removed deprecated 'recent' search genre for Low9/21/2025
v0.1.2## Overview This release includes significant improvements and fixes, including FastMCP version adjustment for stability, enhanced crawling features, and a critical config loading fix that ensures proper execution from any directory. ## Major Updates ### FastMCP Version Adjustment & Project Restructure - Downgraded FastMCP from 2.x to FastMCP 2.11.0 for improved stability - Reorganized project structure for better maintainability - Enhanced tool registration and MCP protocol handliLow8/25/2025
v0.1.1## Crawl4AI MCP Server v0.1.1 Bug fix release resolving critical import errors and adding comprehensive testing framework. ### Fixes - Fixed `batch_crawl` tool import error - Fixed `extract_structured_data` tool import error - Synchronized main server and DXT package versions ### New Features - Comprehensive test suite with 100% success rate for tested tools - Four testing modes: quick, comprehensive, category, and interactive - Enhanced FastMCP client with better error handling Low8/16/2025
v0.1.0## Crawl4AI MCP Server v0.1.0 Initial stable release of the Crawl4AI MCP (Model Context Protocol) Server. ### Features - Web crawling with JavaScript support - YouTube transcript extraction - Google search integration - Document processing (PDF, Word, Excel, PowerPoint) - Intelligent content extraction - Batch processing capabilities - Enhanced crawling with fallback strategies ### Installation - UVX package: `crawl4ai-dxt-correct` - Direct installation from sourceLow8/16/2025

Dependencies & License Audit

Loading dependencies...

Similar Packages

AI-Skills๐Ÿค– Enhance AI capabilities with modular Skills that provide expert knowledge, workflows, and integrations for any project.main@2026-06-07
telegram-mcp๐Ÿค– Manage multiple Telegram accounts effortlessly with AI-driven tools for bulk messaging, scheduling, and more in one easy-to-use platform.main@2026-06-07
Enterprise-Multi-AI-Agent-Systems-๐Ÿค– Build and deploy scalable Multi-AI Agent systems with LangGraph and Groq LLMs to enhance intelligence across enterprise applications.main@2026-06-07
AIDomesticCoreAIJ๐Ÿ› ๏ธ Build a robust AI Kernel for stable, auditable, and sovereign AI systems, ensuring secure execution and compliance across various domains.main@2026-06-07
argus-mcp๐Ÿ” Enhance code quality with Argus MCP, an AI-driven code review server using a Zero-Trust model for safe and efficient development.main@2026-06-07

More in MCP Servers

claude-plugins-officialOfficial, Anthropic-managed directory of high quality Claude Code Plugins.
langchain4jLangChain4j is an open-source Java library that simplifies the integration of LLMs into Java applications through a unified API, providing access to popular LLMs and vector databases. It makes impleme
hyperframesWrite HTML. Render video. Built for agents.
claude-code-guideClaude Code Guide - Setup, Commands, workflows, agents, skills & tips-n-tricks go from beginner to power user!