freshcrate
Home > MCP Servers > crawl-mcp

crawl-mcp

Crawl4AI MCP Server: Extract content from web pages, PDFs, Office docs, YouTube videos with AI-powered summarization. 17 tools, token reduction, production-ready.

Description

Crawl4AI MCP Server: Extract content from web pages, PDFs, Office docs, YouTube videos with AI-powered summarization. 17 tools, token reduction, production-ready.

README

Crawl-MCP: Unofficial MCP Server for crawl4ai

⚠️ Important: This is an unofficial MCP server implementation for the excellent crawl4ai library.
Not affiliated with the original crawl4ai project.

A comprehensive Model Context Protocol (MCP) server that wraps the powerful crawl4ai library with advanced AI capabilities. Extract and analyze content from any source: web pages, PDFs, Office documents, YouTube videos, and more. Features intelligent summarization to dramatically reduce token usage while preserving key information.

🌟 Key Features

  • πŸ” Google Search Integration - 7 optimized search genres with Google official operators
  • πŸ” Advanced Web Crawling: JavaScript support, deep site mapping, entity extraction
  • 🌐 Universal Content Extraction: Web pages, PDFs, Word docs, Excel, PowerPoint, ZIP archives
  • πŸ€– AI-Powered Summarization: Smart token reduction (up to 88.5%) while preserving essential information
  • 🎬 YouTube Integration: Extract video transcripts and summaries without API keys
  • ⚑ Production Ready: 19 specialized tools with comprehensive error handling

πŸš€ Quick Start

Prerequisites (Required First)

  • Python 3.11 δ»₯上(FastMCP が Python 3.11+ を要求)

Install system dependencies for Playwright:

Ubuntu 24.04 LTS (Manual Required):

# Manual setup required due to t64 library transition
sudo apt update && sudo apt install -y \
  libnss3 libatk-bridge2.0-0 libxss1 libasound2t64 \
  libgbm1 libgtk-3-0t64 libxshmfence-dev libxrandr2 \
  libxcomposite1 libxcursor1 libxdamage1 libxi6 \
  fonts-noto-color-emoji fonts-unifont python3-venv python3-pip

python3 -m venv venv && source venv/bin/activate
pip install playwright==1.55.0 && playwright install chromium
sudo playwright install-deps

Other Linux/macOS:

sudo bash scripts/prepare_for_uvx_playwright.sh

Windows (as Administrator):

scripts/prepare_for_uvx_playwright.ps1

Installation

UVX (Recommended - Easiest):

# After system preparation above - that's it!
uvx --from git+https://github.com/walksoda/crawl-mcp crawl-mcp

Docker (Production-Ready):

# Clone the repository
git clone https://github.com/walksoda/crawl-mcp
cd crawl-mcp

# Build and run with Docker Compose (STDIO mode)
docker-compose up --build

# Or build and run HTTP mode on port 8000
docker-compose --profile http up --build crawl4ai-mcp-http

# Or build manually
docker build -t crawl4ai-mcp .
docker run -it crawl4ai-mcp

Docker Features:

  • πŸ”§ Multi-Browser Support: Chromium, Firefox, Webkit headless browsers
  • 🐧 Google Chrome: Additional Chrome Stable for compatibility
  • ⚑ Optimized Performance: Pre-configured browser flags for Docker
  • πŸ”’ Security: Non-root user execution
  • πŸ“¦ Complete Dependencies: All required libraries included

Claude Desktop Setup

UVX Installation: Add to your claude_desktop_config.json:

{
  "mcpServers": {
    "crawl-mcp": {
      "transport": "stdio",
      "command": "uvx",
      "args": [
        "--from",
        "git+https://github.com/walksoda/crawl-mcp",
        "crawl-mcp"
      ],
      "env": {
        "CRAWL4AI_LANG": "en"
      }
    }
  }
}

Docker HTTP Mode:

{
  "mcpServers": {
    "crawl-mcp": {
      "transport": "http",
      "baseUrl": "http://localhost:8000"
    }
  }
}

For Japanese interface:

"env": {
  "CRAWL4AI_LANG": "ja"
}

πŸ“– Documentation

Topic Description
Installation Guide Complete installation instructions for all platforms
API Reference Full tool documentation and usage examples
Configuration Examples Platform-specific setup configurations
HTTP Integration HTTP API access and integration methods
Advanced Usage Power user techniques and workflows
Development Guide Contributing and development setup

Language-Specific Documentation

  • English: docs/ directory
  • ζ—₯本θͺž: docs/ja/ directory

πŸ› οΈ Tool Overview

Web Crawling (3)

  • crawl_url - Extract web page content with JavaScript support
  • deep_crawl_site - Crawl multiple pages from a site with configurable depth
  • crawl_url_with_fallback - Crawl with fallback strategies for anti-bot sites

Data Extraction (3)

  • intelligent_extract - Extract specific data from web pages using LLM
  • extract_entities - Extract entities (emails, phones, etc.) from web pages
  • extract_structured_data - Extract structured data using CSS selectors or LLM

YouTube (4)

  • extract_youtube_transcript - Extract YouTube transcripts with timestamps
  • batch_extract_youtube_transcripts - Extract transcripts from multiple YouTube videos (max 3)
  • get_youtube_video_info - Get YouTube video metadata and transcript availability
  • extract_youtube_comments - Extract YouTube video comments with pagination

Search (4)

  • search_google - Search Google with genre filtering
  • batch_search_google - Perform multiple Google searches (max 3)
  • search_and_crawl - Search Google and crawl top results
  • get_search_genres - Get available search genres

File Processing (3)

  • process_file - Convert PDF, Word, Excel, PowerPoint, ZIP to markdown
  • get_supported_file_formats - Get supported file formats and capabilities
  • enhanced_process_large_content - Process large content with chunking and BM25 filtering

Batch Operations (2)

  • batch_crawl - Crawl multiple URLs with fallback (max 3 URLs)
  • multi_url_crawl - Multi-URL crawl with pattern-based config (max 5 URL patterns)

πŸ’Ύ Persist Large Results to Disk (token-saver)

All information-gathering tools accept an optional output_path parameter that writes the full fetched content straight to disk and returns a slim metadata-only response. This lets an LLM fetch huge pages, long YouTube transcripts, or whole batches without blowing its context budget β€” read from the saved file only when needed.

How it works:

  • Single-file tools (e.g. crawl_url, extract_youtube_transcript) write one .md (or .json for JSON-kind tools) β€” pass an absolute file path; the extension is auto-added if omitted. An existing regular file at that path is rejected unless overwrite=true.
  • Batch tools (batch_crawl, multi_url_crawl, deep_crawl_site, search_and_crawl, batch_extract_youtube_transcripts) expect an absolute directory path and write one .md per URL plus index.json. Any non-existent path is treated as a directory and created β€” including names containing dots such as /tmp/run.v1. If the path already exists as a regular file, the call is rejected. batch_crawl / multi_url_crawl keep their list return shape and embed an output_file key on each success item.
  • Request-dict tools (search_google, batch_search_google, search_and_crawl, batch_extract_youtube_transcripts) read the persistence keys directly from their request dict.
  • Common parameters: output_path (absolute; None or "" skips persistence), include_content_in_response (default false β€” when true, content is included in the response too, still subject to any content_limit/content_offset/max_content_per_page slicing), overwrite (default false).
  • Writes are atomic per file (temp file + os.replace); parent directories are auto-created; the full unsliced payload is persisted before any slicing or tool-internal truncation so the on-disk copy is always complete even when the response is sliced.
  • Batch dict tools (deep_crawl_site, search_and_crawl, batch_extract_youtube_transcripts) skip per-item persistence for items that report success=false; these still appear in index.json with file: null so callers can reason about the attempt list.

Markdown single-file example:

{
  "tool": "crawl_url",
  "arguments": {
    "url": "https://example.com/long-article",
    "output_path": "/tmp/crawl_out/article.md"
  }
}

JSON structured extraction (extension auto-added):

{
  "tool": "extract_structured_data",
  "arguments": {
    "url": "https://example.com/products",
    "extraction_type": "css",
    "css_selectors": {"price": ".price", "name": "h1"},
    "output_path": "/tmp/crawl_out/products"
  }
}

Batch directory mode:

{
  "tool": "batch_crawl",
  "arguments": {
    "urls": ["https://a.example", "https://b.example"],
    "output_path": "/tmp/crawl_out/batch_run1"
  }
}

Each persisted markdown file begins with a YAML frontmatter block containing url, title, fetched_at, and source_tool so the artifact is self-describing.

🎯 Common Use Cases

Content Research:

search_and_crawl β†’ extract_structured_data β†’ analysis

Documentation Mining:

deep_crawl_site β†’ batch processing β†’ extraction

Media Analysis:

extract_youtube_transcript β†’ summarization workflow

Site Mapping:

batch_crawl β†’ multi_url_crawl β†’ comprehensive data

🚨 Quick Troubleshooting

Installation Issues:

  1. Re-run setup scripts with proper privileges
  2. Try development installation method
  3. Check browser dependencies are installed

Performance Issues:

  • Use wait_for_js: true for JavaScript-heavy sites
  • Increase timeout for slow-loading pages
  • Use extract_structured_data for targeted extraction

Configuration Issues:

  • Check JSON syntax in claude_desktop_config.json
  • Verify file paths are absolute
  • Restart Claude Desktop after configuration changes

πŸ—οΈ Project Structure

  • Original Library: crawl4ai by unclecode
  • MCP Wrapper: This repository (walksoda)
  • Implementation: Unofficial third-party integration

πŸ“„ License

This project is an unofficial wrapper around the crawl4ai library. Please refer to the original crawl4ai license for the underlying functionality.

🀝 Contributing

See our Development Guide for contribution guidelines and development setup instructions.

πŸ”— Related Projects

Release History

VersionChangesUrgencyDate
v0.3.0# Release v0.3.0 - Output Persistence and Reliability Improvements ## Overview This release adds a new output_path option for persisting tool results to disk. It also includes reliability fixes for CrawlResponse handling, batch execution, and pagination. ## New Features ### output_path Option - New `output_path` parameter available across all MCP tools - Persist tool results as files to disk for downstream processing - Useful for integrating crawl results into automated pipelines ### readOnHigh4/12/2026
v0.2.0# Release v0.2.0 - YouTube Comments Tool and Codebase Modularization ## Overview This release adds a new YouTube comment extraction tool. It also includes a major codebase refactoring for better maintainability. Security, reliability, and test coverage are improved across the project. ## New Features ### extract_youtube_comments Tool - Extract YouTube video comments without API key using youtube-comment-downloader - Pagination support via `comment_offset` parameter for retrieving large commeLow3/1/2026

Dependencies & License Audit

Loading dependencies...

Similar Packages

nmap-mcpπŸ” Enable AI-driven network security scanning with a production-ready Nmap MCP server supporting diverse tools, scan types, and timing templates.main@2026-04-21
noapi-google-search-mcpπŸ” Enable local LLMs with real-time Google search, live feeds, OCR, and video insights using noapi-google-search-mcp server tools.main@2026-04-21
hybrid-orchestratorπŸ€– Implement hybrid human-AI orchestration patterns in Python to coordinate agents, manage sessions, and enable smooth AI-human handoffs.master@2026-04-21
aiA productive AI coworker that learns, self-improves, and ships work.main@2026-04-21
Advanced-AI-AgentsπŸ€– Build advanced AI agents with a collection of production-ready applications using modern frameworks for single and multi-agent systems.main@2026-04-21