β οΈ Important: This is an unofficial MCP server implementation for the excellent crawl4ai library.
Not affiliated with the original crawl4ai project.
A comprehensive Model Context Protocol (MCP) server that wraps the powerful crawl4ai library with advanced AI capabilities. Extract and analyze content from any source: web pages, PDFs, Office documents, YouTube videos, and more. Features intelligent summarization to dramatically reduce token usage while preserving key information.
- π Google Search Integration - 7 optimized search genres with Google official operators
- π Advanced Web Crawling: JavaScript support, deep site mapping, entity extraction
- π Universal Content Extraction: Web pages, PDFs, Word docs, Excel, PowerPoint, ZIP archives
- π€ AI-Powered Summarization: Smart token reduction (up to 88.5%) while preserving essential information
- π¬ YouTube Integration: Extract video transcripts and summaries without API keys
- β‘ Production Ready: 19 specialized tools with comprehensive error handling
- Python 3.11 δ»₯δΈοΌFastMCP γ Python 3.11+ γθ¦ζ±οΌ
Install system dependencies for Playwright:
Ubuntu 24.04 LTS (Manual Required):
# Manual setup required due to t64 library transition
sudo apt update && sudo apt install -y \
libnss3 libatk-bridge2.0-0 libxss1 libasound2t64 \
libgbm1 libgtk-3-0t64 libxshmfence-dev libxrandr2 \
libxcomposite1 libxcursor1 libxdamage1 libxi6 \
fonts-noto-color-emoji fonts-unifont python3-venv python3-pip
python3 -m venv venv && source venv/bin/activate
pip install playwright==1.55.0 && playwright install chromium
sudo playwright install-depsOther Linux/macOS:
sudo bash scripts/prepare_for_uvx_playwright.shWindows (as Administrator):
scripts/prepare_for_uvx_playwright.ps1UVX (Recommended - Easiest):
# After system preparation above - that's it!
uvx --from git+https://github.com/walksoda/crawl-mcp crawl-mcpDocker (Production-Ready):
# Clone the repository
git clone https://github.com/walksoda/crawl-mcp
cd crawl-mcp
# Build and run with Docker Compose (STDIO mode)
docker-compose up --build
# Or build and run HTTP mode on port 8000
docker-compose --profile http up --build crawl4ai-mcp-http
# Or build manually
docker build -t crawl4ai-mcp .
docker run -it crawl4ai-mcpDocker Features:
- π§ Multi-Browser Support: Chromium, Firefox, Webkit headless browsers
- π§ Google Chrome: Additional Chrome Stable for compatibility
- β‘ Optimized Performance: Pre-configured browser flags for Docker
- π Security: Non-root user execution
- π¦ Complete Dependencies: All required libraries included
UVX Installation:
Add to your claude_desktop_config.json:
{
"mcpServers": {
"crawl-mcp": {
"transport": "stdio",
"command": "uvx",
"args": [
"--from",
"git+https://github.com/walksoda/crawl-mcp",
"crawl-mcp"
],
"env": {
"CRAWL4AI_LANG": "en"
}
}
}
}Docker HTTP Mode:
{
"mcpServers": {
"crawl-mcp": {
"transport": "http",
"baseUrl": "http://localhost:8000"
}
}
}For Japanese interface:
"env": {
"CRAWL4AI_LANG": "ja"
}| Topic | Description |
|---|---|
| Installation Guide | Complete installation instructions for all platforms |
| API Reference | Full tool documentation and usage examples |
| Configuration Examples | Platform-specific setup configurations |
| HTTP Integration | HTTP API access and integration methods |
| Advanced Usage | Power user techniques and workflows |
| Development Guide | Contributing and development setup |
crawl_url- Extract web page content with JavaScript supportdeep_crawl_site- Crawl multiple pages from a site with configurable depthcrawl_url_with_fallback- Crawl with fallback strategies for anti-bot sites
intelligent_extract- Extract specific data from web pages using LLMextract_entities- Extract entities (emails, phones, etc.) from web pagesextract_structured_data- Extract structured data using CSS selectors or LLM
extract_youtube_transcript- Extract YouTube transcripts with timestampsbatch_extract_youtube_transcripts- Extract transcripts from multiple YouTube videos (max 3)get_youtube_video_info- Get YouTube video metadata and transcript availabilityextract_youtube_comments- Extract YouTube video comments with pagination
search_google- Search Google with genre filteringbatch_search_google- Perform multiple Google searches (max 3)search_and_crawl- Search Google and crawl top resultsget_search_genres- Get available search genres
process_file- Convert PDF, Word, Excel, PowerPoint, ZIP to markdownget_supported_file_formats- Get supported file formats and capabilitiesenhanced_process_large_content- Process large content with chunking and BM25 filtering
batch_crawl- Crawl multiple URLs with fallback (max 3 URLs)multi_url_crawl- Multi-URL crawl with pattern-based config (max 5 URL patterns)
All information-gathering tools accept an optional output_path parameter that writes the full fetched content straight to disk and returns a slim metadata-only response. This lets an LLM fetch huge pages, long YouTube transcripts, or whole batches without blowing its context budget β read from the saved file only when needed.
How it works:
- Single-file tools (e.g.
crawl_url,extract_youtube_transcript) write one.md(or.jsonfor JSON-kind tools) β pass an absolute file path; the extension is auto-added if omitted. An existing regular file at that path is rejected unlessoverwrite=true. - Batch tools (
batch_crawl,multi_url_crawl,deep_crawl_site,search_and_crawl,batch_extract_youtube_transcripts) expect an absolute directory path and write one.mdper URL plusindex.json. Any non-existent path is treated as a directory and created β including names containing dots such as/tmp/run.v1. If the path already exists as a regular file, the call is rejected.batch_crawl/multi_url_crawlkeep theirlistreturn shape and embed anoutput_filekey on each success item. - Request-dict tools (
search_google,batch_search_google,search_and_crawl,batch_extract_youtube_transcripts) read the persistence keys directly from their request dict. - Common parameters:
output_path(absolute;Noneor""skips persistence),include_content_in_response(defaultfalseβ whentrue, content is included in the response too, still subject to anycontent_limit/content_offset/max_content_per_pageslicing),overwrite(defaultfalse). - Writes are atomic per file (temp file +
os.replace); parent directories are auto-created; the full unsliced payload is persisted before any slicing or tool-internal truncation so the on-disk copy is always complete even when the response is sliced. - Batch dict tools (
deep_crawl_site,search_and_crawl,batch_extract_youtube_transcripts) skip per-item persistence for items that reportsuccess=false; these still appear inindex.jsonwithfile: nullso callers can reason about the attempt list.
Markdown single-file example:
{
"tool": "crawl_url",
"arguments": {
"url": "https://example.com/long-article",
"output_path": "/tmp/crawl_out/article.md"
}
}JSON structured extraction (extension auto-added):
{
"tool": "extract_structured_data",
"arguments": {
"url": "https://example.com/products",
"extraction_type": "css",
"css_selectors": {"price": ".price", "name": "h1"},
"output_path": "/tmp/crawl_out/products"
}
}Batch directory mode:
{
"tool": "batch_crawl",
"arguments": {
"urls": ["https://a.example", "https://b.example"],
"output_path": "/tmp/crawl_out/batch_run1"
}
}Each persisted markdown file begins with a YAML frontmatter block containing url, title, fetched_at, and source_tool so the artifact is self-describing.
Content Research:
search_and_crawl β extract_structured_data β analysisDocumentation Mining:
deep_crawl_site β batch processing β extractionMedia Analysis:
extract_youtube_transcript β summarization workflowSite Mapping:
batch_crawl β multi_url_crawl β comprehensive dataInstallation Issues:
- Re-run setup scripts with proper privileges
- Try development installation method
- Check browser dependencies are installed
Performance Issues:
- Use
wait_for_js: truefor JavaScript-heavy sites - Increase timeout for slow-loading pages
- Use
extract_structured_datafor targeted extraction
Configuration Issues:
- Check JSON syntax in
claude_desktop_config.json - Verify file paths are absolute
- Restart Claude Desktop after configuration changes
- Original Library: crawl4ai by unclecode
- MCP Wrapper: This repository (walksoda)
- Implementation: Unofficial third-party integration
This project is an unofficial wrapper around the crawl4ai library. Please refer to the original crawl4ai license for the underlying functionality.
See our Development Guide for contribution guidelines and development setup instructions.
- crawl4ai - The underlying web crawling library
- Model Context Protocol - The standard this server implements
- Claude Desktop - Primary client for MCP servers
