freshcrate
Skin:/
Home > MCP Servers > cortex-scout

cortex-scout

A unified web extraction and stateful automation engine for AI. Replaces heavy testing frameworks with token-optimized browser control, deep research, and HITL.

Why this rank:Strong adoptionRelease freshnessHealthy release cadence

Description

A unified web extraction and stateful automation engine for AI. Replaces heavy testing frameworks with token-optimized browser control, deep research, and HITL.

README

CortexScout (cortex-scout) โ€” Search and Web Extraction Engine for AI Agents

CortexScout is the Deep Research & Web Extraction module within the Cortex-Works ecosystem.

Designed for agent workloads that require token-efficient web retrieval, reliable anti-bot handling, and optional Human-in-the-Loop (HITL) fallback.

MIT License Built with Rust MCP


Overview

CortexScout provides a single, self-hostable Rust binary that exposes search, extraction, and stateful browser automation capabilities over MCP (stdio) and an optional HTTP server. Output formats are structured and optimized for downstream LLM use.

It is built to handle the practical failure modes of web retrieval (rate limits, bot challenges, JavaScript-heavy pages) through progressive fallbacks: native retrieval โ†’ Chromium CDP rendering โ†’ Stateful E2E Testing โ†’ HITL workflows.


Tools (Capability Roster)

Area MCP Tools / Capabilities
Search web_search (URL discovery) or web_search(include_content=true) (search+content in one call)
Fetch and Crawl `web_fetch(mode="single"
Extraction extract_fields (primary structured extraction)
Automation scout_browser_automate / browser_automate (stateful omni-tool), scout_agent_profile_auth, scout_browser_close
Anti-bot handling CDP rendering, proxy rotation, block-aware retries
HITL visual_scout, `hitl_web_fetch(auth_mode="challenge"
Memory memory_search (LanceDB-backed research history)
Deep research deep_research (multi-hop search + scrape + synthesis)

Legacy names remain callable as compatibility aliases (web_search_json, web_fetch_batch, web_crawl, fetch_then_extract, human_auth_session). Agents should prefer the unified primary tools above.

Ecosystem Integration

While CortexScout runs as a standalone tool today, it is designed to integrate with CortexDB and CortexStudio for multi-agent scaling, shared retrieval artifacts, and centralized governance.


๐ŸŽญ The "Playwright Killer" (Stateful Browser Automation)

CortexScout includes a built-in, stateful CDP automation engine designed specifically for AI Agents, completely replacing heavy frameworks like Playwright or Cypress for E2E testing workflows.

  • The Silent Omni-Tool (scout_browser_automate): Instead of calling dozens of browser tools, agents pass one array of steps. The runtime now covers Playwright-style action families in one call: navigation, hover/click/type/wait, locator-driven actions, assertions, tabs, screenshots/PDF, file upload, form fill, dialog policy, coordinate mouse actions, route mocking, console/network capture, and cookie/storage CRUD.
  • Persistent Agent Profile: Automation runs silently in the background (--headless=new) using a dedicated isolated profile (~/.cortex-scout/agent_profile). It maintains cookies, localStorage, and session state across tool calls without causing SingletonLock collisions with your active desktop browser.
  • QA Mock, Trace, And Verification Engine: Agents can install route mocks (mock_api, route_list, unroute) with response header overrides/stripping, trace flows (trace_start, trace_stop, trace_export), capture console/network logs, checkpoint browser state, and run both CSS and locator-based assertions plus Playwright-style verification helpers.
  • The Agent Auth Portal (scout_agent_profile_auth): If the silent agent encounters a CAPTCHA or complex OAuth login (like Google/Microsoft) on a new domain, this tool launches the agent's profile in a visible window. You solve the CAPTCHA once, the cookies are saved, and the agent returns to silent automation forever.

Playwright-Style Coverage Map

Capability Area Cortex Scout Actions
Navigation and input navigate, navigate_back, click, hover, type, press_key, scroll, wait_for, wait_for_selector, wait_for_locator
Locator and verification click_locator, type_locator, assert, assert_locator, generate_locator, verify_element_visible, verify_text_visible, verify_list_visible, verify_value
Tabs and media tabs, resize, screenshot, snapshot, pdf_save, file_upload, fill_form, handle_dialog
Network and mocks network_tap, network_dump, network_state_set, mock_api, route_list, unroute
Browser state storage_clear, storage_state_export, storage_state_import, storage_checkpoint, storage_rollback, cookie_*, localstorage_*, sessionstorage_*
Low-level pointer control mouse_click_xy, mouse_down, mouse_move_xy, mouse_drag_xy, mouse_up, mouse_wheel

The main tradeoff versus raw Playwright MCP is packaging, not capability shape: Cortex Scout keeps the browser surface inside one stateful omni-tool so agents spend fewer turns and fewer tokens coordinating multi-step flows.

Anti-Bot Efficacy & Validation

This repository includes captured evidence artifacts that validate extraction and HITL flows against representative protected targets.

Target Protection Evidence Notes
LinkedIn Cloudflare + Auth JSON ยท Snippet Auth-gated listings extraction
Ticketmaster Cloudflare Turnstile JSON ยท Snippet Challenge-handled extraction
Airbnb DataDome JSON ยท Snippet Large result sets under bot controls
Upwork reCAPTCHA JSON ยท Snippet Protected listings retrieval
Amazon AWS Shield JSON ยท Snippet Search result extraction
nowsecure.nl Cloudflare JSON Manual return path validated

See proof/README.md for methodology and raw outputs.


Quick Start

Option A โ€” Prebuilt binaries

Download the latest release assets from GitHub Releases and run one of:

  • cortex-scout-mcp โ€” MCP stdio server (recommended for VS Code / Cursor / Claude Desktop)
  • cortex-scout โ€” optional HTTP server (default port 5000; override via --port, PORT, or CORTEX_SCOUT_PORT)

Health check (HTTP server):

./cortex-scout --port 5000
curl http://localhost:5000/health

Option B โ€” Build from source

Install protoc first. lance-encoding uses Protocol Buffers during the release build, so protoc must be on your PATH.

  • macOS: brew install protobuf
  • Ubuntu/Debian: sudo apt-get install -y protobuf-compiler
  • Fedora: sudo dnf install -y protobuf-compiler

Basic build (search, scrape, deep research, memory):

git clone https://github.com/cortex-works/cortex-scout.git
cd cortex-scout
cargo build --release --manifest-path mcp-server/Cargo.toml --bin cortex-scout-mcp

This works from the repository root because the manifest path is explicit.

Full build (includes hitl_web_fetch / visible-browser HITL):

cargo build --release --manifest-path mcp-server/Cargo.toml --all-features --bin cortex-scout-mcp

If you also want the optional HTTP server binary, build it explicitly with cargo build --release --bin cortex-scout.

Local MCP smoke test:

python3 publish/ci/smoke_mcp.py

This runs a newline-delimited JSON-RPC stdio session against the local cortex-scout-mcp binary and exercises the main public tools with safe example inputs.


MCP Integration (VS Code / Cursor / Claude Desktop)

Add a server entry to your MCP config.

VS Code (mcp.json โ€” global, or settings.json under mcp.servers):

// mcp.json (global): top-level key is "servers"
// settings.json (workspace): use "mcp.servers" instead
{
  "servers": {
    "cortex-scout": {
      "type": "stdio",
      "command": "env",
      "args": [
        "RUST_LOG=warn",
        "SEARCH_ENGINES=google,bing,duckduckgo,brave",
        "LANCEDB_URI=/YOUR_PATH/cortex-scout/lancedb",
        "HTTP_TIMEOUT_SECS=30",
        "MAX_CONTENT_CHARS=10000",
        "/YOUR_PATH/cortex-scout/mcp-server/target/release/cortex-scout-mcp"
      ]
    }
  }
}

Default behavior is direct/no-proxy. Add IP_LIST_PATH and PROXY_SOURCE_PATH only if you want proxy tools available. If you want proxy_control available without routing normal traffic through proxies, point IP_LIST_PATH at an empty ip.txt file and let agents populate it on demand.

Important: Always use RUST_LOG=warn, not info. At info level, the server emits hundreds of log lines per request to stderr, which can confuse MCP clients that monitor stderr.

Windows: Windows has no env command. Use the command+env object format instead โ€” see docs/IDE_SETUP.md.

With deep research (LLM synthesis via OpenRouter / any OpenAI-compatible API):

{
  "servers": {
    "cortex-scout": {
      "type": "stdio",
      "command": "env",
      "args": [
        "RUST_LOG=warn",
        "SEARCH_ENGINES=google,bing,duckduckgo,brave",
        "LANCEDB_URI=/YOUR_PATH/cortex-scout/lancedb",
        "HTTP_TIMEOUT_SECS=30",
        "MAX_CONTENT_CHARS=10000",
        "OPENAI_BASE_URL=https://openrouter.ai/api/v1",
        "OPENAI_API_KEY=sk-or-v1-...",
        "DEEP_RESEARCH_LLM_MODEL=moonshotai/kimi-k2.5",
        "DEEP_RESEARCH_ENABLED=1",
        "DEEP_RESEARCH_SYNTHESIS=1",
        "DEEP_RESEARCH_SYNTHESIS_MAX_TOKENS=4096",
        "/YOUR_PATH/cortex-scout/mcp-server/target/release/cortex-scout-mcp"
      ]
    }
  }
}

Multi-IDE guide: docs/IDE_SETUP.md


Configuration (cortex-scout.json)

Create cortex-scout.json in the same directory as the binary (or repository root). All fields are optional; environment variables act as fallback.

{
  "deep_research": {
    "enabled": true,
    "llm_base_url": "http://localhost:1234/v1",
    "llm_api_key": "",
    "llm_model": "lfm2-2.6b",
    "synthesis_enabled": true,
    "synthesis_max_sources": 3,
    "synthesis_max_chars_per_source": 800,
    "synthesis_max_tokens": 1024
  }
}

Key Environment Variables

Core

Variable Default Description
RUST_LOG warn Log level. Keep warn for MCP stdio โ€” info floods stderr and confuses MCP clients
HTTP_TIMEOUT_SECS 30 Per-request read timeout (seconds)
HTTP_CONNECT_TIMEOUT_SECS 10 TCP connect timeout (seconds)
OUTBOUND_LIMIT 16 Max concurrent outbound HTTP connections
MAX_CONTENT_CHARS 10000 Max characters returned per scraped page

Browser / Anti-bot

Variable Default Description
CHROME_EXECUTABLE auto-detected Override path to Chromium/Chrome/Brave binary
SEARCH_CDP_FALLBACK true Retry search engine fetches via native Chromium CDP when blocked
SEARCH_TIER2_NON_ROBOT unset Set 1 to allow hitl_web_fetch as last-resort search escalation
MAX_LINKS 100 Max links followed per page crawl

Search

Variable Default Description
SEARCH_ENGINES google,bing,duckduckgo,brave Active engines (comma-separated)
SEARCH_MAX_ENGINES_PER_QUERY 3 Max engines queried per search before health-based rotation picks the next set
SEARCH_MAX_RESULTS_PER_ENGINE 10 Results per engine before merge/dedup
SEARCH_ENGINE_STAGGER_MS 125 Delay between per-engine launches to reduce bursty anti-bot triggers
SEARCH_COMMUNITY_TRIGGER_RESULTS 4 Only run Reddit/HN community expansion when primary search returns fewer than this many results
SEARCH_SHARED_CACHE true Share successful search results across concurrent Cortex Scout processes on the same host
SEARCH_SHARED_CACHE_TTL_SECS 300 TTL for the shared cross-process search cache
SEARCH_HOST_MIN_GAP_MS engine-tuned Cross-process minimum spacing between search-engine requests from the same host IP
SEARCH_HOST_MAX_GAP_MS engine-tuned Cross-process maximum spacing/jitter between search-engine requests from the same host IP
SCRAPE_HOST_MIN_GAP_MS 900 Cross-process minimum spacing between scrape requests to the same host
SCRAPE_HOST_MAX_GAP_MS 1800 Cross-process maximum spacing/jitter between scrape requests to the same host
CORTEX_SCOUT_HOST_GUARD_DISABLED false Set 1 only if you explicitly want to disable shared host-level throttling

Proxy

Variable Default Description
IP_LIST_PATH โ€” Optional path to ip.txt (one proxy per line: http://, socks5://). Leave unset to disable proxy support entirely, or point at an empty file to keep proxy tools available but inactive by default
PROXY_SOURCE_PATH โ€” Optional path to proxy_source.json (used by proxy_control grab)

Semantic Memory (LanceDB)

Variable Default Description
LANCEDB_URI โ€” Directory path for persistent research memory. Omit to disable
CORTEX_SCOUT_MEMORY_DISABLED 0 Set 1 to disable memory even when LANCEDB_URI is set
MODEL2VEC_MODEL built-in HuggingFace model ID or local path for embedding (e.g. minishlab/potion-base-8M)

Deep Research

Variable Default Description
DEEP_RESEARCH_ENABLED 1 Set 0 to disable the deep_research tool at runtime
OPENAI_API_KEY โ€” API key for LLM synthesis. Omit for key-less local endpoints (Ollama)
OPENAI_BASE_URL https://api.openai.com/v1 OpenAI-compatible endpoint (OpenRouter, Ollama, LM Studio, etc.)
DEEP_RESEARCH_LLM_MODEL gpt-4o-mini Model identifier (must be supported by the endpoint)
DEEP_RESEARCH_SYNTHESIS 1 Set 0 to skip LLM synthesis (search+scrape only)
DEEP_RESEARCH_HOP_TIMEOUT_SECS 90 Per-hop scrape timeout. When exceeded, deep_research returns partial results instead of hanging until the MCP caller times out
DEEP_RESEARCH_SYNTHESIS_MAX_TOKENS 1024 Max tokens for synthesis response. Use 4096+ for large-context models
DEEP_RESEARCH_SYNTHESIS_MAX_SOURCES 8 Max source documents fed to LLM synthesis
DEEP_RESEARCH_SYNTHESIS_MAX_CHARS_PER_SOURCE 2500 Max characters extracted per source for synthesis

HTTP Server only

Variable Default Description
CORTEX_SCOUT_PORT / PORT 5000 Listening port for the HTTP server binary (cortex-scout)

Agent Best Practices

Recommended operational flow:

  1. Call memory_search before any new research run โ€” skip live fetching if similarity โ‰ฅ 0.60 and skip_live_fetch is true.
  2. For topic discovery use web_search for URL-only discovery, or web_search(include_content=true) to search and scrape top results in one round-trip.
  3. For known URLs use web_fetch(mode="single") with output_format="clean_json", and set query + strict_relevance=true to keep only relevant sections.
  4. On 403/429: call proxy_control with action:"grab" to refresh the proxy list, then retry with use_proxy:true.
  5. For auth-gated pages: run visual_scout when auth_risk_score >= 0.4, then use hitl_web_fetch(auth_mode="challenge") for CAPTCHA walls or hitl_web_fetch(auth_mode="auth") for login walls.
  6. For deep research: deep_research handles multi-hop search + scrape + LLM synthesis automatically. Tune depth (1โ€“3) and max_sources per run cost budget.
  7. For UI automation and E2E testing: use scout_browser_automate with step arrays for tabs, locator assertions, screenshots/PDF, route mocks, file uploads, and browser-state setup. If blocked by first-time login/CAPTCHA, call scout_agent_profile_auth, then resume automation.

FAQ

Why does deep_research with Ollama or qwen3.5 sometimes fail or fall back to heuristic mode?

Some reasoning-capable local models return OpenAI-compatible /v1/chat/completions responses with message.reasoning populated but message.content empty. Cortex Scout now retries local Ollama endpoints through native /api/chat with think:false when that pattern is detected.

Recommended config for local 4B-class Ollama models:

  • llm_api_key: "" in cortex-scout.json is valid and means "no auth required"
  • Keep synthesis_max_sources at 1-2
  • Keep synthesis_max_chars_per_source around 600-1000
  • Keep synthesis_max_tokens around 512-768

If you still see slow or unstable synthesis, reduce synthesis_max_sources before increasing token limits.

Why do I see Chromium profile lock errors?

Each headless request uses a unique temporary profile, so normal scraping and deep_research are safe from profile lock races. Only HITL flows (like hitl_web_fetch) using a real browser profile can hit a lock if you run them concurrently or have Brave/Chrome open on the same profile. To avoid: run HITL calls one at a time, and close all browser windows before reusing a profile.

Checklist:

  1. Use a recent build (2026-03-05 or newer)
  2. Avoid persistent profile paths unless you need a logged-in session
  3. Run HITL/profile flows sequentially
  4. Close all browser windows before reusing a profile
  5. Let Cortex Scout use its own temp profiles for concurrent research

My MCP client connects but tools fail or time out immediately. What should I check first?

Check these before anything else:

  1. Use RUST_LOG=warn, not info.
  2. On macOS/Linux env-style configs, pass the binary path directly after the env assignments. Do not insert "--" in mcp.json args.
  3. On Windows, do not use env; use command plus an env object.
  4. Make sure the binary path points to a current build.

Versioning and Changelog

See CHANGELOG.md.


License

MIT. See LICENSE.

Release History

VersionChangesUrgencyDate
v3.3.7### Changed - Updated MCP setup guidance so hard timeout env vars are treated as required guardrails in copy-pasteable MCP config examples, not optional tuning. ### Fixed - Fixed MCP tool sessions getting stuck on pathological fetches by enforcing hard per-tool timeouts in both HTTP and stdio dispatch, plus bounded timeouts for expensive scrape stages and browser launch/probe paths. - Fixed `web_fetch` on the LLVM `CodeGenerator.html` path by removing UTF-8-unsafe string slicing in the cleaner,High4/10/2026
v3.3.6### Changed - Updated MCP tool descriptions and agent guidance to clarify that proxy use is optional by default, balanced fetches stay on the non-proxy/native path unless blocking signals appear, and all tool responses now expose timing information. - `web_fetch` balanced-mode strategy now prefers the fast native HTTP path on normal server-rendered pages such as GitHub, only escalating into CDP earlier for proxy mode, high/aggressive mode, or known JS-heavy/problematic hosts. ### Added - Added High4/9/2026
v3.3.5### Changed - Updated agent guidance to prefer direct MCP-tool validation on realistic public URLs after rebuilds when verifying runtime behavior for release decisions. ### Fixed - Fixed `extract_fields` natural-language schema parsing so prompts like `fields: page_title, page_type, main_topics, summary` and `Return a JSON response with fields ...` now resolve to the requested strict output fields instead of drifting into generic auto-extraction keys. - Fixed extraction grounding so metadata-baMedium4/9/2026
v3.3.4### Added - Added cross-process host guard coordination for search engines and scrape hosts so multiple Cortex Scout processes on the same machine/IP space requests out instead of self-triggering rate limits. - Added shared cross-process search cache + singleflight locking so concurrent repos/agents can reuse live search results instead of duplicating the same upstream traffic. - Added regression tests for `deep_research` history URL extraction and timeout helper behavior. ### Changed - LoweredMedium4/9/2026
v3.3.3### Added - Added `publish/ci/smoke_deep_research.py` to sweep `deep_research` over MCP with coverage for every public parameter, clamp behavior, and invalid input handling. - Added `effective_config` to `deep_research` responses so MCP clients and smoke tests can inspect the clamped execution parameters that were actually used. ### Fixed - Fixed browser launches in root/restricted environments by applying explicit no-sandbox handling across automation sessions, visible auth sessions, and the rMedium4/8/2026
v3.3.2### Added - Expanded `scout_browser_automate` with broader Playwright-style parity: `navigate_back`, `hover`, `wait_for`, `resize`, `tabs`, `file_upload`, `fill_form`, `handle_dialog`, `pdf_save`, coordinate mouse actions, route inspection/removal, network state toggling, cookie/localStorage/sessionStorage CRUD, and verification helpers (`generate_locator`, `verify_*`). - Added richer `mock_api` controls with persistent route registry, method matching, custom response headers, delay simulation, Medium3/30/2026
v3.3.0### Added - Unified the public MCP tool surface around grouped calls: `web_search(include_content=true)`, `web_fetch(mode="single"|"batch"|"crawl")`, and `hitl_web_fetch(auth_mode="challenge"|"auth")`. - Expanded browser automation to behave more like a compact Playwright replacement, including nested flows, console capture, storage state helpers, and stronger auto-wait assertions. ### Changed - Refreshed usage docs, setup guides, and smoke coverage so the repository points agents at the currenMedium3/30/2026
v3.2.0### Added ## ๐ŸŽญ The "Playwright Killer" (Stateful Browser Automation) CortexScout includes a built-in, stateful CDP automation engine designed specifically for AI Agents, completely replacing heavy frameworks like Playwright or Cypress for E2E testing workflows. - **The Silent Omni-Tool (`scout_browser_automate`)**: Instead of calling dozens of tools, agents pass an array of `steps` (navigate, click, type, scroll, press_key, snapshot, screenshot). The entire sequence executes in a single LLLow3/17/2026
v3.1.3### Fixed - **Auth-wall false positives on public pages with login modals.** Pages like Discourse forum threads were incorrectly blocked with `NEED_HITL` because the password input in the header login modal triggered auth detection. Rewrote `detect_auth_wall_html` with a high/low-confidence selector split: - **High-confidence selectors** (e.g. `#login_field`, `.auth-form`, `#loginForm`) fire unconditionally. - **Low-confidence selectors** (e.g. `[type='password']`, generic `/loginLow3/14/2026
v3.1.2### Fixed - **CDP concurrent launches โ€” `SingletonLock` race condition (closes #7).** When multiple MCP tools triggered headless browser fetches simultaneously, all launched into the same default Chrome user-data dir, causing every instance after the first to fail with `"SingletonLock"`. Each CDP launch now gets an isolated `--user-data-dir` under a unique `/tmp/cortex-scout-cdp-XXXXXXXX` directory (cleaned up automatically after each request). Concurrent headless scraping now worLow3/5/2026
v3.1.1Built from macOS (Apple Silicon) using cargo-zigbuild. Low2/27/2026
v3.1.0 ## v3.1.0 (2026-02-24) ### Added - **`cortex-scout.json` file-based config loader** โ€” `ShadowConfig` struct loaded at startup from `cortex-scout.json` (cwd โ†’ `../cortex-scout.json` โ†’ `CORTEX_SCOUT_CONFIG` env path). All fields optional with env-var + hardcoded fallbacks; missing file is silently ignored. - **`ShadowDeepResearchConfig` sub-struct** with typed resolver methods providing 3-tier priority: JSON value โ†’ env var โ†’ hardcoded default for all 6 deep-research tunables (`llm_base_urlLow2/26/2026
v3.0.2## v3.0.2 (2026-02-24) ### Added - **`skip_live_fetch` machine-readable boolean** in `research_history` response โ€” each result entry now includes: - `skip_live_fetch` (`bool`): `true` only when the entry is a Scrape (not Search), similarity โ‰ฅ 0.60, `word_count` โ‰ฅ 50, and no sparse-content warnings. Agents should consume this field directly rather than re-implementing the cache-quality guard. - `word_count` (`u64 | null`): extracted from the stored `ScrapeResponse`; `null` for Search-tyLow2/21/2026
v3.0.0## v3.0.0 (2026-02-20) ### Added - **`human_auth_session` (The Nuclear Option)**: Launches a visible browser for human login/CAPTCHA solving. Captures and persists full authentication cookies to `~/.shadowcrawl/sessions/{domain}.json`. Enables full automation for protected URLs after a single manual session. - **Instruction Overlay**: `human_auth_session` now displays a custom green "ShadowCrawl" instruction banner on top of the browser window to guide users through complex auth walls. -Low2/20/2026
v2.5.0## v2.5.0 (2026-02-19) ### Added - **Markdown post-processor**: `normalize_markdown(text: String) -> String` unescapes token-wasting Markdown escapes, collapses excess blank lines, and dedupes navigation link spam. - **GitHub blob URL auto-rewrite**: `web_fetch` on `github.com/*/blob/*` URLs is transparently rewritten to `raw.githubusercontent.com` before fetching โ€” returns the raw file/source directly instead of GitHub's React SPA shell. - **GitHub SPA payload extraction**: `looks_like_Low2/19/2026
v2.4.3## v2.4.3 (2026-02-19) ### Feat - **GitHub blob URL auto-rewrite**: `web_fetch` on `github.com/*/blob/*` URLs is now transparently rewritten to `raw.githubusercontent.com` before fetching โ€” returns the raw file/source directly instead of GitHub's React SPA shell. - **GitHub SPA payload extraction**: `looks_like_spa` now detects GitHub's `react-app.embeddedData` script tag. `extract_spa_json_state` extracts `payload.blob.text`, `payload.readme`, `payload.issue.body`, `payload.pullRequest.bodyLow2/19/2026
v2.4.2## v2.4.2 (2026-02-19) ### MCP tool naming normalization (agent clarity) - Standardizes public MCP tool names to consistent verbs: - `web_search`, `web_search_json`, `web_fetch`, `web_fetch_batch`, `web_crawl`, `extract_fields`, `memory_search`, `proxy_control`, `hitl_web_fetch` - Adds intuitive aliases to prevent agent confusion and keep old prompts working: - `fetch_url`, `fetch_webpage`, `webpage_fetch` โ†’ `web_fetch` - `fetch_url_batch` โ†’ `web_fetch_batch` - `site_crawl` โ†’ `webLow2/19/2026
v2.4.0## ๐Ÿงฌ NeuroSiphon Token-Saving Pipeline (v2.4.0) ShadowCrawl v2.4.0 integrates **token-efficiency techniques** inspired by NeuroSiphon: - Repo: https://github.com/DevsHero/NeuroSiphon - Kill switch: set `SHADOWCRAWL_NEUROSIPHON=0` to disable all NeuroSiphon behaviors. These techniques focus on **returning only the most useful content to the agent**, avoiding token leaks (raw HTML, boilerplate imports, DOM scaffolding on SPAs). | Technique | Trigger | Output behavior | Benefit (tokenLow2/19/2026
v2.3.0v2.3.0: ShadowCrawl is now Zero-Docker! ๐Ÿฅท๐Ÿฆ€ This is the most significant update yet. We have officially removed all external Docker dependencies. ShadowCrawl is now a high-performance, single-binary "Cyborg" engine that runs natively on your machine. ๐ŸŒŸ What's New? Pure Rust Architecture: No more Docker Compose. No more SearXNG, Redis, Qdrant, or Browserless. Just one binary to rule them all. Native Browser Control: Transitioned from Browserless.io to chromiumoxide. ShadowCrawl now manaLow2/18/2026
v2.2.0Release v2.2.0: The God-Tier Meta-Search & Zero-Infra Update ๐Ÿฅท๐Ÿฆ€ ShadowCrawl has officially evolved. We have completely decapitated the legacy stack. No more SearXNG, no more Redis, and no more Qdrant. This release introduces a native Internal Meta-Search Engine and a Serverless Vector Memory system, making ShadowCrawl a lightweight, single-binary powerhouse for AI Agents. ๐Ÿš€ Key Highlights 1. Internal Meta-Search Engine (Bye-Bye SearXNG) We've integrated a high-performance multi-enginLow2/18/2026
v2.0.0-rcHere is the comprehensive release documentation for v2.0.0-rc, focusing on the breakthrough Human-In-The-Loop (HITL) integration and the "Non-Robot Search" engine. Release v2.0.0-rc: The HITL & Stealth Sovereignty Update ๐Ÿฅทโœจ This Release Candidate marks the most significant architectural shift in ShadowCrawl's history. We have moved beyond simple automated scraping into Cyborg Intelligence, introducing a hybrid model where AI agents and humans collaborate to bypass the world's toughest anti-Low2/14/2026
v1.1.0## Docker Image Published to: `ghcr.io/devshero/shadowcrawl:1.1.0` release: v1.1.0 quality/runtime hardening ### Changes - centralize shared content quality policy helpers - add runtime quality_mode support across structured/scrape/crawl/extract paths - implement strict_proxy_health with non-strict diagnostic fallback - improve proxy connection testing with HEAD->GET fallback - harden scrape outputs (raw HTML omitted by default in JSON/batch) - enrich search output with seLow2/13/2026
v1.0.1## Docker Image Published to: `ghcr.io/devshero/shadowcrawl:1.0.1` ### Changes - Rebrand ShadowCrawl - Build and push Docker image - ShadowCrawl MCP Server v1.0.1 ### Pull the image: ```bash docker pull ghcr.io/devshero/shadowcrawl:1.0.1 ``` **Full Changelog**: https://github.com/DevsHero/ShadowCrawl/compare/v0.3.0...v1.0.1Low2/13/2026
v0.3.0`docker pull ghcr.io/devshero/search-scrape:0.3.0`Low2/12/2026
v1.0.0## ๐Ÿš€ Release v1.0.0 (General Availability) The search-scrape project has evolved. v1.0.0 GA delivers a robust, self-hosted alternative to premium scraping APIs, specifically engineered for AI Agent workflows and MCP-native environments. ### ๐Ÿ“ฆ Docker Image The official image is now available on GitHub Container Registry: `ghcr.io/DevsHero/search-scrape:1.0.0` ๐Ÿ’Ž Key Features & Enhancements (Since v0.3.0) Unified MCP Surface: 100% parity between HTTP and stdio transport layers. No moLow2/12/2026
v0.2.0### v0.2.0 Integrated key enhancements while maintaining original performance: - crawl_website: Added recursive crawling for deep content extraction. - scrape_batch: Added concurrent scraping for better efficiency. - extract_structured: Added LLM-based structured JSON output. Special thanks to @lutfi238 for the excellent work on these features! Ref: https://github.com/lutfi238/search-scrape - Update documents - Update cargo packages - Github action workflows ## Docker Image PuLow2/10/2026

Dependencies & License Audit

Loading dependencies...

Similar Packages

git-mcp-rs๐Ÿ” Enable real-time exploration of GitHub repositories with this high-performance Model Context Protocol (MCP) server built in Rust.main@2026-06-01
engraphLocal knowledge graph for AI agents. Hybrid search + MCP server for Obsidian vaults.v1.7.2
PlexMCP-OSS๐ŸŒ Build a robust MCP gateway platform to enhance your Plex experience with seamless integration and reliable performance.main@2026-06-07
justoneapi-mcpProduction-ready MCP server exposing JustOneAPI endpoints to AI agents with raw JSON responses.main@2026-06-06
letagentsLet Agents Chat โ€” MCP server for AI agent communicationmain@2026-06-06

More in MCP Servers

PlanExeCreate a plan from a description in minutes
automagik-genieSelf-evolving AI agent orchestration framework with Model Context Protocol support
agentroveYour own Claude Code UI, sandbox, in-browser VS Code, terminal, multi-provider support (Anthropic, OpenAI, GitHub Copilot, OpenRouter), custom skills, and MCP servers.
ProxmoxMCP-PlusEnhanced Proxmox MCP server with advanced virtualization management and full OpenAPI integration.