freshcrate
Skin:/
Home > AI Agents > spider

spider

Web crawler and scraper for Rust

Why this rank:Strong adoptionRelease freshnessHealthy release cadence

Description

Web crawler and scraper for Rust

README

Spider

Crates.io Downloads Documentation

Website | Guides | API | Examples | Discord

The fastest web crawler and scraper for Rust.

Quick Start

[dependencies]
spider = { version = "2", features = ["spider_cloud"] }
use spider::{
    configuration::{SpiderCloudConfig, SpiderCloudMode, SpiderCloudReturnFormat},
    tokio, // re-export
    website::Website,
};

#[tokio::main]
async fn main() {
    // Get your API key free at https://spider.cloud
    let config = SpiderCloudConfig::new("YOUR_API_KEY")
        .with_mode(SpiderCloudMode::Smart)
        .with_return_format(SpiderCloudReturnFormat::Markdown);

    let mut website = Website::new("https://example.com")
        .with_limit(10)
        .with_spider_cloud_config(config)
        .build()
        .unwrap();

    let mut rx = website.subscribe(16);

    tokio::spawn(async move {
        while let Ok(page) = rx.recv().await {
            let url = page.get_url();
            let markdown = page.get_content();
            let status = page.status_code;

            println!("[{status}] {url}\n---\n{markdown}\n");
        }
    });

    website.crawl().await;
    website.unsubscribe();
}

Also supports headless Chrome, WebDriver, and AI automation.

Install

Package Command
spider cargo add spider
spider_cli cargo install spider_cli
spider-nodejs npm i @spider-rs/spider-rs
spider-py pip install spider_rs
Spider Cloud Managed crawling — free credits on signup

License

MIT

Release History

VersionChangesUrgencyDate
v2.48.13## What's New **`spider authenticate` command** — Store your [Spider Cloud](https://spider.cloud) API key locally for remote crawls. ### Usage ```sh # Authenticate (stores key in ~/.spider/credentials) spider authenticate sk-your-key spider auth # alias, interactive prompt # Crawl via Spider Cloud (key auto-loaded) spider crawl -u https://example.com -o # Choose cloud mode spider crawl -u https://example.com --spider-cloud-mode smart -o spider crawl -u https://example.com --spider-cloud-moMedium3/31/2026
v2.48.4New `spider_mcp` crate — MCP server for Spider. ```bash cargo install spider_mcp ``` Setup (Claude Code `~/.claude/settings.json` or Claude Desktop config): ```json { "mcpServers": { "spider": { "command": "spider-mcp" } } } ``` Usage examples: ``` Scrape a page: "Fetch https://example.com as markdown" Crawl a site: "Crawl https://example.com up to 5 pages" Extract links: "Get all links from https://example.com" Transform HTML: "Convert this HTML to markdown: <h1>HelMedium3/25/2026
v2.48.2Race alternative browser engines alongside your primary crawl. Best HTML wins. ```rust use spider::configuration::{BackendEndpoint, BackendEngine, ParallelBackendsConfig}; let mut website = Website::new("https://example.com"); website.configuration.parallel_backends = Some(ParallelBackendsConfig { backends: vec![BackendEndpoint { engine: BackendEngine::LightPanda, endpoint: Some("ws://127.0.0.1:9222".to_string()), binary_path: None, protocol: None, }], Medium3/25/2026
v2.47.75## What's New - **PageData & Crawler trait abstractions** for extensible crawl pipelines - **Proxy support for LLM HTTP requests** (#378) - **Chrome remote_addr** via CDP `Network.responseReceived` - **Remote cache for Chrome responses** — dump & fallback support ## Performance - SIMD-accelerated byte scanning (memchr), unrolled FNV hash - Trie: `Box<str>` keys + manual byte-walk + memchr dot scan - Bloom filter bitmask addressing + inline early-exit - Zero-alloc DNS cache hits via `Arc<[SockLow3/20/2026
v2.47.51- NUMA thread pinning for multi-socket servers (`numa` feature) - zerocopy wire parsing for HTTP status lines, cache headers, DNS records (`zero_copy` feature)Low3/19/2026
v2.47.50Zero-copy page passing (bytes::Bytes), mmap+hugepages bloom filter for URL dedup (`bloom` feature).Low3/19/2026
v2.47.24io_uring TCP connect + lightweight background runtime - io_uring TCP connect: Socket + Connect opcodes for kernel-async TCP connects via the existing uring worker - Lightweight background runtime: Drops from multi-thread to current-thread tokio executor when io_uring is active - Public API: uring_fs::tcp_connect(addr), uring_fs::is_uring_enabled() - CI fixes: clippy unnecessary_cast, io_other_error, cargo fmt **Full Changelog**: https://github.com/spider-rs/spider/compare/Low3/15/2026
v2.45.28### Agent Hardening - Cap LLM-controlled durations (Wait, ClickHold, SetViewport, OpenPage) - Add `js_escape()` for safe JS string interpolation in action handlers - Wrap `Navigate` and screenshot calls with timeouts - Use `PageWaitStrategy::Load` for `WaitForNavigation` instead of fixed sleep - Replace `eval_with_timeout` for Fill/Type/Clear actions with error propagation - Improve semaphore and logging diagnostics on error pathsLow3/2/2026
v2.45.24## What's New ### Performance - **Cache-first fast path** — skip browser/HTTP entirely when cache has data (~5-50ms vs 1-3s) - **Deferred Chrome** — process multi-page crawls from cache before launching a browser - **Work-stealing (hedged requests)** — parallel retry for slow crawl requests - **io_uring** — StreamingWriter for high-throughput file I/O on Linux ### Agent - **Per-round model pool routing** — route cheap rounds to fast models, complex rounds to capable ones - **Comprehensive routLow2/21/2026
v2.45.20## What's New ### Relevance Gate for Remote Multimodal Crawling Added a `relevance_gate` config that instructs the LLM to return a `"relevant": true|false` field in its JSON response. When a page is deemed irrelevant, its wildcard budget credit is refunded so the crawler discovers more relevant content. **New config fields:** - `relevance_gate: bool` — enables the feature - `relevance_prompt: Option<String>` — optional custom relevance criteria **How it works:** 1. When enabled, tLow2/5/2026
v2.44.13## What's New - **Spider Cloud integration** (`spider_cloud` feature) — optional proxy rotation, anti-bot bypass, and intelligent fallback via [spider.cloud](https://spider.cloud) - Modes: Proxy, Api, Unblocker, Fallback, Smart - Smart mode auto-detects Cloudflare challenges, CAPTCHAs, and bot protection then retries via `/unblocker` - **S3 skills loading** (`skills_s3` feature) — load agent skills from S3-compatible storage (AWS, MinIO, R2) - CLI: `--spider-cloud-key` and `--spider-cloud-mLow2/5/2026
v2.43.20## Spider v2.43.20 ### Changes - **fix(spider)**: Fix doctest and update chromey for adblock compatibility - **fix(search)**: Use reqwest::Client directly for cache feature compatibility - **chore(spider)**: Update spider_agent dependency to 0.4 ### spider_agent Integration The `agent` feature now uses spider_agent v0.4.0, which includes: - Smart caching with size-aware LRU eviction - High-performance chain execution with parallel step support - Batch processing for multiple items - PrefetchLow2/3/2026
spider_agent-v0.4.0## Spider Agent v0.4.0 ### Performance Optimizations This release adds several performance optimizations for automation workflows: #### Smart Caching - **SmartCache**: Size-aware LRU cache with automatic cleanup - Bounded memory usage with configurable limits - TTL-based expiration - Automatic cleanup on memory pressure - Statistics tracking (hits, misses, evictions) #### High-Performance Execution - **ChainExecutor**: Parallel step execution for automation chains - Analyzes dependLow2/3/2026
v2.43.18## Features ### Web Search Integration Add web search capabilities to Spider's RemoteMultimodalEngine with support for multiple search providers. #### Supported Providers - **Serper** (`search_serper`) - Google SERP API - **Brave** (`search_brave`) - Privacy-focused search - **Bing** (`search_bing`) - Microsoft Bing Web Search - **Tavily** (`search_tavily`) - AI-optimized search #### New Methods - `search()` - Search the web and return structured results - `search_and_extract()` - Search + feLow2/2/2026
v2.43.13## 🤖 Advanced Agentic Automation Features This release adds comprehensive agentic automation capabilities to spider, making it a powerful tool for autonomous web interactions. ### Phase 1: Simplified Agentic APIs - `act(page, instruction)` - Execute single actions with natural language - `observe(page)` - Analyze page state and get structured observations - `extract_page(page, prompt, schema)` - Extract structured data from pages - `AutomationMemory` - In-memory state management for multi-rouLow2/2/2026
v2.43.3## Bug Fix - **fix(automation)**: Improve `best_effort_parse_json_object` parsing to handle LLM responses with reasoning text before JSON code blocks - Find ```json blocks anywhere in response (not just at boundaries) - Support JSON arrays in addition to objects - Better fallback parsing for various LLM response formats **Full Changelog**: https://github.com/spider-rs/spider/compare/v2.43.2...v2.43.3Low2/2/2026
v2.43.2## New Feature: Extraction Schema Support Add JSON Schema support for structured extraction in `RemoteMultimodalEngine`. ### `ExtractionSchema` Struct ```rust pub struct ExtractionSchema { pub name: String, // Schema name (e.g., "products") pub description: Option<String>, // What to extract pub schema: String, // JSON Schema definition pub strict: bool, // Enforce strict adherence } ``` ### Example Usage ```rust use spider::features::automation:Low2/2/2026
v2.43.1## Bug Fix - **fix(page)**: Add missing `remote_multimodal_usage` and `extra_remote_multimodal_data` fields to the decentralized `Page` struct for feature parity with the standard `Page` struct. **Full Changelog**: https://github.com/spider-rs/spider/compare/v2.43.0...v2.43.1Low2/1/2026
v2.43.0## What's New ### Token Usage Tracking for RemoteMultimodalEngine The remote multimodal automation engine now tracks and returns token usage conforming to the OpenAI API format: - `AutomationUsage` struct with `prompt_tokens`, `completion_tokens`, `total_tokens` - Usage is accumulated across all inference rounds - Stored on `Page.remote_multimodal_usage` ### Extraction Support New extraction capabilities for RemoteMultimodalEngine, similar to the OpenAI integration: - `extra_ai_data` - EnaLow2/1/2026
v2.42.0## WebDriver Support Full W3C WebDriver protocol support via `thirtyfour` crate for Selenium Grid, remote browsers, and cross-browser testing. ```rust use spider::website::Website; use spider::features::webdriver_common::{WebDriverConfig, WebDriverBrowser}; let mut website = Website::new("https://example.com") .with_webdriver( WebDriverConfig::new() .with_server_url("http://localhost:4444") .with_browser(WebDriverBrowser::Chrome) .with_headless(Low2/1/2026
v2.41.1# v2.41.0 - WebDriver Support This release adds WebDriver support via the `thirtyfour` crate, enabling browser automation using the W3C WebDriver protocol. Connect to remote Selenium Grid, chromedriver, geckodriver, and more. Low2/1/2026
v2.40.2## Whats Changed Solve web challenges, perform actions, and more with remote multimodal iterative automation. - **Remote Multimodal Engine** for Chrome automation using vision + LLM - Iterative automation loop: capture → infer plan → execute → re-capture → repeat - Unified `RemoteMultimodalConfigs` to configure: - API endpoint - Model selection - Prompts - Retry behavior - Capture strategies - Strict JSON automation plans: `{ "label": "...", "done": true|false, "steLow1/23/2026
v2.39.14## What's Changed This release brings built in Chrome gemini nano support and remote vision support. * Add `with_on_should_crawl_callback_closure` by @WilliamVenner in https://github.com/spider-rs/spider/pull/346 * feat(solver): add built in gemini nano support ## New Contributors * @WilliamVenner made their first contribution in https://github.com/spider-rs/spider/pull/346 **Full Changelog**: https://github.com/spider-rs/spider/compare/v2.38.122...v2.39.14Low1/16/2026
v2.38.122## What's Changed * fix(chrome): add automatic chrome executable detection by @yebei199 in https://github.com/spider-rs/spider/pull/343 * feat(gemini): add Gemini AI support for dynamic browser scripting by @swistaczek in https://github.com/spider-rs/spider/pull/344 * chore(smart): add mismatch cypher retry ## New Contributors * @yebei199 made their first contribution in https://github.com/spider-rs/spider/pull/343 * @swistaczek made their first contribution in https://github.com/spider-Low1/2/2026
v2.38.109# Whats Changed Fix smart mode lifecycles loading. **Full Changelog**: https://github.com/spider-rs/spider/compare/v2.38.68...v2.38.109Low12/26/2025
v2.38.70# Whats Changed Fix smart mode re-rendering document content. **Full Changelog**: https://github.com/spider-rs/spider/compare/v2.38.44...v2.38.70Low12/7/2025
v2.38.46## What's Changed * fix "real_browser" disabled by @rumpl in https://github.com/spider-rs/spider/pull/336 * fix builder methods wait for * fix headless http -> https upgrade cf * fix smart mode re-render tracking and content forwarding ## New Contributors * @rumpl made their first contribution in https://github.com/spider-rs/spider/pull/336 **Full Changelog**: https://github.com/spider-rs/spider/compare/v2.37.180...v2.38.46Low12/5/2025
v2.37.180## What's Changed * spider_cli: fix duplicated argument -r by @zazolabs in https://github.com/spider-rs/spider/pull/324 * chore(chrome): fix compile [#328] * spider_cli: fix download files url empty parse * feat(spider): add `with_max_bytes_allowed` to track global browser context bytes for session * chore(cli): add proxy_url [#330] ## New Contributors * @zazolabs made their first contribution in https://github.com/spider-rs/spider/pull/324 **Full Changelog**: https://github.com/spLow11/1/2025
v2.37.159# Whats Changed Builder methods to bind local_address and network. * fix duration tracking [#304] * fix network interface platform checking **Full Changelog**: https://github.com/spider-rs/spider/compare/v2.37.119...v2.37.159Low7/6/2025
v2.37.122## What's Changed Major spoof emulations for chrome moved to `spider_fingerprint`. * chore(fingerprint): add navigator.hardwareConcurrency spoof * chore(examples): fix anti_bot with_user_agent * chore(fingerprint): fix device_pixel_ratio mac defaults * chore(fingerprint): fix hide chrome * chore(fingerprint): prep fingerprint canvas noise * chore(fingerprint): add profiles start * chore(fingerprint): add env section * chore(fingerprint): fix userAgentData getHighEntropyValues * chLow5/17/2025
v2.37.18# Whats Changed The page streaming rewriter now handles built in metadata extracting by default. You can access it by using `page.metadata` or `page.get_metadata()`. Some of the metadata properties are set as placeholders unused. * feat(page): add metadata extracting * chore(chrome): fix concurrent context creation * chore(chrome): bump cdp revision 1457408 ```rust /// Page-level metadata extracted from HTML. pub struct Metadata { /// The meta title pub title: Option<comLow5/7/2025
v2.36.123# Whats Changed Major fix for http or smart mode request adding the Host header preventing proper redirects. Fix openai automation usage. * chore(website): fix client host header * chore(chrome,sitemap): fix sitemap handling xml * feat(antibot): add antibot detection * chore(chrome): fix viewport browser handling pages * chore(chrome): fix fingerprint execution script * chore(sitemap): add auto sitemap adding whitelisting **Full Changelog**: https://github.com/spider-rs/spider/comLow4/18/2025
v2.36.67# Whats Changed Fix xml parsing initial links. * chore(real_browser,chrome): add missing chrome headers * chore(chrome): add real browser loading * chore(chrome): fix request_timeout default * chore(chrome): fix timeout subtracting **Full Changelog**: https://github.com/spider-rs/spider/compare/v2.35.18...v2.36.67Low3/29/2025
v2.35.18# Whats Changed * The [rquest](https://github.com/0x676e67/rquest) client support with the `rquest` feature flag. * New `website.with_emulation` for `rquest` emulation. * Bug fixes and improvements with chrome request timeout handling. ```rust /// Set the request emuluation. This method does nothing if the `rquest` flag is not enabled. pub fn with_emulation(&mut self, emulation: Option<rquest_util::Emulation>) -> &mut Self { self.configuration.with_emulation(emulatiLow3/26/2025
v2.34.5# Whats Changed Get a map of the request and responses sent for headless. Responses: bytes transfered Requests: mono time Example: mapping ```json { "response_map": { "https://spider.cloud/_astro/page.V2R8AmkL.js": 0.0, "https://spider.cloud/_astro/FaqSection.93yW76zV.js": 0.0, "https://spider.cloud/_astro/AuthDropdownMarketing.BtXgMRKz.js": 0.0, "https://spider.cloud/fonts/berkeley-mono/WEB/BerkeleyMono-Italic.woff2": 0.Low3/19/2025
v2.33.11# Whats Changed Add `Website::with_crawl_timeout` builder method to add a max timeout for the crawl. This is useful when robots.txt can change the expected crawl durations. Example: ```rust use std::time::Duration; use spider::tokio; use spider::website::Website; use tokio::io::AsyncWriteExt; #[tokio::main] async fn main() { let mut website: Website = Website::new("https://spider.cloud").with_crawl_timeout(Some(Duration::from_millis(10))).build().unwrap(); let mut rx2Low3/14/2025
v2.33.1# Whats Changed Remove `jemalloc` flag. This should be done at the top level of main. Add asset support for chrome media request. - feat(chrome): add asset handling pages [#275] **Full Changelog**: https://github.com/spider-rs/spider/compare/v2.32.6...v2.33.1Low3/9/2025
v2.32.9# Whats Changed Two new methods for thread safe crawling and bootstrapping setup. `website.crawl_chrome_send` and `website.crawl_raw_send`. **Full Changelog**: https://github.com/spider-rs/spider/compare/v2.32.6...v2.32.9Low3/8/2025
v2.32.6# Whats Changed Chrome performance improved when deserializing the page and removed unused Bytes wrapper. Add to block list items chrome. **Full Changelog**: https://github.com/spider-rs/spider/compare/v2.31.8...v2.32.6Low3/7/2025
v2.31.8# Whats Changed 1. chore(chrome): fix wait_for events sequence 1. chore(chrome): add navigation network cancel **Full Changelog**: https://github.com/spider-rs/spider/compare/v2.31.4...v2.31.8Low3/5/2025
v2.31.4# Whats Changed Chrome now manages document reloading to prevent infinite page reloading through scripting. The firewall feature flag now enables the firewall protection via networking on chrome as well for an improved ad, tracking, and malice website blocker. * chore(chrome): add infinite loop document reload protection * chore(chrome): add to block list * chore(chrome): add firewall feature flag. * perf(conf): remove box indirection proxies, whitelist, and blacklist * chore(chrome):Low2/23/2025
v2.30.3# Whats Changed Use the feature flag `firewall` to protect against malice websites and lazy loading smart mode chrome rendering. * feat(firewall): add start of spider_firewall * chore(smart): fix missing bytes transferred * feature(smart): add lazy load chrome * perf(bytes): remove BytesMut **Full Changelog**: https://github.com/spider-rs/spider/compare/v2.27.66...v2.30.3Low2/16/2025
v2.27.66## What's Changed * chore(cli): trigger help page on missing arguments by @pwnwriter in https://github.com/spider-rs/spider/pull/265 * chore(chrome): add connection retry ws * chore(smart): add initial http fallback * chore(website): add direct proxy control * chore(website): fix scrape hang [#268] ## New Contributors * @pwnwriter made their first contribution in https://github.com/spider-rs/spider/pull/265 **Full Changelog**: https://github.com/spider-rs/spider/compare/v2.27.50...v2Low2/13/2025
v2.27.50# Whats Changed Web page normalizing to prevent all duplicate content, crawl traps, and more pages from being crawled repeatedly. We can now crawl websites that target ports outside 80 and 443. 1. feat(page): add relative directory url handling 1. chore(website): fix relative page merging links 1. chore(serde): fix cron compile configuration 1. chore(chrome): update tokio-tungestite@0.26 1. chore(page): add port validation links 1. chore(website): fix signature compile non disk featuLow1/25/2025
v2.26.27# Whats Changed 1. add auto find sitemap url on 404 or network error. 2. fix chrome_cache_hybrid compile. 3. add `cache_chrome_hybrid_mem` flag to use memory instead of disk. 4. fix q draining across website methods 5. fix crawl depth handling 6. fix worker init background connect 7. add proper status code from errors **Full Changelog**: https://github.com/spider-rs/spider/compare/v2.26.1...v2.26.27Low1/18/2025
v2.26.1# Whats Changed This release brings performance improvements by skipping URL parsing per page. You can now also pass in a second param to the page link methods to collect the links with a new domain target. Targeting the correct root domain for parsing the links is now handled across features. If you used `page::Page::take_url` directly you may need to call `page::Page::set_url_parsed_direct_empty()` first or the `page::Page::get_url_parsed()` method. 1. perf(cli): add page links direLow1/11/2025
v2.24.15# Whats Changed Add a callback to perform validation using [spider::page::Page](https://docs.rs/spider/latest/spider/page/struct.Page.html). You can now use the `basic` feature flag to easily disable io-uring on linux and still get the default features with `"default-features = false"`. 1. feat(website): add on_should_crawl_callback [#241] 1. feat(page): add blocked_crawl [#242] 1. chore(disk): fix cfg aho_corasick 1. chore(fs): remove tentril crate 1. chore(page): fix crawling initiaLow1/4/2025
v2.23.7# Whats Changed Linux now uses [io_uring](https://github.com/tokio-rs/io-uring) for the DNS connect phase. If you do not have a recent version of linux installed disable the feature flag `io_uring`. * feat(io_uring): add io_uring for connect_phase linux * chore(fs): fix feature flag compile fs **Full Changelog**: https://github.com/spider-rs/spider/compare/v2.22.19...v2.23.7Low12/31/2024
v2.22.19# Whats Changed This release brings in a SQLite for improved memory handling with the feature flags `disk_native_tls`, `disk`, and `disk_aws`. SQLite is set to be used in a hybrid manner with memory in order to maintain performance. With disk handling and our string interning urls crawled can entire the billions of resources or infinite with EFS attached. ## Other Changes * chore(website,page): fix concurrent initial scoped access to `lazy_static!` * chore(chrome): add more networkLow12/24/2024
v2.21.33# Whats Changed Fix http crawling past first page Fix safe handling abs urls **Full Changelog**: https://github.com/spider-rs/spider/compare/v2.21.27...v2.21.33Low12/18/2024

Dependencies & License Audit

Loading dependencies...

Similar Packages

xmasterX/Twitter CLI for developers and AI agents — post, reply, search, DM, schedule, analyze. Single Rust binary.v1.7.4
scraping-browser🔍 Automate dynamic web scraping with Scraping Browser, a full-host solution using Puppeteer, Selenium, and Playwright for seamless data collection.main@2026-06-07
call-with-ai-agentđŸ—Ŗī¸ Engage in real-time voice conversations with an AI agent using Next.js and ElevenLabs for an interactive and responsive user experience.master@2026-06-06
skalesYour local AI Desktop Agent for Windows, macOS & Linux. Agent Skills (SKILL.md), autonomous coding (Codework), multi-agent teams, desktop automation, 15+ AI providers, Desktop Buddy. No Docker, no terv11.1.6
engramEnable AI agents with fast, local semantic memory to search and recall knowledge from text files without servers or complex setup.main@2026-06-04

More in AI Agents

@blockrun/franklinFranklin — The AI agent with a wallet. Spends USDC autonomously to get real work done. Pay per action, no subscriptions.
hermes-agentThe agent that grows with you
awesome-copilotCommunity-contributed instructions, agents, skills, and configurations to help you make the most of GitHub Copilot.
e2bE2B SDK that give agents cloud environments