A web crawler and scraper for Rust
Find a file
j-mendez db3efdaff8
Some checks are pending
Benches / build (push) Waiting to run
Rust / build (push) Waiting to run
chore(chrome): add to block list
2025-02-14 09:40:07 -05:00
.github/workflows ci(bench): fix local server install 2025-01-25 08:49:17 -05:00
benches chore(chrome): remove intercept tasks on complete 2024-09-02 17:11:26 -04:00
examples chore(glob): fix compile [#267] 2025-02-13 10:48:06 -05:00
spider chore(chrome): add to block list 2025-02-14 09:40:07 -05:00
spider_chrome chore(chrome): add to block list 2025-02-14 09:40:07 -05:00
spider_cli chore(chrome): add to block list 2025-02-14 09:40:07 -05:00
spider_transformations chore(chrome): add to block list 2025-02-14 09:40:07 -05:00
spider_utils chore(chrome): add to block list 2025-02-14 09:40:07 -05:00
spider_worker chore(chrome): add to block list 2025-02-14 09:40:07 -05:00
.gitignore chore(encoding): add html lang auto parsing 2024-10-22 06:34:03 -04:00
Cargo.lock chore(chrome): add to block list 2025-02-14 09:40:07 -05:00
Cargo.toml chore(chrome): fix concurrent websocket page connecting 2024-10-02 15:15:47 -04:00
CHANGELOG.md chore(page): add initial start domain tracking 2024-09-23 09:32:36 -04:00
CONTRIBUTING.md chore(readme): update intro section title 2022-04-21 16:38:15 -04:00
LICENSE chore(license): bump 2025 2025-01-05 19:54:26 -05:00
README.md perf(page): add max limit streaming chunk size 2024-12-30 07:02:09 -05:00

Spider

Build Status Crates.io Documentation Rust Discord chat

Website | Guides | API Docs | Chat

A web crawler and scraper, building blocks for data curation workloads.

  • Concurrent

  • Streaming

  • Decentralization

  • Headless Chrome Rendering

  • HTTP Proxies

  • Cron Jobs

  • Subscriptions

  • Smart Mode

  • Anti-Bot mitigation

  • Disk persistence

  • Privacy and Efficiency through Ad, Analytics, and Custom Tiered Network Blocking

  • Blacklisting, Whitelisting, and Budgeting Depth

  • Dynamic AI Prompt Scripting Headless with Step Caching

  • CSS/Xpath Scraping with spider_utils

  • HTML to markdown, text, and etc transformations with spider_transformations

  • Changelog

Getting Started

The simplest way to get started is to use the Spider Cloud hosted service. View the spider or spider_cli directory for local installations. You can also use spider with Node.js using spider-nodejs and Python using spider-py.

Benchmarks

See BENCHMARKS.

Examples

See EXAMPLES.

License

This project is licensed under the MIT license.

Contributing

See CONTRIBUTING.