.github/workflows | ||
benches | ||
examples | ||
spider | ||
spider_chrome | ||
spider_cli | ||
spider_transformations | ||
spider_utils | ||
spider_worker | ||
.gitignore | ||
Cargo.lock | ||
Cargo.toml | ||
CHANGELOG.md | ||
CONTRIBUTING.md | ||
LICENSE | ||
README.md |
Spider
Website | Guides | API Docs | Chat
A web crawler and scraper, building blocks for data curation workloads.
-
Concurrent
-
Streaming
-
Decentralization
-
Headless Chrome Rendering
-
HTTP Proxies
-
Cron Jobs
-
Subscriptions
-
Smart Mode
-
Anti-Bot mitigation
-
Disk persistence
-
Privacy and Efficiency through Ad, Analytics, and Custom Tiered Network Blocking
-
Blacklisting, Whitelisting, and Budgeting Depth
-
Dynamic AI Prompt Scripting Headless with Step Caching
-
CSS/Xpath Scraping with spider_utils
-
HTML to markdown, text, and etc transformations with spider_transformations
Getting Started
The simplest way to get started is to use the Spider Cloud hosted service. View the spider or spider_cli directory for local installations. You can also use spider with Node.js using spider-nodejs and Python using spider-py.
Benchmarks
See BENCHMARKS.
Examples
See EXAMPLES.
License
This project is licensed under the MIT license.
Contributing
See CONTRIBUTING.