CrawlerScraper Documentation
Complete guide to the CrawlerScraper platform -- a full-featured web scraping, crawling, and data extraction system with LLM-powered intelligence and autonomous operation.
Getting Started
CrawlerScraper provides a unified API and web UI for extracting data from websites. All API requests require authentication via an API key.
Authentication
Include your API key in every request using the X-API-Key header:
curl -H "X-API-Key: sk-your-key-here" https://provider.thefoundationreport.com/api/scrape
Get your API key from the API Keys page in the sidebar, or create one via the API:
POST /auth/register
{ "name": "My Name", "email": "[email protected]", "password": "..." }
# Returns: { "user": {...}, "api_key": "sk-..." }Base URL
All API endpoints are available at: https://provider.thefoundationreport.com/api/
OpenAPI schema is at /api/docs and /api/openapi.json.
Scraping
Extract content from any URL. The system uses a 5-engine fallback chain with anti-bot detection to maximize success.
POST /scrape
{
"url": "https://example.com",
"output_format": "markdown", // markdown | html | raw_html | text
"engine": null, // auto-select, or: curl | crawl4ai | stealth | nodriver | camoufox
"cache": true, // use Redis/DB cache (default true)
"auto_detect": false, // LLM-powered site analysis
"classify": false, // add content classification
"normalize": false, // normalize extracted data
"download_images": false,
"max_images": 20,
"timeout": 30
}Engine Escalation Chain
| Engine | Type | Speed | JS Support | Anti-Bot |
|---|---|---|---|---|
| curl | HTTP client | Fastest | No | None |
| crawl4ai | Headless browser | Fast | Yes | Basic |
| stealth | Crawl4AI + magic | Medium | Yes | Good |
| nodriver | Undetected Chrome | Slow | Yes | Strong |
| camoufox | Firefox + fingerprint | Slowest | Yes | Strongest |
When no engine is specified, the system tries engines in order. If anti-bot protection is detected (Cloudflare, DataDome, PerimeterX), it automatically escalates to the next engine.
Caching
Results are cached in Redis (1h TTL) and PostgreSQL (24h). The second request for the same URL returns instantly from cache. Set cache: false to bypass. Check cache stats at GET /cache/stats.
Extraction
Add an extraction field to extract structured data:
"extraction": {
"strategy": "css", // css | xpath | llm | auto
"schema_def": { // for CSS strategy
"base_selector": "div.product",
"fields": [
{"name": "title", "selector": "h2", "type": "text"},
{"name": "price", "selector": ".price", "type": "text"},
{"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
]
}
}
// Or use "auto" strategy -- LLM infers the schema for youCrawling
Crawl multiple pages starting from a URL. Supports BFS traversal and LLM-powered intelligent crawling.
POST /crawl
{
"start_url": "https://example.com",
"max_pages": 10, // 1-100
"max_depth": 2, // 1-10
"url_pattern": null, // regex filter for URLs
"same_domain_only": true,
"respect_robots_txt": true,
"use_sitemap": false, // seed queue from sitemap.xml
"crawl_goal": null // enables LLM-powered link scoring
}Intelligent Crawling
When crawl_goal is set, the crawler uses a priority queue instead of BFS. An LLM scores each link by relevance (0-1) to your goal, and the crawler follows the highest-scored links first. Low-relevance links (below 0.1) are skipped entirely.
Sitemap Support
Set use_sitemap: true to discover pages from robots.txt sitemaps and sitemap.xml. Sitemap URLs are seeded into the crawl queue alongside link-discovered pages.
Search
Search the web using multiple providers with optional result scraping.
POST /search
{
"query": "web scraping best practices",
"provider": null, // auto-select, or: duckduckgo | tavily | serper
"search_type": "text", // text | news | images
"max_results": 10,
"region": null, // e.g. "us-en"
"time_range": null, // day | week | month | year
"scrape_results": false // scrape each result page for full content
}DuckDuckGo is free and requires no API key. Tavily and Serper require keys set in environment variables.
Research
Autonomous multi-step research that searches, scrapes, and extracts structured data to fill your schema.
POST /research
{
"goal": "Compare top 5 CRM platforms for small businesses",
"output_schema": {
"name": "string",
"pricing": "string",
"features": ["string"],
"rating": "number",
"source_url": "string"
},
"strategy": "c", // a = deterministic, b = LLM-guided, c = hybrid
"max_results_per_query": 5,
"max_iterations": 3 // strategy B only
}| Strategy | Speed | Cost | Quality | How It Works |
|---|---|---|---|---|
| A (Deterministic) | Fastest | Lowest | Basic | Goal as query, LLM only for extraction |
| B (LLM-Guided) | Slowest | Highest | Best | LLM generates queries, scores relevance, iterates |
| C (Hybrid) | Medium | Medium | Good | LLM generates queries, deterministic pipeline |
Reverse Image Search
Find where an image appears online using Google Lens, Yandex, and TinEye.
POST /reverse-image-search (multipart/form-data) Fields: image_url: "https://..." // OR image_file: <file> // OR image_base64: "data:..." providers: "google,yandex" // google | yandex | tineye max_results: 10
Intelligence (LLM-Powered)
These features require SCRAPER_LLM_API_KEY to be configured.
URL Analysis
Auto-detect site type and optimal scraping strategy. Uses a 25+ site registry (Amazon, GitHub, Reddit, etc.) with LLM fallback.
POST /intelligence/analyze-url
{ "url": "https://amazon.com/dp/B09V3KXJPB" }
// Returns: { site_type: "ecommerce", recommended_engine: "crawl4ai", confidence: "high" }Schema Inference
Auto-generate extraction schemas from page content:
POST /intelligence/infer-schema
{ "url": "https://example.com/products", "goal": "extract product data" }
// Returns suggested ExtractionConfig + sample data previewNatural Language Commands
Express scraping intent in plain English:
POST /intelligence/nl
{ "command": "scrape all product prices from books.toscrape.com" }
// Returns: { request_type: "scrape", request: {...}, explanation: "..." }
POST /intelligence/nl/execute // Parse AND execute in one call
{ "command": "search for AI startups in Europe" }Content Classification
Set classify: true on scrape requests to get category, tags, language, sentiment, and entity extraction in the response.
Data Normalization
Set normalize: true to standardize extracted prices, dates, addresses, and phone numbers into consistent formats.
Chat Assistant
Interactive scraping sessions via conversation. The chat maintains context across messages -- scrape a URL, then ask to extract data from it, then export.
POST /chat
{ "message": "Scrape https://example.com", "session_id": null }
// Returns: { session_id: "uuid", message: "...", tool_used: "scrape", tool_result: {...} }Available tools in chat: scrape, crawl, search, extract, export, analyze, infer_schema.
Sessions are stored in Redis with a 30-minute TTL. Retrieve history via GET /chat/{session_id}/history.
Autonomous Research Agents
Agents plan multi-step research, execute it, adapt on failure, and synthesize findings into a final report.
POST /agents/research
{
"goal": "Comprehensive comparison of 5 CRM platforms",
"output_format": "report", // structured | report | comparison_table
"budget": {
"max_pages": 50,
"max_llm_calls": 20,
"max_time_minutes": 10
}
}
// Returns: { job_id: "uuid", status: "pending" }
// Poll GET /jobs/{job_id} or use /jobs/{job_id}/stream for SSEThe agent follows 4 phases: Planning (LLM generates 3-8 step plan), Execution (search, scrape, extract), Adaptation (re-plan on failures), Synthesis (consolidate into report with key findings).
Background Jobs
Long-running tasks (crawl, research, agent) run as background jobs via the arq worker queue.
| Endpoint | Description |
|---|---|
| POST /jobs/crawl | Submit background crawl job |
| POST /jobs/research | Submit background research job |
| POST /jobs/bulk-scrape | Scrape multiple URLs in background |
| GET /jobs | List your jobs (filter by status) |
| GET /jobs/{id} | Get job status + progress |
| DELETE /jobs/{id} | Cancel a running job |
| GET /jobs/{id}/stream | SSE progress stream |
| GET /jobs/{id}/export?format=csv | Export job results |
Job statuses: pending (queued), running (in progress), completed, failed, cancelled.
Schedules & Change Detection
Create recurring scrape schedules with automatic content change detection.
POST /schedules
{
"name": "Monitor pricing page",
"schedule_type": "scrape", // scrape | crawl | research
"cron_expression": "0 * * * *", // every hour
"request_data": { "url": "https://example.com/pricing", "output_format": "text" }
}After each run, the system compares content hashes with the previous snapshot. If changed, a diff is generated and a change_detected event is dispatched to webhooks.
| Endpoint | Description |
|---|---|
| POST /schedules | Create schedule |
| GET /schedules | List all schedules |
| PUT /schedules/{id} | Update (name, cron, pause/resume) |
| DELETE /schedules/{id} | Delete schedule |
| GET /schedules/{id}/snapshots | View change history |
| GET /schedules/{id}/diff/{snap_id} | View specific diff |
Common cron expressions: 0 * * * * (hourly), 0 9 * * * (daily 9am), 0 9 * * 1 (weekly Monday), */30 * * * * (every 30 min).
Data Pipelines
Route scraped data to external destinations automatically.
| Destination | Config |
|---|---|
| Webhook | {"url": "https://..."} |
| S3 / MinIO | {"bucket": "...", "access_key": "...", "secret_key": "..."} |
| Database | {"connection_url": "postgresql+asyncpg://...", "table": "..."} |
| Google Sheets | {"spreadsheet_id": "...", "service_account": {...}} |
Pipelines support transforms: field mapping (field_map), field filtering (include_fields, exclude_fields), and row filtering (filters with operators eq/ne/gt/lt/contains).
POST /pipelines/{id}/trigger // Manual trigger with data
{ "data": [{"name": "test", "value": 42}] }Webhooks & Notifications
Receive HTTP callbacks when events occur.
POST /webhooks
{
"url": "https://your-server.com/hook",
"events": ["change_detected", "job_completed", "job_failed"]
}Webhook payloads are signed with HMAC-SHA256. Verify using the X-Webhook-Signature header (format: sha256=hex_digest) and the secret returned on creation.
Failed deliveries retry up to 3 times. Check delivery status via GET /notifications.
Export & Batch Operations
Export Data
POST /export
{
"format": "csv", // json | csv | excel | ndjson
"data": [{"col1": "val1", "col2": "val2"}],
"filename": "my-data.csv"
}
// Returns file downloadBatch Scrape
POST /v1/scrape/batch
{
"urls": ["https://example.com", "https://httpbin.org/html"],
"output_format": "text",
"max_concurrent": 5
}
// Returns: { total: 2, success: 2, failed: 0, results: [...] }Extraction Templates
Save extraction schemas for reuse across scrapes. Templates can be matched to URLs via regex patterns.
POST /templates
{
"name": "Product scraper",
"description": "Extract titles and prices from product listings",
"strategy": "css",
"config": {
"schema_def": {
"base_selector": "div.product",
"fields": [
{"name": "title", "selector": "h2", "type": "text"},
{"name": "price", "selector": ".price", "type": "text"}
]
}
},
"url_pattern": "books\.toscrape\.com" // regex for auto-suggest
}| Endpoint | Description |
|---|---|
| POST /templates | Create a template |
| GET /templates | List your templates (filter by strategy) |
| GET /templates/{id} | Get template details |
| PUT /templates/{id} | Update template |
| DELETE /templates/{id} | Delete template |
| POST /templates/{id}/use | Increment usage counter |
| GET /templates/suggest/{url} | Find templates matching a URL |
When scraping, use a template's config as the extraction field in your scrape request.
Bulk URL Upload
Upload a file containing URLs to scrape them all at once.
POST /bulk/upload (multipart/form-data) Fields: file: <CSV, JSON, or TXT file> output_format: "markdown" // markdown | html | text mode: "sync" // sync (up to 50 inline) | async (up to 500 as job)
Supported Formats
| Format | How URLs Are Detected |
|---|---|
| CSV | Any cell containing https:// is extracted |
| JSON | Array of strings, or array of objects with "url" field |
| TXT | One URL per line (https://...) |
Sync mode returns results inline (capped at 50 URLs). Async mode creates a background job trackable on the Jobs page.
Error Codes
All API errors return JSON with a detail field explaining the issue.
| Code | Name | When It Happens | What To Do |
|---|---|---|---|
| 200 | OK | Request succeeded | Process the response normally |
| 401 | Unauthorized | Missing, invalid, or deactivated API key | Check X-API-Key header; create a new key if expired |
| 403 | Forbidden | API key doesn't have access to this resource | Verify the resource belongs to your account |
| 404 | Not Found | Resource (job, schedule, template) doesn't exist | Check the ID; the resource may have been deleted |
| 422 | Validation Error | Request body has invalid fields | Check the detail field for which parameter failed validation |
| 429 | Rate Limited | Too many requests per minute, or daily quota exceeded | Wait for Retry-After seconds; upgrade quota if needed |
| 500 | Internal Error | Server-side bug or unexpected failure | Retry once; report if persistent |
| 502 | Bad Gateway | Target website unreachable or returned an error | The scrape target may be down; try a different engine |
| 503 | Service Unavailable | Redis or job queue is down | Infrastructure issue; wait and retry |
Validation Error Details
422 responses include specific field-level errors:
{
"detail": [
{
"loc": ["body", "url"],
"msg": "URLs targeting private/internal networks are not allowed",
"type": "value_error"
}
]
}Common Validation Rules
| Field | Rule |
|---|---|
| url / start_url | Max 4096 chars; no private IPs (127.*, 10.*, 192.168.*) |
| js_code | Max 5000 chars; no require(), import, process.*, eval() |
| timeout | 1-120 seconds |
| max_pages | 1-100 |
| max_depth | 1-10 |
| max_results | 1-50 |
| query | 1-500 characters |
| crawl_goal | Max 2000 characters |
| max_images | 0-100 |
Rate Limiting
When rate limited (429), the response includes a Retry-After header with the number of seconds to wait. Per-key limits are configurable by admins via the API key rate_limit_rpm and daily_quota fields.
API Reference
All endpoints require X-API-Key header unless noted. Admin endpoints use the master key.
| Method | Path | Auth | Description |
|---|---|---|---|
| GET | /health | None | Engine availability check |
| POST | /auth/register | None | Create account + API key |
| POST | /auth/login | None | Login, get session key |
| GET | /auth/me | API Key | Current user info |
| POST | /auth/keys | API Key | Create new API key |
| POST | /scrape | API Key | Scrape a single URL |
| POST | /crawl | API Key | Crawl from a starting URL |
| POST | /search | API Key | Web search |
| POST | /reverse-image-search | API Key | Reverse image search |
| POST | /research | API Key | Autonomous research |
| POST | /intelligence/nl | API Key | Parse NL command |
| POST | /intelligence/nl/execute | API Key | Parse + execute NL command |
| POST | /intelligence/analyze-url | API Key | Analyze URL type |
| POST | /intelligence/infer-schema | API Key | Auto-generate schema |
| POST | /intelligence/classify | API Key | Classify URL content |
| POST | /chat | API Key | Chat message |
| GET | /chat/{id}/history | API Key | Chat session history |
| POST | /agents/research | API Key | Autonomous agent (async) |
| POST | /agents/research/sync | API Key | Autonomous agent (sync) |
| POST | /jobs/crawl | API Key | Submit crawl job |
| POST | /jobs/research | API Key | Submit research job |
| POST | /jobs/bulk-scrape | API Key | Submit bulk scrape job |
| GET | /jobs | API Key | List jobs |
| GET | /jobs/{id} | API Key | Job status + progress |
| DELETE | /jobs/{id} | API Key | Cancel job |
| GET | /jobs/{id}/stream | API Key | SSE progress stream |
| POST | /schedules | API Key | Create schedule |
| GET | /schedules | API Key | List schedules |
| PUT | /schedules/{id} | API Key | Update schedule |
| DELETE | /schedules/{id} | API Key | Delete schedule |
| GET | /schedules/{id}/snapshots | API Key | Snapshot history |
| GET | /schedules/{id}/diff/{snap} | API Key | View diff |
| GET | /dashboard/overview | API Key | Dashboard stats |
| GET | /dashboard/activity | API Key | Activity time series |
| GET | /dashboard/changes | API Key | Recent changes |
| GET | /dashboard/engine-health | API Key | Engine success rates |
| GET | /dashboard/jobs-queue | API Key | Job queue status |
| POST | /webhooks | API Key | Register webhook |
| GET | /webhooks | API Key | List webhooks |
| DELETE | /webhooks/{id} | API Key | Delete webhook |
| GET | /notifications | API Key | Notification log |
| POST | /pipelines | API Key | Create pipeline |
| GET | /pipelines | API Key | List pipelines |
| PUT | /pipelines/{id} | API Key | Update pipeline |
| DELETE | /pipelines/{id} | API Key | Delete pipeline |
| POST | /pipelines/{id}/trigger | API Key | Manual trigger |
| POST | /export | API Key | Export data as file |
| GET | /jobs/{id}/export | API Key | Export job results |
| POST | /v1/scrape/batch | API Key | Batch scrape URLs |
| GET | /cache/stats | None | Cache statistics |
| POST | /templates | API Key | Create extraction template |
| GET | /templates | API Key | List templates |
| GET | /templates/{id} | API Key | Get template |
| PUT | /templates/{id} | API Key | Update template |
| DELETE | /templates/{id} | API Key | Delete template |
| POST | /templates/{id}/use | API Key | Record template usage |
| GET | /templates/suggest/{url} | API Key | Auto-suggest templates |
| GET | /user/pinned | API Key | Get pinned results |
| POST | /user/pinned | API Key | Pin a result |
| DELETE | /user/pinned | API Key | Unpin a result |
| GET | /user/dashboard-config | API Key | Get dashboard settings |
| PUT | /user/dashboard-config | API Key | Update dashboard settings |
| POST | /bulk/upload | API Key | Upload URLs file for bulk scrape |
| GET | /history/logs | API Key | Your request history |
| GET | /history/stats | API Key | Your usage statistics |
Configuration
Environment variables (prefix: SCRAPER_):
| Variable | Default | Description |
|---|---|---|
| SCRAPER_DATABASE_URL | sqlite:///./data/scraper.db | Database connection string |
| SCRAPER_REDIS_URL | redis://localhost:6379/0 | Redis connection for caching + jobs |
| SCRAPER_ADMIN_MASTER_KEY | (required) | Admin API master key |
| SCRAPER_LLM_API_KEY | OpenAI key for intelligence features | |
| SCRAPER_LLM_PROVIDER | openai/gpt-4o-mini | LLM model (OpenAI, vLLM, Cloudflare) |
| SCRAPER_LLM_BASE_URL | Custom LLM endpoint | |
| SCRAPER_PROXIES | Comma-separated proxy list | |
| SCRAPER_RESIDENTIAL_PROXIES | Premium residential proxies | |
| SCRAPER_DEFAULT_TIMEOUT | 30 | Request timeout (seconds) |
| SCRAPER_MAX_CONCURRENT | 5 | Max parallel scrapes |
| SCRAPER_ESCALATION_ENABLED | true | Enable engine fallback chain |
| SCRAPER_RATE_LIMIT_RPM | 60 | Global rate limit (requests/min) |
| SCRAPER_TAVILY_API_KEY | Tavily search API key | |
| SCRAPER_SERPER_API_KEY | Serper search API key | |
| SCRAPER_TINEYE_API_KEY | TinEye reverse image API key |
Version: 0.4.0