CrawlerScraper Documentation

Complete guide to the CrawlerScraper platform -- a full-featured web scraping, crawling, and data extraction system with LLM-powered intelligence and autonomous operation.

Getting Started

CrawlerScraper provides a unified API and web UI for extracting data from websites. All API requests require authentication via an API key.

Authentication

Include your API key in every request using the X-API-Key header:

curl -H "X-API-Key: sk-your-key-here" https://provider.thefoundationreport.com/api/scrape

Get your API key from the API Keys page in the sidebar, or create one via the API:

POST /auth/register
{ "name": "My Name", "email": "[email protected]", "password": "..." }
# Returns: { "user": {...}, "api_key": "sk-..." }

Base URL

All API endpoints are available at: https://provider.thefoundationreport.com/api/

OpenAPI schema is at /api/docs and /api/openapi.json.

Scraping

Extract content from any URL. The system uses a 5-engine fallback chain with anti-bot detection to maximize success.

POST /scrape
{
  "url": "https://example.com",
  "output_format": "markdown",    // markdown | html | raw_html | text
  "engine": null,                  // auto-select, or: curl | crawl4ai | stealth | nodriver | camoufox
  "cache": true,                   // use Redis/DB cache (default true)
  "auto_detect": false,            // LLM-powered site analysis
  "classify": false,               // add content classification
  "normalize": false,              // normalize extracted data
  "download_images": false,
  "max_images": 20,
  "timeout": 30
}

Engine Escalation Chain

Engine	Type	Speed	JS Support	Anti-Bot
curl	HTTP client	Fastest	No	None
crawl4ai	Headless browser	Fast	Yes	Basic
stealth	Crawl4AI + magic	Medium	Yes	Good
nodriver	Undetected Chrome	Slow	Yes	Strong
camoufox	Firefox + fingerprint	Slowest	Yes	Strongest

When no engine is specified, the system tries engines in order. If anti-bot protection is detected (Cloudflare, DataDome, PerimeterX), it automatically escalates to the next engine.

Caching

Results are cached in Redis (1h TTL) and PostgreSQL (24h). The second request for the same URL returns instantly from cache. Set cache: false to bypass. Check cache stats at GET /cache/stats.

Extraction

Add an extraction field to extract structured data:

"extraction": {
  "strategy": "css",          // css | xpath | llm | auto
  "schema_def": {             // for CSS strategy
    "base_selector": "div.product",
    "fields": [
      {"name": "title", "selector": "h2", "type": "text"},
      {"name": "price", "selector": ".price", "type": "text"},
      {"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
    ]
  }
}
// Or use "auto" strategy -- LLM infers the schema for you

Crawling

Crawl multiple pages starting from a URL. Supports BFS traversal and LLM-powered intelligent crawling.

POST /crawl
{
  "start_url": "https://example.com",
  "max_pages": 10,            // 1-100
  "max_depth": 2,             // 1-10
  "url_pattern": null,        // regex filter for URLs
  "same_domain_only": true,
  "respect_robots_txt": true,
  "use_sitemap": false,       // seed queue from sitemap.xml
  "crawl_goal": null          // enables LLM-powered link scoring
}

Intelligent Crawling

When crawl_goal is set, the crawler uses a priority queue instead of BFS. An LLM scores each link by relevance (0-1) to your goal, and the crawler follows the highest-scored links first. Low-relevance links (below 0.1) are skipped entirely.

Sitemap Support

Set use_sitemap: true to discover pages from robots.txt sitemaps and sitemap.xml. Sitemap URLs are seeded into the crawl queue alongside link-discovered pages.

Search

Search the web using multiple providers with optional result scraping.

POST /search
{
  "query": "web scraping best practices",
  "provider": null,            // auto-select, or: duckduckgo | tavily | serper
  "search_type": "text",      // text | news | images
  "max_results": 10,
  "region": null,              // e.g. "us-en"
  "time_range": null,          // day | week | month | year
  "scrape_results": false      // scrape each result page for full content
}

DuckDuckGo is free and requires no API key. Tavily and Serper require keys set in environment variables.

Research

Autonomous multi-step research that searches, scrapes, and extracts structured data to fill your schema.

POST /research
{
  "goal": "Compare top 5 CRM platforms for small businesses",
  "output_schema": {
    "name": "string",
    "pricing": "string",
    "features": ["string"],
    "rating": "number",
    "source_url": "string"
  },
  "strategy": "c",            // a = deterministic, b = LLM-guided, c = hybrid
  "max_results_per_query": 5,
  "max_iterations": 3          // strategy B only
}

Strategy	Speed	Cost	Quality	How It Works
A (Deterministic)	Fastest	Lowest	Basic	Goal as query, LLM only for extraction
B (LLM-Guided)	Slowest	Highest	Best	LLM generates queries, scores relevance, iterates
C (Hybrid)	Medium	Medium	Good	LLM generates queries, deterministic pipeline

Reverse Image Search

Find where an image appears online using Google Lens, Yandex, and TinEye.

POST /reverse-image-search  (multipart/form-data)
Fields:
  image_url: "https://..."    // OR
  image_file: <file>          // OR
  image_base64: "data:..."
  providers: "google,yandex"  // google | yandex | tineye
  max_results: 10

Intelligence (LLM-Powered)

These features require SCRAPER_LLM_API_KEY to be configured.

URL Analysis

Auto-detect site type and optimal scraping strategy. Uses a 25+ site registry (Amazon, GitHub, Reddit, etc.) with LLM fallback.

POST /intelligence/analyze-url
{ "url": "https://amazon.com/dp/B09V3KXJPB" }
// Returns: { site_type: "ecommerce", recommended_engine: "crawl4ai", confidence: "high" }

Schema Inference

Auto-generate extraction schemas from page content:

POST /intelligence/infer-schema
{ "url": "https://example.com/products", "goal": "extract product data" }
// Returns suggested ExtractionConfig + sample data preview

Natural Language Commands

Express scraping intent in plain English:

POST /intelligence/nl
{ "command": "scrape all product prices from books.toscrape.com" }
// Returns: { request_type: "scrape", request: {...}, explanation: "..." }

POST /intelligence/nl/execute    // Parse AND execute in one call
{ "command": "search for AI startups in Europe" }

Content Classification

Set classify: true on scrape requests to get category, tags, language, sentiment, and entity extraction in the response.

Data Normalization

Set normalize: true to standardize extracted prices, dates, addresses, and phone numbers into consistent formats.

Chat Assistant

Interactive scraping sessions via conversation. The chat maintains context across messages -- scrape a URL, then ask to extract data from it, then export.

POST /chat
{ "message": "Scrape https://example.com", "session_id": null }
// Returns: { session_id: "uuid", message: "...", tool_used: "scrape", tool_result: {...} }

Available tools in chat: scrape, crawl, search, extract, export, analyze, infer_schema.

Sessions are stored in Redis with a 30-minute TTL. Retrieve history via GET /chat/{session_id}/history.

Autonomous Research Agents

Agents plan multi-step research, execute it, adapt on failure, and synthesize findings into a final report.

POST /agents/research
{
  "goal": "Comprehensive comparison of 5 CRM platforms",
  "output_format": "report",   // structured | report | comparison_table
  "budget": {
    "max_pages": 50,
    "max_llm_calls": 20,
    "max_time_minutes": 10
  }
}
// Returns: { job_id: "uuid", status: "pending" }
// Poll GET /jobs/{job_id} or use /jobs/{job_id}/stream for SSE

The agent follows 4 phases: Planning (LLM generates 3-8 step plan), Execution (search, scrape, extract), Adaptation (re-plan on failures), Synthesis (consolidate into report with key findings).

Background Jobs

Long-running tasks (crawl, research, agent) run as background jobs via the arq worker queue.

Endpoint	Description
POST /jobs/crawl	Submit background crawl job
POST /jobs/research	Submit background research job
POST /jobs/bulk-scrape	Scrape multiple URLs in background
GET /jobs	List your jobs (filter by status)
GET /jobs/{id}	Get job status + progress
DELETE /jobs/{id}	Cancel a running job
GET /jobs/{id}/stream	SSE progress stream
GET /jobs/{id}/export?format=csv	Export job results

Job statuses: pending (queued), running (in progress), completed, failed, cancelled.

Schedules & Change Detection

Create recurring scrape schedules with automatic content change detection.

POST /schedules
{
  "name": "Monitor pricing page",
  "schedule_type": "scrape",   // scrape | crawl | research
  "cron_expression": "0 * * * *",   // every hour
  "request_data": { "url": "https://example.com/pricing", "output_format": "text" }
}

After each run, the system compares content hashes with the previous snapshot. If changed, a diff is generated and a change_detected event is dispatched to webhooks.

Endpoint	Description
POST /schedules	Create schedule
GET /schedules	List all schedules
PUT /schedules/{id}	Update (name, cron, pause/resume)
DELETE /schedules/{id}	Delete schedule
GET /schedules/{id}/snapshots	View change history
GET /schedules/{id}/diff/{snap_id}	View specific diff

Common cron expressions: 0 * * * * (hourly), 0 9 * * * (daily 9am), 0 9 * * 1 (weekly Monday), */30 * * * * (every 30 min).

Data Pipelines

Route scraped data to external destinations automatically.

Destination	Config
Webhook	{"url": "https://..."}
S3 / MinIO	{"bucket": "...", "access_key": "...", "secret_key": "..."}
Database	{"connection_url": "postgresql+asyncpg://...", "table": "..."}
Google Sheets	{"spreadsheet_id": "...", "service_account": {...}}

Pipelines support transforms: field mapping (field_map), field filtering (include_fields, exclude_fields), and row filtering (filters with operators eq/ne/gt/lt/contains).

POST /pipelines/{id}/trigger     // Manual trigger with data
{ "data": [{"name": "test", "value": 42}] }

Webhooks & Notifications

Receive HTTP callbacks when events occur.

POST /webhooks
{
  "url": "https://your-server.com/hook",
  "events": ["change_detected", "job_completed", "job_failed"]
}

Webhook payloads are signed with HMAC-SHA256. Verify using the X-Webhook-Signature header (format: sha256=hex_digest) and the secret returned on creation.

Failed deliveries retry up to 3 times. Check delivery status via GET /notifications.

Export & Batch Operations

Export Data

POST /export
{
  "format": "csv",     // json | csv | excel | ndjson
  "data": [{"col1": "val1", "col2": "val2"}],
  "filename": "my-data.csv"
}
// Returns file download

Batch Scrape

POST /v1/scrape/batch
{
  "urls": ["https://example.com", "https://httpbin.org/html"],
  "output_format": "text",
  "max_concurrent": 5
}
// Returns: { total: 2, success: 2, failed: 0, results: [...] }

Extraction Templates

Save extraction schemas for reuse across scrapes. Templates can be matched to URLs via regex patterns.

POST /templates
{
  "name": "Product scraper",
  "description": "Extract titles and prices from product listings",
  "strategy": "css",
  "config": {
    "schema_def": {
      "base_selector": "div.product",
      "fields": [
        {"name": "title", "selector": "h2", "type": "text"},
        {"name": "price", "selector": ".price", "type": "text"}
      ]
    }
  },
  "url_pattern": "books\.toscrape\.com"    // regex for auto-suggest
}

Endpoint	Description
POST /templates	Create a template
GET /templates	List your templates (filter by strategy)
GET /templates/{id}	Get template details
PUT /templates/{id}	Update template
DELETE /templates/{id}	Delete template
POST /templates/{id}/use	Increment usage counter
GET /templates/suggest/{url}	Find templates matching a URL

When scraping, use a template's config as the extraction field in your scrape request.

Bulk URL Upload

Upload a file containing URLs to scrape them all at once.

POST /bulk/upload  (multipart/form-data)
Fields:
  file: <CSV, JSON, or TXT file>
  output_format: "markdown"          // markdown | html | text
  mode: "sync"                       // sync (up to 50 inline) | async (up to 500 as job)

Supported Formats

Format	How URLs Are Detected
CSV	Any cell containing https:// is extracted
JSON	Array of strings, or array of objects with "url" field
TXT	One URL per line (https://...)

Sync mode returns results inline (capped at 50 URLs). Async mode creates a background job trackable on the Jobs page.

Error Codes

All API errors return JSON with a detail field explaining the issue.

Code	Name	When It Happens	What To Do
200	OK	Request succeeded	Process the response normally
401	Unauthorized	Missing, invalid, or deactivated API key	Check X-API-Key header; create a new key if expired
403	Forbidden	API key doesn't have access to this resource	Verify the resource belongs to your account
404	Not Found	Resource (job, schedule, template) doesn't exist	Check the ID; the resource may have been deleted
422	Validation Error	Request body has invalid fields	Check the detail field for which parameter failed validation
429	Rate Limited	Too many requests per minute, or daily quota exceeded	Wait for Retry-After seconds; upgrade quota if needed
500	Internal Error	Server-side bug or unexpected failure	Retry once; report if persistent
502	Bad Gateway	Target website unreachable or returned an error	The scrape target may be down; try a different engine
503	Service Unavailable	Redis or job queue is down	Infrastructure issue; wait and retry

Validation Error Details

422 responses include specific field-level errors:

{
  "detail": [
    {
      "loc": ["body", "url"],
      "msg": "URLs targeting private/internal networks are not allowed",
      "type": "value_error"
    }
  ]
}

Common Validation Rules

Field	Rule
url / start_url	Max 4096 chars; no private IPs (127., 10., 192.168.*)
js_code	Max 5000 chars; no require(), import, process.*, eval()
timeout	1-120 seconds
max_pages	1-100
max_depth	1-10
max_results	1-50
query	1-500 characters
crawl_goal	Max 2000 characters
max_images	0-100

Rate Limiting

When rate limited (429), the response includes a Retry-After header with the number of seconds to wait. Per-key limits are configurable by admins via the API key rate_limit_rpm and daily_quota fields.

API Reference

All endpoints require X-API-Key header unless noted. Admin endpoints use the master key.

Method	Path	Auth	Description
GET	/health	None	Engine availability check
POST	/auth/register	None	Create account + API key
POST	/auth/login	None	Login, get session key
GET	/auth/me	API Key	Current user info
POST	/auth/keys	API Key	Create new API key
POST	/scrape	API Key	Scrape a single URL
POST	/crawl	API Key	Crawl from a starting URL
POST	/search	API Key	Web search
POST	/reverse-image-search	API Key	Reverse image search
POST	/research	API Key	Autonomous research
POST	/intelligence/nl	API Key	Parse NL command
POST	/intelligence/nl/execute	API Key	Parse + execute NL command
POST	/intelligence/analyze-url	API Key	Analyze URL type
POST	/intelligence/infer-schema	API Key	Auto-generate schema
POST	/intelligence/classify	API Key	Classify URL content
POST	/chat	API Key	Chat message
GET	/chat/{id}/history	API Key	Chat session history
POST	/agents/research	API Key	Autonomous agent (async)
POST	/agents/research/sync	API Key	Autonomous agent (sync)
POST	/jobs/crawl	API Key	Submit crawl job
POST	/jobs/research	API Key	Submit research job
POST	/jobs/bulk-scrape	API Key	Submit bulk scrape job
GET	/jobs	API Key	List jobs
GET	/jobs/{id}	API Key	Job status + progress
DELETE	/jobs/{id}	API Key	Cancel job
GET	/jobs/{id}/stream	API Key	SSE progress stream
POST	/schedules	API Key	Create schedule
GET	/schedules	API Key	List schedules
PUT	/schedules/{id}	API Key	Update schedule
DELETE	/schedules/{id}	API Key	Delete schedule
GET	/schedules/{id}/snapshots	API Key	Snapshot history
GET	/schedules/{id}/diff/{snap}	API Key	View diff
GET	/dashboard/overview	API Key	Dashboard stats
GET	/dashboard/activity	API Key	Activity time series
GET	/dashboard/changes	API Key	Recent changes
GET	/dashboard/engine-health	API Key	Engine success rates
GET	/dashboard/jobs-queue	API Key	Job queue status
POST	/webhooks	API Key	Register webhook
GET	/webhooks	API Key	List webhooks
DELETE	/webhooks/{id}	API Key	Delete webhook
GET	/notifications	API Key	Notification log
POST	/pipelines	API Key	Create pipeline
GET	/pipelines	API Key	List pipelines
PUT	/pipelines/{id}	API Key	Update pipeline
DELETE	/pipelines/{id}	API Key	Delete pipeline
POST	/pipelines/{id}/trigger	API Key	Manual trigger
POST	/export	API Key	Export data as file
GET	/jobs/{id}/export	API Key	Export job results
POST	/v1/scrape/batch	API Key	Batch scrape URLs
GET	/cache/stats	None	Cache statistics
POST	/templates	API Key	Create extraction template
GET	/templates	API Key	List templates
GET	/templates/{id}	API Key	Get template
PUT	/templates/{id}	API Key	Update template
DELETE	/templates/{id}	API Key	Delete template
POST	/templates/{id}/use	API Key	Record template usage
GET	/templates/suggest/{url}	API Key	Auto-suggest templates
GET	/user/pinned	API Key	Get pinned results
POST	/user/pinned	API Key	Pin a result
DELETE	/user/pinned	API Key	Unpin a result
GET	/user/dashboard-config	API Key	Get dashboard settings
PUT	/user/dashboard-config	API Key	Update dashboard settings
POST	/bulk/upload	API Key	Upload URLs file for bulk scrape
GET	/history/logs	API Key	Your request history
GET	/history/stats	API Key	Your usage statistics

Configuration

Environment variables (prefix: SCRAPER_):

Variable	Default	Description
SCRAPER_DATABASE_URL	sqlite:///./data/scraper.db	Database connection string
SCRAPER_REDIS_URL	redis://localhost:6379/0	Redis connection for caching + jobs
SCRAPER_ADMIN_MASTER_KEY	(required)	Admin API master key
SCRAPER_LLM_API_KEY		OpenAI key for intelligence features
SCRAPER_LLM_PROVIDER	openai/gpt-4o-mini	LLM model (OpenAI, vLLM, Cloudflare)
SCRAPER_LLM_BASE_URL		Custom LLM endpoint
SCRAPER_PROXIES		Comma-separated proxy list
SCRAPER_RESIDENTIAL_PROXIES		Premium residential proxies
SCRAPER_DEFAULT_TIMEOUT	30	Request timeout (seconds)
SCRAPER_MAX_CONCURRENT	5	Max parallel scrapes
SCRAPER_ESCALATION_ENABLED	true	Enable engine fallback chain
SCRAPER_RATE_LIMIT_RPM	60	Global rate limit (requests/min)
SCRAPER_TAVILY_API_KEY		Tavily search API key
SCRAPER_SERPER_API_KEY		Serper search API key
SCRAPER_TINEYE_API_KEY		TinEye reverse image API key

Version: 0.4.0