CrawlerScraper Documentation

Complete guide to the CrawlerScraper platform -- a full-featured web scraping, crawling, and data extraction system with LLM-powered intelligence and autonomous operation.

Getting Started

CrawlerScraper provides a unified API and web UI for extracting data from websites. All API requests require authentication via an API key.

Authentication

Include your API key in every request using the X-API-Key header:

curl -H "X-API-Key: sk-your-key-here" https://provider.thefoundationreport.com/api/scrape

Get your API key from the API Keys page in the sidebar, or create one via the API:

POST /auth/register
{ "name": "My Name", "email": "[email protected]", "password": "..." }
# Returns: { "user": {...}, "api_key": "sk-..." }

Base URL

All API endpoints are available at: https://provider.thefoundationreport.com/api/

OpenAPI schema is at /api/docs and /api/openapi.json.

Scraping

Extract content from any URL. The system uses a 5-engine fallback chain with anti-bot detection to maximize success.

POST /scrape
{
  "url": "https://example.com",
  "output_format": "markdown",    // markdown | html | raw_html | text
  "engine": null,                  // auto-select, or: curl | crawl4ai | stealth | nodriver | camoufox
  "cache": true,                   // use Redis/DB cache (default true)
  "auto_detect": false,            // LLM-powered site analysis
  "classify": false,               // add content classification
  "normalize": false,              // normalize extracted data
  "download_images": false,
  "max_images": 20,
  "timeout": 30
}

Engine Escalation Chain

EngineTypeSpeedJS SupportAnti-Bot
curlHTTP clientFastestNoNone
crawl4aiHeadless browserFastYesBasic
stealthCrawl4AI + magicMediumYesGood
nodriverUndetected ChromeSlowYesStrong
camoufoxFirefox + fingerprintSlowestYesStrongest

When no engine is specified, the system tries engines in order. If anti-bot protection is detected (Cloudflare, DataDome, PerimeterX), it automatically escalates to the next engine.

Caching

Results are cached in Redis (1h TTL) and PostgreSQL (24h). The second request for the same URL returns instantly from cache. Set cache: false to bypass. Check cache stats at GET /cache/stats.

Extraction

Add an extraction field to extract structured data:

"extraction": {
  "strategy": "css",          // css | xpath | llm | auto
  "schema_def": {             // for CSS strategy
    "base_selector": "div.product",
    "fields": [
      {"name": "title", "selector": "h2", "type": "text"},
      {"name": "price", "selector": ".price", "type": "text"},
      {"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
    ]
  }
}
// Or use "auto" strategy -- LLM infers the schema for you

Crawling

Crawl multiple pages starting from a URL. Supports BFS traversal and LLM-powered intelligent crawling.

POST /crawl
{
  "start_url": "https://example.com",
  "max_pages": 10,            // 1-100
  "max_depth": 2,             // 1-10
  "url_pattern": null,        // regex filter for URLs
  "same_domain_only": true,
  "respect_robots_txt": true,
  "use_sitemap": false,       // seed queue from sitemap.xml
  "crawl_goal": null          // enables LLM-powered link scoring
}

Intelligent Crawling

When crawl_goal is set, the crawler uses a priority queue instead of BFS. An LLM scores each link by relevance (0-1) to your goal, and the crawler follows the highest-scored links first. Low-relevance links (below 0.1) are skipped entirely.

Sitemap Support

Set use_sitemap: true to discover pages from robots.txt sitemaps and sitemap.xml. Sitemap URLs are seeded into the crawl queue alongside link-discovered pages.

Search the web using multiple providers with optional result scraping.

POST /search
{
  "query": "web scraping best practices",
  "provider": null,            // auto-select, or: duckduckgo | tavily | serper
  "search_type": "text",      // text | news | images
  "max_results": 10,
  "region": null,              // e.g. "us-en"
  "time_range": null,          // day | week | month | year
  "scrape_results": false      // scrape each result page for full content
}

DuckDuckGo is free and requires no API key. Tavily and Serper require keys set in environment variables.

Research

Autonomous multi-step research that searches, scrapes, and extracts structured data to fill your schema.

POST /research
{
  "goal": "Compare top 5 CRM platforms for small businesses",
  "output_schema": {
    "name": "string",
    "pricing": "string",
    "features": ["string"],
    "rating": "number",
    "source_url": "string"
  },
  "strategy": "c",            // a = deterministic, b = LLM-guided, c = hybrid
  "max_results_per_query": 5,
  "max_iterations": 3          // strategy B only
}
StrategySpeedCostQualityHow It Works
A (Deterministic)FastestLowestBasicGoal as query, LLM only for extraction
B (LLM-Guided)SlowestHighestBestLLM generates queries, scores relevance, iterates
C (Hybrid)MediumMediumGoodLLM generates queries, deterministic pipeline

Reverse Image Search

Find where an image appears online using Google Lens, Yandex, and TinEye.

POST /reverse-image-search  (multipart/form-data)
Fields:
  image_url: "https://..."    // OR
  image_file: <file>          // OR
  image_base64: "data:..."
  providers: "google,yandex"  // google | yandex | tineye
  max_results: 10

Intelligence (LLM-Powered)

These features require SCRAPER_LLM_API_KEY to be configured.

URL Analysis

Auto-detect site type and optimal scraping strategy. Uses a 25+ site registry (Amazon, GitHub, Reddit, etc.) with LLM fallback.

POST /intelligence/analyze-url
{ "url": "https://amazon.com/dp/B09V3KXJPB" }
// Returns: { site_type: "ecommerce", recommended_engine: "crawl4ai", confidence: "high" }

Schema Inference

Auto-generate extraction schemas from page content:

POST /intelligence/infer-schema
{ "url": "https://example.com/products", "goal": "extract product data" }
// Returns suggested ExtractionConfig + sample data preview

Natural Language Commands

Express scraping intent in plain English:

POST /intelligence/nl
{ "command": "scrape all product prices from books.toscrape.com" }
// Returns: { request_type: "scrape", request: {...}, explanation: "..." }

POST /intelligence/nl/execute    // Parse AND execute in one call
{ "command": "search for AI startups in Europe" }

Content Classification

Set classify: true on scrape requests to get category, tags, language, sentiment, and entity extraction in the response.

Data Normalization

Set normalize: true to standardize extracted prices, dates, addresses, and phone numbers into consistent formats.

Chat Assistant

Interactive scraping sessions via conversation. The chat maintains context across messages -- scrape a URL, then ask to extract data from it, then export.

POST /chat
{ "message": "Scrape https://example.com", "session_id": null }
// Returns: { session_id: "uuid", message: "...", tool_used: "scrape", tool_result: {...} }

Available tools in chat: scrape, crawl, search, extract, export, analyze, infer_schema.

Sessions are stored in Redis with a 30-minute TTL. Retrieve history via GET /chat/{session_id}/history.

Autonomous Research Agents

Agents plan multi-step research, execute it, adapt on failure, and synthesize findings into a final report.

POST /agents/research
{
  "goal": "Comprehensive comparison of 5 CRM platforms",
  "output_format": "report",   // structured | report | comparison_table
  "budget": {
    "max_pages": 50,
    "max_llm_calls": 20,
    "max_time_minutes": 10
  }
}
// Returns: { job_id: "uuid", status: "pending" }
// Poll GET /jobs/{job_id} or use /jobs/{job_id}/stream for SSE

The agent follows 4 phases: Planning (LLM generates 3-8 step plan), Execution (search, scrape, extract), Adaptation (re-plan on failures), Synthesis (consolidate into report with key findings).

Background Jobs

Long-running tasks (crawl, research, agent) run as background jobs via the arq worker queue.

EndpointDescription
POST /jobs/crawlSubmit background crawl job
POST /jobs/researchSubmit background research job
POST /jobs/bulk-scrapeScrape multiple URLs in background
GET /jobsList your jobs (filter by status)
GET /jobs/{id}Get job status + progress
DELETE /jobs/{id}Cancel a running job
GET /jobs/{id}/streamSSE progress stream
GET /jobs/{id}/export?format=csvExport job results

Job statuses: pending (queued), running (in progress), completed, failed, cancelled.

Schedules & Change Detection

Create recurring scrape schedules with automatic content change detection.

POST /schedules
{
  "name": "Monitor pricing page",
  "schedule_type": "scrape",   // scrape | crawl | research
  "cron_expression": "0 * * * *",   // every hour
  "request_data": { "url": "https://example.com/pricing", "output_format": "text" }
}

After each run, the system compares content hashes with the previous snapshot. If changed, a diff is generated and a change_detected event is dispatched to webhooks.

EndpointDescription
POST /schedulesCreate schedule
GET /schedulesList all schedules
PUT /schedules/{id}Update (name, cron, pause/resume)
DELETE /schedules/{id}Delete schedule
GET /schedules/{id}/snapshotsView change history
GET /schedules/{id}/diff/{snap_id}View specific diff

Common cron expressions: 0 * * * * (hourly), 0 9 * * * (daily 9am), 0 9 * * 1 (weekly Monday), */30 * * * * (every 30 min).

Data Pipelines

Route scraped data to external destinations automatically.

DestinationConfig
Webhook{"url": "https://..."}
S3 / MinIO{"bucket": "...", "access_key": "...", "secret_key": "..."}
Database{"connection_url": "postgresql+asyncpg://...", "table": "..."}
Google Sheets{"spreadsheet_id": "...", "service_account": {...}}

Pipelines support transforms: field mapping (field_map), field filtering (include_fields, exclude_fields), and row filtering (filters with operators eq/ne/gt/lt/contains).

POST /pipelines/{id}/trigger     // Manual trigger with data
{ "data": [{"name": "test", "value": 42}] }

Webhooks & Notifications

Receive HTTP callbacks when events occur.

POST /webhooks
{
  "url": "https://your-server.com/hook",
  "events": ["change_detected", "job_completed", "job_failed"]
}

Webhook payloads are signed with HMAC-SHA256. Verify using the X-Webhook-Signature header (format: sha256=hex_digest) and the secret returned on creation.

Failed deliveries retry up to 3 times. Check delivery status via GET /notifications.

Export & Batch Operations

Export Data

POST /export
{
  "format": "csv",     // json | csv | excel | ndjson
  "data": [{"col1": "val1", "col2": "val2"}],
  "filename": "my-data.csv"
}
// Returns file download

Batch Scrape

POST /v1/scrape/batch
{
  "urls": ["https://example.com", "https://httpbin.org/html"],
  "output_format": "text",
  "max_concurrent": 5
}
// Returns: { total: 2, success: 2, failed: 0, results: [...] }

Extraction Templates

Save extraction schemas for reuse across scrapes. Templates can be matched to URLs via regex patterns.

POST /templates
{
  "name": "Product scraper",
  "description": "Extract titles and prices from product listings",
  "strategy": "css",
  "config": {
    "schema_def": {
      "base_selector": "div.product",
      "fields": [
        {"name": "title", "selector": "h2", "type": "text"},
        {"name": "price", "selector": ".price", "type": "text"}
      ]
    }
  },
  "url_pattern": "books\.toscrape\.com"    // regex for auto-suggest
}
EndpointDescription
POST /templatesCreate a template
GET /templatesList your templates (filter by strategy)
GET /templates/{id}Get template details
PUT /templates/{id}Update template
DELETE /templates/{id}Delete template
POST /templates/{id}/useIncrement usage counter
GET /templates/suggest/{url}Find templates matching a URL

When scraping, use a template's config as the extraction field in your scrape request.

Bulk URL Upload

Upload a file containing URLs to scrape them all at once.

POST /bulk/upload  (multipart/form-data)
Fields:
  file: <CSV, JSON, or TXT file>
  output_format: "markdown"          // markdown | html | text
  mode: "sync"                       // sync (up to 50 inline) | async (up to 500 as job)

Supported Formats

FormatHow URLs Are Detected
CSVAny cell containing https:// is extracted
JSONArray of strings, or array of objects with "url" field
TXTOne URL per line (https://...)

Sync mode returns results inline (capped at 50 URLs). Async mode creates a background job trackable on the Jobs page.

Error Codes

All API errors return JSON with a detail field explaining the issue.

CodeNameWhen It HappensWhat To Do
200OKRequest succeededProcess the response normally
401UnauthorizedMissing, invalid, or deactivated API keyCheck X-API-Key header; create a new key if expired
403ForbiddenAPI key doesn't have access to this resourceVerify the resource belongs to your account
404Not FoundResource (job, schedule, template) doesn't existCheck the ID; the resource may have been deleted
422Validation ErrorRequest body has invalid fieldsCheck the detail field for which parameter failed validation
429Rate LimitedToo many requests per minute, or daily quota exceededWait for Retry-After seconds; upgrade quota if needed
500Internal ErrorServer-side bug or unexpected failureRetry once; report if persistent
502Bad GatewayTarget website unreachable or returned an errorThe scrape target may be down; try a different engine
503Service UnavailableRedis or job queue is downInfrastructure issue; wait and retry

Validation Error Details

422 responses include specific field-level errors:

{
  "detail": [
    {
      "loc": ["body", "url"],
      "msg": "URLs targeting private/internal networks are not allowed",
      "type": "value_error"
    }
  ]
}

Common Validation Rules

FieldRule
url / start_urlMax 4096 chars; no private IPs (127.*, 10.*, 192.168.*)
js_codeMax 5000 chars; no require(), import, process.*, eval()
timeout1-120 seconds
max_pages1-100
max_depth1-10
max_results1-50
query1-500 characters
crawl_goalMax 2000 characters
max_images0-100

Rate Limiting

When rate limited (429), the response includes a Retry-After header with the number of seconds to wait. Per-key limits are configurable by admins via the API key rate_limit_rpm and daily_quota fields.

API Reference

All endpoints require X-API-Key header unless noted. Admin endpoints use the master key.

MethodPathAuthDescription
GET/healthNoneEngine availability check
POST/auth/registerNoneCreate account + API key
POST/auth/loginNoneLogin, get session key
GET/auth/meAPI KeyCurrent user info
POST/auth/keysAPI KeyCreate new API key
POST/scrapeAPI KeyScrape a single URL
POST/crawlAPI KeyCrawl from a starting URL
POST/searchAPI KeyWeb search
POST/reverse-image-searchAPI KeyReverse image search
POST/researchAPI KeyAutonomous research
POST/intelligence/nlAPI KeyParse NL command
POST/intelligence/nl/executeAPI KeyParse + execute NL command
POST/intelligence/analyze-urlAPI KeyAnalyze URL type
POST/intelligence/infer-schemaAPI KeyAuto-generate schema
POST/intelligence/classifyAPI KeyClassify URL content
POST/chatAPI KeyChat message
GET/chat/{id}/historyAPI KeyChat session history
POST/agents/researchAPI KeyAutonomous agent (async)
POST/agents/research/syncAPI KeyAutonomous agent (sync)
POST/jobs/crawlAPI KeySubmit crawl job
POST/jobs/researchAPI KeySubmit research job
POST/jobs/bulk-scrapeAPI KeySubmit bulk scrape job
GET/jobsAPI KeyList jobs
GET/jobs/{id}API KeyJob status + progress
DELETE/jobs/{id}API KeyCancel job
GET/jobs/{id}/streamAPI KeySSE progress stream
POST/schedulesAPI KeyCreate schedule
GET/schedulesAPI KeyList schedules
PUT/schedules/{id}API KeyUpdate schedule
DELETE/schedules/{id}API KeyDelete schedule
GET/schedules/{id}/snapshotsAPI KeySnapshot history
GET/schedules/{id}/diff/{snap}API KeyView diff
GET/dashboard/overviewAPI KeyDashboard stats
GET/dashboard/activityAPI KeyActivity time series
GET/dashboard/changesAPI KeyRecent changes
GET/dashboard/engine-healthAPI KeyEngine success rates
GET/dashboard/jobs-queueAPI KeyJob queue status
POST/webhooksAPI KeyRegister webhook
GET/webhooksAPI KeyList webhooks
DELETE/webhooks/{id}API KeyDelete webhook
GET/notificationsAPI KeyNotification log
POST/pipelinesAPI KeyCreate pipeline
GET/pipelinesAPI KeyList pipelines
PUT/pipelines/{id}API KeyUpdate pipeline
DELETE/pipelines/{id}API KeyDelete pipeline
POST/pipelines/{id}/triggerAPI KeyManual trigger
POST/exportAPI KeyExport data as file
GET/jobs/{id}/exportAPI KeyExport job results
POST/v1/scrape/batchAPI KeyBatch scrape URLs
GET/cache/statsNoneCache statistics
POST/templatesAPI KeyCreate extraction template
GET/templatesAPI KeyList templates
GET/templates/{id}API KeyGet template
PUT/templates/{id}API KeyUpdate template
DELETE/templates/{id}API KeyDelete template
POST/templates/{id}/useAPI KeyRecord template usage
GET/templates/suggest/{url}API KeyAuto-suggest templates
GET/user/pinnedAPI KeyGet pinned results
POST/user/pinnedAPI KeyPin a result
DELETE/user/pinnedAPI KeyUnpin a result
GET/user/dashboard-configAPI KeyGet dashboard settings
PUT/user/dashboard-configAPI KeyUpdate dashboard settings
POST/bulk/uploadAPI KeyUpload URLs file for bulk scrape
GET/history/logsAPI KeyYour request history
GET/history/statsAPI KeyYour usage statistics

Configuration

Environment variables (prefix: SCRAPER_):

VariableDefaultDescription
SCRAPER_DATABASE_URLsqlite:///./data/scraper.dbDatabase connection string
SCRAPER_REDIS_URLredis://localhost:6379/0Redis connection for caching + jobs
SCRAPER_ADMIN_MASTER_KEY(required)Admin API master key
SCRAPER_LLM_API_KEYOpenAI key for intelligence features
SCRAPER_LLM_PROVIDERopenai/gpt-4o-miniLLM model (OpenAI, vLLM, Cloudflare)
SCRAPER_LLM_BASE_URLCustom LLM endpoint
SCRAPER_PROXIESComma-separated proxy list
SCRAPER_RESIDENTIAL_PROXIESPremium residential proxies
SCRAPER_DEFAULT_TIMEOUT30Request timeout (seconds)
SCRAPER_MAX_CONCURRENT5Max parallel scrapes
SCRAPER_ESCALATION_ENABLEDtrueEnable engine fallback chain
SCRAPER_RATE_LIMIT_RPM60Global rate limit (requests/min)
SCRAPER_TAVILY_API_KEYTavily search API key
SCRAPER_SERPER_API_KEYSerper search API key
SCRAPER_TINEYE_API_KEYTinEye reverse image API key

Version: 0.4.0