NEW ScrapingAnt MCP for Claude Code, Cursor & Windsurf — try it free →
★★★★★ 5.0 on Capterra

AI training data scraping API. Bulk corpus collection for fine-tuning.

The data infrastructure under your LLM training pipeline. Bulk fetch open-web pages at datacenter speed, return LLM-clean Markdown (boilerplate stripped) or structured JSON (custom schema). Not a curated dataset marketplace — bring your URL list, we hand back tokenizer-ready text.

LLM-clean Markdown · failed requests cost 0 · ~10B tokens/month on Business plan

# LLM-clean Markdown — boilerplate stripped, ready for tokenization.
$ curl 'https://api.scrapingant.com/v2/markdown' \
    --data-urlencode 'x-api-key=YOUR_KEY' \
    --data-urlencode 'url=https://example.com/long-form-article'
# → { "url": "…", "markdown": "# Heading\n\n…" }
# Stream a URL list into a JSONL training set.
import requests, json

with open("corpus.jsonl", "a") as f:
    for url in source_urls:
        r = requests.get("https://api.scrapingant.com/v2/markdown", params={
            "x-api-key": "YOUR_KEY",
            "url": url,
        })
        if r.ok:
            f.write(json.dumps({
                "text": r.json()["markdown"],
                "source": url,
            }) + "\n")
# Load with: datasets.load_dataset('json', data_files='corpus.jsonl')
# Turn FAQ pages into instruction-tuning pairs — list syntax for multiple Q&As.
$ curl 'https://api.scrapingant.com/v2/extract' \
    --data-urlencode 'x-api-key=YOUR_KEY' \
    --data-urlencode 'url=https://example.com/faq' \
    --data-urlencode 'extract_properties=faqs(list: question, answer)'
# → { "faqs": [{ "question": "…", "answer": "…" }, …] } ready for SFT.
your-corpus-pipeline — CORPUS · domain-finetune · 2026-05-12 2.4B tokens · 842K pages · 18m 22s · 96 workers SFT pairs · 18,402 RAG chunks · 412K pre-train · 2.4B — corpus.jsonl head {"text":"# Quarterly Earnings…"} 3,124 tokens · ✓ deduped .md {"question":"What is…","answer":"…"} FAQ extraction · 412 tokens · ✓ SFT-ready SFT {"text":"## API Reference…"} 1,824 tokens · ✓ RAG-ready RAG /v2/markdown · tokenizer-ready · burst concurrency
URL list your seed corpus /v2/markdown pre-training corpus boilerplate stripped tokenize /v2/extract instruction tuning {Q, A} JSON SFT /v2/general custom pipeline raw rendered HTML DIY one URL, three training shapes, one credit pool
Three formats, three training shapes

Markdown, JSON, HTML. One per training shape.

Different training stages need different shapes. /v2/markdown for pre-training corpora — boilerplate stripped, headings preserved, ready for the tokenizer. /v2/extract for instruction-tuning datasets — pass a schema like question,answer or title,body,label and get JSONL records back. /v2/general raw HTML when your downstream pipeline does its own parsing (rare in modern LLM stacks, but supported).

  • Same key, switch endpoint by path — no separate plan per output format
  • Markdown output is deterministic — sha256 dedupe before token billing makes economic sense
  • JSONL-friendly — pipe /v2/markdown straight into datasets.load_dataset
example.com/article-A 200 · 2,124 tokens 1 cr example.com/article-B 404 · not found 0 cr example.com/article-C 200 · 3,402 tokens 1 cr example.com/article-D timeout · retry queue 0 cr 4 fetches · 2 succeeded total: 2 credits only successful fetches bill — retries are free
Failed crawls don't bloat budgets

Million-page crawls. Only successes bill.

Bulk-corpus crawls hit dead URLs constantly — sites disappear, articles get pulled, paywalls swap in. ScrapingAnt charges zero credits for failed requests: DNS errors, timeouts, 4xx/5xx responses never bill. That keeps the per-token cost predictable when your training pipeline needs a million pages and the source list is 40% stale. Datacenter proxies handle the burst concurrency; failed fetches don't inflate the bill.

  • Failed = 0 credits — retries are free across the whole pipeline
  • 1 credit per static fetch — Business Pro ($599 / 8M credits) covers ~8M pages/month
  • Burst concurrency without per-key caps — your worker pool sets the throughput
extract_properties faqs(list: question, answer) /v2/extract response · 200 OK { "faqs" : [ { "question" : "How do I integrate…" , "answer" : "Use the API key …" } , … ] } free-form schema · JSON out · no per-site parsers
Free-form schema · /v2/extract

Free-form schema. Parser-free JSON.

/v2/extract takes a free-form extract_properties string — a comma-separated list of fields you want from the page — and returns matching JSON. Pass title, content for article corpora, product title, price(number), reviews(list: review title, content) for catalog mining, or faqs(list: question, answer) to turn FAQ pages into SFT pairs. No XPath, no selectors, no per-site parsers; the AI maps your fields onto the page's Markdown.

  • Nested list: syntax for repeated structures — pull every Q/A, every product, every comment
  • Field names come straight back as JSON keys — your schema is your contract
  • Same key, same credit pool, same fetch+render pipeline as /v2/markdown and /v2/general

Six training workloads teams build on top.

Same API, same credit pool — different ways of slicing the open web into training data.

Pre-training corpus collection

Bulk fetch open-web pages into LLM-clean Markdown. Stream into JSONL for HuggingFace <code>datasets</code>, then tokenize. Boilerplate already stripped — no <code>BeautifulSoup</code> + readability post-pass.

Instruction-tuning Q&A pairs

FAQ pages, support knowledge bases, regulatory Q&A — turn them into <code>{question, answer}</code> JSONL with <a href="/ai-data-scraper"><code>/v2/extract</code></a>. Direct SFT input, no manual annotation.

Domain-specific fine-tuning

Legal opinions, medical guidelines, financial filings, scientific abstracts — narrow corpora that public LLMs underweight. Bring the URL list, get the corpus.

RAG knowledge-base ingestion

Crawl documentation, product manuals, public knowledge bases. Markdown chunks land directly in a vector store — no parsing, no chunking pre-processor needed.

Synthetic-data seed corpora

Seed your data-augmentation pipeline with diverse real-web samples. Markdown output keeps the LLM-generation step focused on transforming, not parsing.

Evaluation-set construction

Pull benchmark sources, hold-out test corpora, fresh-from-the-web eval slices for contamination-free LLM evaluation. Date-stamped fetches lock the snapshot.

Pricing

Industry leading pricing that scales with your business.

Compare plans side by side. Every tier includes 10,000 free credits to start.
👈Swipe to compare all 5 plans👉
Plans
Enthusiast
100K credits / mo
$19/mo
★ Most Popular
Startup
500K credits / mo
$49/mo
Business
3M credits / mo
$249/mo
Business Pro
8M credits / mo
$599/mo
Custom
10M+ credits / mo
$699+/mo
Monthly API credits 100,000 500,000 3,000,000 8,000,000 10M+
Support channel Email Priority email Priority email Priority email Priority + dedicated
Integration help Docs only Custom code snippets Debug sessions Priority debug sessions Full enterprise onboarding
Expert assistance included included included included
Custom proxy pools included included included
Custom anti-bot avoidances included included included
Dedicated account manager included included included
Start Free Start Free → Start Free Start Free Talk to Sales
Hit your limit mid-month?
Restart your plan instantly — no waiting for the next billing cycle. Credits refresh the moment you pay, so scraping never has to stop.
10,000 free credits every month
No credit card required
Pay only for successful scrapes — failed requests cost 0
Customers

What teams are saying.

From solo developers shipping side projects to enterprise pipelines at Fortune 500s.

★★★★★ 5.0 on Capterra →
★★★★★

“Onboarding and API integration was smooth and clear. Everything works great. The support was excellent.

Illia K.
Android Software Developer
★★★★★

“Great communication with co-founders helped me to get the job done. Great proxy diversity and good price.”

Andrii M.
Senior Software Engineer
★★★★★

“This product helps me to scale and extend my business. The setup is easy and support is really good.”

Dmytro T.
Senior Software Engineer
FAQ

AI training data API FAQ.

Anything else? Talk to us — we read every email.

What is an AI training data scraping API?

An AI training data scraping API is a managed endpoint that takes URL lists and returns LLM-ready text — clean Markdown for pre-training corpora, structured JSON for instruction-tuning datasets — at the concurrency and cost predictable enough for million-page jobs. ScrapingAnt's /v2/markdown strips nav, ads, footers, and comments so your tokenizer sees signal, not boilerplate. /v2/extract returns parser-free JSON when you need structured question/answer or schema-conformant training records.

Can I use scraped web data to train commercial LLMs?

The legal landscape is unsettled and varies by jurisdiction. In the US, the hiQ Labs v. LinkedIn Ninth Circuit precedent protects scraping of publicly accessible web data, but downstream redistribution of copyrighted material is a separate question. Common Crawl, RefinedWeb, and many open-source pre-training corpora are built from scraped data. We don't provide legal advice — licensing, fair-use claims, and downstream redistribution are your responsibility.

How does this compare to Common Crawl?

Common Crawl ships quarterly broad-coverage snapshots — strong for general pre-training corpora, weaker when you need (a) fresh content, (b) a specific URL list, or (c) JS-rendered pages. ScrapingAnt fetches what you ask, when you ask, with optional browser=true for JS-rendered targets. Many teams use both: Common Crawl as the foundation, ScrapingAnt for the narrow fresh slice (recent news, domain-specific corpora, missing JS-rendered pages).

How many tokens does a typical fetch produce?

A long-form article in /v2/markdown output averages 2,000–4,000 tokens after tokenization (varies by tokenizer). At 1 credit per static fetch, the Business plan ($249/month, 3M credits) covers roughly 3M articles — call it ~10B tokens of pre-training corpus per month. Heavier browser=true fetches halve that throughput. Failed fetches cost zero credits.

Does it deduplicate across URLs?

No — deduplication is your downstream concern. sha256 on the response body is a cheap exact-match gate; for near-duplicate detection (paraphrased press releases, mirrored content, syndicated articles), use a MinHash / LSH pipeline. The reason we don't dedupe in-API: training-data pipelines have wildly different dedup thresholds and the right call depends on your downstream tokenizer.

Markdown vs raw HTML — which should I store?

For pre-training and SFT corpora: Markdown — boilerplate strip is critical, tokenizer doesn't need <div class="ad-wrapper"> noise. For RAG knowledge bases: also Markdown — vector stores chunk cleaner on heading structure. For research that needs render fidelity (UI layout extraction, ad-placement studies): raw HTML via /v2/general. Most LLM teams use Markdown for 95%+ of pages.

Can I get paywalled or auth-bound content?

Paywall handling is your responsibility — you need to be on the right side of access. For pages where you have legitimate auth (subscription accounts, customer logins, OAuth-bound content), pass cookies via cookies= param and sticky sessions via proxy_country + session identifier to keep the same egress IP across multi-page flows. The API doesn't bypass paywalls; it executes the request you supply.

Talk to us

Building a domain-specific LLM?

Custom volume pricing for million-page corpora, dedicated datacenter pools, JSON-schema design help for instruction-tuning datasets, or migration help from in-house bulk scrapers — drop us a line and a real human gets back within a few hours.

“Our clients are pleasantly surprised by the response speed of our team.”

Oleg Kulyk
Founder, ScrapingAnt

A real human replies within a few hours · we don't share your email

Thanks — we'll be in touch shortly.
Something went wrong submitting the form. Please try again or email us directly.

Ready to scrape the web?

10,000 free credits every month. No credit card. Pay only for successful requests.

Sign up in under 30 seconds — no card, no commitment.