AI training data scraping API. Bulk corpus collection for fine-tuning.
The data infrastructure under your LLM training pipeline. Bulk fetch open-web pages at datacenter speed, return LLM-clean Markdown (boilerplate stripped) or structured JSON (custom schema). Not a curated dataset marketplace — bring your URL list, we hand back tokenizer-ready text.
LLM-clean Markdown · failed requests cost 0 · ~10B tokens/month on Business plan
# LLM-clean Markdown — boilerplate stripped, ready for tokenization.
$ curl 'https://api.scrapingant.com/v2/markdown' \
--data-urlencode 'x-api-key=YOUR_KEY' \
--data-urlencode 'url=https://example.com/long-form-article'
# → { "url": "…", "markdown": "# Heading\n\n…" }# Stream a URL list into a JSONL training set.
import requests, json
with open("corpus.jsonl", "a") as f:
for url in source_urls:
r = requests.get("https://api.scrapingant.com/v2/markdown", params={
"x-api-key": "YOUR_KEY",
"url": url,
})
if r.ok:
f.write(json.dumps({
"text": r.json()["markdown"],
"source": url,
}) + "\n")
# Load with: datasets.load_dataset('json', data_files='corpus.jsonl')# Turn FAQ pages into instruction-tuning pairs — list syntax for multiple Q&As.
$ curl 'https://api.scrapingant.com/v2/extract' \
--data-urlencode 'x-api-key=YOUR_KEY' \
--data-urlencode 'url=https://example.com/faq' \
--data-urlencode 'extract_properties=faqs(list: question, answer)'
# → { "faqs": [{ "question": "…", "answer": "…" }, …] } ready for SFT. Why LLM teams build on us.
The infrastructure layer, not a curated dataset. Bring your URL list — we hand back text the tokenizer can use directly.
LLM-clean Markdown out of the box
/v2/markdown strips nav, ads, footers. No BeautifulSoup+readability pre-pass.
/v2/markdown → Datacenter concurrency for bulk corpora
Millions of pages on one key, predictable per-page cost. Datacenter pool handles the burst.
How bulk works →JSON-schema extraction for SFT
/v2/extract returns parser-free {question, answer} JSON ready for instruction tuning.
/v2/extract → Markdown, JSON, HTML. One per training shape.
Different training stages need different shapes. /v2/markdown for pre-training corpora — boilerplate stripped, headings preserved, ready for the tokenizer. /v2/extract for instruction-tuning datasets — pass a schema like question,answer or title,body,label and get JSONL records back. /v2/general raw HTML when your downstream pipeline does its own parsing (rare in modern LLM stacks, but supported).
- Same key, switch endpoint by path — no separate plan per output format
- Markdown output is deterministic — sha256 dedupe before token billing makes economic sense
- JSONL-friendly — pipe
/v2/markdownstraight intodatasets.load_dataset
Million-page crawls. Only successes bill.
Bulk-corpus crawls hit dead URLs constantly — sites disappear, articles get pulled, paywalls swap in. ScrapingAnt charges zero credits for failed requests: DNS errors, timeouts, 4xx/5xx responses never bill. That keeps the per-token cost predictable when your training pipeline needs a million pages and the source list is 40% stale. Datacenter proxies handle the burst concurrency; failed fetches don't inflate the bill.
- Failed = 0 credits — retries are free across the whole pipeline
- 1 credit per static fetch — Business Pro ($599 / 8M credits) covers ~8M pages/month
- Burst concurrency without per-key caps — your worker pool sets the throughput
Free-form schema. Parser-free JSON.
/v2/extract takes a free-form extract_properties string — a comma-separated list of fields you want from the page — and returns matching JSON. Pass title, content for article corpora, product title, price(number), reviews(list: review title, content) for catalog mining, or faqs(list: question, answer) to turn FAQ pages into SFT pairs. No XPath, no selectors, no per-site parsers; the AI maps your fields onto the page's Markdown.
- Nested
list:syntax for repeated structures — pull every Q/A, every product, every comment - Field names come straight back as JSON keys — your schema is your contract
- Same key, same credit pool, same fetch+render pipeline as
/v2/markdownand/v2/general
Six training workloads teams build on top.
Same API, same credit pool — different ways of slicing the open web into training data.
Pre-training corpus collection
Bulk fetch open-web pages into LLM-clean Markdown. Stream into JSONL for HuggingFace <code>datasets</code>, then tokenize. Boilerplate already stripped — no <code>BeautifulSoup</code> + readability post-pass.
Instruction-tuning Q&A pairs
FAQ pages, support knowledge bases, regulatory Q&A — turn them into <code>{question, answer}</code> JSONL with <a href="/ai-data-scraper"><code>/v2/extract</code></a>. Direct SFT input, no manual annotation.
Domain-specific fine-tuning
Legal opinions, medical guidelines, financial filings, scientific abstracts — narrow corpora that public LLMs underweight. Bring the URL list, get the corpus.
RAG knowledge-base ingestion
Crawl documentation, product manuals, public knowledge bases. Markdown chunks land directly in a vector store — no parsing, no chunking pre-processor needed.
Synthetic-data seed corpora
Seed your data-augmentation pipeline with diverse real-web samples. Markdown output keeps the LLM-generation step focused on transforming, not parsing.
Evaluation-set construction
Pull benchmark sources, hold-out test corpora, fresh-from-the-web eval slices for contamination-free LLM evaluation. Date-stamped fetches lock the snapshot.
Pricing
Industry leading pricing that scales with your business.
|
Plans
|
Enthusiast
100K credits / mo
$19/mo
|
★ Most Popular
Startup
500K credits / mo
$49/mo
|
Business
3M credits / mo
$249/mo
|
Business Pro
8M credits / mo
$599/mo
|
Custom
10M+ credits / mo
$699+/mo
|
|---|---|---|---|---|---|
| Monthly API credits | 100,000 | 500,000 | 3,000,000 | 8,000,000 | 10M+ |
| Support channel | Priority email | Priority email | Priority email | Priority + dedicated | |
| Integration help | Docs only | Custom code snippets | Debug sessions | Priority debug sessions | Full enterprise onboarding |
| Expert assistance | — | ||||
| Custom proxy pools | — | — | |||
| Custom anti-bot avoidances | — | — | |||
| Dedicated account manager | — | — | |||
| Start Free | Start Free → | Start Free | Start Free | Talk to Sales |
What teams are saying.
From solo developers shipping side projects to enterprise pipelines at Fortune 500s.
★★★★★ 5.0 on Capterra →★★★★★“Onboarding and API integration was smooth and clear. Everything works great. The support was excellent.”
★★★★★“Great communication with co-founders helped me to get the job done. Great proxy diversity and good price.”
★★★★★“This product helps me to scale and extend my business. The setup is easy and support is really good.”
What is an AI training data scraping API?
An AI training data scraping API is a managed endpoint that takes URL lists and returns LLM-ready text — clean Markdown for pre-training corpora, structured JSON for instruction-tuning datasets — at the concurrency and cost predictable enough for million-page jobs. ScrapingAnt's /v2/markdown strips nav, ads, footers, and comments so your tokenizer sees signal, not boilerplate. /v2/extract returns parser-free JSON when you need structured question/answer or schema-conformant training records.
Can I use scraped web data to train commercial LLMs?
The legal landscape is unsettled and varies by jurisdiction. In the US, the hiQ Labs v. LinkedIn Ninth Circuit precedent protects scraping of publicly accessible web data, but downstream redistribution of copyrighted material is a separate question. Common Crawl, RefinedWeb, and many open-source pre-training corpora are built from scraped data. We don't provide legal advice — licensing, fair-use claims, and downstream redistribution are your responsibility.
How does this compare to Common Crawl?
Common Crawl ships quarterly broad-coverage snapshots — strong for general pre-training corpora, weaker when you need (a) fresh content, (b) a specific URL list, or (c) JS-rendered pages. ScrapingAnt fetches what you ask, when you ask, with optional browser=true for JS-rendered targets. Many teams use both: Common Crawl as the foundation, ScrapingAnt for the narrow fresh slice (recent news, domain-specific corpora, missing JS-rendered pages).
How many tokens does a typical fetch produce?
A long-form article in /v2/markdown output averages 2,000–4,000 tokens after tokenization (varies by tokenizer). At 1 credit per static fetch, the Business plan ($249/month, 3M credits) covers roughly 3M articles — call it ~10B tokens of pre-training corpus per month. Heavier browser=true fetches halve that throughput. Failed fetches cost zero credits.
Does it deduplicate across URLs?
No — deduplication is your downstream concern. sha256 on the response body is a cheap exact-match gate; for near-duplicate detection (paraphrased press releases, mirrored content, syndicated articles), use a MinHash / LSH pipeline. The reason we don't dedupe in-API: training-data pipelines have wildly different dedup thresholds and the right call depends on your downstream tokenizer.
Markdown vs raw HTML — which should I store?
For pre-training and SFT corpora: Markdown — boilerplate strip is critical, tokenizer doesn't need <div class="ad-wrapper"> noise. For RAG knowledge bases: also Markdown — vector stores chunk cleaner on heading structure. For research that needs render fidelity (UI layout extraction, ad-placement studies): raw HTML via /v2/general. Most LLM teams use Markdown for 95%+ of pages.
Can I get paywalled or auth-bound content?
Paywall handling is your responsibility — you need to be on the right side of access. For pages where you have legitimate auth (subscription accounts, customer logins, OAuth-bound content), pass cookies via cookies= param and sticky sessions via proxy_country + session identifier to keep the same egress IP across multi-page flows. The API doesn't bypass paywalls; it executes the request you supply.
Building a domain-specific LLM?
Custom volume pricing for million-page corpora, dedicated datacenter pools, JSON-schema design help for instruction-tuning datasets, or migration help from in-house bulk scrapers — drop us a line and a real human gets back within a few hours.
“Our clients are pleasantly surprised by the response speed of our team.”