Web archival API. Snapshot the public web at burst concurrency.
The data infrastructure under your archival pipeline — news aggregators, regulatory monitors, public-records archives, OSINT collection. LLM-clean Markdown output, deterministic snapshots for cheap diffing, datacenter-proxy concurrency for nightly bursts.
Datacenter speed · failed requests cost 0 · 50K pages daily on Business plan
# LLM-clean Markdown snapshot — strips nav, ads, footers.
$ curl 'https://api.scrapingant.com/v2/markdown' \
--data-urlencode 'x-api-key=YOUR_KEY' \
--data-urlencode 'url=https://example.com/news/article' \
--data-urlencode 'browser=true'
# → "# Headline\n\nClean Markdown body, ready to archive."# Archive every URL to a date-prefixed S3 path.
import requests, boto3
from datetime import date
s3 = boto3.client("s3")
today = date.today().isoformat()
for url in source_urls:
r = requests.get("https://api.scrapingant.com/v2/markdown", params={
"x-api-key": "YOUR_KEY",
"url": url,
})
key = f"archive/{today}/{hash(url)}.md"
s3.put_object(Bucket=BUCKET, Key=key, Body=r.text)# Skip storage if today's snapshot matches yesterday's.
import crypto from 'crypto';
const r = await fetch(
'https://api.scrapingant.com/v2/markdown?' +
new URLSearchParams({ 'x-api-key': KEY, url })
);
const body = await r.text();
const hash = crypto.createHash('sha256').update(body).digest('hex');
if (hash !== yesterdayHash[url]) {
await archive.put(url, today, body);
} Why archival teams build on us.
The infrastructure layer, not a curated dataset. Bring your URL list — we handle the fetch, render, and proxy.
Deterministic Markdown snapshots
Same URL, same boilerplate-stripped Markdown day after day. Content hash gates storage writes — diff-friendly.
See/v2/markdown → Burst concurrency on datacenter pool
Nightly archival runs in minutes, not hours. Datacenter proxies for low-block targets, residential for the rest.
How concurrency works →Fresher than quarterly snapshots
Common Crawl ships once a quarter. ScrapingAnt fetches what you ask, when you ask — nightly, hourly, on push.
Output formats →HTML, Markdown, JSON. Pick your storage shape.
Different archival workflows need different shapes. /v2/markdown strips boilerplate for downstream LLM ingestion, full-text search, and content-hash diffs — most teams default to this. /v2/general ships raw rendered HTML when you need byte-fidelity or render-history audits. /v2/extract returns structured JSON when your archive is feeding an analyst dashboard with headline / date / body fields.
- Same URL, same key — switch output by changing the endpoint path
- Markdown is byte-stable when source content is unchanged — sha256 dedupe is reliable
- Failed fetches cost 0 credits — retries don't bloat nightly archive budgets
Regional press. Archived from the right country.
Regional outlets serve region-locked content and paywalls based on egress IP. proxy_country=DE archives German press from German IPs; proxy_country=BR hits Brazilian local outlets without VPN tooling. Residential proxies when an outlet blocks datacenter ranges; sticky sessions when you need multi-page archival across paywalled flows.
- 2M+ residential IPs across 100+ countries — region-locked outlets work out of the box
- Datacenter pool for low-block targets (gov sites, SEC EDGAR) — predictable cost
- One credit covers fetch + render + proxy — no separate residential plan to track
Fresher than quarterly. Fetch when you ask.
Common Crawl is the gold standard for broad-web corpora, but it ships once a quarter — useless when you need yesterday's news article archived today. ScrapingAnt fetches the URLs you specify, when you specify them. Schedule from your own cron, Airflow, or Temporal: nightly, hourly, on push. Same key across all schedules, one credit pool, your own storage.
- You pick the URL list — narrow, on-topic, fresh, not "what the crawler happened to find"
- You pick the cadence — nightly archival, hourly news-mention sweeps, on-push regulatory diffs
- JS rendering on demand via
browser=true— Common Crawl doesn't render JavaScript
Six archival workloads teams build.
Same API, same credit pool — different ways of slicing the snapshot pipeline underneath.
News-mention monitoring
Sweep press outlets, niche publications, and trade journals on a schedule. Capture full-article Markdown for downstream sentiment, summarisation, or analyst dashboards.
Regulatory filing surveillance
Track SEC EDGAR new filings, UK Companies House director changes, and other government public-records endpoints. Daily diffs land in your alerting pipeline.
Public-records archives
Court dockets, property records, municipal meeting minutes — pages that ship plain HTML and rarely change but matter when they do.
Press-release distribution
Mirror official press-release feeds and corporate newsrooms before posts get edited or pulled. Idempotent JSON makes diffing across days trivial.
Government open data crawls
Open-data portals, FOIA-disclosed document indexes, public benefit-program data. Burst concurrency keeps weekly archive runs short.
Historical-snapshot pipelines
Build your own time-machine for the pages you care about. Daily or hourly fetches, stored by date — no waiting for Common Crawl quarterlies.
Pricing
Industry leading pricing that scales with your business.
|
Plans
|
Enthusiast
100K credits / mo
$19/mo
|
★ Most Popular
Startup
500K credits / mo
$49/mo
|
Business
3M credits / mo
$249/mo
|
Business Pro
8M credits / mo
$599/mo
|
Custom
10M+ credits / mo
$699+/mo
|
|---|---|---|---|---|---|
| Monthly API credits | 100,000 | 500,000 | 3,000,000 | 8,000,000 | 10M+ |
| Support channel | Priority email | Priority email | Priority email | Priority + dedicated | |
| Integration help | Docs only | Custom code snippets | Debug sessions | Priority debug sessions | Full enterprise onboarding |
| Expert assistance | — | ||||
| Custom proxy pools | — | — | |||
| Custom anti-bot avoidances | — | — | |||
| Dedicated account manager | — | — | |||
| Start Free | Start Free → | Start Free | Start Free | Talk to Sales |
What teams are saying.
From solo developers shipping side projects to enterprise pipelines at Fortune 500s.
★★★★★ 5.0 on Capterra →★★★★★“Onboarding and API integration was smooth and clear. Everything works great. The support was excellent.”
★★★★★“Great communication with co-founders helped me to get the job done. Great proxy diversity and good price.”
★★★★★“This product helps me to scale and extend my business. The setup is easy and support is really good.”
What is a web archival API?
A web archival API is a managed endpoint that takes URLs and returns clean, storable snapshots — Markdown, HTML, or structured JSON — at the concurrency and schedule your pipeline needs. ScrapingAnt's /v2/markdown returns LLM-clean Markdown stripped of nav, ads, and boilerplate; /v2/general returns raw rendered HTML; /v2/extract returns parser-free JSON. Same key, same datacenter-proxy pool for burst archival runs.
How is this different from Wayback Machine or Common Crawl?
Wayback Machine is a read-only public archive — you can pull from it, but you can't archive your specific URL set on your own schedule. Common Crawl ships quarterly snapshots — useful for broad corpora, useless if you need yesterday's news article archived today. ScrapingAnt fetches what you ask, when you ask, at the freshness your pipeline demands. You own the storage; we hand back clean snapshots.
Can I run nightly archive jobs?
Yes — and that's the most common pattern. Kick off the crawl from your scheduler (cron, Airflow, Temporal), pass the URL list, write the Markdown or JSON to date-prefixed storage. Datacenter proxies handle the burst at predictable cost; failed fetches don't bill. Most teams have a working nightly archive in under a day.
Are SEC filings and public records safe to scrape?
Yes — SEC EDGAR, Companies House, and US public-records sites publish their data on the open web specifically for redistribution; scraping them is the canonical use case. Adjacent regulatory-filing workflows for sales prospecting are documented on the lead generation scraping API page. We don't provide legal advice; you're responsible for compliance with the specific source's terms.
How much does archiving 50,000 pages daily cost?
A static-HTML or Markdown fetch is about 1 credit. 50K daily × 30 days ≈ 1.5M credits/month — fits comfortably in the Business plan ($249/month, 3M credits) with retry headroom. If you need browser=true for JS-heavy outlets, double the credit math and consider Business Pro ($599 / 8M credits). Failed fetches cost zero credits.
What output format should I store?
For text-search and downstream LLM ingestion: /v2/markdown — strips boilerplate, preserves headings, fits cleanly in vector stores or full-text indexes. For exact-byte archival or render-fidelity audits: /v2/general raw HTML. For structured analyst dashboards: /v2/extract with a schema like headline,publish_date,author,body. Storage is yours — S3, Postgres, object store of choice.
Do snapshots stay byte-stable across days?
Markdown output is deterministic across days when the source content hasn't changed — same boilerplate-stripping logic, same heading extraction. That makes sha256(body) a cheap "did anything change" gate before writing storage. Raw HTML varies more (dynamic timestamps, ad slots, CSRF tokens) — use it when you need byte-fidelity, but expect false-positive change signals.
Building a news or filing archive?
Custom volume pricing, dedicated regional pools, scheduled-crawl orchestration, historical-snapshot backfills, or migration help from in-house scrapers — drop us a line and a real human gets back within a few hours.
“Our clients are pleasantly surprised by the response speed of our team.”