LLM-ready Markdown. Web pages in one call.
A Markdown extraction API for LLM workflows. Pass a URL to /v2/markdown, get back clean token-efficient Markdown — ~5-10× fewer tokens than raw HTML. No HTML cleanup, no XPath, no boilerplate. Drop straight into RAG pipelines, fine-tunes, or live agent context.
10 credits per call · failed requests cost 0 · cancel anytime
# Any URL → clean LLM-ready Markdown
$ curl 'https://api.scrapingant.com/v2/markdown' \
--data-urlencode 'url=https://example.com/article' \
-H 'x-api-key: YOUR_API_KEY'import requests
r = requests.get(
'https://api.scrapingant.com/v2/markdown',
params={'url': 'https://example.com/article'},
headers={'x-api-key': 'YOUR_API_KEY'},
)
doc = r.json()
print(doc['markdown']) # clean Markdown, no boilerplateconst res = await fetch(
'https://api.scrapingant.com/v2/markdown?' +
new URLSearchParams({
url: 'https://example.com/article',
}),
{ headers: { 'x-api-key': 'YOUR_API_KEY' } },
);
const { markdown } = await res.json();
console.log(markdown); // ready to feed your LLM Why Markdown.
Token-efficient, structurally lossless, and the input format every modern LLM was trained on.
Boilerplate stripped
No scripts, ads, navs, comments, or footers. Just the content you came for.
See the cleanup →~9× fewer tokens
Tighter context windows. Lower inference cost. Faster embeddings.
Token math →Live or bulk
Sub-second for agents. Parallel fanout for fine-tune corpora. Same endpoint.
How it works →Drop the noise. Keep the signal.
Every page passes through a content-first extraction step that strips scripts, ads, sponsored blocks, navigation, cookie banners, comments, and footers. What's left is the article body — headings, paragraphs, lists, links, code, and tables — formatted as clean Markdown.
- No HTML cleanup pass needed in your pipeline
- Headings, lists, links, and tables preserved
- Image alt text retained for vision-aware models
Tokens that count stay in your context.
A typical article page costs ~80k tokens as raw HTML and around 9k as cleaned Markdown — roughly a 9× reduction. That means tighter context windows, lower embedding cost, and more documents per query in your RAG retriever.
- ~9× smaller than raw HTML on typical articles
- Cleaner chunks tokenize and embed predictably
- Lower inference + retrieval costs at scale
On-demand for agents. Bulk for fine-tunes.
Same endpoint, two modes. Agents call /v2/markdown synchronously to read a page mid-thought — sub-second response, ready-to-tokenize Markdown back. Training pipelines fan out URL lists in parallel; we don't cap concurrency, so you collect Markdown for each page on your end at whatever rate your worker pool can drive. For MCP-aware clients (Claude, Cursor, Windsurf), the same Markdown tool is exposed through the ScrapingAnt MCP server as get_web_page_markdown — and through Claude Code with one claude mcp add.
- Drop straight into LangChain / LlamaIndex / your custom retriever
- No concurrency cap — push as many parallel calls as your code can drive
- Failed fetches don't cost credits — clean fanout retries
Same anti-bot. Same proxy fleet.
Every /v2/markdown call routes through the same headless Chrome cluster, rotating proxies, TLS fingerprinting, and CAPTCHA avoidance that backs the JavaScript rendering API. JavaScript-rendered pages, Cloudflare / Akamai-protected sites, and SPA targets all return clean Markdown — same call, same reliability. Switch to residential proxies via proxy_type=residential for tougher targets.
- Real headless Chrome — handles JS-rendered content
- Rotating proxies + CAPTCHA avoidance out of the box
- Switch to
residentialfor tougher targets — same call
What teams build with it.
If your pipeline ends in an LLM, your input format probably starts here.
RAG corpora
Build retrieval indexes from public web sources. Markdown chunks tokenize cleanly, embed predictably.
Talk to us →Fine-tuning datasets
Bulk-fetch domain content — each call returns clean Markdown, ready to collect into your fine-tune pipeline.
Talk to us →Agent context
Give your agent live access to web pages without parsing. One call, ready-to-read Markdown.
Talk to us →Knowledge-base sync
Crawl docs, blogs, or wikis on a schedule and store as Markdown — diff-friendly, version-controllable.
Talk to us →News & blog ingestion
Daily crawls of publishers, newsletters, or forums — clean text in, no HTML soup in your pipeline.
Talk to us →Agent web tools
Give your agent a real-time read-the-web tool. Sub-second responses, clean Markdown ready to reason over.
Talk to us →Pricing
Industry leading pricing that scales with your business.
|
Plans
|
Enthusiast
100K credits / mo
$19/mo
|
★ Most Popular
Startup
500K credits / mo
$49/mo
|
Business
3M credits / mo
$249/mo
|
Business Pro
8M credits / mo
$599/mo
|
Custom
10M+ credits / mo
$699+/mo
|
|---|---|---|---|---|---|
| Monthly API credits | 100,000 | 500,000 | 3,000,000 | 8,000,000 | 10M+ |
| Support channel | Priority email | Priority email | Priority email | Priority + dedicated | |
| Integration help | Docs only | Custom code snippets | Debug sessions | Priority debug sessions | Full enterprise onboarding |
| Expert assistance | — | ||||
| Custom proxy pools | — | — | |||
| Custom anti-bot avoidances | — | — | |||
| Dedicated account manager | — | — | |||
| Start Free | Start Free → | Start Free | Start Free | Talk to Sales |
What teams are saying.
From solo developers shipping side projects to enterprise pipelines at Fortune 500s.
★★★★★ 5.0 on Capterra →★★★★★“Onboarding and API integration was smooth and clear. Everything works great. The support was excellent.”
★★★★★“Great communication with co-founders helped me to get the job done. Great proxy diversity and good price.”
★★★★★“This product helps me to scale and extend my business. The setup is easy and support is really good.”
Frequently asked questions.
Still curious? Get in touch with our team — we usually reply within hours.
What is LLM-ready data extraction?
LLM-ready data extraction is the workflow of fetching a web page, stripping the boilerplate (navigation, ads, scripts, footers, cookie banners), and returning the article content as clean Markdown that an LLM can consume directly — no HTML cleanup, no XPath rules, no token waste. ScrapingAnt's /v2/markdown endpoint is a Markdown extraction API: pass a URL, get back token-efficient Markdown ready for RAG pipelines, fine-tunes, or live agent context. Built on the same Headless Chrome and proxy stack as our JavaScript rendering API.
What does the Markdown extraction response look like?
JSON with two main fields: url (the URL we fetched) and markdown (the cleaned page content). Status code 200 on success. The Markdown preserves headings, lists, links, code blocks, and tables — but strips scripts, ads, navigation, footers, cookie banners, and other boilerplate. Same shape across every URL, so your RAG chunker doesn't need per-source rules.
How is /v2/markdown different from /v2/general?
The /v2/general endpoint returns the rendered HTML — useful when you want full control over parsing or DOM access.
/v2/markdown takes the same rendered page and returns clean LLM-ready Markdown. ~5-10× fewer tokens than raw HTML, no boilerplate to strip, no XPath selectors to maintain. Same proxy / anti-bot stack underneath; different output format.
How is LLM-ready Markdown different from the AI data scraper?
Different shapes for different jobs. /v2/markdown returns the whole page as cleaned Markdown — best when you want full content for RAG indexing or fine-tuning. The AI data scraper (/v2/extract) returns typed JSON keyed to a plain-English schema you describe — best when you know exactly which fields you need (price, rating, address). Pick by output shape, not by capability.
Does Markdown extraction handle JavaScript-rendered pages?
Yes — every request runs through real headless Chrome by default. Single-page apps, lazy-loaded content, and React / Vue / Next.js pages all return clean Markdown of the final visible content. Same engine that powers our JavaScript rendering API.
Can I run /v2/markdown for live agents and bulk corpus jobs?
Yes — same endpoint, both modes. Agents call /v2/markdown synchronously and get a sub-second response. Training pipelines fan out URL lists in parallel; there's no concurrency cap, so you push as many simultaneous calls as your worker pool can drive. Same auth, same response shape, same credit cost per call. For agent integrations, the same Markdown tool is also exposed through the ScrapingAnt MCP server as get_web_page_markdown. Markdown endpoint docs →
How are credits charged for LLM-ready Markdown?
Each /v2/markdown call costs the same as a regular /v2/general request — based on your proxy choice. Default datacenter routing is 10 credits per request; switch to residential proxies when the source page is anti-bot-protected and the cost adjusts. Failed requests cost 0. Every account starts with 10,000 free credits per month, no card required.
Is LLM-ready Markdown good enough for fine-tuning?
For most public-web content, yes — the Markdown output is human-readable and tokenizes cleanly. For very structured datasets (catalogs, schema-rich pages), combine it with the AI data scraper so each row carries explicit fields. Mixed approaches work great for RAG: Markdown for narrative pages, typed JSON for tables.
Building an LLM pipeline at scale?
Volume crawls, custom extraction schemas, dedicated capacity, or a one-off RAG corpus — drop us a line and a real human gets back within a few hours.
“Our clients are pleasantly surprised by the response speed of our team.”