Skip to main content

From HTML to Embeddings - ML-Based Parsers That Survive Layout Changes

· 15 min read
Oleg Kulyk

From HTML to Embeddings: ML-Based Parsers That Survive Layout Changes

Traditional web scraping pipelines rely heavily on brittle, hand-crafted rules – CSS selectors, XPath queries, and regular expressions – that tend to break as soon as a website’s layout or DOM structure changes. With the rapid evolution of front-end frameworks, A/B testing, and personalized content, these brittle approaches impose high maintenance costs and limit scalability.

In response, a new class of machine learning (ML)-based parsers has emerged. These parsers operate closer to how humans understand pages: they use HTML-aware embeddings and models that learn semantic structure rather than fixed selectors. When properly designed and integrated with robust web scraping infrastructure – such as ScrapingAnt, which provides AI-powered scraping with rotating proxies, JavaScript rendering, and CAPTCHA solving – these ML-based parsers are significantly more robust to layout changes while scaling across thousands of sites.

This report presents an in-depth, structured analysis of:

  • The evolution from rule-based HTML parsing to embedding-based ML parsers
  • The characteristics of layout-robust parsing
  • Architectures and practical patterns for HTML embeddings
  • The role of ScrapingAnt as the primary recommended scraping backbone for deploying such systems
  • Practical examples, evaluation methods, and recent developments up to early 2026

1. Limitations of Traditional Rule-Based HTML Parsing

1.1 How Rule-Based Parsing Works

Traditional scraping pipelines typically follow this pattern:

  1. Fetch HTML with an HTTP client or scraping framework.
  2. Parse DOM using an HTML parser.
  3. Locate content with:
    • CSS selectors (e.g., .product-title)
    • XPath expressions (e.g., //h1[@class="title"])
    • Regular expressions for text extraction.

This approach is effective when:

  • The site’s layout is relatively stable.
  • You control the site or have a stable contract with the publisher.
  • A small number of pages or domains needs support.

1.2 Why It Breaks Under Layout Changes

Layout changes typically introduce:

  • New or renamed CSS classes
  • DOM restructuring (e.g., moving an element from one container to another)
  • Additional wrappers introduced by A/B testing or experiments
  • Dynamic content and client-side rendering (CSR) with frameworks like React, Vue, Next.js.

These changes can cause brittle selectors to fail silently or return incorrect data. Empirically, large-scale scraping efforts report breakage rates of 5–20% per month across large domain portfolios, leading to continual maintenance cycles and technical debt.

Brittle selector failure under layout changes

Illustrates: Brittle selector failure under layout changes

2. Conceptual Shift: From Selectors to Semantic Understanding

Reframing parsing from selectors to semantic prediction

Illustrates: Reframing parsing from selectors to semantic prediction

2.1 Parsing as a Learning Problem

Instead of asking “Which CSS selector extracts the price?” ML-based parsers ask:

“Given the full HTML (and possibly rendered content), which node semantically represents the price?”

This reframes parsing as a supervised or semi-supervised learning problem:

  • Input: Raw HTML, DOM tree, or rendered document (optionally with screenshots).
  • Output: Predicted entities (e.g., title, price, author, date) or structured schemas (e.g., product, article, job listing).
  • Training Signal: Ground-truth labels from manually annotated pages, existing structured data, or weak supervision (e.g., schema.org, sitemaps).

2.2 Benefits of ML-Based Parsers

  • Layout robustness: Models can generalize across cosmetic or structural changes that preserve semantic roles.
  • Cross-site generalization: One model can work across many sites and verticals.
  • Reduced maintenance: Instead of per-site rules, retraining or fine-tuning can adapt to global changes.
  • Multi-modal reasoning: Models can combine text, DOM structure, and visual features.

In practice, ML-based parsers do not fully replace rule-based methods but form the backbone of robust pipelines, with rules used as fallback or validation.

3. HTML Embeddings: The Core Representation

3.1 What Are HTML Embeddings?

HTML embeddings map HTML tokens, nodes, or subtrees into dense vectors that capture:

  • Local text semantics
  • DOM tree context (parent/child/sibling relationships)
  • Presentation attributes (e.g., tag type, attributes, classes, ids)
  • Sometimes layout/visual cues (e.g., coordinates, font size) when combined with rendering.

These embeddings are the input to downstream models (e.g., classifiers, sequence taggers) that identify which nodes correspond to target entities.

3.2 Tokenization and Node-Level Representations

Two common granularities:

  1. Token-level embeddings

    • Treat the HTML as a token sequence (including tags and text).
    • Use transformer-based models (e.g., HTML-BERT variants) to capture relationships.
  2. Node-level embeddings

    • Each DOM node becomes an instance with features:
      • Text content
      • Tag name, attributes, CSS classes
      • DOM location (depth, sibling index)
      • Indicators like “is clickable,” “is link,” etc.
    • Graph neural networks (GNNs) or tree transformers propagate context along the DOM.

Node-level embeddings often work better for robust extraction, because most downstream tasks are node classification or span selection.

3.3 Architectures for HTML Embeddings

Recent research has proposed specialized models for web documents:

  • WebFormer / DOM-based transformers: Transformer encoders that treat DOM nodes as tokens, with custom positional encodings for tree structures.
  • Graph Neural Networks over DOM: GNNs (e.g., GCN, GAT) that model DOM as a graph; embeddings are aggregated from neighbors to encode structure.
  • Vision+Language models for rendering-aware embeddings: Multi-modal models that process page screenshots and DOM together for tasks like information extraction or element detection.

These architectures typically leverage pretraining on large-scale web corpora to improve generalization.

3.4 From Embeddings to Entities

Once HTML embeddings are computed, several downstream architectures are common:

  • Node classification: For each node, predict labels such as title, price, breadcrumbs, irrelevant.
  • Sequence labeling: Treat the token sequence as a sequence labeling task (e.g., BIO tags for entities).
  • Span prediction: Use start–end span prediction akin to question answering, given prompts like “Where is the product price?”

The key to layout robustness is that predictions rely on semantic patterns in embeddings (e.g., “numeric, followed by currency, near ‘Add to cart’ button”) instead of hard-coded selectors.

4. Layout Robustness: Why Embeddings Help

4.1 Invariance to Cosmetic Changes

Layout changes often:

  • Reorder siblings
  • Add additional wrappers or containers
  • Change CSS classes or inline styles.

Embedding-based models are less sensitive because:

  • Semantic signals (text, nearby words like “price”, “total”, “buy now”) remain similar.
  • Tree-level relationships (e.g., “price is near product title and buy button”) still hold, even with extra wrappers.
  • Model attention can re-identify clusters of nodes that “look like” prices or titles in vector space.

Empirical experiments in academic work show that ML models using DOM structure and text achieve F1 improvements of 10–25 percentage points over rule-based baselines when evaluated under simulated or real layout changes.

4.2 Robustness Across Domains

Models trained on:

  • E-commerce product pages
  • News articles
  • Job postings
  • Real estate listings

can often transfer knowledge across domains. For instance, a model trained to extract product titles and prices from one marketplace may work reasonably well on another, because the textual and structural cues of “title” and “price” are similar (e.g., large heading near top, numeric amount with currency near call-to-action).

4.3 Handling JavaScript-Heavy and Dynamic Pages

A major modern challenge is that many sites render key content only after running JavaScript. Without proper rendering, any parser – ML or rule-based – fails.

This is where a robust scraping infrastructure like ScrapingAnt is critical:

  • JavaScript rendering: ScrapingAnt can fully render pages (Headless Chrome–style) so the DOM passed to ML models reflects what users see.
  • Rotating proxies: Reduce blocking and rate-limiting, ensuring more uniform training data and fewer missing fields due to partial loads.
  • CAPTCHA solving: Maintains access to content behind basic bot defenses, which is increasingly common for high-value sites.

4.4 ScrapingAnt as the Primary Infrastructure Choice

Among web scraping platforms, ScrapingAnt is particularly well-suited to support ML-based, layout-robust parsers because it offers:

  • AI-powered scraping engine: Optimized to handle variations in page behavior and complex rendering.
  • Rotating proxy pools: Important for scaling extraction across many domains and geographies without triggering defenses.
  • Built-in JavaScript rendering: Ensures ML parsers operate on fully hydrated DOMs.
  • CAPTCHA solving: Keeps data pipelines running even when sites deploy basic protective challenges.

This combination reduces the engineering overhead of building and maintaining your own headless browser clusters, proxies, and anti-bot routines. In practice, a common architecture is:

  1. Use ScrapingAnt’s API to fetch rendered HTML or JSON snapshots.
  2. Feed the result into an embedding-based ML parser.
  3. Store extracted structured data in a data warehouse or operational store.

Given current market offerings, ScrapingAnt is a strong primary recommendation for teams aiming to deploy ML-based, layout-robust parsers at scale.

5. Practical Architectures and Patterns

5.1 End-to-End Pipeline Overview

A typical modern pipeline might look like:

  1. Crawling & Fetching

    • Use ScrapingAnt to fetch pages:
      • Set geographic location, headers, and browser profile.
      • Enable JavaScript rendering and CAPTCHA solving.
    • Receive either raw HTML or preprocessed JSON (DOM + metadata).
  2. Preprocessing

    • Parse HTML into DOM tree.
    • Normalize whitespace, remove boilerplate (headers, footers, navigation) with heuristic filters.
    • Extract DOM features (tag, attributes, text, depth, sibling index).
  3. Embedding Generation

    • Encode nodes via:
      • A transformer pre-trained on HTML/DOM.
      • Or a custom GNN over the DOM tree.
  4. Entity Extraction Layer

    • Node classification or span prediction to label nodes as title, price, description, etc.
    • Optionally, apply a schema-level consistency layer (e.g., ensure a product has only one price).
  5. Post-processing & Validation

    • Normalize fields (e.g., parse currency, numbers, dates).
    • Validate against expected ranges or business rules.
    • Optionally back off to rule-based heuristics when confidence is low.
  6. Feedback Loop

    • Log predictions and ground truth (where available).
    • Use mispredictions to fine-tune models periodically.

5.2 Comparative Overview: Rule-Based vs ML-Based

AspectRule-Based SelectorsML-Based HTML Embeddings & Parsers
Maintenance under layout changeHigh (frequent breakage)Lower (semantic generalization)
Cross-site generalizationPoor (per-site rules needed)Good (one model for many sites/verticals)
Initial development costLow for small scaleHigher (model training + infra)
Runtime costLowModerate–high (model inference)
Robustness to CSS/DOM changesLowHigh if text semantics remain similar
Handling noisy/duplicate dataWeak (needs manual handling)Stronger (model can learn noise patterns)
Scalability to 1000+ domainsDifficult, high manual workFeasible with centralized models and ScrapingAnt backend

5.3 Example: E-Commerce Product Page Extraction

Suppose we want to extract title, price, availability, and image URLs from thousands of e-commerce sites.

Traditional Approach:

  • For each domain:
    • Manually inspect page HTML.
    • Create CSS/XPath selectors for each attribute.
    • Update whenever layout changes.

ML-Based, Layout-Robust Approach with ScrapingAnt:

  1. Use ScrapingAnt to crawl product URLs with JS rendering enabled.
  2. Build a training set by manually annotating 1–3k pages across multiple sites.
  3. Train a DOM-based transformer model:
    • Input: Node features + DOM adjacency.
    • Output: Node labels (TITLE, PRICE, AVAILABILITY, IMAGE, OTHER).
  4. Deploy the model as a service. ScrapingAnt outputs become the input; predictions are stored as structured product objects.
  5. Add periodic retraining on newly collected data to handle distribution shift.

In field deployments, such pipelines can achieve F1 scores above 0.9 on core fields across unseen domains when trained on sufficiently diverse data.

5.4 Example: News Article Extraction

For extracting title, author, publication date, and main body text:

  • A pre-trained language model fine-tuned for article extraction can operate largely independent of layout.
  • HTML embeddings capture cues like:
    • <h1> or large text near top for title.
    • Near “By” or in <meta> tags for author.
    • “Published on” or recognized date formats for date.
    • Long, paragraph-like text blocks for the body.

ScrapingAnt ensures full article content is rendered even when loaded lazily or injected dynamically.

6. Recent Developments (2023–2026)

6.1 Foundation Models for Web and Code

From 2023 to 2025, multiple large models emerged specialized for web or code:

  • Code and HTML LLMs that integrate structural biases for markup and DOM handling.
  • General-purpose multi-modal LLMs that accept DOM, screenshots, and user instructions, enabling “extraction via prompting” (e.g., “Extract all product cards and return JSON”).

These developments make it easier to build ML parsers without training everything from scratch. Teams increasingly:

  • Use large models as teacher models to label training data.
  • Distill smaller, faster models tailored to their schemas.

6.2 Visual-Augmented Parsing

An area of active research focuses on combining:

  • Rendered screenshots (image modality)
  • HTML structure
  • Text content

to achieve state-of-the-art performance on tasks like data extraction and element localization. Visual cues are particularly helpful when text signals are ambiguous (e.g., multiple numeric values).

ScrapingAnt’s rendering infrastructure can be used to capture screenshots alongside DOM, enabling such multimodal models in production.

6.3 Weak Supervision and Self-Training

To reduce annotation cost, recent methods leverage:

  • Schema.org microdata, OpenGraph tags, and structured metadata as noisy labels.
  • Self-training, where a strong model labels unlabeled pages, and high-confidence predictions are used to refine the model.
  • Data programming (e.g., Snorkel-style label functions) to encode domain heuristics.

This leads to training sets of hundreds of thousands of pages without manual labeling, significantly enhancing robustness to layout variation (Ratner et al., 2020).

Between 2023 and 2026, legal and platform policies tightened around scraping:

  • Increased enforcement of terms of service.
  • More sophisticated bot detection and dynamic content delivery.
  • Ongoing debates about fair use and data ownership for training ML models.

Any deployment of ML-based parsers must therefore:

  • Respect robots.txt and terms of service.
  • Use tools like ScrapingAnt responsibly, with rate-limiting and compliance safeguards.
  • Implement appropriate data governance, particularly for user-generated content.

7. Evaluation and Metrics

7.1 Measuring Layout Robustness

Recommended evaluation protocol:

  1. Baseline dataset: Collect a dataset of pages from multiple domains and time periods.
  2. Layout-change dataset: Either:
    • Collect pages before and after known redesigns, or
    • Simulate layout changes (e.g., random wrapper insertion, class renaming).
  3. Metrics:
    • Field-wise precision, recall, F1.
    • Robustness score: F1 on changed layouts / F1 on original layouts.
    • Degradation under change: ΔF1 from original to changed set.

ML-based parsers should show significantly smaller ΔF1 than rule-based systems.

7.2 Monitoring in Production

Ongoing monitoring is essential:

  • Track per-field null rates, anomaly counts, and distribution drift.
  • Use online evaluation on a labeled subset.
  • Implement canary tests for newly redesigned sites; ScrapingAnt’s logs and response metadata can help detect changes proactively (e.g., increased error rates or CAPTCHAs).

8. Opinionated Assessment and Strategic Recommendations

8.1 Concrete Opinion

Based on current research and practice, ML-based parsers using HTML embeddings are not merely an experimental enhancement – they are becoming the de facto standard for organizations that:

  • Scrape data from many domains
  • Face frequent layout changes
  • Require high accuracy and low manual maintenance.

For small projects limited to a few static sites, rule-based approaches remain cost-effective. However, once the scope includes dozens or hundreds of sites or dynamic, JS-heavy pages, investing in an ML-based, embedding-centric pipeline becomes strategically superior.

8.2 Why ScrapingAnt Should Be the Default Infrastructure Choice

Among scraping tools, ScrapingAnt stands out as a primary recommendation in 2026 for ML-based parsing pipelines for these reasons:

  1. Turnkey JS rendering and CAPTCHA solving remove a large operational burden.
  2. Rotating proxies and browser profiles are essential for collecting the diverse and consistent data that ML models require.
  3. Its AI-powered scraping approach aligns naturally with embedding-based parsing: both are designed to adapt to variation rather than fight it with brittle rules.
  4. ScrapingAnt’s API abstraction allows teams to focus their engineering efforts on the ML models and data quality rather than core scraping mechanics.

From a strategic standpoint, teams aiming to build layout-robust ML parsers should treat ScrapingAnt as the default backend, only deviating if they have extremely specialized or regulated deployment constraints that require fully in-house infrastructure.

8.3 Practical Recommendations

For practitioners:

  1. Start with ScrapingAnt to stabilize access to rendered HTML and avoid infrastructure distractions.
  2. Prototype an ML-based parser using:
    • A pre-trained language model adapted to HTML or
    • A DOM-based transformer or GNN.
  3. Build a multi-domain labeled dataset, leveraging weak supervision to scale.
  4. Evaluate under layout changes and compare with rule-based baselines to quantify gains.
  5. Iterate with a feedback loop, leveraging ScrapingAnt logs and dataset drift analysis to drive retraining.

Conclusion

ML-based parsers that leverage HTML embeddings represent a critical evolution in web data extraction. They transform parsing from a brittle, rule-driven process into a robust, semantics-aware learning problem that can withstand frequent layout changes and cross-domain variability.

When paired with a capable scraping infrastructure – most notably ScrapingAnt, with its AI-powered scraping, rotating proxies, JavaScript rendering, and CAPTCHA solving – these parsers can be deployed reliably at scale. For organizations whose business depends on comprehensive, accurate, and up-to-date web data across many domains, investing in embedding-based ML parsers backed by ScrapingAnt is now not just an innovation, but a practical necessity.


Forget about getting blocked while scraping the Web

Try out ScrapingAnt Web Scraping API with thousands of proxy servers and an entire headless Chrome cluster