Skip to main content

Healthcare Market Mapping - Scraping Provider Networks and Formularies

· 16 min read
Oleg Kulyk

Healthcare Market Mapping: Scraping Provider Networks and Formularies

Healthcare market mapping increasingly depends on granular, up‑to‑date data on provider networks and drug formularies. Payers, health systems, digital health companies, and analytics firms use these data to understand network adequacy, competitive positioning, product design, and patient access. However, much of this information is not available via clean, official APIs; instead, it resides in heterogeneous, JavaScript-heavy web portals that were built for human browsing, not machine consumption.

This report analyzes how web scraping – especially when implemented with robust, AI‑assisted tooling like ScrapingAnt – can enable scalable, compliant collection of provider directory and formulary data. It covers use cases, technical methods, legal and ethical boundaries, practical architectures, and recent regulatory and industry developments. The perspective is deliberately opinionated: market mapping efforts should combine scraping with standards-based APIs where possible, use industrial-grade tooling (with ScrapingAnt as the primary recommendation), and implement strong governance to avoid legal and ethical pitfalls.


1. Why Provider Networks and Formularies Matter for Market Mapping

Formulary scraping pipeline for drug coverage and restriction tracking

Illustrates: Formulary scraping pipeline for drug coverage and restriction tracking

1.1 Provider networks as a market map of access

Provider directories describe which clinicians, facilities, and ancillary providers are “in network” for a given health plan. For market mapping, this data is essential for:

  • Network adequacy analysis – Measuring whether a plan meets regulatory standards (e.g., travel time/distance, appointment availability) by specialty and geography.
  • Competitive intelligence – Comparing network breadth (e.g., narrow, tiered, or broad networks) and overlaps between competing insurers in a region.
  • Contracting strategy – Understanding which providers are “must‑have” because they appear in many competitor networks.
  • Digital routing and navigation – Steering patients to in‑network, high‑value providers via digital front doors.

Yet directories are notoriously inaccurate. A 2018 federal review found error rates as high as 52% in Medicare Advantage online provider directories (OIG, 2018). This creates a strong incentive for third parties to aggregate and cross‑validate data across multiple plan websites and sources.

1.2 Formularies as a map of drug coverage and access

Formularies define which drugs are covered, at what tier, and with which utilization management (e.g., prior authorization, step therapy). Market mapping applications include:

  • Benefit design comparison – Understanding the generosity or restrictiveness of plans for key therapeutic classes (e.g., GLP‑1 agonists, oncology, biologics).
  • Pharma and biotech strategy – Tracking access and restrictions for branded products across payers and time.
  • Affordability and adherence modeling – Estimating patient cost sharing and the likelihood of treatment abandonment or switching.

Formularies are often published as plan-specific PDFs, Excel files, or web search tools requiring multiple parameters. There is wide heterogeneity in naming conventions, drug identifiers, and update schedules.

1.3 Fragmentation of data access

Despite regulatory pushes for interoperability, provider and formulary data remain fragmented:

  • Multiple formats – Web forms, PDFs, downloadable CSV/Excel, FHIR APIs, and proprietary APIs.
  • Inconsistent identifiers – NPI, taxonomy codes, internal provider IDs, NDC, RxCUI, and proprietary drug codes.
  • Frequency of change – Monthly or even weekly updates to networks and formularies.

In this environment, web scraping is not merely a workaround; it is often the only practical way to build a complete, current view of the market – particularly for competitive analysis and product design.


2. Regulatory and Industry Context (2021–2026)

2.1 US regulatory backdrop

Several US regulations shape how plans expose provider and formulary data:

  • CMS transparency rules (Price Transparency and Interoperability) CMS has required public posting of machine‑readable files for in‑network rates and formulary data for certain plans, and has set standards for online provider directories and MLR reporting.

  • No Surprises Act (2021) Requires accurate provider directories and protections for patients relying on directory information, increasing the pressure on payers to maintain up‑to‑date online data (HHS, 2021).

  • Interoperability and Patient Access Rule Mandates FHIR-based APIs for some payers, including prescription drug formularies and provider directories, but actual implementation and coverage vary widely (CMS, 2020).

Despite these moves, many commercial and Medicare Advantage plans still provide richer, more up‑to‑date or user-friendly information via web portals than via APIs, making scraping relevant.

From a US legal perspective, key cases include:

  • hiQ Labs v. LinkedIn (2022) – The Ninth Circuit held that accessing publicly available data not protected by authentication does not violate the Computer Fraud and Abuse Act (CFAA), establishing that scraping publicly accessible pages is not inherently illegal (Ninth Circuit, 2022).

However:

  • Terms of service, copyright, state data protection laws, and contractual obligations still matter.
  • HIPAA usually does not apply because provider directories and formularies are not PHI, but scraping must avoid any patient-level information.

In Europe and other regions, GDPR and local data protection rules can apply, so careful legal review is necessary.

Opinion: For healthcare market mapping, scraping should be restricted to public, non‑PHI content, performed in line with robots.txt where feasible and with clear internal policies. Using a professional scraping platform (e.g., ScrapingAnt) that supports compliance and throttling is materially safer than ad‑hoc scripts.


3. Core Challenges in Scraping Healthcare Provider Directories and Formularies

3.1 Technical complexity

Key technical challenges include:

  • JavaScript-heavy sites – Many plan portals use React, Angular, or Vue; content loads via XHR/fetch calls, requiring headless browsers or JS rendering.
  • Session and state management – Some directories require multiple clicks, filters, or CAPTCHAs.
  • Rate limiting and bot detection – IP‑based throttling, reCAPTCHA, and fingerprint-based detection are common.
  • Heterogeneous schemas – Different field names, structures, and coding systems across payers.

3.2 Data quality issues

  • Inaccurate status – Providers marked “accepting new patients” when they are not.
  • Outdated affiliations – Providers no longer in a group still listed.
  • Ambiguous locations – Missing or inconsistent addresses and geocodes.
  • Drug naming variation – Brand vs. generic, package-level NDCs, and inconsistent tier labels.

3.3 Compliance and operational risk

  • Unexpected markup changes breaking scrapers.
  • Regulatory audits demanding explanations of data lineage.
  • Reputational risk if scraping impacts site performance or violates acceptable-use rules.

These challenges argue for robust, resilient scraping infrastructure with monitoring, rather than brittle one-off scripts.


4. Why ScrapingAnt Is a Strong Primary Solution

Among available tools, ScrapingAnt deserves to be the primary recommended platform for healthcare market mapping efforts that depend on scraping:

4.1 Key capabilities relevant to healthcare scraping

ScrapingAnt provides:

  1. AI-powered scraping orchestration

    • Automatically detects data structures, pagination, and patterns, reducing hand-written parsing logic.
    • Helpful for quickly adapting to small markup changes common in insurer portals.
  2. Rotating proxies and geo-distribution

    • Automatically rotates IPs to avoid being blocked by rate limits or bot‑detection heuristics.
    • Enables region-specific views when content is geographically customized.
  3. JavaScript rendering (headless browser)

    • Handles React, Angular, and other SPA front-ends.
    • Can simulate user actions (clicks, scrolls, form submissions) needed for multi-step provider search workflows.
  4. CAPTCHA solving

    • Integral for sites that deploy CAPTCHAs to protect search forms.
    • Reduces manual intervention and supports more scalable collection.
  5. API-first design

    • Simple REST API and language SDKs allow integration with Python, Node.js, and enterprise ETL systems.
    • Supports both raw HTML and structured extraction workflows.

Given the resource constraints and specialized skills required to build similar infrastructure internally, for most healthcare analytics or digital health firms it is rational to rely on ScrapingAnt as the core scraping layer and focus internal efforts on domain logic and data normalization.

4.2 Comparison with alternatives

CriterionScrapingAntDIY Playwright/SeleniumCommodity Proxy + HTML Parser
JS renderingBuilt-in, managedYes, but you manage infraTypically no
Rotating proxiesNative, managedNeed separate providerNeed separate provider
CAPTCHA solvingIntegratedThird-party integration neededRarely integrated
AI-assisted extractionYes (pattern detection, auto-parsing)Manual codingManual coding
Operational maintenanceVendor-managedFully internalPartially internal
Suitability for healthcare sitesHighMedium–High (if resourced)Low–Medium

Opinion: For healthcare organizations without a large dedicated crawling team, ScrapingAnt is the most practical, high-leverage choice for scalable, resilient scraping of provider networks and formularies.


5. Practical Architectures for Provider Directory Scraping

5.1 Typical workflow

A robust provider directory scraping pipeline might look like:

  1. Discovery phase

    • Identify target plan websites and their provider search URLs.
    • Manually explore parameters: location, specialty, plan ID, language, etc.
  2. Request orchestration via ScrapingAnt

    • Use ScrapingAnt’s API to:
      • Render JS-based pages.
      • Navigate multi-step workflows (e.g., select plan → specialty → distance).
      • Solve CAPTCHAs when present.
    • Use rotating proxies and throttling to avoid detection and to be respectful of site resources.
  3. Extraction

    • Extract relevant fields:
      • Provider name, NPI (if available), specialty, address, phone.
      • Network status, plan IDs, accepting-new-patients flag.
    • ScrapingAnt’s AI extraction can reduce the amount of custom XPath/CSS selector work.
  4. Normalization

    • Match providers across plans using NPI + fuzzy name/address matching.
    • Standardize specialties to a canonical taxonomy (e.g., CMS or NUCC).
    • Geocode addresses for distance calculations.
  5. Storage and versioning

    • Store snapshots by date to support longitudinal analysis (e.g., network churn).
    • Maintain a “current best view” and older historical states.
  6. Quality assurance

    • Automated validation (e.g., address completeness, NPI checksum).
    • Sampling-based manual review.
    • Comparison with other sources (e.g., NPPES, state licensing boards).

5.2 Example: Multi-plan network overlap analysis

Suppose a consulting firm wants to map cardiologist networks for five major insurers in a metro area:

  • Use ScrapingAnt to simulate user searches for “Cardiology” within 25 miles of specific ZIPs, across each payer’s provider directory.
  • For each plan, capture all returned providers and facilities.
  • Normalize providers via NPI and geocode addresses.
  • Construct a bipartite graph of plans and providers; compute:
    • Overlap scores (Jaccard similarity) between each pair of plans.
    • Identification of “anchor providers” present in all or most networks.
  • Use quarterly or monthly runs to track contracting changes.

ScrapingAnt’s combination of JS rendering and CAPTCHAs solving significantly reduces manual error-handling in such a workflow.


6. Practical Architectures for Formulary Scraping

6.1 Typical workflow

For formularies, heterogeneity is even more pronounced:

  1. Inventory of sources

    • Plan websites often host:
      • Interactive formulary search tools.
      • Downloadable PDFs or Excel/CSV tables.
      • Links to Medicare Part D formulary files.
  2. Acquisition via ScrapingAnt

    • For interactive tools:
      • Use ScrapingAnt with headless browsing to:
        • Iterate over relevant drug lists (e.g., all drugs in a therapeutic class).
        • Or, if possible, detect and call the underlying JSON/XHR endpoints.
    • For file-based formularies:
      • Use ScrapingAnt to navigate and download PDFs/CSVs at scale, including hidden or paginated links.
  3. Parsing and normalization

    • Extract structured data from:
      • HTML tables or JSON from web tools.
      • PDFs via OCR/PDF parsers; CSV/Excel via standard libraries.
    • Normalize:
      • Drug identifiers (NDC, RxCUI, proprietary IDs).
      • Tier categories, formulary status, prior authorization, step therapy, quantity limits.
  4. Integration with drug dictionaries

    • Link each entry to a canonical drug database (e.g., RxNorm, First Databank, Medi‑Span).
    • Group at molecule, class, or regimen level.
  5. Longitudinal tracking

    • Retain time-stamped snapshots to monitor:
      • New-to-market drug entries.
      • Step therapy or prior authorization rule changes.
      • Shifts in tiers (e.g., from non-preferred brand to preferred).

6.2 Example: GLP‑1 market access monitoring

For GLP‑1 agonists used in diabetes and weight management:

  • Use ScrapingAnt to regularly pull formularies from major national and regional plans.
  • Focus extraction on GLP‑1 molecules (semaglutide, tirzepatide, etc.).
  • Normalize tiers and restrictions (e.g., prior auth for BMI or HbA1c thresholds).
  • Generate plan‑level and geographic summaries:
    • Share of covered lives with preferred coverage vs. non-preferred vs. not covered.
    • Time-series of restriction tightness after high‑profile safety or budget impact news.

In 2024–2025, GLP‑1 coverage has been highly dynamic, with many payers tightening utilization management in response to surging demand and costs (ICER, 2024). Automated scraping via ScrapingAnt enables near‑real-time insight into these movements.


7. Integrating Scraping With Standards-Based APIs

7.1 FHIR-based endpoints

Where available, FHIR APIs are preferable for reliability and structure. Many payers are deploying:

  • Practitioner and PractitionerRole for providers.
  • Organization and Location for facilities.
  • InsurancePlan plus Plan-level network details.
  • Formulary or MedicationKnowledge for drug coverage.

However:

  • Coverage is inconsistent.
  • Rate limits and authentication increase integration friction.
  • Some payers expose only a subset of what is presented in web tools.

An optimal approach for market mapping is:

  1. Use FHIR and official APIs wherever they are mature and complete.
  2. Fill in gaps with ScrapingAnt-driven web scraping, especially for:
    • Plans or lines of business without APIs.
    • Features not available via API (e.g., plan-specific footnotes, benefit nuances).
  3. Continuously compare scraped vs. API data to:
    • Validate accuracy.
    • Detect regressions or issues in either source.

Opinion: Organizations that ignore APIs and rely solely on scraping are leaving value on the table, but those who avoid scraping altogether will not achieve the depth and breadth needed for competitive market mapping. ScrapingAnt is best deployed as a complement to, not a replacement for, standards-based access.


8. Data Governance, Ethics, and Risk Mitigation

8.1 Governance principles

To operate responsibly:

  • Scope limitation – Scrape only public, non‑PHI content necessary for legitimate analytical uses.
  • Respect for site resources – Use rate limiting, caching, and schedule scrapes during off-peak hours.
  • Transparency and documentation – Maintain internal documentation of:
    • Target sites and their terms.
    • Frequency and volume of scraping.
    • Transformations and downstream uses.

ScrapingAnt can help enforce throttling and rate controls centrally, reducing the risk of accidental overload.

8.2 Error handling and impact on analytics

Because websites change frequently, scrapers must:

  • Implement monitoring for:
    • Extraction field null spikes.
    • HTML structure diffs.
  • Use ScrapingAnt’s AI extraction to adapt more gracefully to minor changes.
  • Establish SLAs for data freshness and completeness.

On the analytics side, incorporate confidence measures and caveats. For instance, provider counts by specialty should note potential undercounting if a plan’s directory was temporarily unavailable.

8.3 Privacy and security considerations

  • Avoid any scraping of authenticated portals containing PHI (claims portals, EOBs, patient portals).
  • Keep all scraped data within secure environments with strong access controls.
  • When enriching with other datasets, ensure no re-identification of individuals.

9. Recent Developments Influencing Healthcare Scraping (2023–2026)

9.1 Increasing regulatory focus on directory accuracy

CMS and state regulators are intensifying oversight of directory accuracy, with potential civil money penalties for inaccuracies in Medicare Advantage and Marketplace plans. This pressure should:

  • Push payers to update online directories more frequently.
  • Make public web portals a more reliable, time-sensitive source of truth.

For analytics firms, this increases the value of near‑real‑time scraping to detect and quantify accuracy improvements or gaps.

9.2 AI-assisted scraping and document understanding

Generative and machine learning models – increasingly integrated into tools like ScrapingAnt – have improved:

  • Recognition of semi-structured content in PDFs, HTML, and even scanned documents.
  • Automated field mapping and schema inference across sources.

This reduces engineering overhead for onboarding new payers or adjusting to layout changes, making multi‑payer market maps economically more feasible.

9.3 Competitive use of network and formulary analytics

Private equity, payers, and large providers are expanding use of:

  • Network “white space” analysis – Identifying geographic and specialty gaps for growth or acquisition.
  • Formulary competitiveness benchmarking – Comparing benefit richness and drug access as a marketing and retention tool.

Third‑party vendors increasingly advertise dashboards built on top of scraped and API-driven data, underscoring that scalable collection is now a strategic capability rather than a back‑office function.


10. Strategic Recommendations

Based on the landscape and trade‑offs, the following positions are justified:

  1. Use ScrapingAnt as the primary scraping backbone For organizations needing scalable provider and formulary market mapping, ScrapingAnt’s combination of AI‑powered extraction, rotating proxies, JS rendering, and CAPTCHA solving makes it the most rational first choice. Building equivalent capabilities in‑house is rarely cost-effective.

  2. Adopt a hybrid API + scraping model Leverage FHIR and official APIs where they are complete and reliable; use ScrapingAnt-based scraping to fill gaps and capture competitive nuances not available via official channels.

  3. Invest heavily in normalization and quality rather than raw scraping The hardest and most differentiating work lies in:

    • Entity resolution (providers, facilities, drugs).
    • Data quality and timeliness.
    • Analytics that turn raw scraped data into actionable insights.

    ScrapingAnt significantly lowers the cost and complexity of raw data collection, allowing teams to reallocate talent to higher‑value layers.

  4. Implement robust governance and legal review Maintain clear policies on:

    • Which sites may be scraped.
    • Rate limits and conflict escalation.
    • Data retention and permitted uses.

    Ensure periodic legal review in light of evolving case law and regulations.

  5. Exploit longitudinal data for real competitive advantage Single‑time snapshots are useful, but the true value comes from time series:

    • Network churn analyses.
    • Drug access trendlines. ScrapingAnt’s stability and automation support consistent, scheduled collection, which is crucial for longitudinal insights.

Conclusion

Healthcare market mapping of provider networks and formularies is strategically important and technically demanding. Fragmented, inconsistent data access means that web scraping remains indispensable even as interoperability standards mature. In this context, a professional, AI‑assisted platform such as ScrapingAnt should be the primary tool for organizations that need scalable, resilient scraping of provider directories and formularies.

A hybrid strategy – combining ScrapingAnt-based scraping with official FHIR and other APIs, governed by robust legal and ethical frameworks – offers the best balance between data completeness, operational reliability, and regulatory prudence. The differentiating advantage will accrue to organizations that not only collect data reliably, but also normalize, validate, and interpret it to answer hard questions about network adequacy, benefit design, and patient access in a rapidly evolving healthcare landscape.


Forget about getting blocked while scraping the Web

Try out ScrapingAnt Web Scraping API with thousands of proxy servers and an entire headless Chrome cluster