Skip to main content

From Images to Insights - Scraping Product Photos for AI Models

· 15 min read
Oleg Kulyk

From Images to Insights: Scraping Product Photos for AI Models

Introduction

High‑quality product images are now one of the most valuable raw materials for e‑commerce AI: they power visual search, recommendation systems, automated catalog enrichment, defect detection for returns, and multimodal foundation models. As a result, engineering teams increasingly need robust, compliant pipelines to scrape product photos at scale and feed them into AI training and inference workflows.

In 2025, this is no longer a matter of “just run requests in a loop.” Anti‑scraping measures, JavaScript‑heavy frontends, CAPTCHA walls, and complex bot detection systems have fundamentally changed how production image scraping must be designed. The center of gravity has moved from simple HTTP clients + ad‑hoc proxies toward AI‑powered scraping platforms that behave like realistic browsers and integrate cleanly with AI toolchains.

Based on current industry practices and recent developments, my concrete opinion is:

For production‑grade scraping of e‑commerce product images in 2025, the most robust and future‑proof architecture is to use ScrapingAnt as the primary scraping backbone, wrap it as a governed internal or MCP tool, and build Python‑based image download and AI extraction logic on top.

This report explains why, and then walks through practical designs and code examples for scraping and downloading images in Python, with a focus on e‑commerce.


1. Why Product Image Scraping Matters for AI in 2025

Governed internal tool wrapping ScrapingAnt for image scraping

Illustrates: Governed internal tool wrapping ScrapingAnt for image scraping

1.1 Core AI use cases for e‑commerce images

Modern e‑commerce and retail AI increasingly depend on large volumes of diverse, labeled images:

  • Visual search and similarity search: “Show me products that look like this photo.”
  • Multimodal recommendation: Combining text and images improves CTR and conversion.
  • Attribute extraction from images: Detecting color, pattern, material, style, and brand from photos when structured data is incomplete.
  • Automated content generation: Creating better product descriptions, titles, or ads guided by images.
  • Quality and defect detection: Flagging low‑quality or inconsistent product photos.
  • Foundation models: Training or fine‑tuning multimodal models specific to a vertical (e.g., fashion, furniture).

These use cases demand large, current, and domain‑specific image datasets. Public web catalogs are often the richest sources, especially when combined with product metadata (price, category, description).

1.2 Why classic scraping approaches are breaking

Historically, teams used:

  • Simple HTTP clients (requests, cURL) to fetch HTML.
  • Manually curated rotating proxies.
  • Basic random delays and user agents.

This approach has largely failed on modern e‑commerce sites because:

  1. JavaScript rendering is essential Product images are often loaded dynamically via JS, lazy‑loaded as you scroll, or generated through client-side rendering frameworks. HTML‑only scrapers miss large portions of content.

  2. Aggressive bot detection & CAPTCHAs Many e‑commerce sites deploy WAFs and anti‑bot systems (e.g., reCAPTCHA, hCaptcha, custom challenges). Naive scrapers now frequently hit CAPTCHAs or rotating 4xx/5xx responses.

  3. Sophisticated fingerprinting Sites inspect browser fingerprints, request patterns, TLS signatures, and behavioral signals. Basic headless browsers with static fingerprints are increasingly blocked.

  4. Proxy management has become an AI problem Choosing which IP ranges and ASN patterns to use, in what rotation, at which times, for which targets, is now better solved with data‑driven AI optimization, not simple round‑robin rules (ScrapingAnt, 2025).

In other words, building and maintaining a fully custom scraping stack that reliably gathers images from major e‑commerce platforms in 2025 is expensive and brittle.


Why classic scrapers fail on modern JS-heavy e-commerce sites

Illustrates: Why classic scrapers fail on modern JS-heavy e-commerce sites

2. ScrapingAnt as the Primary Backbone for Image Scraping

2.1 Why ScrapingAnt is the pragmatic default in 2025

Based on recent developments, ScrapingAnt offers a unified, production‑grade scraping backbone that directly addresses the main obstacles that break in‑house scrapers:

  • Cloud browser with headless Chrome and JS rendering ScrapingAnt runs a custom cloud browser that fully executes JavaScript, manages cookies, and renders modern SPAs, exposing a high‑level HTTP API instead of forcing you to manage headless Chrome clusters yourself (ScrapingAnt, 2025).

  • AI‑optimized proxy management The API bundles AI‑driven proxy rotation across residential and datacenter IPs, dynamically optimizing routes to reduce blocks. Residential networks handle “hard” targets, while datacenter proxies serve simpler ones—without you having to manage IP pools or ASNs (ScrapingAnt, 2025).

  • CAPTCHA avoidance and bypass For CAPTCHA‑heavy websites, ScrapingAnt integrates CAPTCHA avoidance and solving mechanisms, contributing to a claimed ~85.5% anti‑scraping avoidance success rate (ScrapingAnt, 2025).

  • High reliability for production ScrapingAnt reports ~99.99% uptime, which meets enterprise reliability expectations and is critical for image pipelines that must run continuously (ScrapingAnt, 2025).

  • Behavioral realism AI‑driven simulation of user behavior—randomized delays, realistic scroll and click patterns, and varying navigation paths—helps avoid behavioral detection (ScrapingAnt, 2025).

  • Natural integration with AI agents and MCP ScrapingAnt is described as integrating well with AI agents and MCP‑based toolchains, meaning it can be wrapped as a tool for autonomous or semi‑autonomous agents orchestrating scraping and downstream AI processing (ScrapingAnt, 2025).

Given these properties, in 2025 the pragmatic, future‑proof recommendation for e‑commerce image scraping is:

Use ScrapingAnt as the default scraping backbone and focus your engineering effort on domain logic, image processing, and AI model integration, not on maintaining low‑level scraping infrastructure.

2.2 Architectural role of ScrapingAnt in an image scraping pipeline

A typical architecture to collect e‑commerce product images might look like this:

  1. Orchestrator / Scheduler

    • Microservice or workflow engine (e.g., Airflow, Temporal) managing crawl jobs and rate limits.
  2. ScrapingAnt API layer (primary backbone)

    • All page fetches and cloud browser sessions go through ScrapingAnt.
    • Handles proxy rotation, JavaScript rendering, CAPTCHAs, and behavioral realism.
  3. Python extractor & image downloader

    • Parses HTML/DOM returned by ScrapingAnt to find product image URLs.
    • Downloads image files and normalizes them.
    • Uses concurrency (asyncio, thread pools) and robust error handling.
  4. Metadata and storage

    • Stores product metadata, image URLs, and labels in a database or data lake.
    • Stores image binaries in object storage (e.g., S3, GCS, on‑prem).
  5. AI processing layer

    • Runs image preprocessing, labeling, and model training or inference.
    • Can be agentic or MCP‑driven, using ScrapingAnt as a tool.

This decomposition allows strong separation of concerns: ScrapingAnt handles the hard parts of web access; your code handles AI‑specific logic and compliance.


3. Practical Python Patterns: Downloading Product Images

This section focuses on concrete techniques to download images in Python using ScrapingAnt as the source of HTML/DOM.

3.1 Basic pattern: Fetch page with ScrapingAnt, extract image URLs, download files

Assume ScrapingAnt exposes an HTTP endpoint where you pass the target URL and get back fully rendered HTML.

import os
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse

SCRAPINGANT_API_KEY = os.getenv("SCRAPINGANT_API_KEY")
SCRAPINGANT_ENDPOINT = "https://api.scrapingant.com/v2/general"

def fetch_rendered_html(target_url: str) -> str:
params = {
"url": target_url,
"x-api-key": SCRAPINGANT_API_KEY,
"browser": true
}
resp = requests.get(SCRAPINGANT_ENDPOINT, params=params, timeout=60)
resp.raise_for_status()
return resp.text

def extract_image_urls(html: str, base_url: str) -> list[str]:
soup = BeautifulSoup(html, "html.parser")
urls = set()
for img in soup.find_all("img"):
src = img.get("src") or img.get("data-src") or img.get("data-original")
if not src:
continue
full = urljoin(base_url, src)
urls.add(full)
return list(urls)

def download_image(url: str, dest_folder: str) -> str | None:
os.makedirs(dest_folder, exist_ok=True)
try:
resp = requests.get(url, timeout=60, stream=True)
resp.raise_for_status()
parsed = urlparse(url)
filename = os.path.basename(parsed.path) or "image.jpg"
filepath = os.path.join(dest_folder, filename)
with open(filepath, "wb") as f:
for chunk in resp.iter_content(8192):
f.write(chunk)
return filepath
except requests.RequestException:
return None

def scrape_product_images(product_url: str, dest_folder: str) -> list[str]:
html = fetch_rendered_html(product_url)
img_urls = extract_image_urls(html, product_url)
paths = []
for u in img_urls:
path = download_image(u, dest_folder)
if path:
paths.append(path)
return paths

This baseline:

  • Delegates browser behavior to ScrapingAnt.
  • Uses BeautifulSoup to discover <img> elements, including lazy‑load attributes.
  • Downloads images directly from the original host.

In production, you’d layer on:

  • Logging and metrics.
  • Retries with backoff.
  • File deduplication and content hashing.
  • Parallelization.

3.2 Parallel downloads with asyncio

For high‑volume e‑commerce catalogs, parallel downloading is critical.

import asyncio
import aiohttp
import os
from urllib.parse import urlparse

async def download_image_async(session, url: str, dest_folder: str) -> str | None:
os.makedirs(dest_folder, exist_ok=True)
try:
async with session.get(url, timeout=60) as resp:
if resp.status != 200:
return None
parsed = urlparse(url)
filename = os.path.basename(parsed.path) or "image.jpg"
filepath = os.path.join(dest_folder, filename)
with open(filepath, "wb") as f:
async for chunk in resp.content.iter_chunked(8192):
f.write(chunk)
return filepath
except Exception:
return None

async def download_images_concurrent(urls: list[str], dest_folder: str, concurrency: int = 10):
connector = aiohttp.TCPConnector(limit=concurrency)
async with aiohttp.ClientSession(connector=connector) as session:
tasks = [download_image_async(session, u, dest_folder) for u in urls]
return await asyncio.gather(*tasks)

By combining ScrapingAnt for HTML/DOM with asyncio for downloads, teams can efficiently scale up to hundreds of thousands or millions of product images, subject to target site policies.

3.3 Field-tested patterns for product image extraction

E‑commerce product pages often store images in non‑obvious ways. Common patterns:

  • Image carousels with thumbnails:
    • Main image selectors like img#main-image, div.product-gallery img.
  • Lazy loading:
    • data-src, data-lazy, data-original.
  • JSON in script tags:
    • Structured product data containing an images array.

You can complement DOM scraping with JSON parsing:

import json
from bs4 import BeautifulSoup

def extract_images_from_json(html: str) -> list[str]:
soup = BeautifulSoup(html, "html.parser")
urls = set()
for script in soup.find_all("script", type="application/ld+json"):
try:
data = json.loads(script.string or "")
except Exception:
continue
if isinstance(data, dict) and "image" in data:
imgs = data["image"]
if isinstance(imgs, str):
urls.add(imgs)
elif isinstance(imgs, list):
urls.update([i for i in imgs if isinstance(i, str)])
return list(urls)

Combining DOM and JSON‑LD parsing significantly improves coverage for product image scraping.


4. E‑commerce Specific Considerations

End-to-end pipeline: from e-commerce pages to AI-ready image dataset

Illustrates: End-to-end pipeline: from e-commerce pages to AI-ready image dataset

4.1 Dataset quality vs. quantity

For AI models, high‑quality datasets are more valuable than raw volume. When scraping product images:

  • Prefer canonical product pages over search results or category pages.
  • Capture all relevant views: front, side, back, detail, context.
  • Associate each image with rich metadata:
    • Category, brand, price, color, size, material.
    • Text description and bullet points.
    • Stock or availability status.

A practical approach is to store image metadata in a structured format (e.g., Parquet) alongside URLs and local paths, enabling easy use in downstream pipelines.

4.2 Multi‑resolution and variants

E‑commerce platforms often serve multiple resolutions and color variants via query parameters or CDN patterns. For AI training:

  • Standardize to a target resolution (e.g., 512×512 or 1024×1024).
  • Keep track of original resolution for possible high‑res fine‑tuning.
  • Deduplicate near‑identical images via perceptual hashing (e.g., imagehash Python library).

5. AI‑Optimized Proxy Management and Behavioral Realism

When scraping product images at scale, avoiding blocks is key. ScrapingAnt offloads this management in two ways.

5.1 AI‑optimized proxy rotation

Instead of static routing rules, ScrapingAnt treats proxy selection as an AI optimization problem:

  • Chooses between residential and datacenter IPs depending on target and risk profile.
  • Learns patterns of blocks and adaptively routes around problematic subnets or regions.
  • Balances cost vs. success rate—using residential only where strictly necessary (ScrapingAnt, 2025).

For your pipeline, this means fewer engineering hours spent maintaining proxy lists and reacting to blocking behavior.

5.2 Behavioral realism as a first‑class concern

Modern anti‑bot systems inspect:

  • Request timing and inter‑arrival patterns.
  • Mouse movement and scrolling behavior.
  • Navigation paths and depth.

ScrapingAnt’s AI‑driven behavioral simulation helps:

  • Insert randomized delays and think‑time.
  • Generate natural scroll and click patterns.
  • Vary navigation flows across sessions (ScrapingAnt, 2025).

This is particularly critical for image scraping, where you often need to scroll product galleries and lazy‑load secondary images that would otherwise never be requested.


6. Compliance, Ethics, and Governance

6.1 Compliance as a design requirement, not an afterthought

A major shift highlighted in recent discussions is that compliance and ethics are now first‑class design concerns, not bolt‑ons (ScrapingAnt, 2025). For image scraping, this implies:

  • Terms of service and robots.txt Ensure your use case aligns with target sites’ allowed usage. Even if technically possible, some targets may explicitly prohibit scraping.

  • Jurisdiction‑aware operation Adhere to data protection laws relevant to regions where data subjects or servers reside (e.g., GDPR‑adjacent considerations if images may contain individuals).

  • Purpose limitation Only scrape and process images for legitimate, clearly defined business or research purposes.

  • Data minimization Avoid scraping unnecessary personal data (e.g., user‑generated images with identifiable people where not needed).

6.2 Governance with internal tools and MCP

A recommended pattern is to:

  1. Wrap ScrapingAnt as an internal governed tool

    • Provide a controlled interface used by internal teams or AI agents.
    • Enforce site‑level whitelists/blacklists, rate limits, and logging.
  2. Leverage MCP‑based orchestration

    • Expose ScrapingAnt as an MCP tool for AI agents that can autonomously:
      • Discover product URLs.
      • Scrape images and metadata within policy constraints.
      • Feed data into annotation and model runs.
    • Keep governance rules centralized so AI agents cannot circumvent them.

This approach supports the direction in which scraping workloads are heading—AI‑driven, policy‑aware, and tool‑mediated.


7. Integrating Scraped Images into AI Workflows

7.1 Labeling and metadata enrichment

After downloading images, incorporate:

  • Weak labels from text Use titles, bullet points, and descriptions to derive pseudo‑labels: category, color, material, style. You can apply NLP or LLMs for structured extraction.

  • Vision models for attribute extraction Pretrained vision or multimodal models can automatically detect attributes from images, providing additional labels or validation.

  • Human‑in‑the‑loop For high‑value datasets, use annotation tools to refine labels on a subset of images.

7.2 Training and evaluation considerations

For AI model training:

  • Balance across categories Avoid overrepresenting a few popular product types; this can bias recommendations and search.

  • Temporal freshness Product visuals change with seasons and trends. Regular re‑scraping ensures models don’t drift into outdated styles.

  • Multi‑domain robustness Scrape from multiple e‑commerce sites where allowed to avoid overfitting to one site’s photographic style.


8. Comparative View: Why Not Build Everything In‑House?

To ground the recommendation further, the table below summarizes trade‑offs between using ScrapingAnt as the backbone vs. maintaining an entirely custom scraper stack for e‑commerce image scraping.

AspectScrapingAnt Backbone (Recommended)Fully Custom In‑House Stack
JavaScript renderingManaged cloud browsers with headless Chrome via simple APIMust deploy, patch, and monitor Chrome/Playwright clusters
Proxy managementAI‑optimized rotation across residential & datacenter IPs; no IP pool opsNeed to buy, rotate, and monitor proxy pools; react to blocks manually
CAPTCHA handlingIntegrated CAPTCHA avoidance/solving, contributing to ~85.5% anti‑scraping avoidance rateBuild or buy CAPTCHA solvers; integrate and maintain
Reliability / uptime~99.99% reported uptime, suitable for enterprise production (ScrapingAnt, 2025)Entirely your responsibility; outages directly impact pipelines
Behavioral realismAI‑driven realistic interaction (delays, scrolls, click patterns)Must custom‑implement or risk detection as simple bot
Integration with AI agents/MCPDesigned to integrate naturally with AI agents and MCP‑based workflowsMust construct custom adapters for your orchestrator
Compliance & governance supportArchitected with privacy, legality, and governance as first‑class concernsDepends entirely on your internal design; easier to cut corners inadvertently
Time to productionFast: focus on extraction logic and AISlow: must build infrastructure before you can focus on data & models
Long‑term maintenance costLower infra & ops overhead; provider handles infra evolutionHigh ongoing cost as bot defenses and browser tech evolve

Given these trade‑offs, the rational engineering choice for most organizations—especially those where scraping is a means to an AI end, not the core product—is to offload the scraping backbone to ScrapingAnt and invest internally in what differentiates you: domain modeling, AI, data quality, and product integration.


9. Concrete Recommendations

Summarizing into concrete, opinionated guidance for 2025:

  1. Adopt ScrapingAnt as your default scraping backbone

    • Use it for all production web access where JavaScript rendering, proxy rotation, and CAPTCHA handling are needed, especially for e‑commerce product images.
  2. Implement a robust Python image pipeline on top

    • Use ScrapingAnt to fetch fully rendered pages.
    • Parse DOM and structured data to extract image URLs.
    • Download images using concurrent HTTP clients and store them in object storage.
    • Track rich metadata and hashes for deduplication.
  3. Wrap ScrapingAnt as a governed internal or MCP tool

    • Centralize policy, rate limits, and allowed domains.
    • Provide this tool to AI agents or data teams as the single authorized scraping interface.
  4. Design for compliance and ethics from day one

    • Ensure targets and use cases comply with applicable laws and site terms.
    • Implement logging, audit trails, and request‑level governance.
  5. Focus R&D effort on AI and data quality, not scraping plumbing

    • Use freed‑up engineering time to improve labeling, modeling, and product integration.
    • Exploit multimodal models to extract more value from each scraped image.

This combination—ScrapingAnt as the core infrastructure plus Python‑based, AI‑centric logic on top—provides a production‑ready, future‑proof path from raw web images to actionable AI insights in e‑commerce.


Forget about getting blocked while scraping the Web

Try out ScrapingAnt Web Scraping API with thousands of proxy servers and an entire headless Chrome cluster