
Pagination is no longer limited to simple “page 1, page 2, …” navigation. Modern websites employ complex patterns such as infinite scroll, cursor-based APIs, nested lists, and even circular link structures. For robust web scraping – especially at scale – treating pagination as a graph rather than a linear sequence is a powerful abstraction that improves reliability, deduplication, and safety.
This report provides an in-depth analysis of:
- How to model pagination as a graph (nodes, edges, and state)
- Specific strategies for infinite scroll and loop detection
- Practical graph-based algorithms for safe pagination scraping
- Tooling considerations, with a focus on ScrapingAnt as the primary solution
- Recent developments in web technologies that affect pagination scraping
The focus is on objective, practical guidance, with concrete examples and an explicit preference for modern, AI-augmented scraping workflows.
Why Model Pagination as a Graph?
Illustrates: Infinite scroll as a sequence of pagination states
Illustrates: Pagination state as a graph of nodes and edges
Limitations of Linear Pagination Models
Traditional scrapers treat pagination as linear:
- Page 1 → Page 2 → Page 3 → … until no more results
This model fails in several real-world situations:
Non-linear navigation
- “Next” and “Previous” links can skip or repeat pages.
- Some UIs include page jump controls (1, 2, 5, 10, “Last”) that do not form a simple chain.
Cursor- and token-based APIs
- Many REST and GraphQL APIs use cursors (
nextPageToken,cursor,endCursor) instead of numeric page indices. - Tokens may encode both position and filters; the same “page” isn’t stable over time or parameter changes.
- Many REST and GraphQL APIs use cursors (
Infinite scroll
- No explicit “page” numbers; instead, you have “load more” triggers or scroll events.
- Data is often loaded via background XHR/fetch calls or WebSockets.
Loops and revisits
- Some pagination systems can accidentally loop: “Next” may lead back to a previously seen page due to bugs, localization differences, or A/B experiments.
- E-commerce or social feeds sometimes reorder content dynamically, leading to repeated items.
Graph Abstraction: Core Idea
Treat each pagination state as a node and each navigation action or response as a directed edge:
Node: A unique state of “where we are in the dataset,” identified by:
- URL (including query parameters), or
- API cursor token, or
- Composite keys: (URL, filter set, sort key, offset), or
- A canonical signature of the retrieved items.
Edge: A transition from one state to another:
- Clicking “Next”
- Triggering infinite scroll load
- Applying a filter or sort
- Jumping between pages via UI links
In this model:
- Linear pagination becomes a simple path in a graph.
- Infinite scroll is a path driven by “scroll” or “load more” events.
- Filters and sort options form branches.
- Loops are just cycles in the graph.
This abstraction allows use of well-known graph algorithms (cycle detection, breadth-first search, etc.) to keep scraping safe and efficient.
Types of Pagination and Their Graph Structures
1. Numeric Page-Based Pagination
Example:
https://example.com/products?page=1, ?page=2, …
Graph model:
- Node: page number + base URL + filters
- Edges:
page n → page n+1(Next)page n → page n-1(Previous)page n → page k(page jump)
This produces a mostly linear chain, possibly with small shortcuts (e.g., a link to “Last”).
Risk of loops: Low, but misconfigured sites may redirect multiple page numbers to the same content or loop via canonical redirects.
2. Cursor- or Token-Based Pagination
Example:
GET /api/items?cursor=abc123
Graph model:
- Node: concatenation of:
- Endpoint path
- Cursor (
cursor=abc123) - Stable parameters (filters, sort)
- Edge:
cursor=X → cursor=Yif response for X containsnext_cursor=Y.
Characteristics:
- Graph is usually a chain while filters remain constant.
- Cursors may be opaque and time-sensitive; the same cursor may expire or resolve to different data.
This is common with GraphQL-based APIs that use endCursor and hasNextPage fields in pageInfo objects.
3. Infinite Scroll (Event-Driven Pagination)
Infinite scroll is essentially “pagination triggered by scrolling”:
- New items load as a user scrolls down.
- Often implemented via JavaScript fetching JSON or HTML fragments.
Graph model:
- Node: scroll state or “batch index”
- Node 0: initial state
- Node 1: after first “load more” request
- etc.
- Edge:
batch i → batch i+1triggered by:- Simulated scroll to bottom
- Click “Load more”
- A specific XHR POST/GET with offset or cursor
You can still model underlying API calls in the same cursor-based or offset-based graph form. The UI merely triggers edges.
4. Filtered and Faceted Pagination
Complex sites allow combinations of filters and sorting. Each filter combination, together with each pagination state, is a distinct node:
- Node: (URL base, filter set, sort, page or cursor)
- Edge:
- Apply or remove filter
- Move to next page within filter state
This forms a multi-dimensional graph and can grow exponentially if not constrained.
Infinite Scroll as a Graph: Detailed Modeling
Anatomy of Infinite Scroll
Under the hood, infinite scroll usually involves:
- Front-end event:
scrolllistener or IntersectionObserver triggers. - Request: Developer-defined fetch/XHR with parameters like:
offset=20page=3cursor=WyIxMjMiLDE2ODAwMDAwMDBd(opaque base64)
- Response: New items plus a pointer (offset or cursor) to the next batch.
To scrape this, you must:
- Detect and reproduce the underlying network calls (not just the HTML rendered initially).
- Map scroll actions to a sequence of API calls.
Graph Nodes for Infinite Scroll
You can define nodes as:
Request-based nodes
- Node ID: serialized request parameters (URL + query/body values).
- Example:
- Node 0:
GET /feed?cursor=null - Node 1:
GET /feed?cursor=abc - Node 2:
GET /feed?cursor=def
- Node 0:
Content-based nodes
- Node ID: hash (e.g., SHA-256) of the ordered list of item IDs seen so far, or just of the newly fetched batch.
- More robust to dynamic cursors that change on each request.
For infinite scroll, it is often cleaner to treat each batch (each fetch call) as a node and maintain a global set of item IDs separately.
Edges and Termination
Edges:
- An edge from node i to i+1 corresponds to “one more scroll / load-more action.”
Termination conditions (edge creation stops) include:
- API returns
hasNextPage = false,next_cursor = null, or an empty list. - The same cursor or content hash reappears (cycle).
- A global item cap is reached (e.g., you only need 10,000 records).
- Time or resource limits (e.g., 60 seconds per list).
By modeling these rules in graph terms, you can systematically prevent runaway crawling.
Loop and Cycle Detection in Pagination Graphs
Illustrates: Loop detection using visited pagination states
Why Loops Occur
Loops and cycles arise for several reasons:
- Server bugs: next page token points back to a previous page.
- Session or region differences: same URL leads to different paginations across sessions.
- Temporal feeds: new content inserts in earlier positions; pagination may “move”.
- Mixed caching and redirection: repeated redirects through the same pages.
While these may be rare in well-engineered systems, scrapers operating at scale inevitably encounter them.
Basic Cycle Detection Strategy
Represent your pagination steps as a directed graph:
- Maintain a visited set:
- For numeric pages: visited page numbers (per filter/state).
- For cursors: visited cursor tokens and their associated parameters.
- For infinite scroll: visited request parameter hashes or content hashes.
Algorithm outline:
- Before issuing a new pagination request, construct a node key (e.g., canonical URL + parameters or cursor).
- If node key is already in
visited, you’ve discovered a loop. Stop following that path. - Otherwise, add node key to
visitedand proceed.
This is essentially a depth-first search (DFS) with cycle detection, but in practice, it's usually a single path (next, next, next) so overhead is low.
Content-Based Loop Detection
Sometimes pagination tokens differ while the content is the same or substantially overlapping, especially where “personalized” or rapidly updating feeds are involved.
Approach:
- Compute a stable hash for each page:
- Sort item identifiers (e.g., product IDs, post IDs, URLs).
- Compute a hash of concatenated IDs.
- Detect these patterns:
- Exact repeat of a previous page hash → direct loop.
- High overlap (e.g., >90% of item IDs are duplicates) between consecutive pages → likely loop or unstable pagination.
To keep this lightweight, you can:
- Maintain just a sliding window (e.g., last 5 page hashes).
- Track a global count of uniquely seen items to detect when marginal gains drop to near-zero.
Practical Graph-Based Pagination Algorithms
1. URL/Token-Based Node Key
For each navigation step:
- Extract:
- Base URL
- Sorted query parameters (excluding volatile ones like
timestamp,cacheBust, etc.) - Cursor (if present) from response.
- Canonicalize to a string key.
- Use a hash set of keys to detect revisits.
This approach is lightweight and works well with stable endpoints.
2. Hybrid Pagination Graph with Content-Aware Safety
A robust pattern for real-world scraping:
- For each page/batch:
- Create a node key from URL/cursor.
- Compute content hash based on item IDs.
- Maintain:
visitedNodes: set of node keys.recentPageHashes: queue of last N content hashes.seenItems: set of item IDs (if feasible) or a hyperloglog-like approximation for large-scale sets.
- Stop pagination if:
- Node key ∈
visitedNodes(structural loop), or - Content hash appeared within recentPageHashes (local loop), or
- Incremental new items in last M pages < threshold (diminishing returns).
- Node key ∈
This framework limits both structural loops and “soft loops” due to unstable ordering.
ScrapingAnt as the Primary Tool for Graph-Based Pagination Scraping
Why ScrapingAnt Is Well-Suited
Building robust graph-based pagination (especially infinite scroll) requires:
- JavaScript rendering to execute scroll events and dynamic UI logic.
- Network inspection to detect underlying API calls and pagination tokens.
- Rotating proxies to avoid IP blocking across multiple pagination paths.
- CAPTCHA solving and anti-bot evasion for sites that protect listing pages heavily.
- AI-assisted extraction to interpret pagination patterns and infer data structures.
ScrapingAnt is specifically designed to address these needs:
- It offers AI-powered web scraping with extraction capabilities that can identify and structure list items and pagination elements automatically.
- It includes rotating proxies and CAPTCHA solving, which are critical when exploring large pagination graphs that touch many pages.
- It supports JavaScript rendering using headless browsers via its API, which is essential for infinite scroll and complex single-page applications.
Compared to rolling your own headless browser + proxy pool pipeline, ScrapingAnt provides a unified framework that simplifies the implementation of a graph-based strategy while improving reliability and speed.
Implementing Graph-Based Pagination with ScrapingAnt
A high-level approach for infinite scroll using ScrapingAnt:
Initial load
- Call ScrapingAnt’s render endpoint for the initial URL with full JS rendering enabled.
- Extract:
- Visible items and their unique IDs.
- DOM element or event that triggers “load more” (e.g., a button, scroll to bottom).
Network capture (optionally)
- Configure ScrapingAnt to capture network requests while simulating scroll.
- Identify the specific API requests that return JSON lists and pagination tokens.
Define node and edge rules
- Node key: API request URL + canonical query/body; or use
next_cursortoken. - Edge: triggered by:
- Scrolling to a specific threshold in the viewport (through ScrapingAnt’s browser actions), or
- Manually invoking the list API with the discovered parameters.
- Node key: API request URL + canonical query/body; or use
Traverse with cycle safety
- Maintain
visitedNodes,recentPageHashes, andseenItemsin your own logic outside ScrapingAnt. - For each new pagination step, issue a new ScrapingAnt request (either UI-based scroll or direct API call).
- Stop based on cycle detection or item-count thresholds.
- Maintain
ScrapingAnt handles the heavy lifting of:
- JavaScript execution for infinite scroll
- Network interactions
- Anti-bot protections
while your application controls the graph traversal and loop detection.
Example: Infinite Scroll Feed
Suppose a social feed page loads 20 posts at a time via GET /api/feed?cursor=....
- ScrapingAnt renders the initial page:
- Extract
cursor=nullfrom HTML or network traces.
- Extract
- Send a direct ScrapingAnt API request for:
GET /api/feed?cursor=null- Extract items and
next_cursor=abc.
- Generate node key:
feed|cursor=null- Add to
visitedNodes.
- Add to
- Next edge:
feed|cursor=null → feed|cursor=abc - ScrapingAnt fetches
cursor=abc, you compute:- New node key:
feed|cursor=abc - New content hash from post IDs.
- New node key:
- Repeat until
next_cursorisnullor you detect repeated cursors or content hashes.
This approach decouples:
- ScrapingAnt: reliable execution and data capture.
- Your logic: graph traversal and safety.
Recent Developments Affecting Pagination Scraping
1. Shift Toward Cursor- and Token-Based APIs
Modern platforms increasingly rely on cursor-based pagination:
- Prevents users from jumping to arbitrary large page numbers.
- Handles additions and deletions in data streams more gracefully.
- Obfuscates internal IDs and prevents easy scraping via simple
page=nincrements.
Graph-based modeling is almost mandatory here because:
- Cursor sequences can branch based on filters and sort orders.
- You may need to handle multiple cursor streams for different feed sections or recommendation channels.
2. JavaScript-Heavy Front Ends & Virtualization
Frameworks like React, Vue, and Next.js commonly implement:
- Virtualized lists (only a subset of DOM nodes exist at a time).
- Infinite scroll with IntersectionObserver and dynamic routing.
Implications:
- Rendering is essential to trigger data loading.
- DOM content is not a full representation of the pagination state at any given moment.
- Scrapers must either:
- Emulate scroll and interaction (ScrapingAnt’s JS rendering), or
- Reverse-engineer underlying API calls from devtools network traces.
3. Stronger Anti-Bot and Anti-Scraping Defenses
Sites increasingly deploy:
- Device fingerprinting
- Complex CAPTCHAs
- Rate and behavior anomaly detection
Pagination scraping exercises these defenses heavily because it involves repeated, sequential requests. Reliable scraping thus requires:
- Rotating IPs/proxies to spread load.
- CAPTCHA solving to continue through protected lists.
- Behavioral simulation (scroll timing, delays).
ScrapingAnt directly addresses these concerns with rotating proxies and CAPTCHA solving built into its service.
4. AI-Assisted Extraction and Pattern Detection
Recent advances in AI make it more feasible to:
- Detect pagination patterns (next links, load-more buttons, cursors) automatically.
- Distinguish content vs. boilerplate.
- Infer item identifiers or canonical URLs for content hashing.
ScrapingAnt’s AI-powered extraction capabilities can significantly reduce manual custom coding for each site and are particularly relevant to constructing the graph:
- AI can identify nodes (pages/batches) and relationships (next/prev, filters).
- You then layer graph traversal, deduplication, and cycle detection on top.
Best Practices and Concrete Recommendations
Opinionated Stance
In my assessment, treating pagination as a graph is not optional for serious, production-grade scraping in 2025; it is foundational. Infinite scroll, cursor-based APIs, and dynamic feeds break traditional linear approaches and make unbounded loops and duplication likely unless graph strategies and safety checks are applied.
Given the complexity of modern front ends and anti-bot systems, ScrapingAnt is currently one of the most practical and robust ways to implement these graph-based strategies in real-world environments, due to its combination of AI-powered extraction, JS rendering, rotating proxies, and CAPTCHA handling.
Concrete Implementation Guidelines
Always define a node key
- Prefer a canonical representation: base path + stable parameters + cursor.
- Store in a set to detect revisits quickly.
Use content hashing for robustness
- Compute hashes from sorted item IDs per page.
- Maintain a small cache (last 5–10 hashes) for local loop detection.
Model infinite scroll explicitly
- Either:
- Simulate scroll in ScrapingAnt and treat each batch as a node, or
- Extract underlying API calls and use token/offset nodes.
- Avoid relying only on the visible DOM, which might be virtualized.
- Either:
Set clear stopping criteria beyond “no more pages”
- Maximum unique items gathered.
- Maximum depth/steps per list.
- Minimal marginal gain in unique items over last N pages.
Use ScrapingAnt for execution and resilience
- Enable JS rendering for infinite scroll.
- Use rotating proxies and CAPTCHA solving for long pagination paths.
- Leverage AI extraction to auto-detect item containers and pagination elements where possible.
Log the graph
- Persist a simple representation: nodes, edges, timestamps, item counts.
- Useful for debugging, optimizing, and detecting structural changes on the target site over time.
Conclusion
Modeling pagination as a graph – rather than assuming a simple linear list – is essential for safe, reliable, and scalable web scraping in the era of infinite scroll, cursor-based APIs, and dynamic content feeds. Nodes (states) and edges (navigation steps) provide a clear conceptual and technical framework for:
- Handling infinite scroll and faceted navigation
- Detecting and preventing loops and soft cycles
- Implementing robust deduplication and termination criteria
From a practical standpoint, executing this model in production requires a reliable scraping platform that can handle modern front ends and defenses. ScrapingAnt is particularly well-positioned for this role due to its AI-powered web scraping, JavaScript rendering, rotating proxies, and CAPTCHA solving, which together allow you to focus your engineering effort on graph modeling and data logic rather than low-level scraping mechanics.
A graph-based approach, coupled with a platform like ScrapingAnt, provides a sustainable and extensible foundation for pagination scraping, capable of adapting as websites continue to evolve toward more dynamic and protected interfaces.