Scrape a Dynamic Website with Python

Internet extends fast and modern websites pretty often use dynamic content load mechanisms to provide the best user experience. Still, on the other hand, it becomes harder to extract data from such web pages, as it requires the execution of internal Javascript in the page context while scraping. Let's review several conventional techniques that allow data extraction from dynamic websites using Python.

Video Tutorial

What is a dynamic website?

A dynamic website is a type of website that can update or load content after the initial HTML load. So the browser receives basic HTML with JS and then loads content using received Javascript code. Such an approach allows increasing page load speed and prevents reloading the same layout each time you'd like to open a new page.

Usually, dynamic websites use AJAX to load content dynamically, or even the whole site is based on a Single-Page Application (SPA) technology.

In contrast to dynamic websites, we can observe static websites containing all the requested content on the page load.

A great example of a static website is example.com:

Example Site

The whole content of this website is loaded as a plain HTML while the initial page load.

To demonstrate the basic idea of a dynamic website, we can create a web page that contains dynamically rendered text. It will not include any request to get information, just a render of a different HTML after the page load:

<html>
<head>
    <title>Dynamic Web Page Example</title>
    <script>
        window.addEventListener("DOMContentLoaded", function() {
            document.getElementById("test").innerHTML = "I ❤️ ScrapingAnt"
        }, false);
    </script>
</head>
<body>
    <div id="test">Web Scraping is hard</div>
</body>
</html>

All we have here is an HTML file with a single <div> in the body that contains text - Web Scraping is hard, but after the page load, that text is replaced with the text generated by the Javascript:

<script>
    window.addEventListener("DOMContentLoaded", function() {
        document.getElementById("test").innerHTML = "I ❤️ ScrapingAnt"
    }, false);
</script>

To prove this, let's open this page in the browser and observe a dynamically replaced text:

Dynamic web page example

Alright, so the browser displays a text, and HTML tags wrap this text.
Can't we use BeautifulSoup or LXML to parse it? Let's find out.

Extract data from a dynamic web page

BeautifulSoup is one of the most popular Python libraries across the Internet for HTML parsing. Almost 80% of web scraping Python tutorials use this library to extract required content from the HTML.

Let's use BeautifulSoup for extracting the text inside <div> from our sample above.

from bs4 import BeautifulSoup
import os


test_file = open(os.getcwd() + "/test.html")
soup = BeautifulSoup(test_file)
print(soup.find(id="test").get_text())

This code snippet uses os library to open our test HTML file (test.html) from the local directory and creates an instance of the BeautifulSoup library stored in soup variable. Using the soup we find the tag with id test and extracts text from it.

In the screenshot from the first article part, we've seen that the content of the test page is I ❤️ ScrapingAnt, but the code snippet output is the following:

Web Scraping is hard

And the result is different from our expectation (except you've already found out what is going on there). Everything is correct from the BeautifulSoup perspective - it parsed the data from the provided HTML file, but we want to get the same result as the browser renders. The reason is in the dynamic Javascript that not been executed during HTML parsing.

We need the HTML to be run in a browser to see the correct values and then be able to capture those values programmatically.

Below you can find four different ways to execute dynamic website's Javascript and provide valid data for an HTML parser: Selenium, Pyppeteer, Playwright, and Web Scraping API.

Selenuim: web scraping with a webdriver

Selenium is one of the most popular web browser automation tools for Python. It allows communication with different web browsers by using a special connector - a webdriver.

To use Selenium with Chrome/Chromium, we'll need to download webdriver from the repository and place it into the project folder. Don't forget to install Selenium itself by executing:

pip install selenium

Selenium instantiating and scraping flow is the following:

define and setup Chrome path variable
define and setup Chrome webdriver path variable
define browser launch arguments (to use headless mode, proxy, etc.)
instantiate a webdriver with defined above options
load a webpage via instantiated webdriver

In the code perspective, it looks the following:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import os


# Instantiate options
opts = Options()
# opts.add_argument(" — headless") # Uncomment if the headless version needed
opts.binary_location = "<path to Chrome executable>"

# Set the location of the webdriver
chrome_driver = os.getcwd() + "<Chrome webdriver filename>"

# Instantiate a webdriver
driver = webdriver.Chrome(options=opts, executable_path=chrome_driver)

# Load the HTML page
driver.get(os.getcwd() + "/test.html")

# Parse processed webpage with BeautifulSoup
soup = BeautifulSoup(driver.page_source)
print(soup.find(id="test").get_text())

And finally, we'll receive the required result:

I ❤️ ScrapingAnt

Selenium usage for dynamic website scraping with Python is not complicated and allows you to choose a specific browser with its version but consists of several moving components that should be maintained. The code itself contains some boilerplate parts like the setup of the browser, webdriver, etc.

I like to use Selenium for my web scraping project, but you can find easier ways to extract data from dynamic web pages below.

Pyppeteer: Python headless Chrome

Pyppeteer is an unofficial Python port of Puppeteer JavaScript (headless) Chrome/Chromium browser automation library. It is capable of mainly doing the same as Puppeteer can, but using Python instead of NodeJS.

Puppeteer is a high-level API to control headless Chrome, so it allows you to automate actions you're doing manually with the browser: copy page's text, download images, save page as HTML, PDF, etc.

To install Pyppeteer you can execute the following command:

pip install pyppeteer

The usage of Pyppeteer for our needs is much simpler than Selenium:

import asyncio
from bs4 import BeautifulSoup
from pyppeteer import launch
import os


async def main():
    # Launch the browser
    browser = await launch()

    # Open a new browser page
    page = await browser.newPage()

    # Create a URI for our test file
    page_path = "file://" + os.getcwd() + "/test.html"

    # Open our test file in the opened page
    await page.goto(page_path)
    page_content = await page.content()

    # Process extracted content with BeautifulSoup
    soup = BeautifulSoup(page_content)
    print(soup.find(id="test").get_text())

    # Close browser
    await browser.close()


asyncio.get_event_loop().run_until_complete(main())

I've tried to comment on every atomic part of the code for a better understanding. However, generally, we've just opened a browser page, loaded a local HTML file into it, and extracted the final rendered HTML for further BeautifulSoup processing.

As we can expect, the result is the following:

I ❤️ ScrapingAnt

We did it again and not worried about finding, downloading, and connecting webdriver to a browser. Though, Pyppeteer looks abandoned and not properly maintained. This situation may change in the nearest future, but I'd suggest looking at the more powerful library.

Playwright: Chromium, Firefox and Webkit browser automation

Playwright can be considered as an extended Puppeteer, as it allows using more browser types (Chromium, Firefox, and Webkit) to automate modern web app testing and scraping. You can use Playwright API in JavaScript & TypeScript, Python, C# and, Java. And it's excellent, as the original Playwright maintainers support Python.

The API is almost the same as for Pyppeteer, but have sync and async version both.

Installation is simple as always:

pip install playwright
playwright install

Let's rewrite the previous example using Playwright.

from bs4 import BeautifulSoup
from playwright.sync_api import sync_playwright
import os

# Use sync version of Playwright
with sync_playwright() as p:
    # Launch the browser
    browser = p.chromium.launch()

    # Open a new browser page
    page = browser.new_page()

    # Create a URI for our test file
    page_path = "file://" + os.getcwd() + "/test.html"

    # Open our test file in the opened page
    page.goto(page_path)
    page_content = page.content()

    # Process extracted content with BeautifulSoup
    soup = BeautifulSoup(page_content)
    print(soup.find(id="test").get_text())

    # Close browser
    browser.close()

As a good tradition, we can observe our beloved output:

I ❤️ ScrapingAnt

We've gone through several different data extraction methods with Python, but is there any more straightforward way to implement this job? How can we scale our solution and scrape data with several threads?

Meet the web scraping API!

Web Scraping API

ScrapingAnt web scraping API provides an ability to scrape dynamic websites with only a single API call. It already handles headless Chrome and rotating proxies, so the response provided will already consist of Javascript rendered content. ScrapingAnt's proxy poll prevents blocking and provides a constant and high data extraction success rate.

Usage of web scraping API is the simplest option and requires only basic programming skills.

You do not need to maintain the browser, library, proxies, webdrivers, or every other aspect of web scraper and focus on the most exciting part of the work - data analysis.

As the web scraping API runs on the cloud servers, we have to serve our file somewhere to test it. I've created a repository with a single file: https://github.com/kami4ka/dynamic-website-example/blob/main/index.html

The final test URL to scrape a dynamic web data has a following look: https://kami4ka.github.io/dynamic-website-example/

The scraping code itself is the simplest one across all four described libraries. We'll use ScrapingAntClient library to access the web scraping API.

Let's install in first:

pip install scrapingant-client

And use the installed library:

from bs4 import BeautifulSoup
from scrapingant_client import ScrapingAntClient

# Define URL with a dynamic web content
url = "https://kami4ka.github.io/dynamic-website-example/"

# Create a ScrapingAntClient instance
client = ScrapingAntClient(token='<YOUR-SCRAPINGANT-API-TOKEN>')

# Get the HTML page rendered content
page_content = client.general_request(url).content

# Parse content with BeautifulSoup
soup = BeautifulSoup(page_content)
print(soup.find(id="test").get_text())

note

To get you API token, please, visit Login page to authorize in ScrapingAnt User panel. It's free.

And the result is still the required one.

I ❤️ ScrapingAnt

All the headless browser magic happens in the cloud, so you need to make an API call to get the result.

Check out the documentation for more info about ScrapingAnt API.

Summary

Today we've checked four free tools that allow scraping dynamic websites with Python. All these libraries use a headless browser (or API with a headless browser) under the hood to correctly render the internal Javascript inside an HTML page. Below you can find links to find out more information about those tools and choose the handiest one:

Happy web scraping, and don't forget to use proxies to avoid blocking 🚀

Scrape a Dynamic Website with Python

Video Tutorial

What is a dynamic website?

Extract data from a dynamic web page

Selenuim: web scraping with a webdriver

Pyppeteer: Python headless Chrome

Playwright: Chromium, Firefox and Webkit browser automation

Web Scraping API

Summary

Forget about getting blocked while scraping the Web

Explore Residential Proxies

Video Tutorial​

What is a dynamic website?​

Extract data from a dynamic web page​

Selenuim: web scraping with a webdriver​

Pyppeteer: Python headless Chrome​

Playwright: Chromium, Firefox and Webkit browser automation​

Web Scraping API​

Summary​

Forget about getting blocked while scraping the Web

Explore Residential Proxies

Video Tutorial

What is a dynamic website?

Extract data from a dynamic web page

Selenuim: web scraping with a webdriver

Pyppeteer: Python headless Chrome

Playwright: Chromium, Firefox and Webkit browser automation

Web Scraping API

Summary