AngularJS site scraping. The easy deal with Puppeteer and Headless Chrome.

Oleg Kulyk

Oleg Kulyk

Co-Founder @ ScrapingAnt

AngularJS sites scraping

AngularJS is a quite common framework for building modern Single Page Applications, but what about the ability to scrape sites based on it? Let’s find out.

Scraping test with CURL#

To check whether we can scrape a site directly we can use a simple CURL command:

curl https://example.com > example.html

So, we’ve made a simple HTTP request to the example site and the result of this action has been saved to the example.html file. We can open this file with a preferable browser to observe the same result as with opening the original site via the browser. It's easy, isn’t it?

example.com saved HTML in browser

So, we can go further and get the site content of the official AngularJS site:

curl https://angular.io/ > angular.html

After the opening of this file (angular.html) in the browser, we will see the blank page without any content. What went wrong?

The AngularJS site uses Javascript to render exact HTML content, and the first received content is just a bunch of JS files with a rendering logic. To scrape this site we have to execute those files somehow, and the most common way is to use a headless browser.

A deep dive into Puppeteer#

Puppeteer is a project from the Google Chrome team which enables us to control a Chrome (or any other Chrome DevTools Protocol based browser) and execute common actions, much like in a real browser - programmatically, through a decent API. It’s a super useful and easy tool for automating, testing, and scraping web pages.

To read more about Puppeteer please visit the official project site: https://pptr.dev/

With using NodeJS we can write a simple script to scrape the rendered content:

const puppeteer = require('puppeteer');
const fs = require('fs').promises;
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://angular.io/');
const html = await page.content();
await fs.writeFile("angular.html", html);
await browser.close();
})();

Only 10 lines of code allows us to get and save the site content to local file.

Is that all that is needed for web scraping?#

Yep. Scraping is not that complicated a process, and you’ll not face any problems until you'll have to achieve the following:

  • Scraping parallelization (for scraping several pages at once you need to run several browsers/pages and utilize resources properly)
  • Request limits (sites usually limit the number of requests from a particular IP to prevent scraping or DDoS attack)
  • Code deploy and maintenance (for production usage you’ll need to deploy Puppeteer-related code to some server with its own restrictions)

By using ScrapingAnt API you can forget about all the problems above and just write the business logic for your application.