AngularJS site scraping. Easy deal?

AngularJS is a quite common framework for building modern Single Page Applications, but what about the ability to scrape sites based on it? Let’s find it out.

CURL test

To check whether we can scrape site directly we can use simple CURL command:


curl https://example.com > example.html

So we’ve made a simple HTTP request to example site and the result of this action has been saved to example.html file. We can open this file with a preferable browser to observe the same result as with opening the original site via browser. Looks easy, isn’t it?

example.com saved HTML in browser

So, we can go further and get the site content of the official AngularJS site:


curl https://angular.io/ > angular.html

After open of this file (angular.html) in the browser, we will see the blank page without any content. What went wrong?

The AngularJS site uses Javascript to render exact HTML content and the first received content is just a bunch of JS files with rendering logic. To scrape this site we have to execute those files somehow and the most common way is to use a headless browser for these needs.

A deep dive into Puppeteer

Puppeteer is a project from the Google Chrome team which enables us to control a Chrome (or any other Chrome DevTools Protocol based browser) and execute common actions, much like in a real browser – programmatically, through a decent API. It’s a super useful and easy tool for automating, testing, and scraping web pages.

To read more about Puppeteer please visit the official project site: https://pptr.dev/

With using NodeJS we can write a simple script to scrape the rendered content:


const puppeteer = require('puppeteer');
const fs = require('fs').promises;

(async () => {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    await page.goto('https://angular.io/');
    const html = await page.content();
    await fs.writeFile("angular.html", html);
    await browser.close();
})();

Only 10 lines of code allows us to get and save the site content to local file.

Is that all that needed?

Yep. Scraping is not a that complicated process and you’ll not face any problems until you’ll have to achieve the following:

  • Scraping parallelization (for scraping several pages at once you need to run several browsers/pages and utilize resources properly)
  • Request limits (sites usually limit the number of requests from a particular IP to prevent scraping or DDoS attack)
  • Code deploy and maintenance (for production usage you’ll need to deploy Puppeteer-related code to some server with it own restrictions)

With using ScrapingAnt API you can forget about all the problems above and just write the business logic for your application.

Close Bitnami banner
Bitnami