AngularJS is a quite common framework for building modern Single Page Applications, but what about the ability to scrape sites based on it? Let’s find out.
To check whether we can scrape a site directly we can use a simple CURL command:
So, we’ve made a simple HTTP request to the example site and the result of this action has been saved to the example.html file. We can open this file with a preferable browser to observe the same result as with opening the original site via the browser. It's easy, isn’t it?
So, we can go further and get the site content of the official AngularJS site:
After the opening of this file (
angular.html) in the browser, we will see the blank page without any content. What went wrong?
Puppeteer is a project from the Google Chrome team which enables us to control a Chrome (or any other Chrome DevTools Protocol based browser) and execute common actions, much like in a real browser - programmatically, through a decent API. It’s a super useful and easy tool for automating, testing, and scraping web pages.
To read more about Puppeteer please visit the official project site: https://pptr.dev/
With using NodeJS we can write a simple script to scrape the rendered content:
Only 10 lines of code allows us to get and save the site content to local file.
Yep. Scraping is not that complicated a process, and you’ll not face any problems until you'll have to achieve the following:
- Scraping parallelization (for scraping several pages at once you need to run several browsers/pages and utilize resources properly)
- Request limits (sites usually limit the number of requests from a particular IP to prevent scraping or DDoS attack)
- Code deploy and maintenance (for production usage you’ll need to deploy Puppeteer-related code to some server with its own restrictions)
By using ScrapingAnt API you can forget about all the problems above and just write the business logic for your application.