While communicating with our web scraping API users, we've found that many of them use a whole web page text extraction for further data manipulation.
It's interesting, as such an approach simplifies the needed data extraction by just picking the particular text row from the text or using RegExp.
Let's start our walk through the several methods used to extract the whole page's text in NodeJS and Puppeteer.
The most common way -
The first method is the most common one and can be found at most of the Stackoverflow answers. Also, we've used the exact mechanism in our web scraping API for a while.
It is based on the extraction using
innerText property. More info about the property can be found here.
To demonstrate the extraction, we'll use example.com website with the following HTML content:
Let's open this website using Puppeteer and extract the text:
This code snippet will log the extracted text data to the console output:
As we can observe, we've received the same result as we've selected and copied the text manually.
Still, the actions are not the same as copy-paste from the page, and it leads to a problem: some of the sites (like Bing) use display optimizations that don't allow us to extract the whole text.
You can try it with the presented code snippet and
count parameter for the Bing request set to
30, or just with the following URL:
Also, we've found that a standard manual copy-paste action works well.
Let's do the copy-paste programmatically!
This method, actually, doesn't make any copy-paste action. Instead, it makes
Ctrl+A (select all) action and returns the text content from the selection.
To make it work, we also execute the custom JS inside the webpage context:
The output is almost the same:
The output differs slightly from the
innerText variant, but it mocks the select-copy-paste action from the browser window. Also, it allows extracting data from the sites that use display optimizations.
Some websites can not be necessarily scraped using Puppeteer (or any other browser-based tool like Playwright) but just retrieved using an HTTP client.
To extract text content from such a website, we can use a third-party library based on HTML parsing.
htmlparser2 - a powerful swiss knife for HTML manipulations.
npm, just execute the following command:
The final text scraper will use
axios to get the HTML content and mentioned text extraction library to get the text from the HTML:
The output will be different from the two first variants, as our code is not operating the browser (view) context. Also, the library allows us to get more information from the website - link URLs, image URLs, etc.
Read more about
This article revealed to us one of the primary web scraping techniques - text extraction. It's essential to work with text while data mining, as most information on the web, is transferred using text characters.
As usual, we recommend you extend your web scraping knowledge using our articles:
Happy Web Scraping, and don't forget to save your web scraping temporary progress for avoiding data loss 💾