How to get all text from a webpage using Puppeteer?

Puppeteer extracts text from webpage

While communicating with our web scraping API users, we've found that many of them use a whole web page text extraction for further data manipulation.

It's interesting, as such an approach simplifies the needed data extraction by just picking the particular text row from the text or using RegExp.

Let's start our walk through the several methods used to extract the whole page's text in NodeJS and Puppeteer.

The most common way - `innerText`

The first method is the most common one and can be found at most of the Stackoverflow answers. Also, we've used the exact mechanism in our web scraping API for a while.

It is based on the extraction using innerText property. More info about the property can be found here.

To demonstrate the extraction, we'll use example.com website with the following HTML content:

<!doctype html>
<html>
<head>
    <title>Example Domain</title>

    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 2em;
        background-color: #fdfdff;
        border-radius: 0.5em;
        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        div {
            margin: 0 auto;
            width: auto;
        }
    }
    </style>    
</head>

<body>
<div>
    <h1>Example Domain</h1>
    <p>This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.</p>
    <p><a href="https://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>

Let's open this website using Puppeteer and extract the text:

const puppeteer = require('puppeteer');

(async () => {
    const browser = await puppeteer.launch({
        headless: true
    });
    const page = (await browser.pages())[0];
    await page.goto('http://example.com');
    const extractedText = await page.$eval('*', (el) => el.innerText);
    console.log(extractedText);

    await browser.close();
})();

This code snippet will log the extracted text data to the console output:

Example Domain

This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.

More information...

As we can observe, we've received the same result as we've selected and copied the text manually.

Still, the actions are not the same as copy-paste from the page, and it leads to a problem: some of the sites (like Bing) use display optimizations that don't allow us to extract the whole text.

You can try it with the presented code snippet and count parameter for the Bing request set to 30, or just with the following URL: https://www.bing.com/search?q=scrapingant&count=30

Also, we've found that a standard manual copy-paste action works well.

Let's do the copy-paste programmatically!

The native way - copy-paste a text programmatically

This method, actually, doesn't make any copy-paste action. Instead, it makes Ctrl+A (select all) action and returns the text content from the selection.

To make it work, we also execute the custom JS inside the webpage context:

const puppeteer = require('puppeteer');

(async () => {
    const browser = await puppeteer.launch({
        headless: true
    });
    const page = (await browser.pages())[0];
    await page.goto('http://example.com');
    const extractedText = await page.$eval('*', (el) => {
        const selection = window.getSelection();
        const range = document.createRange();
        range.selectNode(el);
        selection.removeAllRanges();
        selection.addRange(range);
        return window.getSelection().toString();
    });
    console.log(extractedText);

    await browser.close();
})();

The output is almost the same:

Example Domain
This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.

More information...

The output differs slightly from the innerText variant, but it mocks the select-copy-paste action from the browser window. Also, it allows extracting data from the sites that use display optimizations.

Third-party library text extraction

Some websites can not be necessarily scraped using Puppeteer (or any other browser-based tool like Playwright) but just retrieved using an HTTP client.

To extract text content from such a website, we can use a third-party library based on HTML parsing.

Meet html-to-text!

It's a Javascript library that is based on htmlparser2 - a powerful swiss knife for HTML manipulations.

To install html-to-text using npm, just execute the following command:

npm install html-to-text

The final text scraper will use axios to get the HTML content and mentioned text extraction library to get the text from the HTML:

const axios = require('axios');
const { convert } = require('html-to-text');

(async () => {
    const response = await axios.get('http://example.com');
    const text = convert(response.data);

    console.log(text);
})();

The output will be different from the two first variants, as our code is not operating the browser (view) context. Also, the library allows us to get more information from the website - link URLs, image URLs, etc.

EXAMPLE DOMAIN

This domain is for use in illustrative examples in documents. You may use this
domain in literature without prior coordination or asking for permission.

More information... [https://www.iana.org/domains/example]

Read more about html-to-text: https://www.npmjs.com/package/html-to-text

Conclusion

This article revealed to us one of the primary web scraping techniques - text extraction. It's essential to work with text while data mining, as most information on the web, is transferred using text characters.

As usual, we recommend you extend your web scraping knowledge using our articles:

Web Scraping with Javascript (NodeJS) - JavaScript libraries to scrape data
Download image with Javascript (NodejS) - how to download files using NodeJS
HTML Parsing Libraries - JavaScript - JavaScript HTML parsing libraries overview

Happy Web Scraping, and don't forget to save your web scraping temporary progress for avoiding data loss 💾

How to get all text from a webpage using Puppeteer?

The most common way - `innerText`

The native way - copy-paste a text programmatically

Third-party library text extraction

Conclusion

Forget about getting blocked while scraping the Web

Web Scraping with ScrapingAnt

The most common way - innerText​

The native way - copy-paste a text programmatically​

Third-party library text extraction​

Conclusion​

Forget about getting blocked while scraping the Web

Web Scraping with ScrapingAnt

The most common way - `innerText`

The native way - copy-paste a text programmatically

Third-party library text extraction

Conclusion