Web Scraping with Deno

Dynamic languages are helpful tools for web scraping. Scripting allows users to rapidly tie together complex systems or libraries and express ideas without dealing with memory management or build systems.

JavaScript is the most popularly used dynamic language, operating on every device with a web browser, and Node.js as a JS runtime proved to be a very successful software platform. Due to design mistakes, it became hard to evolve with an existing user base, so Deno was born to resolve all the problems. Let's find out how to scrape the web and dynamic websites with Deno.

What is Deno?

Deno is a simple, modern and secure runtime runtime for executing JavaScript and TypeScript outside of the web browser.

The basic idea of Deno is to provide a standalone tool for quickly scripting complex functionality. Also, it's not just a NodeJS fork - it's a wholly re-implemented runtime. It ships with a single executable file and already knows how to fetch external code.

Deno brings to life several concepts that are missing in NodeJS:

TypeScript support out of the box
Security: no file, network, or environment access, unless explicitly enabled
Global async-await
Built-in utilities like a dependency inspector and a code formatter

And that's just the beginning!

To implement a primary HTTP server, you'll have to write four code lines:

import { serve } from "https://deno.land/std@0.50.0/http/server.ts";

for await (const req of serve({ port: 8000 })) {
  req.respond({ body: "Hello World\n" });
}

How cool is that! We can observe inspiration from Go and Rust. And what about the performance?

A hello-world Deno HTTP server does about 25k requests per second with a max latency of 1.3 milliseconds. A comparable Node program does 34k requests per second with a rather erratic max latency between 2 and 300 milliseconds.

We inspected such numbers at the Deno project release, so in the nearest future, the performance of Deno will be much higher.

Everything looks impressive so far, so let's get started web scraping with Deno.

Making requests: built-in fetch module

A basic web scraping technique is an HTML content extraction from the provided URL. It allows getting the web page information for further parsing, saving, or postprocessing.

To obtain information from the web server, we have to make an HTTP call to the target server and receive the response with the needed HTML content of the web page.

Deno supports lots of native javascript APIs, and Fetch API is one of them. It is making request handling easy and dependency-free.

We'll start with creating a file scraper.ts to use the full power of global async/await. The content would be a following:

const url = 'https://example.com';

try {
  const res = await fetch(url);
  const html = await res.text();

  console.log(html)
} catch(error) {
  console.log(error);
}

As Deno is secured by default, we have to run this application with a special internet access flag --allow-net:

deno run --allow-net scraper.ts

This code snippet retrieves HTML content from the example.com website and outputs it to a console.

As expected, result is a following:

<!doctype html>
<html>
<head>
    <title>Example Domain</title>

    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <style type="text/css">
         /*CSS was removed by author to save a place*/
    </style>
</head>

<body>
<div>
    <h1>Example Domain</h1>
    <p>This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.</p>
    <p><a href="https://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>

Great! Let's find out how we can extract a piece of specific information from the provided HTML.

Deno DOM: traverse HTML with a Deno-specific module

Deno DOM is a library that allows traversing HTML using Javascript DOM manipulation methods with ease.

We need to provide a retrieved HTML for the parsing (using parseFromString method of the new DOMParser instance) and then select the required HTML node using query selector:

import { DOMParser } from 'https://deno.land/x/deno_dom/deno-dom-wasm.ts';

const url = 'https://example.com';

try {
  const res = await fetch(url);
  const html = await res.text();
  const document: any = new DOMParser().parseFromString(html, 'text/html');

  const pageHeader = document.querySelector('h1').textContent;

  console.log(pageHeader)
} catch(error) {
  console.log(error);
}

And the result is the following (also expected 🙂):

Example Domain

Awesome! We've learned about Deno-specific library, but what about something more popular and widely spread across the JS web scraping community?

Cheerio: core jQuery for the server

Cheerio can be used with Deno too!

It parses markup and provides an API for traversing/manipulating the resulting data structure. It uses a similar to jQuery API while jQuery provides the most efficient and straightforward API to parse and manipulate DOM.

Let's rewrite the previous example using Cheerio:

import { cheerio } from "https://deno.land/x/cheerio@1.0.4/mod.ts";

const url = 'https://example.com';

try {
  const res = await fetch(url);
  const html = await res.text();
  const $ = cheerio.load(html)  

  const pageHeader = $('h1').text();

  console.log(pageHeader)
} catch(error) {
  console.log(error);
}

We'll receive the same result!

Ok, we're good with static websites, but what if some web resource uses Javascript to render dynamic content? How to execute an internal JS code of scraped HTML?

Puppeteer: headless Chrome for Deno (port)

Puppeteer is a library that offers a simple and efficient API and enables you to control Google’s Chrome or Chromium browser. The goal of Puppeteer is to provide an easy way of controlling a headless web browser to be useful in end-to-end testing and web scraping real-world web applications. So it will be helpful for our needs with JavaScript execution abilities.

Let's consider a test file for scraping:

<html>
<head>
   <title>Dynamic Web Page Example</title>
   <script>
       window.addEventListener("DOMContentLoaded", function() {
           document.getElementById("test").innerHTML="I ❤️ ScrapingAnt"
       }, false);
   </script>
</head>
<body>
   <div id="test">Web Scraping is hard</div>
</body>
</html>

It can be found as a Github page: https://kami4ka.github.io/dynamic-website-example/

As we can observe, it has a text Web Scraping is hard inside div, but while the HTML rendering text changes to I ❤️ ScrapingAnt by the following JS code inside:

<script>
    window.addEventListener("DOMContentLoaded", function() {
        document.getElementById("test").innerHTML="I ❤️ ScrapingAnt"
    }, false);
</script>

To check it out, just open this page in your browser.

Unfortunately, the previous scraping code example won't help us with a proper scraping of this page, as this Cheerio scraper:

import { cheerio } from "https://deno.land/x/cheerio@1.0.4/mod.ts";

const url = 'https://kami4ka.github.io/dynamic-website-example/';

try {
  const res = await fetch(url);
  const html = await res.text();
  const $ = cheerio.load(html)  

  const pageText = $('div').text();

  console.log(pageText)

  await browser.close();
} catch(error) {
    console.log(error);
}

returns

Web Scraping is hard

How to fix it? Let's use Puppeteer to open this page in the real headless browser and execute the internal web page's Javascript code:

import puppeteer from "https://deno.land/x/puppeteer@9.0.0/mod.ts";
import { cheerio } from "https://deno.land/x/cheerio@1.0.4/mod.ts";

const url = 'https://kami4ka.github.io/dynamic-website-example/';

try {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    await page.goto(url);

    const html = await page.content();

    const $ = cheerio.load(html);

    const pageText = $('div').text();

    console.log(pageText)
} catch(error) {
    console.log(error);
}

In this sample, we're launching a browser for a Javascript rendering and then using Cheerio to parse the rendered HTML.

To launch this script, we have to install the Puppeteer's headless Chrome first using the following command:

PUPPETEER_PRODUCT=chrome deno run -A --unstable https://deno.land/x/puppeteer@9.0.0/install.ts

In order to use Firefox:

PUPPETEER_PRODUCT=firefox deno run -A --unstable https://deno.land/x/puppeteer@9.0.0/install.ts

And then you'll be able to run the entire web scraper application:

deno run -A --unstable scraper.ts

And finally we'll get a result:

I ❤️ ScrapingAnt

We've learned about static and dynamic website scraping and went through several popular web scraping libraries, but what if there is a more simple way to get data from the Javascript-heavy website? What to do if a site blocks your scraping bot or enables a rate-limiting?

Meet the Web Scraping API!

Web Scraping API: language agnostic solution

ScrapingAnt web scraping API provides the capability to scrape dynamic websites with only a single API call.

It already handles headless Chrome and rotating proxies, so the response provided will already consist of Javascript rendered content. ScrapingAnt's proxy poll prevents blocking and provides a constant and high data extraction success rate.

Usage of web scraping API is the simplest option and requires only basic programming skills.

We'll use a Fetch API to access web scraping API and the rewrited code for dynamic website scraping looks like the following:

import { cheerio } from "https://deno.land/x/cheerio@1.0.4/mod.ts";
import { encodeUrl } from "https://deno.land/x/encodeurl/mod.ts";

const encodedUrl = encodeUrl('https://kami4ka.github.io/dynamic-website-example/');

try {
  const response = await fetch('https://api.scrapingant.com/v1/general?url=' + encodedUrl, {
    method: 'GET',
    headers: {
        'x-api-key': '<YOUR_API_TOKEN>'
    },
  });

  const data = await response.json();

  const $ = cheerio.load(data.content);

  const pageText = $('div').text();

  console.log(pageText)
} catch(error) {
  console.log(error);
}

note

To get you API token, please, visit Login page to authorize in ScrapingAnt User panel. It's free.

We have sent an HTTP request to ScrapingAnt API with a url query parameter to specify what URL should be scraped. ScrapingAnt service has opened this page via a browser using one of the 30.000 proxies, rendered JavaScript, and returned the content in response. Then, we've processed it with a Cheerio, and the result is:

I ❤️ ScrapingAnt

Yay! 🎉

Summary

Today we've learned basic concepts of a Deno web scraping, checked several libraries, and learned a propper way of avoiding blocks while scraping.

Should you use Deno for your hobby project or even at work? I guess you should give it a chance. Just remember that Deno has been under development for just two years, while Node has been under development for over a decade. It may not be such polished as NodeJS, but it may evolve into something even bigger and better with the proper amount of interest.

As usual, helpful links for further reading:

Happy web scraping, and don't forget to update your headless browser 🌍

Web Scraping with Deno

What is Deno?

Making requests: built-in fetch module

Deno DOM: traverse HTML with a Deno-specific module

Cheerio: core jQuery for the server

Puppeteer: headless Chrome for Deno (port)

Web Scraping API: language agnostic solution

Summary

Forget about getting blocked while scraping the Web

LLM-ready data extraction

What is Deno?​

Making requests: built-in fetch module​

Deno DOM: traverse HTML with a Deno-specific module​

Cheerio: core jQuery for the server​

Puppeteer: headless Chrome for Deno (port)​

Web Scraping API: language agnostic solution​

Summary​

Forget about getting blocked while scraping the Web

LLM-ready data extraction

What is Deno?

Making requests: built-in fetch module

Deno DOM: traverse HTML with a Deno-specific module

Cheerio: core jQuery for the server

Puppeteer: headless Chrome for Deno (port)

Web Scraping API: language agnostic solution

Summary