Dynamic languages are helpful tools for web scraping. Scripting allows users to rapidly tie together complex systems or libraries and express ideas without dealing with memory management or build systems.
The basic idea of Deno is to provide a standalone tool for quickly scripting complex functionality. Also, it's not just a NodeJS fork - it's a wholly re-implemented runtime. It ships with a single executable file and already knows how to fetch external code.
Deno brings to life several concepts that are missing in NodeJS:
- TypeScript support out of the box
- Security: no file, network, or environment access, unless explicitly enabled
- Global async-await
- Built-in utilities like a dependency inspector and a code formatter
And that's just the beginning!
To implement a primary HTTP server, you'll have to write four code lines:
How cool is that! We can observe inspiration from Go and Rust. And what about the performance?
A hello-world Deno HTTP server does about 25k requests per second with a max latency of 1.3 milliseconds. A comparable Node program does 34k requests per second with a rather erratic max latency between 2 and 300 milliseconds.
We inspected such numbers at the Deno project release, so in the nearest future, the performance of Deno will be much higher.
Everything looks impressive so far, so let's get started web scraping with Deno.
A basic web scraping technique is an HTML content extraction from the provided URL. It allows getting the web page information for further parsing, saving, or postprocessing.
To obtain information from the web server, we have to make an HTTP call to the target server and receive the response with the needed HTML content of the web page.
We'll start with creating a file
scraper.ts to use the full power of global async/await. The content would be a following:
As Deno is secured by default, we have to run this application with a special internet access flag
This code snippet retrieves HTML content from the
example.com website and outputs it to a console.
As expected, result is a following:
Great! Let's find out how we can extract a piece of specific information from the provided HTML.
We need to provide a retrieved HTML for the parsing (using
parseFromString method of the new
DOMParser instance) and then select the required HTML node using query selector:
And the result is the following (also expected 🙂):
Awesome! We've learned about Deno-specific library, but what about something more popular and widely spread across the JS web scraping community?
Cheerio can be used with Deno too!
It parses markup and provides an API for traversing/manipulating the resulting data structure. It uses a similar to jQuery API while jQuery provides the most efficient and straightforward API to parse and manipulate DOM.
Let's rewrite the previous example using Cheerio:
We'll receive the same result!
Let's consider a test file for scraping:
It can be found as a Github page: https://kami4ka.github.io/dynamic-website-example/
As we can observe, it has a text
Web Scraping is hard inside
div, but while the HTML rendering text changes to
I ❤️ ScrapingAnt by the following JS code inside:
To check it out, just open this page in your browser.
Unfortunately, the previous scraping code example won't help us with a proper scraping of this page, as this Cheerio scraper:
To launch this script, we have to install the Puppeteer's headless Chrome first using the following command:
In order to use Firefox:
And then you'll be able to run the entire web scraper application:
And finally we'll get a result:
Meet the Web Scraping API!
ScrapingAnt web scraping API provides the capability to scrape dynamic websites with only a single API call.
Usage of web scraping API is the simplest option and requires only basic programming skills.
We'll use a Fetch API to access web scraping API and the rewrited code for dynamic website scraping looks like the following:
We have sent an HTTP request to ScrapingAnt API with a
Today we've learned basic concepts of a Deno web scraping, checked several libraries, and learned a propper way of avoiding blocks while scraping.
Should you use Deno for your hobby project or even at work? I guess you should give it a chance. Just remember that Deno has been under development for just two years, while Node has been under development for over a decade. It may not be such polished as NodeJS, but it may evolve into something even bigger and better with the proper amount of interest.
As usual, helpful links for further reading:
- Deno Official Website
- Puppeteer Documentation
- ScrapingAnt Documentation
Happy web scraping, and don't forget to update your headless browser 🌍