In this mini-tutorial, Let's take a look at how we can build a simple web scraper with Nodejs, Express, Axios and Cheerio. This is for beginners new to Nodejs and wants to know about web scraping in general.
Let's get to it๐
Please make sure you have Nodejs installed on your machine.
Create a new folder and open it in your preferred terminal and run this command.
npm init
This will create a package.json file which will contain some information about our project including our dependencies.
Now you need to install the packages mentioned above. I like to install them all at once by running this shorthand command
npm i express axios cheerio
This should install all our dependencies and you should be able to see them appear in the package.json file. Next, we will create an index.js file. this is where we will be writing our code for this project.
Now let's run our local server. you do this by going into the package.json file and adding some extras to the scripts section.
"scripts": {
"test": "echo \"Error: no test specified\" && exit 1",
"start": "nodemon index.js" //add this line
},
This is so that when you run npm start in your terminal, it runs "nodemon index.js".
TIP: Nodemon is a very useful package you can use to listen to changes to your code and reload the server. if you don't have this installed run npm i -g nodemon
Let's Get Coding
Let us start by creating our local server. we do this by importing express and using the .listen() function.
const PORT = 8000 // this is the port our local server will be available at
const express = require("express");
const app = express() // Initializing express
app.listen(PORT, ()=> console.log(`Server is running at port: ${PORT}`));
What we want to do next is use Axios to get HTML markup from the page we are trying to scrape
const PORT = 8000 // this is the port our local server will be available at
const express = require("express");
const axios = require('axios');
const app = express() // Initializing express
const url = 'https://techcabal.com/' //this is the website I will be using
axios(url).then(response => {
const htmlMarkup = response.data;
console.log(htmlMarkup)
}).catch(err => {
//catching errors
console.log({
error: err
})
})
app.listen(PORT, ()=> console.log(`Server is running at port: ${PORT}`));
Save your code and look at your terminal. you should see a long HTML markup Axios sent back to us from the website. let's extract data from this markup we just got, shall we?
This is why we need cheerio. Cheerio helps us pick out HTML elements on a web page. it works by parsing markup and it also provides an API for traversing and manipulating the resulting data structure. Now we will use it to parse and get data from our already ready markup
const PORT = 8000 // this is the port our local server will be available at
const express = require("express");
const axios = require('axios');
const cheerio = require('cheerio');
const app = express() // Initializing express
const url = 'https://techcabal.com/' //this is the website I will be using
axios(url).then(response => {
const htmlMarkup = response.data;
const _parsedHTML = cheerio.load(htmlMarkup) // here is where we parse our markup
const articles = []
//extracting data you want from the website
_parsedHTML('.article-list-item').each(function () {
const title = _parsedHTML(this).find('.article-list-desc').find('a').text().replace( /[\r\n]+/gm, "")
const category = _parsedHTML(this).find('.article-list-pretitle').find('a').text()
const url = _parsedHTML(this).find('.article-list-desc').find('a').attr('href')
//appending data to existing array
articles.push({
title,
category,
url
})
})
console.log(articles) //logging data to console
}).catch(err => {
//catching errors
console.log({
error: err
})
})
app.listen(PORT, ()=> console.log(`Server is running at port: ${PORT}`));
Here is what we did. We inspected the website using the dev tools and found an article element I wanted to extract data from. so I used cheerio to find all elements with the class .article-list-item and for each element, I find on the page I run the callback function to get the article's title, category and URL using cheerios methods as seen above. save this and you should see an array with all the articles and extracted data.
Conclusion
Great! now we have been able to extract some data from a website. you can also extend your research by looking at cheerio's docs. you can also get the full code for this tutorial here on Github. Cheers!