Web Scraping for beginners (How to scrape data from a website)

Web Scraping for beginners (How to scrape data from a website)

ยท

4 min read

In this mini-tutorial, Let's take a look at how we can build a simple web scraper with Nodejs, Express, Axios and Cheerio. This is for beginners new to Nodejs and wants to know about web scraping in general.

Let's get to it๐Ÿ˜‰

Please make sure you have Nodejs installed on your machine.

Create a new folder and open it in your preferred terminal and run this command.

npm init

This will create a package.json file which will contain some information about our project including our dependencies.

Now you need to install the packages mentioned above. I like to install them all at once by running this shorthand command

npm i express axios cheerio

This should install all our dependencies and you should be able to see them appear in the package.json file. Next, we will create an index.js file. this is where we will be writing our code for this project.

Now let's run our local server. you do this by going into the package.json file and adding some extras to the scripts section.

"scripts": {
    "test": "echo \"Error: no test specified\" && exit 1",
    "start": "nodemon index.js" //add this line
  },

This is so that when you run npm start in your terminal, it runs "nodemon index.js".

TIP: Nodemon is a very useful package you can use to listen to changes to your code and reload the server. if you don't have this installed run npm i -g nodemon

Let's Get Coding

Let us start by creating our local server. we do this by importing express and using the .listen() function.

const PORT = 8000 // this is the port our local server will be available at
const express = require("express");

const app = express() // Initializing express

app.listen(PORT, ()=> console.log(`Server is running at port: ${PORT}`));

What we want to do next is use Axios to get HTML markup from the page we are trying to scrape

const PORT = 8000 // this is the port our local server will be available at
const express = require("express");
const axios = require('axios'); 

const app = express() // Initializing express

const url = 'https://techcabal.com/' //this is the website I will be using

axios(url).then(response => {
    const htmlMarkup = response.data;
    console.log(htmlMarkup)
}).catch(err => {
        //catching errors
  console.log({
      error: err
  })
})

app.listen(PORT, ()=> console.log(`Server is running at port: ${PORT}`));

Save your code and look at your terminal. you should see a long HTML markup Axios sent back to us from the website. let's extract data from this markup we just got, shall we?

This is why we need cheerio. Cheerio helps us pick out HTML elements on a web page. it works by parsing markup and it also provides an API for traversing and manipulating the resulting data structure. Now we will use it to parse and get data from our already ready markup

const PORT = 8000 // this is the port our local server will be available at
const express = require("express");
const axios = require('axios'); 
const cheerio = require('cheerio');

const app = express() // Initializing express

const url = 'https://techcabal.com/' //this is the website I will be using

axios(url).then(response => {
    const htmlMarkup = response.data;
    const _parsedHTML = cheerio.load(htmlMarkup) // here is where we parse our markup
    const articles = []

//extracting data you want from the website
    _parsedHTML('.article-list-item').each(function () {
       const title =  _parsedHTML(this).find('.article-list-desc').find('a').text().replace( /[\r\n]+/gm, "")
        const category = _parsedHTML(this).find('.article-list-pretitle').find('a').text()
        const url =  _parsedHTML(this).find('.article-list-desc').find('a').attr('href')

        //appending data to existing array
        articles.push({
            title,
            category,
            url
        })
    })
    console.log(articles) //logging data to console
}).catch(err => {
        //catching errors
  console.log({
      error: err
  })
})

app.listen(PORT, ()=> console.log(`Server is running at port: ${PORT}`));

Here is what we did. We inspected the website using the dev tools and found an article element I wanted to extract data from. so I used cheerio to find all elements with the class .article-list-item and for each element, I find on the page I run the callback function to get the article's title, category and URL using cheerios methods as seen above. save this and you should see an array with all the articles and extracted data.

Conclusion

Great! now we have been able to extract some data from a website. you can also extend your research by looking at cheerio's docs. you can also get the full code for this tutorial here on Github. Cheers!