Programming lesson
Building a Web Scraper with Node.js: A Step-by-Step Tutorial for Cosc484 Assignment 7
Learn how to build a web scraper using Node.js, Axios, Cheerio, and Nodemailer. This tutorial guides you through scraping a music chart website and emailing specific artists and songs, perfect for Cosc484 Assignment 7.
Introduction to Web Scraping in Node.js
Web scraping is a powerful technique for extracting data from websites. In this tutorial, you'll build a web scraper that fetches the top rap songs from PopVortex and emails details of specified artists. This project integrates multiple Node.js modules: Axios for HTTP requests, Cheerio for HTML parsing, and Nodemailer for sending emails. By the end, you'll have a solid foundation for scraping dynamic websites and automating data extraction.
Prerequisites
- Node.js installed (v14 or later)
- Basic knowledge of JavaScript and Node.js
- A Gmail account for sending emails (with 2-factor authentication and an app password)
Project Setup
Create a new directory for your project and initialize it with npm init -y. Install the required modules:
npm install axios cheerio nodemailerCreate a file named artists.js and a credentials.json file with the following structure:
{
"from": "[email protected]",
"to": "[email protected]",
"sender email": "[email protected]",
"sender password": "your-app-password"
}Reading Credentials
Use fs.readFileSync to parse the JSON file and extract the email credentials. Ensure you use the exact keys specified in the assignment.
const fs = require('fs');
const credentials = JSON.parse(fs.readFileSync('credentials.json', 'utf8'));
const from = credentials['from'];
const to = credentials['to'];
const senderEmail = credentials['sender email'];
const senderPassword = credentials['sender password'];Command-Line Arguments
Read the artist names from process.argv. If no artists are provided, exit without sending an email.
const args = process.argv.slice(2);
if (args.length === 0) {
console.log('No artists specified. Exiting.');
process.exit(0);
}Scraping the Website
Use Axios to fetch the HTML of the target URL. Then, use Cheerio to parse the HTML and extract the top 25 songs and their artists.
const axios = require('axios');
const cheerio = require('cheerio');
async function scrapeSongs() {
const url = 'http://www.popvortex.com/music/charts/top-rap-songs.php';
const response = await axios.get(url);
const $ = cheerio.load(response.data);
const songs = [];
$('.chart-content .chart-item').each((i, el) => {
if (i >= 25) return false; // limit to top 25
const artist = $(el).find('.artist').text().trim();
const song = $(el).find('.title').text().trim();
songs.push({ artist, song });
});
return songs;
}Filtering Songs by Specified Artists
For each specified artist, check if they appear in the artist name (including features). Collect matching songs.
function filterSongs(songs, artists) {
const matches = [];
songs.forEach(song => {
artists.forEach(artist => {
if (song.artist.toLowerCase().includes(artist.toLowerCase())) {
matches.push({ artist: song.artist, song: song.song });
}
});
});
return matches;
}Formatting the Email
If matches are found, format the email subject and body. The subject lists the artists in alphabetical order with proper formatting. The body uses bold for artists and italics for song titles.
function formatEmail(matches, artists) {
const sortedArtists = artists.sort();
let subject = 'Your artist(s) are: ';
if (sortedArtists.length === 1) {
subject += sortedArtists[0];
} else if (sortedArtists.length === 2) {
subject += sortedArtists.join(' and ');
} else {
subject += sortedArtists.slice(0, -1).join(', ') + ', and ' + sortedArtists.slice(-1);
}
let body = '';
matches.forEach(match => {
body += `${match.artist}: ${match.song}
`;
});
return { subject, body };
}Sending the Email
Use Nodemailer to send the email. Ensure you use the app password from Gmail.
const nodemailer = require('nodemailer');
async function sendEmail(subject, body) {
const transporter = nodemailer.createTransport({
service: 'gmail',
auth: {
user: senderEmail,
pass: senderPassword
}
});
const mailOptions = {
from: from,
to: to,
subject: subject,
html: body
};
await transporter.sendMail(mailOptions);
console.log('Email sent successfully');
}Putting It All Together
Create an async main function that orchestrates the scraping, filtering, and email sending.
async function main() {
const songs = await scrapeSongs();
const matches = filterSongs(songs, args);
if (matches.length === 0) {
console.log('No matching artists found. No email sent.');
process.exit(0);
}
const { subject, body } = formatEmail(matches, args);
await sendEmail(subject, body);
}
main().catch(err => console.error(err));Testing the Scraper
Run the script with sample artists. For example:
node artists.js Drake MigosIf the artists are found, you'll receive an email with the subject: Your artist(s) are: Drake and Migos and a list of their songs.
Handling Edge Cases
- No artists specified: Exit without sending an email.
- Artists not found: Do not send an email.
- Case sensitivity: The search is case-insensitive.
- Multi-word artists: The assignment suggests not worrying about them, but you can handle them by joining command-line arguments for each artist.
Conclusion
You've built a functional web scraper that extracts music chart data and sends customized emails. This project demonstrates the power of Node.js for automation and data extraction. Expand it by adding more features like scraping other websites or scheduling periodic emails.
Additional Tips
Be respectful of the website's rate limits. Add a delay between requests if scraping multiple pages. Also, consider using environment variables for sensitive credentials instead of a JSON file for production.