Nodejs 爬虫案例

码农世界 2024-05-28 后端 69 次浏览 0个评论

1.安装：

npm install cheerio
npm install axios

2.介绍：

2.1 cheerio 特点和用途描述：

HTML解析和操作：Cheerio 可以将 HTML 字符串加载到内存中，并将其转换为一个可操作的 DOM 树结构，从而可以方便地对 HTML 文档进行解析和操作。
类似于 jQuery 的API：Cheerio 提供了类似于 jQuery 的选择器和操作方法，使用户可以使用 CSS 选择器、DOM 操作等方法来操纵 HTML 文档，例如查找元素、修改属性、添加样式等。
轻量级：相比于浏览器端的 jQuery，Cheerio 是一个轻量级的库，适用于服务器端的 Node.js 环境，可以高效地进行 HTML 解析和操作，而无需运行整个浏览器引擎。
方便的数据提取：通过 Cheerio，用户可以方便地从 HTML 文档中提取所需的数据，例如爬取网页内容、解析HTML 结构等，常用于网络爬虫、数据抓取等任务。

模块化：Cheerio 可以与其他 Node.js 模块和工具结合使用，例如请求库（如 Axios、request）、文件系统操作等，从而实现更复杂的任务和功能。

2.2 使用axios进行网络请求

2.3 fs进行文件操作：将请求的数据，写入到指定的文件夹中

涉及到的知识点：

response.data.pipe(); 返回的是文件流的操作

fs.createWriteStream() 写入文件流的操作

3.示例：

        const cheerio = require('cheerio');
        const axios = require('axios');
        const fs = require('fs');
        const path = require('path');
        const headers = {
            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.55 Safari/537.36'
        };
        const downloadImage = async (url, filePath) => {
            const response = await axios({
                url: url,
                method: 'GET',
                responseType: 'stream'
            });
            response.data.pipe(fs.createWriteStream(filePath));
            return new Promise((resolve, reject) => {
                response.data.on('end', () => {
                    resolve();
                });
                response.data.on('error', (err) => {
                    reject(err);
                });
            });
        };
        const crawler = async (options) => {
            for (let i = 1; i <= options.page; i++) {
                const url = i === 1 ? options.url : `${options.url}list_${i}.html`;
                console.log(url);
                try {
                    const response = await axios.get(url, {
                        headers: headers
                    });
                    const $ = cheerio.load(response.data);
                    const imageElements = $('.pics img');
                    imageElements.each((index, element) => {
                        const imageUrl = $(element).attr('src');
                        if (imageUrl) {
                            const imageName = `${i}-${index}.jpg`;
                            const imagePath = path.join(__dirname, 'img', imageName);
                            downloadImage(imageUrl, imagePath)
                                .then(() => {
                                    console.log(`${i} ---- ${index}`, imageUrl, 'Downloaded successfully.');
                                })
                                .catch((error) => {
                                    console.error(`${i} ---- ${index}`, imageUrl, 'Download failed. Error:', error);
                                });
                        }
                    });
                } catch (err) {
                    console.error('Error fetching or parsing the page:', err);
                }
            }
        };
        crawler({
            url: 'http://www.duoziwang.com/head/gexing/',
            page: 10
        });