[Python知识库] 使用 Python、Scrapy 和 MongoDB 抓取网站

开发: C++知识库 Java知识库 JavaScript Python PHP知识库人工智能区块链大数据移动开发嵌入式开发工具数据结构与算法开发测试游戏开发网络协议系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑笔记本显卡显示器固态硬盘硬盘耳机手机 iphone vivo oppo 小米华为单反装机图拉丁

-> Python知识库 -> 使用 Python、Scrapy 和 MongoDB 抓取网站 -> 正文阅读

[Python知识库]使用 Python、Scrapy 和 MongoDB 抓取网站

使用 Python、Scrapy 和 MongoDB 抓取网站

介绍

??数据已成为一种新商品，而且价格昂贵。随着人们在线创建无限内容，不同网站上的数据量有所增加，许多初创公司提出需要这些数据的想法。不幸的是，由于时间和金钱的限制，他们不能总是自己生产。
??此问题的一种流行解决方案是网络爬行和抓取。随着机器学习应用程序对数据的需求不断增加，网络爬虫变得非常流行。网络爬虫读取网站的源代码，这样可以轻松找到要提取的内容。
??然而，爬虫效率低下，因为它们会抓取 HTML 标签内的所有内容，然后开发人员必须验证和清理数据。这就是 Scrapy 之类的工具的用武之地。Scrapy 是一种网络爬虫，而不是简单的爬虫，因为它对将收集的数据类型更加挑剔。
??在以下部分中，您将了解 Scrapy，Python 最流行的抓取框架以及如何使用它。

Scrapy 简介

??Scrapy是一个用 Python 编写的快速、高级的网络爬虫框架。它是免费和开源的，用于大规模的网络抓取。
??Scrapy 使用spiders，它决定如何抓取一个站点（或一组站点）以获取您想要的信息。Spiders 是定义您希望如何抓取站点以及如何从一组页面中提取结构化数据的类。

入门

??就像任何其他 Python 项目一样，最好创建一个单独的虚拟环境，这样库就不会弄乱现有的基础环境。本文假设您已经安装了 Python 3.3 或更高版本。

1.创建一个虚拟环境

??本文将使用一个叫做 .venv 的虚拟环境。您可以自由地更改它，但是要确保在整个项目中使用相同的名称。

mkdir web-scraper
cd web-scraper
python3 -m venv .venv

2. 激活虚拟环境

对于 Windows，使用以下命令:

.venv\Scripts\activate

对于 Linux 和 OSX:

source .venv/bin/activate

此命令将启用新的虚拟环境。它是新的，因此不包含任何内容，所以您必须安装所有必需的库。

3. 设置 Scrapy

因为scrapy是一个框架，它会自动安装其他需要的库:

pip install scrapy

要安装 Scrapy，请遵循官方文档。

抓取 LogRocket 的文章

注意：LogRocket只是一个网站，你可以换成其他的，比如https://blog.csdn.net/low5252；https://weibo.com/

要理解任何框架，最好的方法是边做边学。话虽如此，让我们抓取 LogRocket 的特色文章及其各自的评论。

基本设置

让我们从创建一个空白项目开始：

scrapy startproject logrocket

接下来，使用以下内容创建您的第一个蜘蛛：

cd logrocket
scrapy genspider feature_article blog.logrocket.com

让我们看看目录结构是什么样的:

web-scraper
├── .venv
└── logrocket
    ├── logrocket
    │   ├── __init__.py
    │   ├── items.py
    │   ├── middlewares.py
    │   ├── pipelines.py
    │   ├── settings.py
    │   └── spiders
    │       ├── __init__.py
    │       └── feature_article.py
    └── scrapy.cfg

写第一个spiders爬虫

??现在该项目已成功设置，让我们创建我们的第一个蜘蛛，它将从LogRocket 博客中抓取所有特色文章。

打开spiders/feature_article.py文件。

让我们一步一步来，首先从博客页面获取特色文章：

class FeatureArticleSpider(scrapy.Spider):
    name = 'feature_article'
    allowed_domains = ['blog.logrocket.com']
    start_urls = ['http://blog.logrocket.com']

    def parse(self, response):
        feature_articles = response.css("section.featured-posts div.card")
        for article in feature_articles:
            article_dict = {
                "heading": article.css("h2.card-title a::text").extract_first().strip(),
                "url": article.css("h2.card-title a::attr(href)").extract_first(),
                "author": article.css("span.author-meta span.post-name a::text").extract_first(),
                "published_on": article.css("span.author-meta span.post-date::text").extract_first(),
                "read_time": article.css("span.readingtime::text").extract_first(),
            }
            yield article_dict

??正如你在上面的代码中看到的，scrapy.Spider 定义了一些属性和方法，它们是:

name, 它定义了spiders名字，并且在项目中必须是唯一的
allowed_domains, 允许我们抓取的域列表
start_urls，我们开始抓取的网址列表
parse()，它被调用来处理请求的响应。它通常解析响应，提取数据，并以以下形式生成dict

选择正确的 CSS 元素

??在抓取过程中，重要的是要知道唯一标识要抓取的元素的最佳方法。
??最好的方法是在浏览器中检查元素。您可以在开发人员工具（右键检查即可调出来）菜单中轻松查看 HTML 结构。
??**推荐使用xpath**一个快速定位特定元素位置的插件。
在这里插入图片描述

运行第一个蜘蛛

使用以下命令运行上面的 spider:

scrapy crawl feature_article

它应该包含所有的特色文章，比如:

...
...
{'heading': 'Understanding React’s ', 'url': 'https://blog.logrocket.com/understanding-react-useeffect-cleanup-function/', 'author': 'Chimezie Innocent', 'published_on': 'Oct 27, 2021', 'read_time': '6 min read'}
2021-11-09 19:00:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.logrocket.com/>
...
...

介绍项目

抓取的主要目标是提取非结构化数据并将其转换为有意义的结构化数据。Items 提供了一个dict-like API和一些很棒的附加功能。您可以在此处阅读有关项目的更多信息。
让我们创建第一个项目以通过其属性指定文章。这里我们使用dataclass来定义它。
使用以下内容进行编辑：items.py

from dataclasses import dataclass

@dataclass
class LogrocketArticleItem:
    _id: str
    heading: str
    url: str
    author: str
    published_on: str
    read_time: str

然后，更新 spider/feature _ article.py 文件，如下所示:

import scrapy
from ..items import LogrocketArticleItem

class FeatureArticleSpider(scrapy.Spider):
    name = 'feature_article'
    allowed_domains = ['blog.logrocket.com']
    start_urls = ['http://blog.logrocket.com']

    def parse(self, response):
        feature_articles = response.css("section.featured-posts div.card")
        for article in feature_articles:
            article_obj = LogrocketArticleItem(
                _id = article.css("::attr('id')").extract_first(),
                heading = article.css("h2.card-title a::text").extract_first(),
                url = article.css("h2.card-title a::attr(href)").extract_first(),
                author = article.css("span.author-meta span.post-name a::text").extract_first(),
                published_on = article.css("span.author-meta span.post-date::text").extract_first(),
                read_time = article.css("span.readingtime::text").extract_first(),
            )
            yield article_obj

获取每个帖子的评论

让我们深入研究创建蜘蛛。要获取每篇文章的评论，您需要请求每个文章的 url，然后获取评论。
为此，让我们首先创建一个条目(item.py)用于注释:

@dataclass
class LogrocketArticleCommentItem:
    _id: str
    author: str
    content: str
    published: str

现在注释项已经准备好了，让我们编辑 spider/feature _ article.py，如下所示:

import scrapy
from ..items import (
    LogrocketArticleItem,
    LogrocketArticleCommentItem
)

class FeatureArticleSpider(scrapy.Spider):
    name = 'feature_article'
    allowed_domains = ['blog.logrocket.com']
    start_urls = ['http://blog.logrocket.com']

    def get_comments(self, response):
        """
        The callback method gets the response from each article url.
        It fetches the article comment obj, creates a list of comments, and returns dict with the list of comments and article id.
        """
        article_comments = response.css("ol.comment-list li")
        comments = list()
        for comment in article_comments:
            comment_obj = LogrocketArticleCommentItem(
                _id = comment.css("::attr('id')").extract_first(),
                # special case: author can be inside `a` or `b` tag, so using xpath
                author = comment.xpath("string(//div[@class='comment-author vcard']//b)").get(),
                # special case: there can be multiple p tags, so for fetching all p tag inside content, xpath is used.
                content = comment.xpath("string(//div[@class='comment-content']//p)").get(),
                published = comment.css("div.comment-metadata a time::text").extract_first(),
            )
            comments.append(comment_obj)

        yield {"comments": comments, "article_id": response.meta.get("article_id")}

    def get_article_obj(self, article):
        """
        Creates an ArticleItem by populating the item values.
        """
        article_obj = LogrocketArticleItem(
            _id = article.css("::attr('id')").extract_first(),
            heading = article.css("h2.card-title a::text").extract_first(),
            url = article.css("h2.card-title a::attr(href)").extract_first(),
            author = article.css("span.author-meta span.post-name a::text").extract_first(),
            published_on = article.css("span.author-meta span.post-date::text").extract_first(),
            read_time = article.css("span.readingtime::text").extract_first(),
        )
        return article_obj

    def parse(self, response):
        """
        Main Method: loop through each article and yield the article.
        Also raises a request with the article url and yields the same.
        """
        feature_articles = response.css("section.featured-posts div.card")
        for article in feature_articles:
            article_obj = self.get_article_obj(article)
            # yield the article object
            yield article_obj
            # yield the comments for the article
            yield scrapy.Request(
                url = article_obj.url,
                callback = self.get_comments,
                meta={
                    "article_id": article_obj._id,
                }
            )

现在，使用相同的命令运行上面的 spider:

scrapy crawl feature_article

在 MongoDB 中保存数据

现在我们有了正确的数据，现在让我们将相同的数据保存在数据库中。我们将使用 MongoDB 来存储抓取的项目。

初始步骤

将 MongoDB 安装到您的系统后，使用 pip安装PyMongo。PyMongo 是一个 Python 库，其中包含与 MongoDB 交互的工具。

pip3 install pymongo

接下来，在 settings.py 中添加新的 Mongo 相关设置:

# MONGO DB SETTINGS
MONGO_HOST="localhost"
MONGO_PORT=27017
MONGO_DB_NAME="logrocket"
MONGO_COLLECTION_NAME="featured_articles"

管道管理

现在您已经设置了爬虫来抓取和解析 HTML，并且设置了数据库设置。
接下来，我们必须通过管道将两者连接起来：pipelines.py。

from itemadapter import ItemAdapter
import pymongo
from scrapy.utils.project import get_project_settings
from .items import (
    LogrocketArticleCommentItem,
    LogrocketArticleItem
)
from dataclasses import asdict

settings = get_project_settings()

class MongoDBPipeline:
    def __init__(self):
        conn = pymongo.MongoClient(
            settings.get('MONGO_HOST'),
            settings.get('MONGO_PORT')
        )
        db = conn[settings.get('MONGO_DB_NAME')]
        self.collection = db[settings['MONGO_COLLECTION_NAME']]

    def process_item(self, item, spider):
        if isinstance(item, LogrocketArticleItem): # article item
            self.collection.update({"_id": item._id}, asdict(item), upsert = True)
        else:
            comments = []
            for comment in item.get("comments"):
                comments.append(asdict(comment))
            self.collection.update({"_id": item.get("article_id")}, {"$set": {"comments": comments} }, upsert=True)

        return item

在 settings.py 中添加这个管道:

USER_AGENT='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
ITEM_PIPELINES = {'logrocket.pipelines.MongoDBPipeline': 100}

最终测试

再次运行抓取命令，检查项目是否正确地被推送到数据库:

scrapy crawl feature_article

总结

??在本指南中，您已经学习了如何在 Scrapy 编写基本的 spider 并将爬下的数据持久化到数据库(MongoDB)中。你只是学会了 Scrapy 作为一个 web 抓取工具的冰山一角，除了我们在这里介绍的之外，还有很多东西需要学习。
??我希望通过这篇文章，您了解了 Scrapy 的基本知识，并且有动力使用这个奇妙的爬虫框架工具进行更深入的研究。