[Python知识库] Python + Selenium + Scrapy：爬取分析存储C站博客统计数据示例

开发: C++知识库 Java知识库 JavaScript Python PHP知识库人工智能区块链大数据移动开发嵌入式开发工具数据结构与算法开发测试游戏开发网络协议系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑笔记本显卡显示器固态硬盘硬盘耳机手机 iphone vivo oppo 小米华为单反装机图拉丁

-> Python知识库 -> Python + Selenium + Scrapy：爬取分析存储C站博客统计数据示例 -> 正文阅读

[Python知识库]Python + Selenium + Scrapy：爬取分析存储C站博客统计数据示例

作者:>

为了简化案例，这里仅提及C站博客首页旧版和部分统计数据。

C站旧版的博客主页都是类似 “https://blog.csdn.net/” + uid 这样子的。这种比较好办。找好用户id就可以重用网址前缀。

定义简单的C站用户ID类csdnuser.py：

"""
@author: MR.N
@create: 2021-07-22 Thur. 12:12
"""


class CSDNUser:

    def __init__(self):
        self.uid = None
        self.blog = None

    def __del__(self):
        self.uid = None

    def get_uid(self):
        return self.uid

    def set_uid(self, uid=None):
        self.uid = uid

    def __str__(self):
        return self.uid + '： ' + self.blog + ''

定义C站加载器类：

#!/bin/env python3
from csdnuser import CSDNUser


class CSDNLoader:

    def __init__(self):
        self.user = CSDNUser()
        self.blog_prefix = 'https://blog.csdn.net/'

    def __del__(self):
        print('[GC]', 'called')
        self.uid = None
        self.blog_prefix = None

    def __delete__(self, instance):
        # ...
        print('[GC-1]', 'called')

    def set_uid(self, uid=None):
        self.user.set_uid(uid)

    def get_uid(self):
        return self.user.get_uid()

导入pip install [package]提前安装好的Selenium和Scrapy相关库模块：

from selenium import webdriver
from scrapy.selector import Selector

使用Selenium + Firefox + geckodriver获取个人博客主页的网页源码：

class CSDNLoader:
    # ...
    def load_index(self, timeout=6):
        if self.get_uid() is None or self.blog_prefix is None:
            print('[init-err]', 'uid/blog is none')
            return
        driver = None
        page_source = ''
        try:
            options = webdriver.FirefoxOptions()
            options.headless = True
            driver = webdriver.Firefox(options=options)
            driver.set_page_load_timeout(timeout)
            driver.set_script_timeout(timeout)
            url = self.blog_prefix + self.get_uid()
            try:
                driver.get(url)
                driver.implicitly_wait(timeout)
            except Exception as err:
                print('[load err]', err, 'handled.')
            page_source = driver.page_source
        except Exception as err:
            print('[error]', err)
        finally:
            if driver is not None:
                try:
                    driver.quit()
                finally:
                    driver = None
        return page_source

使用Firefox或Chrome浏览器的右键菜单的“检查”工具提取博客统计数据的HTML节点路径。转化为xpath后，通过Scrapy的Selector选择器获取目标数据的节点及其值。xpath用法不算复杂。最简单的就是：//标签名称[@属性="值"]/.../@属性。使用@属性结尾使用getall()方法获取的就是末节点列表的属性数组，否则就是末节点列表的数组。也可以使用get()获取单一末节点或其属性。

    def explain(self, page_source=None):
        if page_source is not None and page_source != '':
            sel = Selector(text=page_source)
            # 博客统计数据节点的xpath
            attrs = sel.xpath('//div[@id="asideProfile"]/'
                              'div[@class="data-info d-flex item-tiling"]/'
                              'dl[@class="text-center"]/@title').getall()
        else:
            print('[error]', 'Emptry page source!')

毫无疑问，上面的xpath获取的依次是原创博文数量、周排名、总排名（作者总榜）、总访问量以及博客等级。

        # ...
        articles = attrs[0]
        weekly_ranking = attrs[1]
        total_ranking = attrs[2]
        visits = attrs[3]
        blog_level = attrs[4]

临时的统计数据意义不大，所以将所获得的数据保存更好。这样下次重新统计数据时，可以进行比较差异。

        last_articles = *
        last_weekly_ranking = *
        last_total_ranking = *
        last_visits = *
        last_blog_level = *

对新旧数据进行差异化比较：

class CSDNLoader:
    # ...
    def delta(self, a, b, reverse=False):
        try:
            res = (int(a)) - (int(b))
            return '' if res == 0 else str(res) + ('↑' if not reverse else '↓') \
                if res > 0 else str(-res) + ('↓' if not reverse else '↑')
        except:
            return ''

得出的结果类似如下：

[next-updated] 2021-08-19 Thu. 21:48:57
[uid] qq_21264377
[articles] 216?
[weekly-ranking] 6259?
[total-ranking] 12990?
[visits] 49412 85↑
[blog-level] 5?
[last-updated] 2021-08-19 Thu. 12:09:04

Python知识库最新文章

Python中String模块

【Python】 14-CVS文件操作

python的panda库读写文件

使用Nordic的nrf52840实现蓝牙DFU过程

【Python学习记录】numpy数组用法整理

Python学习笔记

python字符串和列表

python如何从txt文件中解析出有效的数据

Python编程从入门到实践自学/3.1-3.2

python变量

加:2021-08-20 15:03:18 更:2021-08-20 15:04:34

360图书馆购物三丰科技阅读网日历万年历 2025年12日历

-2025/12/1 21:45:54-

图片自动播放器
↓图片自动播放器↓

TxT小说阅读器
↓语音阅读,小说下载,古典文学↓

一键清除垃圾
↓轻轻一点,清除系统垃圾↓

图片批量下载器
↓批量下载图片,美女图库↓

网站联系: qq:121756557 email:121756557@qq.com IT数码