为了简化案例,这里仅提及C站博客首页旧版和部分统计数据。
C站旧版的博客主页都是类似 “https://blog.csdn.net/” + uid 这样子的。这种比较好办。找好用户id就可以重用网址前缀。
定义简单的C站用户ID类csdnuser.py:
"""
@author: MR.N
@create: 2021-07-22 Thur. 12:12
"""
class CSDNUser:
def __init__(self):
self.uid = None
self.blog = None
def __del__(self):
self.uid = None
def get_uid(self):
return self.uid
def set_uid(self, uid=None):
self.uid = uid
def __str__(self):
return self.uid + ': ' + self.blog + ''
定义C站加载器类:
#!/bin/env python3
from csdnuser import CSDNUser
class CSDNLoader:
def __init__(self):
self.user = CSDNUser()
self.blog_prefix = 'https://blog.csdn.net/'
def __del__(self):
print('[GC]', 'called')
self.uid = None
self.blog_prefix = None
def __delete__(self, instance):
# ...
print('[GC-1]', 'called')
def set_uid(self, uid=None):
self.user.set_uid(uid)
def get_uid(self):
return self.user.get_uid()
导入pip install [package]提前安装好的Selenium和Scrapy相关库模块:
from selenium import webdriver
from scrapy.selector import Selector
使用Selenium + Firefox + geckodriver获取个人博客主页的网页源码:
class CSDNLoader:
# ...
def load_index(self, timeout=6):
if self.get_uid() is None or self.blog_prefix is None:
print('[init-err]', 'uid/blog is none')
return
driver = None
page_source = ''
try:
options = webdriver.FirefoxOptions()
options.headless = True
driver = webdriver.Firefox(options=options)
driver.set_page_load_timeout(timeout)
driver.set_script_timeout(timeout)
url = self.blog_prefix + self.get_uid()
try:
driver.get(url)
driver.implicitly_wait(timeout)
except Exception as err:
print('[load err]', err, 'handled.')
page_source = driver.page_source
except Exception as err:
print('[error]', err)
finally:
if driver is not None:
try:
driver.quit()
finally:
driver = None
return page_source
使用Firefox或Chrome浏览器的右键菜单的“检查”工具提取博客统计数据的HTML节点路径。转化为xpath后,通过Scrapy的Selector选择器获取目标数据的节点及其值。xpath用法不算复杂。最简单的就是://标签名称[@属性="值"]/.../@属性。使用@属性结尾使用getall()方法获取的就是末节点列表的属性数组,否则就是末节点列表的数组。也可以使用get()获取单一末节点或其属性。
def explain(self, page_source=None):
if page_source is not None and page_source != '':
sel = Selector(text=page_source)
# 博客统计数据节点的xpath
attrs = sel.xpath('//div[@id="asideProfile"]/'
'div[@class="data-info d-flex item-tiling"]/'
'dl[@class="text-center"]/@title').getall()
else:
print('[error]', 'Emptry page source!')
毫无疑问,上面的xpath获取的依次是原创博文数量、周排名、总排名(作者总榜)、总访问量以及博客等级。
# ...
articles = attrs[0]
weekly_ranking = attrs[1]
total_ranking = attrs[2]
visits = attrs[3]
blog_level = attrs[4]
临时的统计数据意义不大,所以将所获得的数据保存更好。这样下次重新统计数据时,可以进行比较差异。
last_articles = *
last_weekly_ranking = *
last_total_ranking = *
last_visits = *
last_blog_level = *
对新旧数据进行差异化比较:
class CSDNLoader:
# ...
def delta(self, a, b, reverse=False):
try:
res = (int(a)) - (int(b))
return '' if res == 0 else str(res) + ('↑' if not reverse else '↓') \
if res > 0 else str(-res) + ('↓' if not reverse else '↑')
except:
return ''
得出的结果类似如下:
[next-updated] 2021-08-19 Thu. 21:48:57
[uid] qq_21264377
[articles] 216?
[weekly-ranking] 6259?
[total-ranking] 12990?
[visits] 49412 85↑
[blog-level] 5?
[last-updated] 2021-08-19 Thu. 12:09:04
|