使用selenium库和xpath抓取昆特牌官网卡组快照
1 实验目的
使用python以及一些库进行官网卡组的抓取,分目录存储卡组快照和链接。
2 实验思路
- python是脚本语言,所以我们可以使用其做一些重复性很强的工作;
- 使用selenium库,可以实例化浏览器对象在后台运行,进行页面拉取,下载视频、图片文件等;
- 使用BeautifulSoup库等可以解析DOM,获取我们所需要的内容,但是此处我们只需几个网页上按钮的信息,而且PlayGwent官网搭建的很清晰,所以直接f12读取xpath即可。
3 整体设计
- 分
l
i
b
r
a
r
y
library
library和
f
a
c
t
i
o
n
faction
faction获取
x
p
a
t
h
xpath
xpath;
图1.library
图2.faction
-
初始化
d
a
t
e
1
,
d
e
c
k
_
u
r
l
2
,
l
i
b
r
a
r
i
e
s
_
x
p
a
t
h
_
d
i
c
t
3
,
f
a
c
t
i
o
n
s
_
x
p
a
t
h
_
d
i
c
t
4
date^{1}, deck\_url^{2}, libraries\_xpath\_dict^{3}, factions\_xpath\_dict^{4}
date1,deck_url2,libraries_xpath_dict3,factions_xpath_dict4,1调函数即可,2是PlayGwent官网网址,3是3个library按钮所对应
n
a
m
e
?
x
p
a
t
h
\textcolor{red}{name-xpath}
name?xpath对组成的字典,4是
a
l
l
all
all和6个
f
a
c
t
i
o
n
faction
faction按钮所对应
n
a
m
e
?
x
p
a
t
h
\textcolor{red}{name-xpath}
name?xpath对组成的字典; -
首先进行目录的创建,使用
o
s
os
os库和
m
k
_
d
i
r
s
mk\_dirs
mk_dirs方法分时间、
l
i
b
r
a
r
y
library
library和
f
a
c
t
i
o
n
faction
faction创建目录; -
对于不同的
l
i
b
r
a
r
y
library
library和
f
a
c
t
i
o
n
faction
faction分别调用
g
e
t
_
g
w
e
n
t
_
d
e
c
k
s
get\_gwent\_decks
get_gwent_decks方法进行卡组快照和链接的抓取,传入的参数
p
a
g
e
page
page还能决定每次抓几面,返回的
l
i
n
k
s
links
links链表传入
m
k
_
l
i
n
k
s
_
t
x
t
mk\_links\_txt
mk_links_txt方法,生成txt文件保存链接(本步也可合并到
g
e
t
_
g
w
e
n
t
_
d
e
c
k
s
get\_gwent\_decks
get_gwent_decks方法中);
图3.实际效果
- 对于每次抓取操作,首先先进行初始化确定每个卡组的xpath?和存储为png格式时的名称和位置,接着设置
d
r
i
v
e
r
driver
driver,添加userAgent,设置模式为headless,改变
d
r
i
v
e
r
driver
driver的比例以截取长图(实际是在启动后设置),再浏览指定网页,点击cookies按钮,再执行选择
l
i
b
r
a
r
y
library
library和
f
a
c
t
i
o
n
faction
faction操作,紧接着进行卡组快照和链接的抓取,最后返回链接链表。
图4.use cookies
4 代码实现
import os
import time
import datetime
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
date = str(datetime.datetime.now()).split()[0]
deck_url = 'https://www.playgwent.com/en/decks/'
libraries_xpath_dict = {
'treading decks': '//*[@id="root"]/div/div[2]/div[1]/ul/li[1]',
'new arrivals': '//*[@id="root"]/div/div[2]/div[1]/ul/li[2]'
}
factions_xpath_dict = {
'all': '//*[@id="root"]/div/section/div[1]/div[1]/div[1]',
'Skellige': '//*[@id="root"]/div/section/div[1]/div[1]/div[2]',
'Scoia\'tael': '//*[@id="root"]/div/section/div[1]/div[1]/div[3]',
'Monsters': '//*[@id="root"]/div/section/div[1]/div[1]/div[4]',
'Northern Realms': '//*[@id="root"]/div/section/div[1]/div[1]/div[5]',
'Nilfgaard': '//*[@id="root"]/div/section/div[1]/div[1]/div[6]',
'Syndicate': '//*[@id="root"]/div/section/div[1]/div[1]/div[7]'
}
def mk_dirs(library, faction):
if not os.path.exists(fr'decks/{date}/{library}/{faction}'):
os.mkdir(fr'decks/{date}/{library}/{faction}')
return fr'decks/{date}/{library}/{faction}'
def get_gwent_decks(library, faction, page):
"""获取卡组图片与相应的links
:param library: 索引方式,由于top decks的有些卡组过于古老,故略去
:param faction: 阵营
:param page: 页面数
:return: 返回卡组links list
"""
links = []
element = [f'//*[@id="root"]/div/section/div[2]/div[{x}]' for x in range(2, 12)]
names = [fr'decks/{date}/{library}/{faction}/ %d.png' % x for x in range(1, (page*10+1))]
dcap = dict(DesiredCapabilities.CHROME)
dcap["chrome.page.settings.userAgent"] = ("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/"
"537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36")
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')
driver = webdriver.Chrome(desired_capabilities=dcap, service_args=['--ignore- ssl-errors=true'],
options=chrome_options)
driver.implicitly_wait(20)
driver.get(url=deck_url)
driver.refresh()
driver.find_element_by_xpath('//*[@id="CybotCookiebotDialogBodyButtonAccept"]').click()
scroll_width = driver.execute_script('return document.body.parentNode.scrollWidth')
scroll_height = driver.execute_script('return document.body.parentNode.scrollHeight')
driver.set_window_size(scroll_width, scroll_height)
driver.find_element_by_xpath(libraries_xpath_dict.get(library)).click()
driver.find_element_by_xpath(factions_xpath_dict.get(faction)).click()
for _ in range(0, 10):
try:
driver.find_element_by_xpath(element[_]).click()
time.sleep(2)
cur_url = str(driver.current_url)
links.append(cur_url)
driver.save_screenshot(names[_])
driver.back()
driver.implicitly_wait(20)
except IOError:
print('IOError')
if _ == 9 and page > 1:
page -= 1
driver.find_element_by_xpath('//*[@id="root"]/div/section/div[3]/a[2]').click()
return links
def mk_links_txt(save_path, links):
file = open(fr'{save_path}/deck links.txt', 'w')
i = 1
for link in links:
file.write(f'{i}: ' + link + '\n')
i += 1
file.close()
def main():
page = 1
libraries = ['treading decks', 'new arrivals']
factions = ['all', 'Skellige', 'Scoia\'tael', 'Monsters', 'Northern Realms', 'Nilfgaard', 'Syndicate']
mk_dirs('', '')
for library in libraries:
mk_dirs(library, '')
for faction in factions:
save_path = mk_dirs(library, faction)
links = get_gwent_decks(library, faction, page)
mk_links_txt(str(save_path), links)
if __name__ == '__main__':
main()
5 不足和缺点
一时兴起写了个小爬虫,也没有怎么改,所以还是会有一些问题。比如,我使用的这个版本的selenium不支持PhamtonJS所以我就用Chrome了,可是不能像PhamtonJS那样截全图就只能缩放了,可是有些老哥的简介巨长就框不进来了,所以可以使用一些旧版本的selenium搭配PhamtonJS实现完美的卡组快照。
|