六.页面分析
1.明确爬取目标
观察爬取网站,我们需要爬取的是图片,需要找到图片的URL。
2.提取页面源代码中有效信息
在爬取自己需要的信息时,我们首先先获取到页面源代码,在页面源代码中进行提取。 (pycharm:Ctrl+F搜索)
在Img标签下(qrcode一般时二维码) 不是需要的信息,继续寻找img标签。 所以我们就要拿到img标签中src属性。 此时我们要完整下载这一套图,就需要拿到每一张图对应的详情页的url,再访问每一个详情页(包括需要选定图片的分辨率),在每个详情页里面提取到src,下载每一张图片。 这样爬取图片太费时费力了,我们再仔细观察这组图的第一张图的页面,找到这组图中其他图片的信息,争取一次爬取到整组图片。通过观察发现整组图片都会在"壁纸下载"的下面那一栏里有缩略图,那么在页面源代码中一定就有这些图的信息,所以我们可以通过页面源代码一次提取到这组图。
一般的在页面源代码中,标签< script>中都是关于页面的脚本。可以看到里面有图片的分辨率
var userid = get_cookie('zol_userid');
var deskPicArr = {"list":[{"picId":"114228","oriSize":"4800x3840","resAll":["4096x2160","2880x1800","2560x1600","2560x1440","1920x1080","1680x1050","1600x900","1440x900","1366x768","1280x1024","1280x800","1024x768"],"imgsrc":"http:\/\/desk-fd.zol-img.com.cn\/t_s##SIZE##\/g6\/M00\/01\/06\/ChMkKV9YosWILofpAAVFdZ1aXQ0AACFOwLz__wABUWN167.jpg"},{"picId":"114226","oriSize":"4800x3840","resAll":["4096x2160","2880x1800","2560x1600","2560x1440","1920x1080","1680x1050","1600x900","1440x900","1366x768","1280x1024","1280x800","1024x768"],"imgsrc":"http:\/\/desk-fd.zol-img.com.cn\/t_s##SIZE##\/g6\/M00\/01\/06\/ChMkKV9YoryIZomlAARkWvUPATgAACFOwH5-iUABGRy471.jpg"},{"picId":"114227","oriSize":"4800x3840","resAll":["4096x2160","2880x1800","2560x1600","2560x1440","1920x1080","1680x1050","1600x900","1440x900","1366x768","1280x1024","1280x800","1024x768"],"imgsrc":"http:\/\/desk-fd.zol-img.com.cn\/t_s##SIZE##\/g6\/M00\/01\/06\/ChMkKV9YosCIb64VAASZqPCMZE4AACFOwJ7v8IABJnA485.jpg"},{"picId":"114229","oriSize":"4800x3840","resAll":["4096x2160","2880x1800","2560x1600","2560x1440","1920x1080","1680x1050","1600x900","1440x900","1366x768","1280x1024","1280x800","1024x768"],"imgsrc":"http:\/\/desk-fd.zol-img.com.cn\/t_s##SIZE##\/g6\/M00\/01\/06\/ChMkKV9YosqIXgdTAAVjcpuyCSoAACFOwOKpp8ABWOK311.jpg"},{"picId":"114230","oriSize":"4800x3840","resAll":["4096x2160","2880x1800","2560x1600","2560x1440","1920x1080","1680x1050","1600x900","1440x900","1366x768","1280x1024","1280x800","1024x768"],"imgsrc":"http:\/\/desk-fd.zol-img.com.cn\/t_s##SIZE##\/g6\/M00\/01\/06\/ChMkKV9Yos6IHodkAAjdOsbssJ8AACFPAAhyQIACN1S454.jpg"},{"picId":"114231","oriSize":"4800x3840","resAll":["4096x2160","2880x1800","2560x1600","2560x1440","1920x1080","1680x1050","1600x900","1440x900","1366x768","1280x1024","1280x800","1024x768"],"imgsrc":"http:\/\/desk-fd.zol-img.com.cn\/t_s##SIZE##\/g6\/M00\/01\/06\/ChMkKV9YotSIZPSPAAjfAbBy5BYAACFPAFFNJYACN8Z978.jpg"},{"picId":"114232","oriSize":"4800x3840","resAll":["4096x2160","2880x1800","2560x1600","2560x1440","1920x1080","1680x1050","1600x900","1440x900","1366x768","1280x1024","1280x800","1024x768"],"imgsrc":"http:\/\/desk-fd.zol-img.com.cn\/t_s##SIZE##\/g6\/M00\/01\/06\/ChMkKV9YotmIcOetAAb6VYxMGX4AACFPAJtMAMABvpt418.jpg"},{"picId":"114233","oriSize":"4800x3840","resAll":["4096x2160","2880x1800","2560x1600","2560x1440","1920x1080","1680x1050","1600x900","1440x900","1366x768","1280x1024","1280x800","1024x768"],"imgsrc":"http:\/\/desk-fd.zol-img.com.cn\/t_s##SIZE##\/g6\/M00\/01\/06\/ChMkKV9Yot6ICVIVAARje2492PsAACFPANr_wcABGOT211.jpg"}]};
在字典中可以看到imgsrc这个键值对,就是图片,八张图片都齐全了,再观察imgsrc的内容。
http:\/\/desk-fd.zol-img.com.cn\/t_s##SIZE##\/g6\/M00\/01\/06\/ChMkKV9YosWILofpAAVFdZ1aXQ0AACFOwLz__wABUWN167.jpg
打开后发现什么都没有,发现t_s##size##中size应该是分辨率参数。 填入分辨率后就可以成功加载出图片了 通过不停的寻找与观察,现在我们可以在页面源代码中提取到script标签下的var deskPicArr 中的图片网址,提取到imgsrc,并且填入所需要发分辨率,这样就可以一次性的爬取到整组图片。 代码如下:
import requests
import re
import json
url = "https://desk.zol.com.cn/bizhi/9374_114228_2.html"
resp = requests.get(url)
resp.encoding="gbk"
obj = re.compile(r"var deskPicArr.*?=(?P<deskPicArr>.*?);",re.S)
result = obj.search(resp.text)
deskPicStr = result.group("deskPicArr")
deskPic = json.loads(deskPicStr)
for item in deskPic['list']:
oriSize = item.get("oriSize")
imgsrc = item.get("imgsrc")
imgsrc = imgsrc.replace("##SIZE##",oriSize)
name = imgsrc.split("/")[-1]
resp_img = requests.get(imgsrc)
with open(f"picture/{name}",mode="wb") as f:
f.write(resp_img.content)
可以下载到图片但是会报错,如下: 解决问题的博客指路:https://blog.csdn.net/wancongconga/article/details/111030335
3.作业
爬取本页所有小黄人壁纸。
import requests
import re
import json
from lxml import etree
url = "https://desk.zol.com.cn/dongman/xiaohuangren/"
string = "https://desk.zol.com.cn"
resp = requests.get(url)
resp.encoding="gbk"
et = etree.HTML(resp.text)
result = et.xpath("//ul[@class='pic-list2 clearfix']/li/a/@href")
obj = re.compile(r"var deskPicArr.*?=(?P<deskPicArr>.*?);",re.S)
for i in range(1,8):
url1 = string+result[i]
resp_img = requests.get(url1)
resp_img.encoding="gbk"
result_img = obj.search(resp_img.text)
deskPicStr = result_img.group("deskPicArr")
deskPic = json.loads(deskPicStr)
for item in deskPic['list']:
oriSize = item.get("oriSize")
imgsrc = item.get("imgsrc")
imgsrc = imgsrc.replace("##SIZE##", oriSize)
name = imgsrc.split("/")[-1]
resp_img = requests.get(imgsrc)
with open(f"picture/{name}", mode="wb") as f:
f.write(resp_img.content)
en,还是上面那个问题
爬取成功啦~
|