六.页面分析
1.明确爬取目标
观察爬取网站,我们需要爬取的是图片,需要找到图片的URL。 ![在这里插入图片描述](https://img-blog.csdnimg.cn/7425c803d1064d35b9b0bfba90a4c354.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBA5Z6D5Zy-5qG26YeM5Lmf5oy65aW9,size_20,color_FFFFFF,t_70,g_se,x_16)
2.提取页面源代码中有效信息
在爬取自己需要的信息时,我们首先先获取到页面源代码,在页面源代码中进行提取。 (pycharm:Ctrl+F搜索) ![在这里插入图片描述](https://img-blog.csdnimg.cn/072b4129fe2340ecac5158100e1f329c.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBA5Z6D5Zy-5qG26YeM5Lmf5oy65aW9,size_20,color_FFFFFF,t_70,g_se,x_16)
在Img标签下(qrcode一般时二维码) ![在这里插入图片描述](https://img-blog.csdnimg.cn/97b3aef8c8904d4ca44b6d1f93cfe61f.png) ![在这里插入图片描述](https://img-blog.csdnimg.cn/dde929530e9b4cbda1ca29b44f30befc.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBA5Z6D5Zy-5qG26YeM5Lmf5oy65aW9,size_20,color_FFFFFF,t_70,g_se,x_16) 不是需要的信息,继续寻找img标签。 ![在这里插入图片描述](https://img-blog.csdnimg.cn/2f56567f93c9449493951381d354d6ac.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBA5Z6D5Zy-5qG26YeM5Lmf5oy65aW9,size_20,color_FFFFFF,t_70,g_se,x_16) ![在这里插入图片描述](https://img-blog.csdnimg.cn/7ad1c8e11f1f4f90922b258bfaedd50b.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBA5Z6D5Zy-5qG26YeM5Lmf5oy65aW9,size_20,color_FFFFFF,t_70,g_se,x_16) 所以我们就要拿到img标签中src属性。 此时我们要完整下载这一套图,就需要拿到每一张图对应的详情页的url,再访问每一个详情页(包括需要选定图片的分辨率),在每个详情页里面提取到src,下载每一张图片。 这样爬取图片太费时费力了,我们再仔细观察这组图的第一张图的页面,找到这组图中其他图片的信息,争取一次爬取到整组图片。通过观察发现整组图片都会在"壁纸下载"的下面那一栏里有缩略图,那么在页面源代码中一定就有这些图的信息,所以我们可以通过页面源代码一次提取到这组图。 ![在这里插入图片描述](https://img-blog.csdnimg.cn/397ff9affe76473da258bd8c2f061423.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBA5Z6D5Zy-5qG26YeM5Lmf5oy65aW9,size_20,color_FFFFFF,t_70,g_se,x_16)
一般的在页面源代码中,标签< script>中都是关于页面的脚本。可以看到里面有图片的分辨率 ![在这里插入图片描述](https://img-blog.csdnimg.cn/8b73d5df366e4f188a979f6697a342a2.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBA5Z6D5Zy-5qG26YeM5Lmf5oy65aW9,size_20,color_FFFFFF,t_70,g_se,x_16)
var userid = get_cookie('zol_userid');
var deskPicArr = {"list":[{"picId":"114228","oriSize":"4800x3840","resAll":["4096x2160","2880x1800","2560x1600","2560x1440","1920x1080","1680x1050","1600x900","1440x900","1366x768","1280x1024","1280x800","1024x768"],"imgsrc":"http:\/\/desk-fd.zol-img.com.cn\/t_s##SIZE##\/g6\/M00\/01\/06\/ChMkKV9YosWILofpAAVFdZ1aXQ0AACFOwLz__wABUWN167.jpg"},{"picId":"114226","oriSize":"4800x3840","resAll":["4096x2160","2880x1800","2560x1600","2560x1440","1920x1080","1680x1050","1600x900","1440x900","1366x768","1280x1024","1280x800","1024x768"],"imgsrc":"http:\/\/desk-fd.zol-img.com.cn\/t_s##SIZE##\/g6\/M00\/01\/06\/ChMkKV9YoryIZomlAARkWvUPATgAACFOwH5-iUABGRy471.jpg"},{"picId":"114227","oriSize":"4800x3840","resAll":["4096x2160","2880x1800","2560x1600","2560x1440","1920x1080","1680x1050","1600x900","1440x900","1366x768","1280x1024","1280x800","1024x768"],"imgsrc":"http:\/\/desk-fd.zol-img.com.cn\/t_s##SIZE##\/g6\/M00\/01\/06\/ChMkKV9YosCIb64VAASZqPCMZE4AACFOwJ7v8IABJnA485.jpg"},{"picId":"114229","oriSize":"4800x3840","resAll":["4096x2160","2880x1800","2560x1600","2560x1440","1920x1080","1680x1050","1600x900","1440x900","1366x768","1280x1024","1280x800","1024x768"],"imgsrc":"http:\/\/desk-fd.zol-img.com.cn\/t_s##SIZE##\/g6\/M00\/01\/06\/ChMkKV9YosqIXgdTAAVjcpuyCSoAACFOwOKpp8ABWOK311.jpg"},{"picId":"114230","oriSize":"4800x3840","resAll":["4096x2160","2880x1800","2560x1600","2560x1440","1920x1080","1680x1050","1600x900","1440x900","1366x768","1280x1024","1280x800","1024x768"],"imgsrc":"http:\/\/desk-fd.zol-img.com.cn\/t_s##SIZE##\/g6\/M00\/01\/06\/ChMkKV9Yos6IHodkAAjdOsbssJ8AACFPAAhyQIACN1S454.jpg"},{"picId":"114231","oriSize":"4800x3840","resAll":["4096x2160","2880x1800","2560x1600","2560x1440","1920x1080","1680x1050","1600x900","1440x900","1366x768","1280x1024","1280x800","1024x768"],"imgsrc":"http:\/\/desk-fd.zol-img.com.cn\/t_s##SIZE##\/g6\/M00\/01\/06\/ChMkKV9YotSIZPSPAAjfAbBy5BYAACFPAFFNJYACN8Z978.jpg"},{"picId":"114232","oriSize":"4800x3840","resAll":["4096x2160","2880x1800","2560x1600","2560x1440","1920x1080","1680x1050","1600x900","1440x900","1366x768","1280x1024","1280x800","1024x768"],"imgsrc":"http:\/\/desk-fd.zol-img.com.cn\/t_s##SIZE##\/g6\/M00\/01\/06\/ChMkKV9YotmIcOetAAb6VYxMGX4AACFPAJtMAMABvpt418.jpg"},{"picId":"114233","oriSize":"4800x3840","resAll":["4096x2160","2880x1800","2560x1600","2560x1440","1920x1080","1680x1050","1600x900","1440x900","1366x768","1280x1024","1280x800","1024x768"],"imgsrc":"http:\/\/desk-fd.zol-img.com.cn\/t_s##SIZE##\/g6\/M00\/01\/06\/ChMkKV9Yot6ICVIVAARje2492PsAACFPANr_wcABGOT211.jpg"}]};
在字典中可以看到imgsrc这个键值对,就是图片,八张图片都齐全了,再观察imgsrc的内容。
http:\/\/desk-fd.zol-img.com.cn\/t_s##SIZE##\/g6\/M00\/01\/06\/ChMkKV9YosWILofpAAVFdZ1aXQ0AACFOwLz__wABUWN167.jpg
打开后发现什么都没有,发现t_s##size##中size应该是分辨率参数。 ![在这里插入图片描述](https://img-blog.csdnimg.cn/a6eff951b96844fcaee676d05a96b097.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBA5Z6D5Zy-5qG26YeM5Lmf5oy65aW9,size_20,color_FFFFFF,t_70,g_se,x_16) 填入分辨率后就可以成功加载出图片了 ![在这里插入图片描述](https://img-blog.csdnimg.cn/40ca725fc9044e50b851e73215b267fc.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBA5Z6D5Zy-5qG26YeM5Lmf5oy65aW9,size_20,color_FFFFFF,t_70,g_se,x_16) 通过不停的寻找与观察,现在我们可以在页面源代码中提取到script标签下的var deskPicArr 中的图片网址,提取到imgsrc,并且填入所需要发分辨率,这样就可以一次性的爬取到整组图片。 代码如下:
import requests
import re
import json
url = "https://desk.zol.com.cn/bizhi/9374_114228_2.html"
resp = requests.get(url)
resp.encoding="gbk"
obj = re.compile(r"var deskPicArr.*?=(?P<deskPicArr>.*?);",re.S)
result = obj.search(resp.text)
deskPicStr = result.group("deskPicArr")
deskPic = json.loads(deskPicStr)
for item in deskPic['list']:
oriSize = item.get("oriSize")
imgsrc = item.get("imgsrc")
imgsrc = imgsrc.replace("##SIZE##",oriSize)
name = imgsrc.split("/")[-1]
resp_img = requests.get(imgsrc)
with open(f"picture/{name}",mode="wb") as f:
f.write(resp_img.content)
可以下载到图片但是会报错,如下: ![在这里插入图片描述](https://img-blog.csdnimg.cn/b751aec549744e4ca569d606ba84a6be.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBA5Z6D5Zy-5qG26YeM5Lmf5oy65aW9,size_20,color_FFFFFF,t_70,g_se,x_16) 解决问题的博客指路:https://blog.csdn.net/wancongconga/article/details/111030335
3.作业
爬取本页所有小黄人壁纸。 ![在这里插入图片描述](https://img-blog.csdnimg.cn/7125beec34b5447f9295f14f56461d7b.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBA5Z6D5Zy-5qG26YeM5Lmf5oy65aW9,size_20,color_FFFFFF,t_70,g_se,x_16)
import requests
import re
import json
from lxml import etree
url = "https://desk.zol.com.cn/dongman/xiaohuangren/"
string = "https://desk.zol.com.cn"
resp = requests.get(url)
resp.encoding="gbk"
et = etree.HTML(resp.text)
result = et.xpath("//ul[@class='pic-list2 clearfix']/li/a/@href")
obj = re.compile(r"var deskPicArr.*?=(?P<deskPicArr>.*?);",re.S)
for i in range(1,8):
url1 = string+result[i]
resp_img = requests.get(url1)
resp_img.encoding="gbk"
result_img = obj.search(resp_img.text)
deskPicStr = result_img.group("deskPicArr")
deskPic = json.loads(deskPicStr)
for item in deskPic['list']:
oriSize = item.get("oriSize")
imgsrc = item.get("imgsrc")
imgsrc = imgsrc.replace("##SIZE##", oriSize)
name = imgsrc.split("/")[-1]
resp_img = requests.get(imgsrc)
with open(f"picture/{name}", mode="wb") as f:
f.write(resp_img.content)
en,还是上面那个问题
![在这里插入图片描述](https://img-blog.csdnimg.cn/09e6fc085b8f4c738b75daf4a33be551.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBA5Z6D5Zy-5qG26YeM5Lmf5oy65aW9,size_20,color_FFFFFF,t_70,g_se,x_16) 爬取成功啦~ ![在这里插入图片描述](https://img-blog.csdnimg.cn/37217b129bad495980ba7f55e82f4241.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBA5Z6D5Zy-5qG26YeM5Lmf5oy65aW9,size_20,color_FFFFFF,t_70,g_se,x_16)
|