如图,爬取所有图片,不包含文字 通过抓包工具可知每一张图片所在div的class=“thumb”,利用正则表达式
ex = '<div class="thumb">.*?<img src="(.*?)" alt.*?</div>'
下面是爬取当前页图片代码
import requests
import re
import os
if __name__ == '__main__':
if not os.path.exists('./qiutuLibs'):
os.mkdir('./qiutuLibs')
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 '
'Safari/537.36 Core/1.70.3877.400 QQBrowser/10.8.4506.400 '
}
url = 'https://www.qiushibaike.com/imgrank/'
page_text = requests.get(url=url, headers=headers).text
ex = '<div class="thumb">.*?<img src="(.*?)" alt.*?</div>'
img_src_list = re.findall(ex, page_text, re.S)
for src in img_src_list:
src = 'https:' + src
img_data = requests.get(url=src, headers=headers).content
img_name = src.split('/')[-1]
imgPath = './qiutuLibs/' + img_name
with open(imgPath, 'wb') as fp:
fp.write(img_data)
print(img_name, '下载成功!!!')
如果要分页爬取数据,就需要建立一个url模板,单击糗事百科的第二页第三页,观察url地址 于是我们建立的url模板可为
url = 'https://www.qiushibaike.com/imgrank/page/%d/'
全部代码如下
import requests
import re
import os
if __name__ == '__main__':
if not os.path.exists('./qiutuLibs'):
os.mkdir('./qiutuLibs')
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 '
'Safari/537.36 Core/1.70.3877.400 QQBrowser/10.8.4506.400 '
}
url = 'https://www.qiushibaike.com/imgrank/page/%d/'
for pageNum in range(1, 3):
new_url = format(url % pageNum)
page_text = requests.get(url=new_url, headers=headers).text
ex = '<div class="thumb">.*?<img src="(.*?)" alt.*?</div>'
img_src_list = re.findall(ex, page_text, re.S)
for src in img_src_list:
src = 'https:' + src
img_data = requests.get(url=src, headers=headers).content
img_name = src.split('/')[-1]
imgPath = './qiutuLibs/' + img_name
with open(imgPath, 'wb') as fp:
fp.write(img_data)
print(img_name, '下载成功!!!')
|