环境安装
解析原理:html对象是以树状的形式进行展示
- 实例化一个etree的对象,且将待解析的页面源码数据加载到该对象中
- 调用etree对象的xpath方法结合着xpath表达式实现标签的定位和数据提取
实例化etree对象
- etree.parse(‘filename’):将本地html文档加载到该对象中
- etree.HTML(‘page_text’):网站获取的页面数据加载到该对象
标签定位
- 最左侧的==/==:如果xpath表达式最左侧是以/开头的表示一定要从根标签开始定位指定标签
- 非最左侧的==/==:表示一个层级
- 非左侧的==//==:表示多个层级
- 属性定位:tagName[@attrName=‘value’]
- 索引定位:tag[index]索引是从1开始的
取文本
- /text():直系文本内容
- //text():所有文本内容
取属性
使用xpath爬取图片名称和图片数据
https://pic.netbian.com/4kdongman/
import os
import pymongo
import requests
from lxml import etree
dirName='images'
if not os.path.exists(dirName):
os.mkdir(dirName)
client = pymongo.MongoClient(host='localhost', port=27017)
db = client.images
collection = db.pic_netbian_com
url='https://pic.netbian.com/4kdongman/'
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.3538.77 Safari/537.36'
}
response=requests.get(url=url,headers=headers)
response.encoding='GBK'
html=response.text
tree=etree.HTML(html)
li_list=tree.xpath('//*[@id="main"]/div[3]/ul/li')
for li in li_list:
title=li.xpath('./a/img/@alt')[0].replace(' ','')+'.jpg'
img_src='https://pic.netbian.com/'+li.xpath('./a/img/@src')[0]
img_data=requests.get(url=img_src,headers=headers).content
imgPath=dirName+'/'+title
with open(imgPath,'wb') as fp:
fp.write(img_data)
data={
'title':title,
'src':img_src
}
result = collection.insert_one(data)
print(title,'保存成功')
client.close()
需求:要求解析出携带html标签的局部数据?
- bs4,bs4在实现标签定位的时候返回的就是定位到标签对应的字符数据
xpath表达式如何更加具有通用性?
在xpath表达式中使用管道符进行分割的作用,可以表示管道符左右两侧的子xpath表达式同时生效或者一个生效
import os
import pymongo
import requests
from lxml import etree
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.3538.77 Safari/537.36'
}
url = 'https://www.aqistudy.cn/historydata'
response = requests.get(url=url, headers=headers)
response.encoding = 'utf-8'
html = response.text
tree = etree.HTML(html)
hot_cities = tree.xpath('//div[@class="bottom"]/ul/li/a/text()')
all_cities = tree.xpath('//div[@class="bottom"]/ul/div[2]/li/a/text()')
print(hot_cities)
print(all_cities)
通用性
import os
import pymongo
import requests
from lxml import etree
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.3538.77 Safari/537.36'
}
url = 'https://www.aqistudy.cn/historydata'
response = requests.get(url=url, headers=headers)
response.encoding = 'utf-8'
html = response.text
tree = etree.HTML(html)
cc=tree.xpath('//div[@class="bottom"]/ul/li/a/text() | //div[@class="bottom"]/ul/div[2]/li/a/text()')
print(cc)
反爬策略:懒加载
站长素材:高清图片
伪属性:src2被浏览器划到可视化区域才变化为src
反爬机制:图片懒加载,广泛应用于一些图片的网站中
只有在当图片被显示在浏览器可视化范围之内才会将img的伪属性变成真正的属性。如果是requests发起的请求是没有可视化范围,因此一定要解析的是img伪属性值(图片地址)
|