数据解析
参考此b站课程
总结的学习笔记
代码均是学习用,拒绝商用,侵权则删
一、数据解析概述
1. 回顾聚焦爬虫:
爬取页面中指定的数据内容
(1)编码流程:
- 指定url
- 发起请求
- 获取响应数据
- 数据解析
- 将解析到的局部数据持久化存储
2. 数据解析分类:
- 正则
- bs4(只能用在Python语言中)
- xpath(重点)
3. 数据解析原理概述
- 解析的局部的文本内容都会在标签之间或者标签对应的属性中进行存储
- 进行指定标签的定位
- 对标签或者标签对应的属性中存储的数据值进行提取(解析)
二、图片数据爬取
1. 前引代码(如何获取图片数据以及存储)
- 我们存储的是图片,图片对应的是一组二进制数据
- requests.get(url=url).content
import requests
url = 'https://game.gtimg.cn/images/yxzj/img201606/heroimg/542/542.jpg'
img_data_test = requests.get(url=url).content
with open('./test.jpg', 'wb') as fp:
fp.write(img_data_test)
print("over!!!")
2. 正则解析
(1)常用正则表达式
(2)需求:
(3)分析:
- 首先用通用爬虫获取一整张页面(在通用爬虫的基础上,再运用聚焦爬虫),再使用聚焦爬虫将整张页面中局部数据(图片)进行解析。
(4)代码:(UA伪装修改一下)
import requests
import re
import os
if not os.path.exists('../picLibs'):
os.mkdir('../picLibs')
url = 'https://www.ratoo.net/a/baoxiao/'
headers = {
'User-Agent':'Mozil。。。。'
}
page_content = requests.get(url=url, headers=headers).text
ex = '<div class="pic1">.*?<img src=\'(.*?)\' border.*?</div>'
img_list = re.findall(ex, page_content, re.S)
for src in img_list:
src = 'https:' + src
data = requests.get(url=src, headers=headers).content
img_name = src.split('/')[-1]
img_path = './picLibs/' + img_name
with open(img_path, 'wb') as fp:
fp.write(data)
print(img_name,'下载成功!!!')
print("采集结束")
(5)拓展需求:
- 在这个“爆笑”版块下,还有好多个页码,每一个页码对应的是每一个页面数据,每个页面都有对应的图片数据,以上刚才所写的,只是针对第一页所表示的url中的图片数据进行爬取,但我们还想接着爬取2,3,4…页面的图片数据,怎么延伸?
- 仔细观察每页的url的不同:(只需要修改9_x的页码)
import requests
import re
import os
if not os.path.exists('./picLibs'):
os.mkdir('./picLibs')
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4750.0 Safari/537.36'
}
url = 'https://www.ratoo.net/a/baoxiao/list_9_%d.html'
for pageNum in range(1, 6):
new_url = format(url%pageNum)
page_content = requests.get(url=new_url, headers=headers).text
ex = '<div class="pic1">.*?<img src=\'(.*?)\' border.*?</div>'
img_list = re.findall(ex, page_content, re.S)
for src in img_list:
src = 'https:' + src
data = requests.get(url=src, headers=headers).content
img_name = src.split('/')[-1]
img_path = './picLibs/' + img_name
with open(img_path, 'wb') as fp:
fp.write(data)
print(img_name,'下载成功!!!')
print("采集结束")
3. bs4进行数据解析
(1)回顾数据解析的原理:
(2)bs4数据解析的原理:
- 1.实例化一个BeautifulSoup对象,并且将页面源码数据加载到该对象中
- 2.通过调用BeautifulSoup对象中相关的属性或者方法进行标签定位和数据提取
(3)环境安装:
pip install bs4
pip install lxml
需要将pip源设置为国内源,阿里源等等(修改镜像)
(4)如何实例化BeautifulSoup对象:
-
from bs4 import BeautifulSoup (在bs4模块中,选择导入BeautifulSoup类(对象))【第一步:导包】 -
对象的实例化:
fp = open('./test.html','r',encoding='utf-8')
soup = BeautifulSoup('fp','lxml')
page_text = response.text
soup = BeautifulSoup(page_text,'lxml')
-
提供的用于数据解析的方法和属性:
(5)bs4实战项目
page_text = requests.get(url=url, headers=headers)
page_text.encoding = 'utf-8'
page_text = page_text.text
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent':'Mozill。。。'
}
url = 'https://www.shicimingju.com/book/sanguoyanyi.html'
page_text = requests.get(url=url, headers=headers)
page_text.encoding = 'utf-8'
page_text = page_text.text
soup = BeautifulSoup(page_text, 'lxml')
li_list = soup.select('.book-mulu > ul > li')
fp = open('./sanguo.txt','w', encoding='utf-8')
for li in li_list:
title = li.a.string
detail_url = 'https://www.shicimingju.com' + li.a['href']
detail_text = requests.get(url=detail_url, headers=headers)
detail_text.encoding = 'utf-8'
detail_text = detail_text.text
soup_detail = BeautifulSoup(detail_text, 'lxml')
div_tagClass = soup_detail.find('div',class_='chapter_content')
content = div_tagClass.get_text()
fp.write(title + ':' + content + '\n')
print(title,'爬取章节结束')
print('全部结束')
4. xpath解析:最常用且最便捷高效的一种解析方式。通用性高。
(1)xpath解析原理:
1. 实例化一个etree的对象,且需要将被解析的页面源码数据加载到该对象中。
2. 调用etree对象中的xpath方法结合着xpath表达式实现标签的定位和内容的捕获。
(2)环境的安装:
pip install lxml (安装解析器 |看好Python环境再安装 | 可以直接用Pycharm安装)
(3)如何实例化一个etree对象:from lxml import etree
1. 将本地的html文档中的源码数据加载到etree对象中
- etree.parse(filePath):返回一个etree对象
2. 可以将从互联网上获取的源码数据加载到该对象中
- etree.HTML('page_text)
- xpath('xpath表达式')
(4)xpath表达式
- 实例化好了一个etree对象,且将被解析的源码加载到了该对象中 | parse(‘待解析的源码,可以是本地的,也可以是直接网上的数据’) | HTML()
from lxml import etree
tree = etree.parse('test.html')
- 接着就是xpath表达式了 | tree.xpath(‘xpath表达式’)
- / :表示的是从根节点开始定位。(最左侧的/)| 表示的是一个层级
- //:表示的是多个层级 | (最左侧)可以表示从任意位置开始定位,找到所有的
- 属性定位://div[@class=“song”] | tag[@attrName=“attrValue”]
- 索引定位://div[@class=“song”]/p[3] 索引是从1开始的。
- 取文本:(取的是标签中间存储的文本内容)| 换行 \t 空格 也属于文本内容
- /text():但是必须先定位到直系的标签,即是直系文本。| 获取的是标签中直系的文本内容
- //text():获取的是标签中非直系的文本内容 (所有的文本内容)
- 取属性:(取得定位的标签的对应属性的属性值)
r = tree.xpath('/html/head/title')
r = tree.xpath('/html/body/div')
r = tree.xpath('/html//div')
r = tree.xpath('//div')
r = tree.xpath('//div[@class="song"]')
r = tree.xpath('//div[@class="song"]/p[3]')
r = tree.xpath('//div[@class="tang"]//li[5]/a/text()')[0]
r = tree.xpath('//div[@class="song"]/img/@src')
?
(5)xpath实战一:爬取58二手房的房源信息
import requests
from lxml import etree
headers = {
'User-Agent':'Mozi。。。.36'
}
url = 'https://bj.58.com/ershoufang/?PGTID=0d100000-0000-119a-7933-11a4db6d83bf&ClickID=2'
page_text = requests.get(url=url, headers=headers).text
tree = etree.HTML(page_text)
div_lists = tree.xpath('//section[@class="list"]/div[@class="property"]')
fp = open('58.txt', 'w',encoding='utf-8')
for div in div_lists:
title = div.xpath('./a//div[@class="property-content"]//div[@class="property-content-title"]/h3/text()')[0]
fp.write(title+'\n')
print('over!!!')
(6)xpath实战二:解析下载图片数据
import requests
from lxml import etree
import os
headers = {
'User-Agent':'.......'
}
url = 'https://pic.netbian.com/4kbeijing/'
response = requests.get(url=url, headers=headers)
page_text = response.text
tree = etree.HTML(page_text)
li_list = tree.xpath('//div[@class="slist"]//li')
for li in li_list:
img_src = 'https://pic.netbian.com' + li.xpath('./a/img/@src')[0]
img_name = li.xpath('./a/img/@alt')[0] + '.jpg'
img_name = img_name.encode('iso-8859-1').decode('gbk')
if not os.path.exists('./pic'):
os.mkdir('./pic')
img_data = requests.get(url = img_src, headers=headers).content
img_path = 'pic/' + img_name
with open(img_path, 'wb') as fp:
fp.write(img_data)
print(img_name, '下载成功')
print('全部采集结束')
(7)xpath实战三:解析出所有城市名称
import requests
from lxml import etree
headers = {
'User-Agent':'。。。'
}
url = 'https://www.aqistudy.cn/historydata/'
page_text = requests.get(url=url, headers=headers).text
tree = etree.HTML(page_text)
a_city_name = tree.xpath('//div[@class="bottom"]/ul/li/a | //div[@class="bottom"]/ul/div[2]/li/a')
all_city_name = []
for a in a_city_name:
name = a.xpath('./text()')[0]
all_city_name.append(name)
print(all_city_name, len(all_city_name))
''' 分了两步写,修改用统一的公式
hot_li_list = tree.xpath('//div[@class="bottom"]/ul/li')
all_city_names = [] # 存储热门城市的名字
for li in hot_li_list:
hot_city_name = li.xpath('./a/text()')[0]
all_city_names.append(hot_city_name) # 存储进列表中
# 解析的是全部城市的名称
city_names_list = tree.xpath('//div[@class="bottom"]/ul/div[2]/li')
for li in city_names_list:
city_name = li.xpath('./a/text()')[0]
all_city_names.append(city_name)
print(all_city_names, len(all_city_names))
'''
(8)xpath实战综合:爬取站长素材中免费简历模板
import requests
import os
from lxml import etree
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4750.0 Safari/537.36'
}
url = 'https://sc.chinaz.com/jianli/free_%d.html'
if not os.path.exists('./muban'):
os.mkdir('./muban')
url_list = []
for pageNum in range (1, 3):
if (pageNum == 1):
new_url = 'https://sc.chinaz.com/jianli/free.html'
else:
new_url = format(url % pageNum)
page_text = requests.get(url = new_url, headers = headers).text
tree = etree.HTML(page_text)
url_list_xpath = tree.xpath('//div[@id="main"]//a/@href')
for i in url_list_xpath:
detail_url = 'https:' + i
url_list.append(detail_url)
for i in url_list:
p_t = requests.get(url=i, headers=headers).text
tree1 = etree.HTML(p_t)
url_rar = tree1.xpath('//div[@class="clearfix mt20 downlist"]/ul/li/a/@href')[0]
model_name = tree1.xpath('//div[@class="ppt_tit clearfix"]/h1/text()')[0]
model_name = './muban/' + model_name.encode('iso-8859-1').decode('utf-8').replace(" ", "") + str(url_rar).split('/')[-1]
rar_data = requests.get(url=url_rar, headers=headers).content
with open(model_name, 'wb') as fp:
fp.write(rar_data)
print(model_name,'下载结束')
print('所有都结束了')
后续实战项目会继续补充…
|