爬虫
作者:Ychhh_
铺垫内容
爬虫分类
反爬机制
- 门户网站,可以通过指定相应的策略,防止爬虫程序进行数据的窃取
- 反反爬策略:破解反爬策略,获取数据
相关协议
requests模块
requests作用
模拟浏览器发送请求
UA伪装(反爬机制)
门户网站若检测到请求载体为request而不是浏览器,则会使得拒绝访问
聚焦爬虫
数据解析分类
bs4
-
数据解析原理 1. 标签定位 2. 提取标签属性中的数据值 -
bs4数据解析原理: 1. 实例化beautysoup对象,并将源码数据加载到beautysoup中
2. 通过调用beautysoup对象中相关属性和方法进行标签定位和数据提取
-
属性定位:
- soup.tagName:找到第一次出现的标签的属性
- soup.find():
1. find(tagName):等同于soup.tagName 2. find(tagName,class / attr / id …):按照属性进行定位 - soup.find_all():查找符合要求的所有标签(列表新式),也可以作为属性定位
- soup.select():
1. 标签选择器 2. 层级选择器: - 父标签 > 子标签(一个层即) - ‘ ’空格表示多个层即 - Attention:对于find和select的结果非同一对象
-
获取标签中的内容:
- soup.text
- soup.string
- soup.get_text()
-
代码样例(三国演义爬取) import requests
import json
from bs4 import BeautifulSoup
if __name__ == "__main__":
url = "https://www.shicimingju.com/book/sanguoyanyi.html"
headers = {
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36"
}
response = requests.get(url=url,headers=headers)
response.encoding = response.apparent_encoding
"""
其中 r.encoding 根据响应头中的 charset 判断网站编码,如果没有设置则默认返回 iso-8859-1 编码,而r.apparent_encoding
则通过网页内容来判断其编码。令r.encoding=r.apparent_encoding就不会出现乱码问题。
"""
html = response.text
soup = BeautifulSoup(html,'lxml')
muluList = soup.select(".book-mulu a")
muluRecord = []
for mulu in muluList:
muluRecord.append(mulu.text)
pageNum = len(muluRecord)
dataTotalUrl = "https://www.shicimingju.com/book/sanguoyanyi/%d.html"
for i,title in enumerate(muluRecord):
dataUrl = dataTotalUrl%(i + 1)
response = requests.get(url=dataUrl,headers=headers)
response.encoding = response.apparent_encoding
dataHtml = response.text
dataSoup = BeautifulSoup(dataHtml,'lxml')
data = dataSoup.find("div",class_="chapter_content").text
data = data.replace(" ","\n")
path = r"C:\Users\Y_ch\Desktop\spider_test\data\text\sanguo\\" + title + ".txt"
with open(path,'w',encoding="utf-8") as fp:
fp.write(data)
print("第%d篇下载完毕"%(i + 1)
xpath
验证码识别
-
验证码为门户网站的反爬机制 -
通过爬取获得img再通过第三方验证码识别软件进行验证码的识别 -
代码样例 import json
import requests
from lxml import etree
from verication import vercation
if __name__ == "__main__":
url = "https://so.gushiwen.cn/user/login.aspx?from=http://so.gushiwen.cn/user/collect.aspx"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36"
}
response = requests.get(url=url,headers=headers)
tree = etree.HTML(response.text)
varication_path = tree.xpath("//img[@id=\"imgCode\"]/@src")
picUrl = "https://so.gushiwen.cn" + varication_path[0]
pic = requests.get(url=picUrl,headers=headers).content
print(vercation(pic=pic))
import requests
from hashlib import md5
class Chaojiying_Client(object):
def __init__(self, username, password, soft_id):
self.username = username
password = password.encode('utf8')
self.password = md5(password).hexdigest()
self.soft_id = soft_id
self.base_params = {
'user': self.username,
'pass2': self.password,
'softid': self.soft_id,
}
self.headers = {
'Connection': 'Keep-Alive',
'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)',
}
def PostPic(self, im, codetype):
"""
im: 图片字节
codetype: 题目类型 参考 http://www.chaojiying.com/price.html
"""
params = {
'codetype': codetype,
}
params.update(self.base_params)
files = {'userfile': ('ccc.jpg', im)}
r = requests.post('http://upload.chaojiying.net/Upload/Processing.php', data=params, files=files, headers=self.headers)
return r.json()
def ReportError(self, im_id):
"""
im_id:报错题目的图片ID
"""
params = {
'id': im_id,
}
params.update(self.base_params)
r = requests.post('http://upload.chaojiying.net/Upload/ReportError.php', data=params, headers=self.headers)
return r.json()
def vercation(pic,picCode=1902,picMoudle=None):
chaojiying = Chaojiying_Client('1325650083', 'ych3362632', '94271a5f53dc7b0e34efdb06a88692c1')
if picMoudle == None:
return chaojiying.PostPic(pic, picCode)["pic_str"]
else :
im = open(pic, 'rb').read()
return chaojiying.PostPic(im, picCode)["pic_str"]
代理
异步爬虫
异步爬虫的方式
selenium模块
iframe处理
动作链
无头浏览器
selenium屏蔽规避
Scrapy框架
初始scrapy
scrapy数据的持久化存储
-
scrapy的持久化存储:
-
基于终端的存储:
scrapy crawl -o
注意: 1. 只可以将parse函数的**返回值**进行存储到**本地文件(不可以直接存储到数据库中)**中
2. 只能存储为指定的文件类型:【'json', 'jsonlines', 'jl', 'csv', 'xml', 'marshal', 'pickle'】
-
基于管道的存储:
-
编码流程:
- 数据解析
- 在item类中定义相关的属性用于数据的封装
- 将解析的数据封装到item中
- 将item类型的对象提交到管道进行持久化存储
- 在管道类的process_item进行数据的保存
- 在配置文件中开启管道
本地保存实例
```python
class QiuabaiproItem(scrapy.Item):
content = scrapy.Field()
```
```python
class QiuabaiproPipeline(object):
fp = None
def open_spider(self,spider):
print("start")
self.fp = open("./qiubi.txt",'w',encoding='utf-8')
def process_item(self, item, spider):
content = item["content"]
self.fp.write(content)
return item
def close_spider(self,spider):
print("finsih")
self.fp.close()
```
```python
ITEM_PIPELINES = {
'qiuabaiPro.pipelines.QiuabaiproPipeline': 300,
}
```
```python
def parse(self, response):
div_list = response.xpath("//div[@class=\"content\"]/span/text()").extract()
yield div_list[0]
return div_list
```
数据库保存实例
class MysqlPipeline(object):
conn = None
cursor = None
def open_spider(self,spider):
self.conn = pymysql.Connect(host='localhost',port=3307,user="root",passwd="ych3362632",db="test",charset="utf8")
def process_item(self,item,spider):
self.cursor = self.conn.cursor()
try:
print(len(item["name"]))
self.cursor.execute("insert into spider (`name`) values (\"%s\")" % item["name"])
self.conn.commit()
except Exception as e:
print(e)
self.conn.rollback()
return item
def close_spider(self,spider):
self.conn.close()
self.cursor.close()
ITEM_PIPELINES = {
'qiuabaiPro.pipelines.QiuabaiproPipeline': 300,
'qiuabaiPro.pipelines.MysqlPipeline': 301,
}
存储总结
全站数据请求
五大核心组件
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-wzUmZtvq-1634221942757)(C:\Users\Y_ch\Desktop\spider_test\data\md_data\11.webp)]
请求传参
图片管道类(ImagesPipeline)的使用
-
通过使用scrapy.pipelines.images中的ImagesPipelines类进行图片地址自动获取和下载 -
需要重写scrapy.pipelines.images中的ImagesPipelines中的函数 -
在setting中设置图片的存储路径
from scrapy.pipelines.images import ImagesPipeline
import scrapy
class ImageLine(ImagesPipeline):
def get_media_requests(self, item, info):
yield scrapy.Request(item["src"][0])
def file_path(self, request, response=None, info=None, *, item=None):
return item["name"][0] + ".jpg"
def item_completed(self, results, item, info):
return item
ITEM_PIPELINES = {
'imagePro.pipelines.ImageLine': 300,
}
IMAGES_STORE = "./data/pic/beauty"
中间件的使用(middlewares):
-
拦截请求:
-
UA伪装:process_request
def process_request(self, request, spider):
request.headers["User-Agent"] = xxx
return None
-
代理IP:process_exception
def process_exception(self, request, exception, spider):
request.meta["proxy"] = xxx
-
拦截响应:
-
篡改相应数据,响应对象:process_response
def process_response(self, request, response, spider):
if request.url in spider.href_list:
bro = spider.bro
bro.get(request.url)
page_text = bro.page_source
new_response = HtmlResponse(url=request.url,body=page_text,encoding="utf-8",request=request)
return new_response
else:
return response
|