1. scrapy中间件的分类和作用
1.1 scrapy中间件的分类
根据scrapy运行流程中所在位置不同分为: 1.下载中间件 2.爬虫中间件
1.2 scrapy中间的作用:预处理request和response对象
1.对header以及cookie进行更换和处理 2.使用代理ip等 3.对请求进行定制化操作,但在scrapy默认的情况下,两种中间件都在middlewares.py一个文件中。爬虫中间件使用方法和下载中间件相同,且功能重复,通常使用下载中间件
2. 下载中间件的使用方法:
通过下载中间件来学习如何使用中间件编写一个Downloader Middlewares和我们编写一个pipeline一样,定义一个类,然后在setting中开启
Downloader Middlewares默认的方法:process_request(self, request, spider):
当每个request通过下载中间件时,该方法被调用。
1.返回None值:没有return也是返回None,该request对象传递给下载器,或通过引擎传递给其他权重低的process_request方法
2.返回Response对象:不再请求,把response返回给引擎
3.返回Request对象:把request对象通过引擎交给调度器,此时将不通过其他权重低的process_request方法
解释: None:如果所有下载器中间件返回None,则请求最终交给下载器处理 Request:如果返回为请求,则将请求交给调度器 Response:如果返回为响应,将响应对象交给spide进行解析 process_response(self, request, response, spider):
当下载器完成http请求,传递响应给引擎的时候调用
1.返回Resposne:通过引擎交给爬虫处理或交给权重更低的其他下载中间件的process_response方法
2.返回Request对象:通过引擎交给调取器继续请求,此时将不通过其他权重低的process_request方法
在settings.py中配置开启中间件,权重值越小越优先执行
解释: Request:如果返回为请求,则将请求交给调度器 Response:将响应对象交给spider进行解析
3.定义实现随机User-Agent的下载中间件
3.1 爬虫豆瓣top250
1.创建工程
scrapy startproject Douban
2.建模(item.py)
class DoubanItem(scrapy.Item):
# define the fields for your item here like:
name = scrapy.Field()
3.创建爬虫
cd Douban scrapy genspider movie douban.com
4.替换(movie.py)
start_urls = [‘https://movie.douban.com/top250’]
5.获取类表
movie_list = response.xpath(’//*[@id=“content”]/div/div[1]/ol/li/div/div[2]’) print(len(movie_list))
6.运行测试
scrapy crawl movie 拿不到robots.txt,被识别成爬虫
7.配置User_Agent(settings.py)
USER_AGENT = ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36’
8.代码实现
import scrapy
from Douban.items import DoubanItem
class MovieSpider(scrapy.Spider):
name = 'movie'
allowed_domains = ['douban.com']
start_urls = ['https://movie.douban.com/top250']
def parse(self, response):
print(response.request.headers['User-Agent'])
movie_list = response.xpath('//*[@id="content"]/div/div[1]/ol/li/div/div[2]')
# print(len(movie_list)) #25
for movie in movie_list:
item = DoubanItem()
item['name'] = movie.xpath('./div[1]/a/span[1]/text()').extract_first()
yield item
next_url = response.xpath('//*[@id="content"]/div/div[1]/div[2]/span[3]/a/@href').extract_first()
if next_url != None:
next_url = response.urljoin(next_url)
yield scrapy.Request(url=next_url)
3.2 在middlewares.py中完善代码
1.删除里面内容
2.settings添加User_Agent列表
USER_AGENT_LIST = [
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.0 Safari/532.5 ",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/10.0.601.0 Safari/534.14 ",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20 ",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.27 (KHTML, like Gecko) Chrome/12.0.712.0 Safari/534.27 ",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.24 Safari/535.1 ",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7 ",
"Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1 ",
"Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1 ",
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0a2) Gecko/20110622 Firefox/6.0a2 ",
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0.1) Gecko/20100101 Firefox/7.0.1 ",
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b4pre) Gecko/20100815 Minefield/4.0b4pre ",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.8 (KHTML, like Gecko) Beamrise/17.2.0.9 Chrome/17.0.939.0 Safari/535.8 ",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/18.6.872.0 Safari/535.2 UNTRUSTED/1.0 3gpp-gba"
]
2.启用 在settings中设置开启自定义的下载中间件,设置方法同管道
DOWNLOADER_MIDDLEWARES = {
'Douban.middlewares.RandomUserAgent': 543,
}
3.随机请求头
# 定义一个中间件类
class RandomUserAgent(object):
def process_request(self, request, spider):
# print(request.headers['User-Agent'])
ua = random.choice(USER_AGENT_LIST)
request.headers['User-Agent'] = ua
4. 代理ip的使用
免费代理ip和收费代理ip: 1.settings添加PROXY_LIST列表
PROXY_LIST =[
{"ip_port": "123.207.53.84:16816", "user_passwd": "morganna_mode_g:ggc22qxp"},
{"ip_port": "27.191.60.100:3256"},
]
2.启用代理
DOWNLOADER_MIDDLEWARES = {
'Douban.middlewares.RandomProxy': 543,
}
3.代理ip
class RandomProxy(object):
def process_request(self, request, spider):
proxy = random.choice(PROXY_LIST)
print(proxy)
if 'user_passwd' in proxy:
# 对账号密码进行编码,python3中base64编码的数据必须是bytes类型,所以需要encode
b64_up = base64.b64encode(proxy['user_passwd'].encode())
# 设置认证
request.headers['Proxy-Authorization'] = 'Basic ' + b64_up.decode()
# 设置代理
request.meta['proxy'] = proxy['ip_port']
else:
# 设置代理
request.meta['proxy'] = proxy['ip_port']
4.检测代理ip是否可用 在使用了代理ip的情况下可以在下载中间件的process_response()方法中处理代理ip的使用情况,如果该代理ip不能使用可以替换其他代理ip
class ProxyMiddleware(object):
def process_response(self, request, response, spider):
if response.status != '200':
request.dont_filter = True # 重新发送的请求对象能够再次进入队列
return request
5. 在中间件中使用selenium
以PM2.5历史数据_空气质量历史数据查询为例,需求更换请求,爬取几条就会被禁用
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
5.1 完成爬虫代码
# -*- coding: utf-8 -*-
import scrapy
from AQI.items import AqiItem
import time
class AqiSpider(scrapy.Spider):
name = 'aqi'
allowed_domains = ['aqistudy.cn']
host = 'https://www.aqistudy.cn/historydata/'
start_urls = [host]
# 解析起始url对应的响应
def parse(self, response):
#获取城市url列表
url_list = response.xpath('//div[@class="bottom"]/ul/div[2]/li/a/@href').extract()
# 遍历列表
for url in url_list[45:48]:
city_url = response.urljoin(url)
# 发起对城市详情页面的请求
yield scrapy.Request(city_url, callback=self.parse_month)
# 解析详情页面请求对应的响应
def parse_month(self, response):
# 获取每月详情url列表
url_list = response.xpath('//ul[@class="unstyled1"]/li/a/@href').extract()
# 遍历url列表中的部分
for url in url_list[30:31]:
month_url = response.urljoin(url)
# 发起详情页面请求
yield scrapy.Request(month_url, callback=self.parse_day)
# 在详情页面解析数据
def parse_day(self, response):
print (response.url,'######')
# 获取所有的数据节点
node_list = response.xpath('//tr')
city = response.xpath('//div[@class="panel-heading"]/h3/text()').extract_first().split('2')[0]
# 遍历数据节点列表
for node in node_list:
# 创建存储数据的item容器
item = AqiItem()
# 先填写一些固定参数
item['city'] = city
item['url'] = response.url
item['timestamp'] = time.time()
# 数据
item['date'] = node.xpath('./td[1]/text()').extract_first()
item['AQI'] = node.xpath('./td[2]/text()').extract_first()
item['LEVEL'] = node.xpath('./td[3]/span/text()').extract_first()
item['PM2_5'] = node.xpath('./td[4]/text()').extract_first()
item['PM10'] = node.xpath('./td[5]/text()').extract_first()
item['SO2'] = node.xpath('./td[6]/text()').extract_first()
item['CO'] = node.xpath('./td[7]/text()').extract_first()
item['NO2'] = node.xpath('./td[8]/text()').extract_first()
item['O3'] = node.xpath('./td[9]/text()').extract_first()
# for k,v in item.items():
# print k,v
# print '##########################'
# 将数据返回给引擎
yield item
5.2 在middlewares.py中使用selenium
from selenium import webdriver
import time
from scrapy.http import HtmlResponse
from scrapy import signals
class SeleniumMiddleware(object):
def process_request(self, request, spider):
url = request.url
if 'daydata' in url:
driver = webdriver.Chrome()
driver.get(url)
time.sleep(3)
data = driver.page_source
driver.close()
# 创建响应对象
res = HtmlResponse(url=url, body=data, encoding='utf-8', request=request)
return res
|