[Python知识库] scrapy图片数据爬取下载中间件

开发: C++知识库 Java知识库 JavaScript Python PHP知识库人工智能区块链大数据移动开发嵌入式开发工具数据结构与算法开发测试游戏开发网络协议系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑笔记本显卡显示器固态硬盘硬盘耳机手机 iphone vivo oppo 小米华为单反装机图拉丁

-> Python知识库 -> scrapy图片数据爬取下载中间件 -> 正文阅读

[Python知识库]scrapy图片数据爬取下载中间件

ImagesPipeline

? ? ? ? 基于scrapy爬取字符串类型的数据和爬取图片类型的数据区别

? ? ? ? ? ? ? ? 字符串：只需要基于xpath进行解析且提交管道进行持久化存储

? ? ? ? ? ? ? ? 图片：xpath解析出图片src的地址，单独的对图片地址发起获取图片二进制类型的数据

? ? ? ? imagepipeline：

? ? ? ? ? ? ? ? 只需要将img的src属性值进行解析，提交到管道，管道就会对图片的src进行请求发送获取图片的二进制类型的数据，且还会帮我们进行持久化存储。

? ? ? ? 需求：爬取站长素材中的高清图片

? ? ? ? 使用流程：

? ? ? ? ? ? ? ? 数据解析（图片的地址）

? ? ? ? ? ? ? ? 将存储图片地址的item提交到指定的管道类

import scrapy
from imgpro.items import ImgproItem

class ImgSpider(scrapy.Spider):
    name = 'img'
    #allowed_domains = ['www.xxx.com']
    start_urls = ['https://sc.chinaz.com/tupian/']

    def parse(self, response):
        div_list = response.xpath('//div[@id="container"]/div')
        for div in div_list:
            #解析的时候，使用伪属性src2
            src = 'https:'+div.xpath('./div/a/img/@src2').extract_first()
            print(src)
            item = ImgproItem()
            item['src'] = src
            
            yield item

? ? ? ? ? ? ? ? 在管道文件中自指定一个ImagesPipeLine的一个管道类

? ? ? ? ? ? ? ? ? ? ? ? get_media_request()

? ? ? ? ? ? ? ? ? ? ? ? file_path()

? ? ? ? ? ? ? ? ? ? ? ? item_complete()

from itemadapter import ItemAdapter
import scrapy

# class ImgproPipeline:
#     def process_item(self, item, spider):
#         return item
from scrapy.pipelines.images import ImagesPipeline
class imgspipleline(ImagesPipeline):
    #就是根据图片的地址进行图片数据的请求
    def get_media_requests(self, item, info):
         
        yield scrapy.Request(item['src'])
    
    #指定图片存储的路径
    def file_path(self, request):
        url = request.url
        img_name = url.split('/')[-1]
        return img_name
    def item_completed(self, results, item, info):
        return item #返回给下一个即将被执行的管道类

? ? ? ? ? ? ? ? 在配置文件中：

? ? ? ? ? ? ? ? ? ? ? ? 指定图片存储的目录：

? ? ? ? ? ? ? ? ? ? ? ? 指定开启的管道：自定制的管道类

ITEM_PIPELINES = {
    'imgpro.pipelines.imgspipleline': 300,
}



#指定图片存储的目录
IMAGES_STORE = './IMGS'

? ? ? ? ? ? ?

中间件

? ? ? ? 下载中间件：位置：引擎和下载器之间

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?作用：批量拦截到整个工程中发起的所有的请求和响应

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 拦截请求：

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? UA伪装:process_request

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 代理IP:process_exception: return request


class MiddleproDownloaderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.
    user_agents = [
		'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60',
		'Opera/8.0 (Windows NT 5.1; U; en)',
		'Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50',
		'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50',
		'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0',
		'Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10',
		'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2 ',
		'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36',
		'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
		'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16',
		'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36',
		'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko',
		'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11',
		'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER',
		'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)',
		'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0',
		'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; SE 2.X MetaSr 1.0) ',
	]
    proxy_http = ['117.27.113.33:9999',
    '36.102.169.189:8080',]
    proxy_https = ['120.83.49.90:90000',
                   '95.18.112.214:35508',]
    #拦截请求
    def process_request(self, request, spider):
        #UA伪装
        request.headers['User-Agent'] = random.choice(self.user_agents )
        #为了验证代理的操作
        request.meta['proxy'] = 'http://123.96.18.192:8080'
       
        return None
    #拦截所有的响应
    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response
    #拦截发生异常的请求对象
    def process_exception(self, request, exception, spider):
        #UA伪装
        #代理IP
        if request.url.split(':')[0] == 'http':
    
            request.meta['proxy'] = 'http://'+random.choice(self.proxy_http)
        else:
            request.meta['proxy'] = 'https://'+random.choice(self.proxy_https)
        return request  #将修正之后的请求对象重新的请求发送

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 拦截响应：

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 篡改响应数据，响应对象3

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 需求：爬取网易新闻中的新闻数据

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?? ?1 通过网易新闻首页解析出的五大板块详情页的url（没有动态加载）

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 2 每个板块对应的新闻标题都是动态加下载出来的

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 3 通过解析出每一条新闻详情页的url获取详情页的页面源码，解析内容?

crawlspider:类，spider的一个子类

? ? ? ? 全站数据爬取

? ? ? ? ? ? ? ? 基于spider：手动请求

? ? ? ? ? ? ? ? 基于crawlspider

? ? ? ? crawlspider的使用

? ? ? ? ? ? ? ? 创建一个工程