SN案例

需求：爬取苏宁易购所有下所有图书和图书分类信息，以及子链接页面的价格内容。
url : http://snbook.suning.com/web/trd-fl/999999/0.htm
目标：熟悉前面的知识点

# -*- coding: utf-8 -*-
import scrapy
import re
from copy import deepcopy

class SuningSpider(scrapy.Spider):
    name = 'suning'
    allowed_domains = ['suning.com']
    start_urls = ['http://snbook.suning.com/web/trd-fl/999999/0.htm']


    def parse(self, response):
        #1.大分类分组
        li_list = response.xpath("//ul[@class='ulwrap']/li")
        for li in li_list:
            item = {}
            item["b_cate"] = li.xpath("./div[1]/a/text()").extract_first()
            #2.小分类分组
            a_list = li.xpath("./div[2]/a")
            for a in a_list:
                item["s_href"] = a.xpath("./@href").extract_first()
                item["s_cate"] = a.xpath("./text()").extract_first()
                if item["s_href"] is not None:
                    item["s_href"]= "http://snbook.suning.com/" + item["s_href"]
                    yield scrapy.Request(
                        item["s_href"],
                        callback=self.parse_book_list,
                        meta = {"item":deepcopy(item)}
                    )

    def parse_book_list(self,response):
        item = deepcopy(response.meta["item"])
        #图书列表页分组
        li_list = response.xpath("//div[@class='filtrate-books list-filtrate-books']/ul/li")
        for li in li_list:
            item["book_name"] = li.xpath(".//div[@class='book-title']/a/@title").extract_first()
            item["book_img"] = li.xpath(".//div[@class='book-img']//img/@src").extract_first()
            if item["book_img"] is None:
                item["book_img"] = li.xpath(".//div[@class='book-img']//img/@src2").extract_first()
            item["book_author"] = li.xpath(".//div[@class='book-author']/a/text()").extract_first()
            item["book_press"] = li.xpath(".//div[@class='book-publish']/a/text()").extract_first()
            item["book_desc"] = li.xpath(".//div[@class='book-descrip c6']/text()").extract_first()
            item["book_href"]= li.xpath(".//div[@class='book-title']/a/@href").extract_first()
            yield scrapy.Request(
                item["book_href"],
                callback=self.parse_book_detail,
                meta = {"item":deepcopy(item)}
            )

        #翻页
        page_count = int(re.findall("var pagecount=(.*?);",response.body.decode())[0])
        current_page =  int(re.findall("var currentPage=(.*?);",response.body.decode())[0])
        if current_page<page_count:
            next_url = item["s_href"] +"?pageNumber={}&sort=0".format(current_page+1)
            yield scrapy.Request(
                next_url,
                callback=self.parse_book_list,
                meta = {"item":response.meta["item"]}
            )



    def parse_book_detail(self,response):
        item = response.meta["item"]
        item["book_price"] = re.findall("\"bp\":'(.*?)',",response.body.decode())
        item["book_price"] = item["book_price"][0] if len(item["book_price"])>0 else None
        print(item)

Crawlspider

在这里插入图片描述
需求：爬取csdn上面所有的博客专家及其文章的文章
Url地址：http://blog.csdn.net/experts.html
目标：通过csdn爬虫了解crawlspider的使用

生成crawlspider的命令：
scrapy genspider –t crawl csdn “csdn.cn”
在这里插入图片描述

了解

demo

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
import re

class CfSpider(CrawlSpider):
    name = 'cf'
    allowed_domains = ['circ.gov.cn']
    start_urls = ['http://www.circ.gov.cn/web/site0/tab5240/module14430/page1.htm']

    #定义提取url地址规则
    rules = (
        #LinkExtractor 连接提取器，提取url地址
        #callback 提取出来的url地址的response会交给callback处理
        #follow 当前url地址的响应是够重新进过rules来提取url地址，
        Rule(LinkExtractor(allow=r'/web/site0/tab5240/info\d+\.htm'), callback='parse_item'),
        Rule(LinkExtractor(allow=r'/web/site0/tab5240/module14430/page\d+\.htm'),follow=True),
    )

    #parse函数有特殊功能，不能定义
    def parse_item(self, response):
        item = {}
        item["title"] = re.findall("<!--TitleStart-->(.*?)<!--TitleEnd-->",response.body.decode())[0]
        item["publish_date"] = re.findall("发布时间：(20\d{2}-\d{2}-\d{2})",response.body.decode())[0]
        print(item)
    #     yield scrapy.Request(
    #         url,
    #         callback=self.parse_detail,
    #         meta = {"item":item}
    #     )
    #
    # def parse_detail(self,response):
    #     item = response.meta["item"]
    #     item["price"] =  "///"
    #     yield item

携带cookie 模拟登录

为什么需要模拟登陆？
获取cookie，能够爬取登陆后的页面

回顾：
requests是如何模拟登陆的？
1、直接携带cookies请求页面
2、找接口发送post请求存储cookie
selenium是如何模拟登陆的？
找到对应的input标签，输入文字点击登录

那么对于scrapy来说，也是有两个方法模拟登陆：
1、直接携带cookie
2、找到发送post请求的url地址，带上信息，发送请求

应用场景：
1、cookie过期时间很长，常见于一些不规范的网站
2、能在cookie过期之前把搜有的数据拿到
3、配合其他程序使用，比如其使用selenium把登陆之后的cookie获取到保存到本地，scrapy发送请求之前先读取本地cookie

我们在spider下面定义了start_urls,那么这个start_urls是交给谁去处理的？
在这里插入图片描述

那么问题来了：如何知道我的cookie确实是在不同的解析函数中传递呢？
在这里插入图片描述

# -*- coding: utf-8 -*-
import scrapy
import re


class RenrenSpider(scrapy.Spider):
    name = 'renren'
    allowed_domains = ['renren.com']
    start_urls = ['http://www.renren.com/327550029/profile']

    def start_requests(self):
        cookies = "anonymid=jcokuqturos8ql; depovince=GW; jebecookies=f90c9e96-78d7-4f74-b1c8-b6448492995b|||||; _r01_=1; JSESSIONID=abcx4tkKLbB1-hVwvcyew; ick_login=ff436c18-ec61-4d65-8c56-a7962af397f4; _de=BF09EE3A28DED52E6B65F6A4705D973F1383380866D39FF5; p=90dea4bfc79ef80402417810c0de60989; first_login_flag=1; ln_uact=mr_mao_hacker@163.com; ln_hurl=http://hdn.xnimg.cn/photos/hdn421/20171230/1635/main_JQzq_ae7b0000a8791986.jpg; t=24ee96e2e2301bf2c350d7102956540a9; societyguester=24ee96e2e2301bf2c350d7102956540a9; id=327550029; xnsid=e7f66e0b; loginfrom=syshome; ch_id=10016"
        cookies = {i.split("=")[0]:i.split("=")[1] for i in cookies.split("; ")}
        # headers = {"Cookie":cookies}
        yield scrapy.Request(
            self.start_urls[0],
            callback=self.parse,
            cookies=cookies
            # headers = headers
        )

    def parse(self, response):
        print(re.findall("毛兆军",response.body.decode()))
        yield scrapy.Request(
            "http://www.renren.com/327550029/profile?v=info_timeline",
            callback=self.parse_detial
        )

    def parse_detial(self,response):
        print(re.findall("毛兆军",response.body.decode()))

发送post请求模拟登录

在这里插入图片描述

# -*- coding: utf-8 -*-
import scrapy
import re

class GithubSpider(scrapy.Spider):
    name = 'github'
    allowed_domains = ['github.com']
    start_urls = ['https://github.com/login']

    def parse(self, response):
        authenticity_token = response.xpath("//input[@name='authenticity_token']/@value").extract_first()
        utf8 = response.xpath("//input[@name='utf8']/@value").extract_first()
        commit = response.xpath("//input[@name='commit']/@value").extract_first()
        post_data = dict(
            login="noobpythoner",
            password="zhoudawei123",
            authenticity_token=authenticity_token,
            utf8=utf8,
            commit=commit
        )
        yield scrapy.FormRequest(
            "https://github.com/session",
            formdata=post_data,
            callback=self.after_login
        )

    def after_login(self,response):
        # with open("a.html","w",encoding="utf-8") as f:
        #     f.write(response.body.decode())
        print(re.findall("noobpythoner|NoobPythoner",response.body.decode()))

自动登录

在这里插入图片描述

# -*- coding: utf-8 -*-
import scrapy
import re


class Github2Spider(scrapy.Spider):
    name = 'github2'
    allowed_domains = ['github.com']
    start_urls = ['https://github.com/login']

    def parse(self, response):
        yield scrapy.FormRequest.from_response(
            response, #自动的从response中寻找from表单
            formdata={"login":"noobpythoner","password":"zhoudawei123"},
            callback = self.after_login
        )

    def after_login(self,response):
        print(re.findall("noobpythoner|NoobPythoner",response.body.decode()))

中间件

使用方法：
编写一个Downloader Middlewares和我们编写一个pipeline一样，定义一个类，然后在setting中开启

Downloader Middlewares默认的方法：
process_request(self, request, spider)：
当每个request通过下载中间件时，该方法被调用。
process_response(self, request, response, spider)：
当下载器完成http请求，传递响应给引擎的时候调用
在这里插入图片描述

# -*- coding: utf-8 -*-

# Define here the models for your spider middleware
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/spider-middleware.html
import random

class RandomUserAgentMiddleware:
    def process_request(self,request,spider):
        ua = random.choice(spider.settings.get("USER_AGENTS_LIST"))
        request.headers["User-Agent"] = ua


class CheckUserAgent:
    def process_response(self,request,response,spider):
        # print(dir(response.request))
        print(request.headers["User-Agent"])
        return response

setting设置

DOWNLOADER_MIDDLEWARES = {
   'circ.middlewares.RandomUserAgentMiddleware': 543,
   'circ.middlewares.CheckUserAgent': 544,
}
USER_AGENTS_LIST = [ "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)", "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)", "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)", "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)", "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6", "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1", "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0", "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5" ]

重点

### scrapy数据流程
- 调度器---》request---》引擎---》下载中间件---》下载器
- 下载器发送请求，获取resposne，---》response--->下载中间件---》引擎---》爬虫中间件---》spider
- spider 提取数据---》引擎---》pipeline
- spider 提取url地址--》构造request---》爬虫中间件---》引擎---》调度器

### scrapy如何发送请求，能够携带什么参数
- scrapy.Request(url,callback,meta,dont_filter)
- dont_filter=True 表示请求过的url地址还会继续被请求


### scrapy如何把数据从一个解析函数传递到另一个，为什么需要这样做
- meta是个字典类型，meta = {"item":item}
- response.meta["item"]


### scrapy中Item是什么，如何使用
- Item 类，继承自scarpy.Item,name=scrapy.Field()
- Item 定义那些字段我们需要抓取
- 使用和字典一样
- 在mongodb中插入数据的时候 dict(item)

### pipeline中open_spider和close_spider是什么
- open_spdier 爬虫开启执行一次，只有一次
- close_spider 爬虫关闭的时候执行一次，只有一次

### crawlspider的使用
- 常见爬虫 scrapy genspider -t crawl 爬虫名 allow_domain
- 指定start_url，对应的响应会进过rules提取url地址
- 完善rules，添加Rule ` Rule(LinkExtractor(allow=r'/web/site0/tab5240/info\d+\.htm'), callback='parse_item'),`

- 注意点:
  - url地址不完整，crawlspider会自动补充完整之后在请求
  - parse函数不能定义，他有特殊的功能需要实现
  - callback：连接提取器提取出来的url地址对应的响应交给他处理
  - follow：连接提取器提取出来的url地址对应的响应是否继续被rules来过滤

在这里插入图片描述