需求

爬取链接中的大学以及对应的专业，存储为excel表格
在这里插入图片描述

波澜

这是我第二次做爬虫，之前做过一次爬取豆瓣top250电影的爬虫，感兴趣的可以看链接和链接，前者是使用scrapy框架的，后者是使用request包实现的。
起初我以为这个需求和之前的差不多，毕竟看着页面很简单，但是做起来发现并不是这样。首先，我最开始的想法是爬取招生院校这个页面，然后根据里面的查看专业这个a标签的跳转地址爬取我们真正想要爬取的信息。但是，遗憾的是，这是一个动态的html，相关信息是js控制的，所以上面的思路解决不了了。接着,观察到，其实这个请求还是有点规律的

https://www.eeagd.edu.cn/lzks/yxzycx/yxzy.jsp?yx_h=10001&pc_h=11&jhlb_h=00&zykl_h=1
前面是请求的接口，后面跟着的是参数，显然是个get请求了
参数yx_h=10001&pc_h=11&jhlb_h=00&zykl_h=1中，观察可以发现，参数yx_h表示院校的代码，参数zykl_h表示类别，1是文史类，2是理工类，另外两个参数在这么多请求中都是一样的无法判别到底是什么

那么也就是我们每次都请求这个地址就好了，院校的代码从10001到19027慢慢增加就好了，但是还有个参数怎么处理呢？我们可以运行程序两次，第一次是请求所有文史类的数据，第二次是请求所有理工类的数据。
问题又又出现了，有些学校在这个网站中是没有文史类或者理工类的数据的，不做处理的话，报错500直接结束程序。这个问题我的思路是：遇到500的报错，我们就跳过，院校代码+1，去处理下一个请求。这个思路还是非常清晰的其实，但是具体落实到代码上，我人傻了。（因为要是使用request这类的软件包，我们可以控制的部分很多，改起来方便，但我这次使用的是scrapy框架，是在没找到哪里能实现这个功能）
解决：

设置Request的errback参数，我们自定义一个异常处理的逻辑（当时没想到这个，后来才发现的）
当时的时候，我亲爱的学长给我整了个for循环，中间捕获异常，但是不处理，才算是勉勉强强解决了问题（虽然还是无法理解为什么这样就能解决）。

其他问题：当爬取表格的内容时，发现用 xpath helper 获取正常，程序却解析不到。
原因：浏览器会在table标签下添加tbody，这个元素是浏览器对于页面的处理，不是页面原生带有的。
解决：将xpath中的tbody全部去掉就好了。

相关代码

核心爬虫文件：eduInformation.py
解决方式1：

import scrapy
from scrapy import Request
from scrapy.downloadermiddlewares.retry import get_retry_request
from scrapy_dir.eduSpider.eduSpider.items import EduspiderItem

from scrapy.spidermiddlewares.httperror import HttpError
from twisted.internet.error import TimeoutError, TCPTimedOutError, ConnectionRefusedError
from twisted.web._newclient import ResponseFailed, ResponseNeverReceived
from scrapy.spidermiddlewares.httperror import HttpError
from scrapy.utils.response import response_status_message # 获取错误代码信息


class EduinformationSpider(scrapy.Spider):
    name = 'eduInformation'
    allowed_domains = ['www.eeagd.edu.cn']
    other = '&pc_h=11&jhlb_h=00&zykl_h=1'
    url = 'https://www.eeagd.edu.cn/lzks/yxzycx/yxzy.jsp?yx_h='
    school_code=10001
    start_urls = [url + str(school_code) + other]

    def parse(self, response):
        content_lists = response.xpath('/html/body/table/tr[4]/td/table/tr[@align]')
        for content_list in content_lists:
            item = EduspiderItem()
            items = content_list.xpath('./td')
            item["institution_code"] = items[0].xpath('./text()')[0].extract()
            item["institution_name"] = items[1].xpath('./a/text()')[0].extract()
            item["discipline_code"] = items[2].xpath('./text()')[0].extract()
            item["discipline_name"] = str(items[3].xpath('./a/text()')[0].extract()).strip()
            item["batch"] = items[4].xpath('./text()')[0].extract()
            item["plan_type"] = items[5].xpath('./text()')[0].extract()
            item["admission_type"] = items[6].xpath('./text()')[0].extract()
            yield item
        self.school_code+=1
        if self.school_code<10011:
            next_url = self.url + str(self.school_code) + self.other
            yield scrapy.Request(next_url, callback=self.parse, errback=self.errback_parse,dont_filter=True)

    def errback_parse(self, failure):
        request = failure.request

        if failure.check(HttpError):
            response = failure.value.response
            self.logger.error(
                'errback <%s> %s , response status:%s' %
                (request.url, failure.value, response_status_message(response.status))
            )

            #业务处理
            self.school_code += 1
            next_url = self.url + str(self.school_code) + self.other
            new_request = Request(url=next_url, method="GET", cookies="")
            new_request_or_none = get_retry_request(
                new_request,
                spider=self,
                reason='retry',
            )
            return new_request_or_none


        elif failure.check(ResponseFailed):
            self.logger.error('errback <%s> ResponseFailed' % request.url)

        elif failure.check(ConnectionRefusedError):
            self.logger.error('errback <%s> ConnectionRefusedError' % request.url)

        elif failure.check(ResponseNeverReceived):
            self.logger.error('errback <%s> ResponseNeverReceived' % request.url)

        elif failure.check(TCPTimedOutError, TimeoutError):
            self.logger.error('errback <%s> TimeoutError' % request.url)

        else:
            self.logger.error('errback <%s> OtherError' % request.url)

解决方式2：

import scrapy

from scrapy_dir.eduSpider.eduSpider.items import EduspiderItem


class EduinformationSpider(scrapy.Spider):
    name = 'eduInformation'
    allowed_domains = ['www.eeagd.edu.cn']
    other = '&pc_h=11&jhlb_h=00&zykl_h=1'
    url = 'https://www.eeagd.edu.cn/lzks/yxzycx/yxzy.jsp?yx_h='
    start_urls = ["https://www.eeagd.edu.cn/lzks/yxzycx/yxzy.jsp"]

    def parse(self, response):
        for school_code in range(10001, 10010):#19028
            try:
                next_url = self.url + str(school_code) + self.other
                yield scrapy.Request(next_url, callback=self.parse_detail, dont_filter=True)
            except:
                continue

    def parse_detail(self, response):

        content_lists = response.xpath('/html/body/table/tr[4]/td/table/tr[@align]')
        for content_list in content_lists:
            item = EduspiderItem()
            items = content_list.xpath('./td')
            item["institution_code"] = items[0].xpath('./text()')[0].extract()
            item["institution_name"] = items[1].xpath('./a/text()')[0].extract()
            item["discipline_code"] = items[2].xpath('./text()')[0].extract()
            item["discipline_name"] = str(items[3].xpath('./a/text()')[0].extract()).strip()
            item["batch"] = items[4].xpath('./text()')[0].extract()
            item["plan_type"] = items[5].xpath('./text()')[0].extract()
            item["admission_type"] = items[6].xpath('./text()')[0].extract()
            yield item

实体文件：item.py


import scrapy


class EduspiderItem(scrapy.Item):
    institution_code = scrapy.Field()
    institution_name = scrapy.Field()
    discipline_code = scrapy.Field()
    discipline_name = scrapy.Field()
    batch = scrapy.Field()
    plan_type = scrapy.Field()
    admission_type = scrapy.Field()
    pass

下载中间件：middlewares.py

# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals

from scrapy_dir.eduSpider.eduSpider.settings import USER_AGENTS, PROXIES


class EduspiderSpiderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, or item objects.
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Request or item objects.
        pass

    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.

        # Must return only requests (not items).
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


class EduspiderDownloaderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


import random
import base64



# 随机的User-Agent
class RandomUserAgent(object):
    def process_request(self, request, spider):
        useragent = random.choice(USER_AGENTS)

        request.headers.setdefault("User-Agent", useragent)

class RandomProxy(object):
    def process_request(self, request, spider):
        proxy = random.choice(PROXIES)

        if proxy['user_passwd'] is None:
            # 没有代理账户验证的代理使用方式
            request.meta['proxy'] = "http://" + proxy['ip_port']
        else:
            # 对账户密码进行base64编码转换
            base64_userpasswd = base64.b64encode(proxy['user_passwd'])
            # 对应到代理服务器的信令格式里
            request.headers['Proxy-Authorization'] = 'Basic ' + base64_userpasswd
            request.meta['proxy'] = "http://" + proxy['ip_port']

持久化组件：pipelines.py

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
import xlwt


class EduspiderPipeline:
    def __init__(self):
        # 新建.xls的文件
        self.workbook = xlwt.Workbook('utf-8')
        # 添加工作表
        self.sheet = self.workbook.add_sheet('demo', cell_overwrite_ok=True)
        self.sheet.write(0,0,"院校代码")
        self.sheet.write(0,1,"院校")
        self.sheet.write(0,2,"专业代码")
        self.sheet.write(0,3,"招生专业")
        self.sheet.write(0,4,"专业批次")
        self.sheet.write(0,5,"计划类别")
        self.sheet.write(0,6,"招生科类")
        self.index=1

    def process_item(self, item, spider):
        self.sheet.write(self.index, 0, item["institution_code"])
        self.sheet.write(self.index, 1, item["institution_name"])
        self.sheet.write(self.index, 2, item["discipline_code"])
        self.sheet.write(self.index, 3, item["discipline_name"])
        self.sheet.write(self.index, 4, item["batch"])
        self.sheet.write(self.index, 5, item["plan_type"])
        self.sheet.write(self.index, 6, item["admission_type"])
        self.index+=1
        self.workbook.save("demo.xls")
        return item

核心配置文件：settings.py

# Scrapy settings for eduSpider project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'eduSpider'

SPIDER_MODULES = ['eduSpider.spiders']
NEWSPIDER_MODULE = 'eduSpider.spiders'
ROBOTSTXT_OBEY = False
COOKIES_ENABLED = False
DOWNLOAD_DELAY = 3

ITEM_PIPELINES = {
   'eduSpider.pipelines.EduspiderPipeline': 300,
}
DOWNLOADER_MIDDLEWARES = {
   #'eduSpider.middlewares.EduspiderDownloaderMiddleware': 543,
    'eduSpider.middlewares.RandomUserAgent': 1,
    #'eduSpider.middlewares.RandomProxy': 100
}

USER_AGENTS = [
    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
    "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
    "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
    "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
    "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5"
    ]
PROXIES = [
    {'ip_port': '111.8.60.9:8123', 'user_passwd': 'user1:pass1'},
    {'ip_port': '101.71.27.120:80', 'user_passwd': 'user2:pass2'},
    {'ip_port': '122.96.59.104:80', 'user_passwd': 'user3:pass3'},
    {'ip_port': '122.224.249.122:8088', 'user_passwd': 'user4:pass4'},
]




# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'eduSpider (+http://www.yourdomain.com)'

# Obey robots.txt rules


# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16



# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# SPIDER_MIDDLEWARES = {
#    'eduSpider.middlewares.EduspiderSpiderMiddleware': 543,
# }
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html


# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html


# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

项目启动文件：run.py

from scrapy.cmdline import execute

# 本程序为爬虫起始入口，直接运行本程序即可运行爬虫
execute(['scrapy', 'crawl', 'eduInformation'])

将excel排序的文件：（因为爬虫爬下来的数据是乱序的，需要整理一下）

#导入
import xlrd
import xlwt

result = xlwt.Workbook('utf-8')
# 添加工作表
result_sheet = result.add_sheet('0', cell_overwrite_ok=True)

# 打开文件 必须是存在的文件路径
wb = xlrd.open_workbook('./demo.xls')
sheet = wb.sheet_by_index(0)
content_list=[]
for i in range(1,sheet.nrows):
    content_dic={}
    content_dic["institution_code"]=sheet.cell(i,0).value
    content_dic["institution_name"]=sheet.cell(i,1).value
    content_dic["discipline_code"]=sheet.cell(i,2).value
    content_dic["discipline_name"]=sheet.cell(i,3).value
    content_dic["batch"]=sheet.cell(i,4).value
    content_dic["plan_type"]=sheet.cell(i,5).value
    content_dic["admission_type"]=sheet.cell(i,6).value
    if content_dic not in content_list:
        content_list.append(content_dic)
content_list = sorted(content_list, key=lambda r: r['institution_code'],reverse=False)

result_sheet.write(0,0,"院校代码")
result_sheet.write(0,1,"院校")
result_sheet.write(0,2,"专业代码")
result_sheet.write(0,3,"招生专业")
result_sheet.write(0,4,"专业批次")
result_sheet.write(0,5,"计划类别")
result_sheet.write(0,6,"招生科类")
for i in range(1,len(content_list)):
    result_sheet.write(i,0,content_list[i-1]["institution_code"])
    result_sheet.write(i,1,content_list[i-1]["institution_name"])
    result_sheet.write(i,2,content_list[i-1]["discipline_code"])
    result_sheet.write(i,3,content_list[i-1]["discipline_name"])
    result_sheet.write(i,4,content_list[i-1]["batch"])
    result_sheet.write(i,5,content_list[i-1]["plan_type"])
    result_sheet.write(i,6,content_list[i-1]["admission_type"])
result.save("文史.xls")