IT数码 购物 网址 头条 软件 日历 阅读 图书馆
TxT小说阅读器
↓语音阅读,小说下载,古典文学↓
图片批量下载器
↓批量下载图片,美女图库↓
图片自动播放器
↓图片自动播放器↓
一键清除垃圾
↓轻轻一点,清除系统垃圾↓
开发: C++知识库 Java知识库 JavaScript Python PHP知识库 人工智能 区块链 大数据 移动开发 嵌入式 开发工具 数据结构与算法 开发测试 游戏开发 网络协议 系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑 笔记本 显卡 显示器 固态硬盘 硬盘 耳机 手机 iphone vivo oppo 小米 华为 单反 装机 图拉丁
 
   -> 系统运维 -> 【爬虫】基于scrapy爬取招生院校专业信息 -> 正文阅读

[系统运维]【爬虫】基于scrapy爬取招生院校专业信息

需求

爬取链接中的大学以及对应的专业,存储为excel表格
在这里插入图片描述
在这里插入图片描述

波澜

这是我第二次做爬虫,之前做过一次爬取豆瓣top250电影的爬虫,感兴趣的可以看链接链接,前者是使用scrapy框架的,后者是使用request包实现的。
起初我以为这个需求和之前的差不多,毕竟看着页面很简单,但是做起来发现并不是这样。首先,我最开始的想法是爬取招生院校这个页面,然后根据里面的查看专业这个a标签的跳转地址爬取我们真正想要爬取的信息。但是,遗憾的是,这是一个动态的html,相关信息是js控制的,所以上面的思路解决不了了。接着,观察到,其实这个请求还是有点规律的

https://www.eeagd.edu.cn/lzks/yxzycx/yxzy.jsp?yx_h=10001&pc_h=11&jhlb_h=00&zykl_h=1
前面是请求的接口,后面跟着的是参数,显然是个get请求了
参数yx_h=10001&pc_h=11&jhlb_h=00&zykl_h=1中,观察可以发现,参数yx_h表示院校的代码,参数zykl_h表示类别,1是文史类,2是理工类,另外两个参数在这么多请求中都是一样的无法判别到底是什么

那么也就是我们每次都请求这个地址就好了,院校的代码从10001到19027慢慢增加就好了,但是还有个参数怎么处理呢?我们可以运行程序两次,第一次是请求所有文史类的数据,第二次是请求所有理工类的数据。
问题又又出现了,有些学校在这个网站中是没有文史类或者理工类的数据的,不做处理的话,报错500直接结束程序。这个问题我的思路是:遇到500的报错,我们就跳过,院校代码+1,去处理下一个请求。这个思路还是非常清晰的其实,但是具体落实到代码上,我人傻了。(因为要是使用request这类的软件包,我们可以控制的部分很多,改起来方便,但我这次使用的是scrapy框架,是在没找到哪里能实现这个功能)
解决

  1. 设置Request的errback参数,我们自定义一个异常处理的逻辑(当时没想到这个,后来才发现的)
  2. 当时的时候,我亲爱的学长给我整了个for循环,中间捕获异常,但是不处理,才算是勉勉强强解决了问题(虽然还是无法理解为什么这样就能解决)。

其他问题:当爬取表格的内容时,发现用 xpath helper 获取正常,程序却解析不到。
原因:浏览器会在table标签下添加tbody,这个元素是浏览器对于页面的处理,不是页面原生带有的。
解决:将xpath中的tbody全部去掉就好了。

相关代码

核心爬虫文件:eduInformation.py
解决方式1

import scrapy
from scrapy import Request
from scrapy.downloadermiddlewares.retry import get_retry_request
from scrapy_dir.eduSpider.eduSpider.items import EduspiderItem

from scrapy.spidermiddlewares.httperror import HttpError
from twisted.internet.error import TimeoutError, TCPTimedOutError, ConnectionRefusedError
from twisted.web._newclient import ResponseFailed, ResponseNeverReceived
from scrapy.spidermiddlewares.httperror import HttpError
from scrapy.utils.response import response_status_message # 获取错误代码信息


class EduinformationSpider(scrapy.Spider):
    name = 'eduInformation'
    allowed_domains = ['www.eeagd.edu.cn']
    other = '&pc_h=11&jhlb_h=00&zykl_h=1'
    url = 'https://www.eeagd.edu.cn/lzks/yxzycx/yxzy.jsp?yx_h='
    school_code=10001
    start_urls = [url + str(school_code) + other]

    def parse(self, response):
        content_lists = response.xpath('/html/body/table/tr[4]/td/table/tr[@align]')
        for content_list in content_lists:
            item = EduspiderItem()
            items = content_list.xpath('./td')
            item["institution_code"] = items[0].xpath('./text()')[0].extract()
            item["institution_name"] = items[1].xpath('./a/text()')[0].extract()
            item["discipline_code"] = items[2].xpath('./text()')[0].extract()
            item["discipline_name"] = str(items[3].xpath('./a/text()')[0].extract()).strip()
            item["batch"] = items[4].xpath('./text()')[0].extract()
            item["plan_type"] = items[5].xpath('./text()')[0].extract()
            item["admission_type"] = items[6].xpath('./text()')[0].extract()
            yield item
        self.school_code+=1
        if self.school_code<10011:
            next_url = self.url + str(self.school_code) + self.other
            yield scrapy.Request(next_url, callback=self.parse, errback=self.errback_parse,dont_filter=True)

    def errback_parse(self, failure):
        request = failure.request

        if failure.check(HttpError):
            response = failure.value.response
            self.logger.error(
                'errback <%s> %s , response status:%s' %
                (request.url, failure.value, response_status_message(response.status))
            )

            #业务处理
            self.school_code += 1
            next_url = self.url + str(self.school_code) + self.other
            new_request = Request(url=next_url, method="GET", cookies="")
            new_request_or_none = get_retry_request(
                new_request,
                spider=self,
                reason='retry',
            )
            return new_request_or_none


        elif failure.check(ResponseFailed):
            self.logger.error('errback <%s> ResponseFailed' % request.url)

        elif failure.check(ConnectionRefusedError):
            self.logger.error('errback <%s> ConnectionRefusedError' % request.url)

        elif failure.check(ResponseNeverReceived):
            self.logger.error('errback <%s> ResponseNeverReceived' % request.url)

        elif failure.check(TCPTimedOutError, TimeoutError):
            self.logger.error('errback <%s> TimeoutError' % request.url)

        else:
            self.logger.error('errback <%s> OtherError' % request.url)



解决方式2

import scrapy

from scrapy_dir.eduSpider.eduSpider.items import EduspiderItem


class EduinformationSpider(scrapy.Spider):
    name = 'eduInformation'
    allowed_domains = ['www.eeagd.edu.cn']
    other = '&pc_h=11&jhlb_h=00&zykl_h=1'
    url = 'https://www.eeagd.edu.cn/lzks/yxzycx/yxzy.jsp?yx_h='
    start_urls = ["https://www.eeagd.edu.cn/lzks/yxzycx/yxzy.jsp"]

    def parse(self, response):
        for school_code in range(10001, 10010):#19028
            try:
                next_url = self.url + str(school_code) + self.other
                yield scrapy.Request(next_url, callback=self.parse_detail, dont_filter=True)
            except:
                continue

    def parse_detail(self, response):

        content_lists = response.xpath('/html/body/table/tr[4]/td/table/tr[@align]')
        for content_list in content_lists:
            item = EduspiderItem()
            items = content_list.xpath('./td')
            item["institution_code"] = items[0].xpath('./text()')[0].extract()
            item["institution_name"] = items[1].xpath('./a/text()')[0].extract()
            item["discipline_code"] = items[2].xpath('./text()')[0].extract()
            item["discipline_name"] = str(items[3].xpath('./a/text()')[0].extract()).strip()
            item["batch"] = items[4].xpath('./text()')[0].extract()
            item["plan_type"] = items[5].xpath('./text()')[0].extract()
            item["admission_type"] = items[6].xpath('./text()')[0].extract()
            yield item


实体文件:item.py


import scrapy


class EduspiderItem(scrapy.Item):
    institution_code = scrapy.Field()
    institution_name = scrapy.Field()
    discipline_code = scrapy.Field()
    discipline_name = scrapy.Field()
    batch = scrapy.Field()
    plan_type = scrapy.Field()
    admission_type = scrapy.Field()
    pass

下载中间件:middlewares.py

# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals

from scrapy_dir.eduSpider.eduSpider.settings import USER_AGENTS, PROXIES


class EduspiderSpiderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, or item objects.
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Request or item objects.
        pass

    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.

        # Must return only requests (not items).
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


class EduspiderDownloaderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


import random
import base64



# 随机的User-Agent
class RandomUserAgent(object):
    def process_request(self, request, spider):
        useragent = random.choice(USER_AGENTS)

        request.headers.setdefault("User-Agent", useragent)

class RandomProxy(object):
    def process_request(self, request, spider):
        proxy = random.choice(PROXIES)

        if proxy['user_passwd'] is None:
            # 没有代理账户验证的代理使用方式
            request.meta['proxy'] = "http://" + proxy['ip_port']
        else:
            # 对账户密码进行base64编码转换
            base64_userpasswd = base64.b64encode(proxy['user_passwd'])
            # 对应到代理服务器的信令格式里
            request.headers['Proxy-Authorization'] = 'Basic ' + base64_userpasswd
            request.meta['proxy'] = "http://" + proxy['ip_port']

持久化组件:pipelines.py

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
import xlwt


class EduspiderPipeline:
    def __init__(self):
        # 新建.xls的文件
        self.workbook = xlwt.Workbook('utf-8')
        # 添加工作表
        self.sheet = self.workbook.add_sheet('demo', cell_overwrite_ok=True)
        self.sheet.write(0,0,"院校代码")
        self.sheet.write(0,1,"院校")
        self.sheet.write(0,2,"专业代码")
        self.sheet.write(0,3,"招生专业")
        self.sheet.write(0,4,"专业批次")
        self.sheet.write(0,5,"计划类别")
        self.sheet.write(0,6,"招生科类")
        self.index=1

    def process_item(self, item, spider):
        self.sheet.write(self.index, 0, item["institution_code"])
        self.sheet.write(self.index, 1, item["institution_name"])
        self.sheet.write(self.index, 2, item["discipline_code"])
        self.sheet.write(self.index, 3, item["discipline_name"])
        self.sheet.write(self.index, 4, item["batch"])
        self.sheet.write(self.index, 5, item["plan_type"])
        self.sheet.write(self.index, 6, item["admission_type"])
        self.index+=1
        self.workbook.save("demo.xls")
        return item

核心配置文件:settings.py

# Scrapy settings for eduSpider project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'eduSpider'

SPIDER_MODULES = ['eduSpider.spiders']
NEWSPIDER_MODULE = 'eduSpider.spiders'
ROBOTSTXT_OBEY = False
COOKIES_ENABLED = False
DOWNLOAD_DELAY = 3

ITEM_PIPELINES = {
   'eduSpider.pipelines.EduspiderPipeline': 300,
}
DOWNLOADER_MIDDLEWARES = {
   #'eduSpider.middlewares.EduspiderDownloaderMiddleware': 543,
    'eduSpider.middlewares.RandomUserAgent': 1,
    #'eduSpider.middlewares.RandomProxy': 100
}

USER_AGENTS = [
    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
    "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
    "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
    "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
    "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5"
    ]
PROXIES = [
    {'ip_port': '111.8.60.9:8123', 'user_passwd': 'user1:pass1'},
    {'ip_port': '101.71.27.120:80', 'user_passwd': 'user2:pass2'},
    {'ip_port': '122.96.59.104:80', 'user_passwd': 'user3:pass3'},
    {'ip_port': '122.224.249.122:8088', 'user_passwd': 'user4:pass4'},
]




# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'eduSpider (+http://www.yourdomain.com)'

# Obey robots.txt rules


# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16



# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# SPIDER_MIDDLEWARES = {
#    'eduSpider.middlewares.EduspiderSpiderMiddleware': 543,
# }
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html


# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html


# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

项目启动文件:run.py

from scrapy.cmdline import execute

# 本程序为爬虫起始入口,直接运行本程序即可运行爬虫
execute(['scrapy', 'crawl', 'eduInformation'])

将excel排序的文件:(因为爬虫爬下来的数据是乱序的,需要整理一下)

#导入
import xlrd
import xlwt

result = xlwt.Workbook('utf-8')
# 添加工作表
result_sheet = result.add_sheet('0', cell_overwrite_ok=True)

# 打开文件 必须是存在的文件路径
wb = xlrd.open_workbook('./demo.xls')
sheet = wb.sheet_by_index(0)
content_list=[]
for i in range(1,sheet.nrows):
    content_dic={}
    content_dic["institution_code"]=sheet.cell(i,0).value
    content_dic["institution_name"]=sheet.cell(i,1).value
    content_dic["discipline_code"]=sheet.cell(i,2).value
    content_dic["discipline_name"]=sheet.cell(i,3).value
    content_dic["batch"]=sheet.cell(i,4).value
    content_dic["plan_type"]=sheet.cell(i,5).value
    content_dic["admission_type"]=sheet.cell(i,6).value
    if content_dic not in content_list:
        content_list.append(content_dic)
content_list = sorted(content_list, key=lambda r: r['institution_code'],reverse=False)

result_sheet.write(0,0,"院校代码")
result_sheet.write(0,1,"院校")
result_sheet.write(0,2,"专业代码")
result_sheet.write(0,3,"招生专业")
result_sheet.write(0,4,"专业批次")
result_sheet.write(0,5,"计划类别")
result_sheet.write(0,6,"招生科类")
for i in range(1,len(content_list)):
    result_sheet.write(i,0,content_list[i-1]["institution_code"])
    result_sheet.write(i,1,content_list[i-1]["institution_name"])
    result_sheet.write(i,2,content_list[i-1]["discipline_code"])
    result_sheet.write(i,3,content_list[i-1]["discipline_name"])
    result_sheet.write(i,4,content_list[i-1]["batch"])
    result_sheet.write(i,5,content_list[i-1]["plan_type"])
    result_sheet.write(i,6,content_list[i-1]["admission_type"])
result.save("文史.xls")

结果

在这里插入图片描述

  系统运维 最新文章
配置小型公司网络WLAN基本业务(AC通过三层
如何在交付运维过程中建立风险底线意识,提
快速传输大文件,怎么通过网络传大文件给对
从游戏服务端角度分析移动同步(状态同步)
MySQL使用MyCat实现分库分表
如何用DWDM射频光纤技术实现200公里外的站点
国内顺畅下载k8s.gcr.io的镜像
自动化测试appium
ctfshow ssrf
Linux操作系统学习之实用指令(Centos7/8均
上一篇文章      下一篇文章      查看所有文章
加:2022-06-21 21:33:29  更:2022-06-21 21:34:17 
 
开发: C++知识库 Java知识库 JavaScript Python PHP知识库 人工智能 区块链 大数据 移动开发 嵌入式 开发工具 数据结构与算法 开发测试 游戏开发 网络协议 系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑 笔记本 显卡 显示器 固态硬盘 硬盘 耳机 手机 iphone vivo oppo 小米 华为 单反 装机 图拉丁

360图书馆 购物 三丰科技 阅读网 日历 万年历 2024年5日历 -2024/5/18 22:03:15-

图片自动播放器
↓图片自动播放器↓
TxT小说阅读器
↓语音阅读,小说下载,古典文学↓
一键清除垃圾
↓轻轻一点,清除系统垃圾↓
图片批量下载器
↓批量下载图片,美女图库↓
  网站联系: qq:121756557 email:121756557@qq.com  IT数码