需求
爬取链接中的大学以及对应的专业,存储为excel表格
波澜
这是我第二次做爬虫,之前做过一次爬取豆瓣top250电影的爬虫,感兴趣的可以看链接和链接,前者是使用scrapy框架的,后者是使用request包实现的。 起初我以为这个需求和之前的差不多,毕竟看着页面很简单,但是做起来发现并不是这样。首先 ,我最开始的想法是爬取招生院校 这个页面,然后根据里面的查看专业 这个a标签的跳转地址爬取我们真正想要爬取的信息。但是 ,遗憾的是,这是一个动态的html,相关信息是js控制的,所以上面的思路解决不了了。接着 ,观察到,其实这个请求还是有点规律的
https://www.eeagd.edu.cn/lzks/yxzycx/yxzy.jsp?yx_h=10001&pc_h=11&jhlb_h=00&zykl_h=1
前面是请求的接口,后面跟着的是参数,显然是个get请求了
参数yx_h=10001&pc_h=11&jhlb_h=00&zykl_h=1中,观察可以发现,参数yx_h表示院校的代码,参数zykl_h表示类别,1是文史类,2是理工类,另外两个参数在这么多请求中都是一样的无法判别到底是什么
那么 也就是我们每次都请求这个地址就好了,院校的代码从10001到19027慢慢增加就好了,但是还有个参数怎么处理呢?我们可以运行程序两次,第一次是请求所有文史类的数据,第二次是请求所有理工类的数据。 问题 又又出现了,有些学校在这个网站中是没有文史类或者理工类的数据的,不做处理的话,报错500直接结束程序。这个问题我的思路 是:遇到500的报错,我们就跳过,院校代码+1,去处理下一个请求。这个思路还是非常清晰的其实,但是具体落实到代码上,我人傻了。(因为要是使用request这类的软件包,我们可以控制的部分很多,改起来方便,但我这次使用的是scrapy框架,是在没找到哪里能实现这个功能) 解决 :
- 设置Request的errback参数,我们自定义一个异常处理的逻辑(当时没想到这个,后来才发现的)
- 当时的时候,我亲爱的学长给我整了个for循环,中间捕获异常,但是不处理,才算是勉勉强强解决了问题(虽然还是无法理解为什么这样就能解决)。
其他问题 :当爬取表格的内容时,发现用 xpath helper 获取正常,程序却解析不到。 原因 :浏览器会在table标签下添加tbody,这个元素是浏览器对于页面的处理,不是页面原生带有的。 解决 :将xpath中的tbody全部去掉就好了。
相关代码
核心爬虫文件:eduInformation.py 解决方式1 :
import scrapy
from scrapy import Request
from scrapy.downloadermiddlewares.retry import get_retry_request
from scrapy_dir.eduSpider.eduSpider.items import EduspiderItem
from scrapy.spidermiddlewares.httperror import HttpError
from twisted.internet.error import TimeoutError, TCPTimedOutError, ConnectionRefusedError
from twisted.web._newclient import ResponseFailed, ResponseNeverReceived
from scrapy.spidermiddlewares.httperror import HttpError
from scrapy.utils.response import response_status_message
class EduinformationSpider(scrapy.Spider):
name = 'eduInformation'
allowed_domains = ['www.eeagd.edu.cn']
other = '&pc_h=11&jhlb_h=00&zykl_h=1'
url = 'https://www.eeagd.edu.cn/lzks/yxzycx/yxzy.jsp?yx_h='
school_code=10001
start_urls = [url + str(school_code) + other]
def parse(self, response):
content_lists = response.xpath('/html/body/table/tr[4]/td/table/tr[@align]')
for content_list in content_lists:
item = EduspiderItem()
items = content_list.xpath('./td')
item["institution_code"] = items[0].xpath('./text()')[0].extract()
item["institution_name"] = items[1].xpath('./a/text()')[0].extract()
item["discipline_code"] = items[2].xpath('./text()')[0].extract()
item["discipline_name"] = str(items[3].xpath('./a/text()')[0].extract()).strip()
item["batch"] = items[4].xpath('./text()')[0].extract()
item["plan_type"] = items[5].xpath('./text()')[0].extract()
item["admission_type"] = items[6].xpath('./text()')[0].extract()
yield item
self.school_code+=1
if self.school_code<10011:
next_url = self.url + str(self.school_code) + self.other
yield scrapy.Request(next_url, callback=self.parse, errback=self.errback_parse,dont_filter=True)
def errback_parse(self, failure):
request = failure.request
if failure.check(HttpError):
response = failure.value.response
self.logger.error(
'errback <%s> %s , response status:%s' %
(request.url, failure.value, response_status_message(response.status))
)
self.school_code += 1
next_url = self.url + str(self.school_code) + self.other
new_request = Request(url=next_url, method="GET", cookies="")
new_request_or_none = get_retry_request(
new_request,
spider=self,
reason='retry',
)
return new_request_or_none
elif failure.check(ResponseFailed):
self.logger.error('errback <%s> ResponseFailed' % request.url)
elif failure.check(ConnectionRefusedError):
self.logger.error('errback <%s> ConnectionRefusedError' % request.url)
elif failure.check(ResponseNeverReceived):
self.logger.error('errback <%s> ResponseNeverReceived' % request.url)
elif failure.check(TCPTimedOutError, TimeoutError):
self.logger.error('errback <%s> TimeoutError' % request.url)
else:
self.logger.error('errback <%s> OtherError' % request.url)
解决方式2 :
import scrapy
from scrapy_dir.eduSpider.eduSpider.items import EduspiderItem
class EduinformationSpider(scrapy.Spider):
name = 'eduInformation'
allowed_domains = ['www.eeagd.edu.cn']
other = '&pc_h=11&jhlb_h=00&zykl_h=1'
url = 'https://www.eeagd.edu.cn/lzks/yxzycx/yxzy.jsp?yx_h='
start_urls = ["https://www.eeagd.edu.cn/lzks/yxzycx/yxzy.jsp"]
def parse(self, response):
for school_code in range(10001, 10010):
try:
next_url = self.url + str(school_code) + self.other
yield scrapy.Request(next_url, callback=self.parse_detail, dont_filter=True)
except:
continue
def parse_detail(self, response):
content_lists = response.xpath('/html/body/table/tr[4]/td/table/tr[@align]')
for content_list in content_lists:
item = EduspiderItem()
items = content_list.xpath('./td')
item["institution_code"] = items[0].xpath('./text()')[0].extract()
item["institution_name"] = items[1].xpath('./a/text()')[0].extract()
item["discipline_code"] = items[2].xpath('./text()')[0].extract()
item["discipline_name"] = str(items[3].xpath('./a/text()')[0].extract()).strip()
item["batch"] = items[4].xpath('./text()')[0].extract()
item["plan_type"] = items[5].xpath('./text()')[0].extract()
item["admission_type"] = items[6].xpath('./text()')[0].extract()
yield item
实体文件:item.py
import scrapy
class EduspiderItem(scrapy.Item):
institution_code = scrapy.Field()
institution_name = scrapy.Field()
discipline_code = scrapy.Field()
discipline_name = scrapy.Field()
batch = scrapy.Field()
plan_type = scrapy.Field()
admission_type = scrapy.Field()
pass
下载中间件:middlewares.py
from scrapy import signals
from scrapy_dir.eduSpider.eduSpider.settings import USER_AGENTS, PROXIES
class EduspiderSpiderMiddleware:
@classmethod
def from_crawler(cls, crawler):
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_spider_input(self, response, spider):
return None
def process_spider_output(self, response, result, spider):
for i in result:
yield i
def process_spider_exception(self, response, exception, spider):
pass
def process_start_requests(self, start_requests, spider):
for r in start_requests:
yield r
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
class EduspiderDownloaderMiddleware:
@classmethod
def from_crawler(cls, crawler):
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
return None
def process_response(self, request, response, spider):
return response
def process_exception(self, request, exception, spider):
pass
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
import random
import base64
class RandomUserAgent(object):
def process_request(self, request, spider):
useragent = random.choice(USER_AGENTS)
request.headers.setdefault("User-Agent", useragent)
class RandomProxy(object):
def process_request(self, request, spider):
proxy = random.choice(PROXIES)
if proxy['user_passwd'] is None:
request.meta['proxy'] = "http://" + proxy['ip_port']
else:
base64_userpasswd = base64.b64encode(proxy['user_passwd'])
request.headers['Proxy-Authorization'] = 'Basic ' + base64_userpasswd
request.meta['proxy'] = "http://" + proxy['ip_port']
持久化组件:pipelines.py
import xlwt
class EduspiderPipeline:
def __init__(self):
self.workbook = xlwt.Workbook('utf-8')
self.sheet = self.workbook.add_sheet('demo', cell_overwrite_ok=True)
self.sheet.write(0,0,"院校代码")
self.sheet.write(0,1,"院校")
self.sheet.write(0,2,"专业代码")
self.sheet.write(0,3,"招生专业")
self.sheet.write(0,4,"专业批次")
self.sheet.write(0,5,"计划类别")
self.sheet.write(0,6,"招生科类")
self.index=1
def process_item(self, item, spider):
self.sheet.write(self.index, 0, item["institution_code"])
self.sheet.write(self.index, 1, item["institution_name"])
self.sheet.write(self.index, 2, item["discipline_code"])
self.sheet.write(self.index, 3, item["discipline_name"])
self.sheet.write(self.index, 4, item["batch"])
self.sheet.write(self.index, 5, item["plan_type"])
self.sheet.write(self.index, 6, item["admission_type"])
self.index+=1
self.workbook.save("demo.xls")
return item
核心配置文件:settings.py
BOT_NAME = 'eduSpider'
SPIDER_MODULES = ['eduSpider.spiders']
NEWSPIDER_MODULE = 'eduSpider.spiders'
ROBOTSTXT_OBEY = False
COOKIES_ENABLED = False
DOWNLOAD_DELAY = 3
ITEM_PIPELINES = {
'eduSpider.pipelines.EduspiderPipeline': 300,
}
DOWNLOADER_MIDDLEWARES = {
'eduSpider.middlewares.RandomUserAgent': 1,
}
USER_AGENTS = [
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
"Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
"Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
"Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
"Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5"
]
PROXIES = [
{'ip_port': '111.8.60.9:8123', 'user_passwd': 'user1:pass1'},
{'ip_port': '101.71.27.120:80', 'user_passwd': 'user2:pass2'},
{'ip_port': '122.96.59.104:80', 'user_passwd': 'user3:pass3'},
{'ip_port': '122.224.249.122:8088', 'user_passwd': 'user4:pass4'},
]
项目启动文件:run.py
from scrapy.cmdline import execute
execute(['scrapy', 'crawl', 'eduInformation'])
将excel排序的文件:(因为爬虫爬下来的数据是乱序的,需要整理一下)
import xlrd
import xlwt
result = xlwt.Workbook('utf-8')
result_sheet = result.add_sheet('0', cell_overwrite_ok=True)
wb = xlrd.open_workbook('./demo.xls')
sheet = wb.sheet_by_index(0)
content_list=[]
for i in range(1,sheet.nrows):
content_dic={}
content_dic["institution_code"]=sheet.cell(i,0).value
content_dic["institution_name"]=sheet.cell(i,1).value
content_dic["discipline_code"]=sheet.cell(i,2).value
content_dic["discipline_name"]=sheet.cell(i,3).value
content_dic["batch"]=sheet.cell(i,4).value
content_dic["plan_type"]=sheet.cell(i,5).value
content_dic["admission_type"]=sheet.cell(i,6).value
if content_dic not in content_list:
content_list.append(content_dic)
content_list = sorted(content_list, key=lambda r: r['institution_code'],reverse=False)
result_sheet.write(0,0,"院校代码")
result_sheet.write(0,1,"院校")
result_sheet.write(0,2,"专业代码")
result_sheet.write(0,3,"招生专业")
result_sheet.write(0,4,"专业批次")
result_sheet.write(0,5,"计划类别")
result_sheet.write(0,6,"招生科类")
for i in range(1,len(content_list)):
result_sheet.write(i,0,content_list[i-1]["institution_code"])
result_sheet.write(i,1,content_list[i-1]["institution_name"])
result_sheet.write(i,2,content_list[i-1]["discipline_code"])
result_sheet.write(i,3,content_list[i-1]["discipline_name"])
result_sheet.write(i,4,content_list[i-1]["batch"])
result_sheet.write(i,5,content_list[i-1]["plan_type"])
result_sheet.write(i,6,content_list[i-1]["admission_type"])
result.save("文史.xls")
结果
|