CrawlSpider深度爬取
CrawlSpider是什么:
crawlspider也是一个spider,是spider的一个子类,所以其功能要比Spider要强大。 多的功能是:提取链接的功能,根据一定的规则,提取指定的链接。
链接提取器:
LinkExtractor(
allow = xxx,
deny = xxx,
restrict_xpaths = xxx,
restrict_css = xxx,
deny_domains = xxx,
)
项目截图:
运行命令:
- scrapy startproject news
- cd news
- scrapy genspider -t crawl qiubai www.qiushibaike.com
示例代码:
items.py文件中:
import scrapy
class NewsItem(scrapy.Item):
icon_url = scrapy.Field()
username = scrapy.Field()
age = scrapy.Field()
content = scrapy.Field()
haha_count = scrapy.Field()
coment_count = scrapy.Field()
settings.py文件:
BOT_NAME = 'news'
LOG_LEVEL = 'ERROR'
SPIDER_MODULES = ['news.spiders']
NEWSPIDER_MODULE = 'news.spiders'
USER_AGENT = '改为自己的User-Agent'
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 0.6
ITEM_PIPELINES = {
'news.pipelines.NewsPipeline': 300,
}
piplines.py文件中:
import json
from itemadapter import ItemAdapter
class NewsPipeline:
def open_spider(self, spider):
self.fp = open('qiubai.txt', 'w', encoding='utf8')
def process_item(self, item, spider):
dic = dict(item)
strin = json.dumps(dic, ensure_ascii=False)
self.fp.write(strin + '\n')
return item
def close_spider(self, spider):
self.fp.close()
qiubai.py文件中:
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from news.items import NewsItem
class QiubaiSpider(CrawlSpider):
name = 'qiubai'
start_urls = ['https://www.qiushibaike.com/text/']
rules = (
Rule(LinkExtractor(allow=r'/text/page/\d+/'), callback='parse_item', follow=True),
)
def parse_item(self, response):
content_div = response.xpath('//*[@id="content"]/div/div[2]/div')
for content_d in content_div:
item = NewsItem()
icon_url = content_d.xpath('.//div/a/img/@src').extract_first()
icon_url = 'https:' + icon_url
username = content_d.xpath('.//div/a[2]/h2/text()').extract_first().strip('\n')
age = content_d.xpath('.//div/div/text()').extract_first()
content = content_d.xpath('.//a[1]/div[@class="content"]/span[1]').xpath('string(.)').extract_first()
haha_count = content_d.xpath('.//div[2]/span[1]/i/text()').extract_first()
comment_count = content_d.xpath('.//div[2]/span[2]/a/i/text()').extract_first()
item['icon_url'] = icon_url
item['username'] = username
item['age'] = age
item['content'] = content.strip('\n')
item['haha_count'] = haha_count
item['coment_count'] = comment_count
yield item
链接提取器不管用什么方式提取链接,都会把重复的链接自动去重。在scrapy shell中,可以这样:
link = LinkExtractor(allow=r'/text/page/\d+/')
link.extract_links(response)
注意:
- 一个链接提取器对应一个规则解析器,多个链接提取器对应多个规则解析器。
link1 = LinkExtractor(allow=r'/text/page/\d+/')
link2 = LinkExtractor(allow=r'/text/page/\d+/')
rules = (
Rule(link1, callback='parse_item', follow=True)
Rule(link2, callback='parse_item', follow=True),
)
- 在实现深度爬取的过程中需要和scrapy.Request()结合使用。
|