相关介绍
首先甩上视频地址 :详细介绍见视频
代码
目录结构: first.py
import scrapy
from Boss.items import BossItem
class FirstSpider(scrapy.Spider):
name = 'first'
allowed_domains = ['www.xxx.com']
start_urls = ['https://sc.chinaz.com/tupian/']
def parse(self, response):
div_list = response.xpath('//div[@id = "container"]/div')
for div in div_list:
src ="http:"+ div.xpath('./div/a/img/@src2').extract_first()
sdd =src.split('_')
src =sdd[0]+'.jpg'
item = BossItem()
item['srcs']=src
yield item
items.py
import scrapy
class BossItem(scrapy.Item):
srcs = scrapy.Field()
piplines.py
from itemadapter import ItemAdapter
from scrapy.pipelines.images import ImagesPipeline
import scrapy
class imgsPipleLine(ImagesPipeline):
def get_media_requests(self, item, info):
print(item['srcs'])
yield scrapy.Request(item['srcs'])
def file_path(self, request, response=None, info=None):
imgName = request.url.split('/')[-1]
return imgName
def item_completed(self, results, item, info):
return item
settings.py
BOT_NAME = 'Boss'
SPIDER_MODULES = ['Boss.spiders']
NEWSPIDER_MODULE = 'Boss.spiders'
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.79 Safari/537.36'
MEDIA_ALLOW_REDIRECTS = True
ROBOTSTXT_OBEY = False
LOG_LEVEL='ERROR'
ITEM_PIPELINES = {
'Boss.pipelines.imgsPipleLine': 300,
}
IMAGES_STORE ='./imaged'
我遇到的问题
1.问题一:导入items包时,在运行时出现没有该模块的问题
直接将该爬虫项目根目录改成资源文件,具体操作如下:选中目录,鼠标右键,选中make dictorty as 中的resource Root
2.问题二:文件夹出现了,但是没有内容
在setting.py 文件中设置 MEDIA_ALLOW_REDIRECTS = True
|