[Python知识库] Scrapy基本操作

开发: C++知识库 Java知识库 JavaScript Python PHP知识库人工智能区块链大数据移动开发嵌入式开发工具数据结构与算法开发测试游戏开发网络协议系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑笔记本显卡显示器固态硬盘硬盘耳机手机 iphone vivo oppo 小米华为单反装机图拉丁

-> Python知识库 -> Scrapy基本操作 -> 正文阅读

[Python知识库]Scrapy基本操作

0.shell的使用

一般使用shell进行前期网页解码的测试

Scrapy shell 网址? # 通过shell访问网址,shell会打开ipython进行操作

Scrapy shell http://quotes.toscrape.com

系统返回如下对象和函数

Request #代表用户的请求

Response #代表请求的返回结果,对结果的解析基本用到xpath

Fetch()#主要用于二次发送请求

Shelp()#获取帮助文档

开始定位数据

txtList= Response.xpath(“//div[@class=’quote’]/span[1]/text()”) #获取具有” //div[@class=’quote’]/span[1]/”属性的节点列表

txtList= Response.xpath(“//div[@class=’quote’]/span[1]/text()”).extract() #获取具有” //div[@class=’quote’]/span[1]/”属性的节点列表的值

Xpath(“//标签[@属性=’值’]/标签[contains(@属性,”正则表达式”)]/标签[下标]/text()|@属性名”).extract()|extract_first()|re(正则表达式)

Xpath(“//li/a/text()”).re(“next”)

1.切换操作磁盘

2.建立项目

Scrapy startproject projectName

3.到项目目录下建立爬虫

Cd projectname

Scrapy genspider spidername quotes.toscrape.com

4.导入代码到项目中进行编辑

Spyder新建项目，选择“已经存在的目录”，选择刚才创建的项目目录即可

5.0 编辑需要存储的item,以下操作在items.py中

class QuoteItem(scrapy.Item):

??? # define the fields for your item here like:

??? # name = scrapy.Field()

?? ?message = scrapy.Field()

??? author = scrapy.Field()

??? tags = scrapy.Field()

??? ?

5.1编辑爬虫操作在spidername.py文件中

进入” spiders”文件下编辑spidername.py的文件即可

在start_urls=[]放入需要首次访问的网址

在parse(self,response)中进行网页解码

为了能够使用自动编码功能，建议添加一个response变量

response =scrapy.http.HtmlResponse() #这个是临时增加的，编写完毕后要注释掉

Information=response.xpath(“//标签/标签[@属性=值]/text()”)

如何选择网页节点及其属性

“//”表示跨轴定位既可以是祖先和重孙，也可以使父子关系

“/”表示父子定位

“//div”

Xpath(“//标签[@属性=’值’]/标签[contains(@属性,”正则表达式”)]/标签[下标]/text()|@属性名”).extract()|extract_first()|re(正则表达式)

Xpath(“//标签[@属性=’值’]/a[contains(@href,”image”)]/标签[下标]/text()|@属性名”).extract()|extract_first()

Xpath(“//标签[@属性=’值’]/标签[下标]/text()”).extract_first()

Xpath(“//标签[@属性=’值’]/标签[下标]/@属性名”).re(正则表达式)

Xpath(“//标签[@属性=’值’]/标签[下标]/@属性名”).extract_first()

Information[0].extract()#获取第一个节点的文本

Information.extract_first()#获取列表中的第一个节点文本,这里information是列表

Yield dict #dict是一个字典，形如{键:值}

url=response.xpath(//标签//标签[下标]/@属性值”))

yield scrapy.http.Request(url,method) #返回一个请求，让scrapy下载这个请求并交给method处理，默认的method是parse，可以不用写

etc:

?????? for divNode in divList:

????????? msg=divNode.xpath("span[1]/text()").extract_first()

????????? author=divNode.xpath("span[2]/small/text()").extract_first()

????????? yield {'message':msg,'author':author}

如果已经定义 item 则需要返回item

yield QuoteItem(message=msg,author=author,tags=keywords)

完整代码如下

import scrapy

class QuoteSpider(scrapy.Spider):

??? name = 'quote'

??? allowed_domains = ['quotes.toscrape.com']

??? start_urls = ['http://quotes.toscrape.com/']

??? def parse(self, response):

#??????? response =scrapy.http.HtmlResponse()

??????? divList=response.xpath("//div[@class='quote']")

??????? if len(divList)>0:

??????????? for divNode in divList:

??????????????? msg=divNode.xpath("span[1]/text()").extract_first()

??????????? ????author=divNode.xpath("span[2]/small/text()").extract_first()

??????????????? yield {'message':msg,'author':author}

??????? #Information=response.xpath(“//标签/标签[@属性=值]/text()”)

??????? #Information[0].extract()#获取第一个节点的文本

??????? #Information.extract_first()#获取列表中的第一个节点文本,这里information是列表

??????? #Yield dict #dict是一个字典，形如{键:值}

??????? #url=response.xpath(//标签//标签[下标]/@属性值”))

??????? #获取链接

??????? href = response.xpath("//li[@class='next']/a/@href").extract_first()

??????? if href is not None:

?????????? ?yield scrapy.http.Request("http://quotes.toscrape.com"+href) #

5.2编写并开启管道流

开启管道流在settings.py文件夹下

ITEM_PIPELINES = {

??? 'quoteDemo.pipelines.QuotedemoPipeline': 300,

}

编写管道处理逻辑在pipelines.py文件夹下

class QuotedemoPipeline:

??? #开始爬虫的时候立刻执行该函数

??? def open_spider(self, spider):

??????? self.quoteList=[]? #定义一个列表

???????

??? #关闭爬虫的时候立刻执行该函数

??? def close_spider(self, spider):

??????? df = pd.DataFrame(self.quoteList) #将列表的数据放入数据帧 DataFrame

??????????????

??????? df.to_sql("quote", dbcon,if_exists='append') #将数据帧导入数据库表quote

???????

???? #当每次有item返回时执行下面函数??

??? def process_item(self, item, spider):

??????? if isinstance(item,QuoteItem):

??????????? #将数据放入列表即可

??????????? self.quoteList.append(item) #将spider获取并返回的item放入列表

????????? ?

开始运行爬虫，建议运行前使用scrapy shell ?url 进行测试

测试操作如下：

打开控制台，切换到项目文件夹下

输入scrapy shell url

Etc:? scrapy shell? http:// quotes.toscrape.com

如果要退出输入exit

运行操作如下:

Scrapy crawl spidername –o? filename –t filetype #使用scrapy 运行(crawl) spidername的爬虫，并输出到文件类型的filetype的filename文件中

Scrapy crawl spidername #直接运行爬虫

6.设置配置文件settings.py

DOWNLOAD_DELAY = 3 #下载的延时设置为3second

# The download delay setting will honor only one of:

CONCURRENT_REQUESTS_PER_DOMAIN = 2 #当前每个域名的链接数量设置为2

CONCURRENT_REQUESTS_PER_IP = 3 #每个ip地址最多可以访问的数量为3

Scrapy完整教程,以下操作最好使用spyder+cmd

创建项目
1. 定位磁盘

etc:D:

1. 创建项目

etc: scrapy startproject projectname

创建蜘蛛
1. 切换到项目文件

Cd project的目录

1. 创建蜘蛛

Etc:scrapy genspider 蜘蛛名称访问的域名

Etc:scrapy genspider 蜘蛛名称 –t 模板名称

在spider类下编辑蜘蛛，蜘蛛一般在项目的spiders文件夹下

3.1编辑域名，一般是爬取地址的域名部分

??? allowed_domains = [' vip.stock.finance.sina.com.cn']

?3.2编辑爬取的首地址

start_urls = [' 历史分红 - 数据中心 - 新浪财经']

3.3网页标签测试，一般在scrapy提供的shell中进行,可以一边写代码一般黏贴

Scrapy shell url? #url一般取爬取的首地址

3.4 编写对网页的解析代码

def parse(self, response):

#以下是网页的解析代码

#寻找爬取信息并抛出item

#寻找链接信息并抛出request

在items.py下编辑item

根据页面的特征数复制默认的代码n次，并根据页面特征命名变量属性

Etc:

class SinaItem(scrapy.Item):

??? # define the fields for your item here like:

daima = scrapy.Field()

mingcheng = scrapy.Field()

shangshishijian = scrapy.Field()

leijiguxi = scrapy.Field()

nianjunguxi = scrapy.Field()

fenhongcishu = scrapy.Field()

rongzizonge = scrapy.Field()

rongzicishu = scrapy.Field()

编辑管道

5.1 编辑管道处理逻辑

class SinaPipeline:

def open_spider(self, spider):

?? pass

def close_spider(self, spider):

??????? pass

???

def process_item(self, item, spider):

if not isinstance(item,SinaItem):???????????

??? return item

self.Link.append(item)

if len(self.Link)>1024:

??????????? df=pd.DataFrame(self.Link)

??????????? df.to_sql('bonus',dbcon,if_exists='append')

self.Link=[]

5.2开启管道，在settings.py下操作

ITEM_PIPELINES = {

??? 'sina.pipelines.SinaPipeline': 300,

}

编辑配置

#开启延时

DOWNLOAD_DELAY = 3

# The download delay setting will honor only one of:

#减少下载频率

CONCURRENT_REQUESTS_PER_DOMAIN = 1

运行框架

Scrapy crawl 蜘蛛的名称

Python知识库最新文章

Python中String模块

【Python】 14-CVS文件操作

python的panda库读写文件

使用Nordic的nrf52840实现蓝牙DFU过程

【Python学习记录】numpy数组用法整理

Python学习笔记

python字符串和列表

python如何从txt文件中解析出有效的数据

Python编程从入门到实践自学/3.1-3.2

python变量

加:2022-01-01 13:51:09 更:2022-01-01 13:51:54

360图书馆购物三丰科技阅读网日历万年历 2025年10日历

-2025/10/24 18:05:11-

图片自动播放器
↓图片自动播放器↓

TxT小说阅读器
↓语音阅读,小说下载,古典文学↓

一键清除垃圾
↓轻轻一点,清除系统垃圾↓

图片批量下载器
↓批量下载图片,美女图库↓

网站联系: qq:121756557 email:121756557@qq.com IT数码