学习笔记视频-哔哩哔哩
- https://www.bilibili.com/video/BV1pt41137qK?p=10
- https://www.bilibili.com/video/BV1pt41137qK?p=11&spm_id_from=pageDriver
- https://www.bilibili.com/video/BV1pt41137qK?p=12&spm_id_from=pageDriver
- https://www.bilibili.com/video/BV1pt41137qK?p=13&spm_id_from=pageDriver
网络爬虫的尺寸
-
Requests 库: 范围: 爬取网页,玩转网页 特点: 小规模,数据量小,爬取速度不敏感 -
Scrapy 库 范围: 爬取网站,爬取系列网站 特点: 中规模,数据规模较大,爬取速度敏感 -
定制开发 范围:爬取全网 特点:大规模,搜索引擎,爬取速度是关键
网络爬虫的问题
- [法律风险]服务器的数据是有产权归属,网络爬虫获取数据后牟利将带来法律风险
- [隐私泄露]网络爬虫可能具备突破简单访问控制的能力,获得被保护数据从而泄露个人隐私
- [对服务器骚乱]服务器被攻击
服务器的限制爬虫方式
- 来源审查:判断User-Agent进行限制
a.检查来访的HTTP协议头的User-Agent域,只响应浏览器或友好的爬虫访问(类似白名单) - 发布公告:Robots 协议
告知所有爬虫网站的爬取策略,要求爬虫遵守
Robots 协议
- Robots Exclusion Standard 网络爬虫排除标准
- 作用:告知哪些可以爬取,哪些不行
- 形式:网站根目录 Robots.txt 文件
例如百度:https://www.baidu.com/robots.txt
User-agent: * // 任何网络爬虫都应该遵循如下协议
Disallow: / // 不允许访问百度的任何资源
User-agent: Baiduspider // Baiduspider 爬虫
Disallow: /baidu // 不允许访问/baidu下的文件
Disallow: /s?
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
Disallow: /bh
User-agent: Googlebot
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
Disallow: /bh
User-agent: MSNBot
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
Disallow: /bh
User-agent: Baiduspider-image
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
Disallow: /bh
User-agent: YoudaoBot
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
Disallow: /bh
User-agent: Sogou web spider
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
Disallow: /bh
User-agent: Sogou inst spider
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
Disallow: /bh
User-agent: Sogou spider2
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
Disallow: /bh
User-agent: Sogou blog
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
Disallow: /bh
User-agent: Sogou News Spider
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
Disallow: /bh
User-agent: Sogou Orion spider
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
Disallow: /bh
User-agent: ChinasoSpider
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
Disallow: /bh
User-agent: Sosospider
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
Disallow: /bh
User-agent: yisouspider
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
Disallow: /bh
User-agent: EasouSpider
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
Disallow: /bh
Robots 协议的使用
- 网络爬虫:自动或人工识别robots.txt,再进行内容爬取
- 约束性:Robots协议是建议但非约束性,网络爬虫可以不遵守,但存在法律风险
|