爬虫:
模拟浏览器自动抓取网页信息的脚本
主要用到浏览器自带的抓包功能,request模块,beaufulsoup模块和re模块
一.伪装
1.进行伪装的原因
import requests
url='http://www.baidu.com'
header={'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Mobile Safari/537.36 Edg/94.0.992.38'}
response=requests.get(url)
response1=requests.get(url,headers=header)
print(len(response.content.decode()))
print(len(response1.content.decode()))
可以看出当不进行伪装时,我们能获取的信息长度只有2287,而当我们进行伪装后,我们能获取的信息长度为295758
2.请求头heders
headers为字典形式,一般构造headers需要cookie和User-Agent两个
import requests
url='https://github.com/Khazing'
header={'Cookie':'_octo=GH1.1.1409507418.1634466346; _device_id=657e29e120e5f4c50fd8f575dc1651eb; user_session=0cgOhLVHLt1AQzVFHGVYnBPb0yDinMqmr0PNWSNjZfSAY9ww; __Host-user_session_same_site=0cgOhLVHLt1AQzVFHGVYnBPb0yDinMqmr0PNWSNjZfSAY9ww; logged_in=yes; dotcom_user=Khazing; has_recent_activity=1; color_mode=%7B%22color_mode%22%3A%22auto%22%2C%22light_theme%22%3A%7B%22name%22%3A%22light%22%2C%22color_mode%22%3A%22light%22%7D%2C%22dark_theme%22%3A%7B%22name%22%3A%22dark%22%2C%22color_mode%22%3A%22dark%22%7D%7D; _gh_sess=RjgJAxoXBWueK4CHLzr7hmxOC%2BW0GSlqophwVtis534lsLS%2FN3PZ6eeBcmrcIstWJCxyKFCu51v3muGPySlsP%2BrCPuTi%2Bl%2BfbKVKWSA6UeyXWm3PnLnGo6hQz1GRf1MsZ5fGGb8%2BBRdQmM9NmBzq9dx0Y9PDwjO1j160tc9yrb2euaiP4B%2Bp%2BsuCo7X9MoId--bxhgtt5nehk5Qc%2FH--2qKYAo1TVPp0HgMBpQPuqw%3D%3D'
,'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Mobile Safari/537.36 Edg/94.0.992.38'}
response1=requests.get(url,headers=header)
User-Agent
向服务器说明是pc/Android发起的请求
?可以写一个随机获取UserAgent的脚本来进行headers的伪装
import random
user_agent = [
"Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 Firefox/38.0",
"Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; InfoPath.3; rv:11.0) like Gecko",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)",
"Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
"Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
"Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11",
"Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
"Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
"Mozilla/5.0 (iPod; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
"Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
"Mozilla/5.0 (Linux; U; Android 2.3.7; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"MQQBrowser/26 Mozilla/5.0 (Linux; U; Android 2.3.7; zh-cn; MB200 Build/GRJ22; CyanogenMod-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Opera/9.80 (Android 2.3.4; Linux; Opera Mobi/build-1107180945; U; en-GB) Presto/2.8.149 Version/11.10",
"Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13",
"Mozilla/5.0 (BlackBerry; U; BlackBerry 9800; en) AppleWebKit/534.1+ (KHTML, like Gecko) Version/6.0.0.337 Mobile Safari/534.1+",
"Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.0; U; en-US) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/233.70 Safari/534.6 TouchPad/1.0",
"Mozilla/5.0 (SymbianOS/9.4; Series60/5.0 NokiaN97-1/20.0.019; Profile/MIDP-2.1 Configuration/CLDC-1.1) AppleWebKit/525 (KHTML, like Gecko) BrowserNG/7.1.18124",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0; HTC; Titan)",
"UCWEB7.0.2.37/28/999",
"NOKIA5700/ UCWEB7.0.2.37/28/999",
"Openwave/ UCWEB7.0.2.37/28/999",
"Mozilla/4.0 (compatible; MSIE 6.0; ) Opera/UCWEB7.0.2.37/28/999",
# iPhone 6
"Mozilla/6.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/8.0 Mobile/10A5376e Safari/8536.25",
# 新版移动ua
"Mozilla/5.0 (Linux;u;Android 4.2.2;zh-cn;) AppleWebKit/534.46 (KHTML,like Gecko) Version/5.1 Mobile Safari/10600.6.3 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"
]
# 随机获取一个请求头
def get_headers():
return {'User-Agent': random.choice(user_agent)}
cookies
能够保持登录状态,从而可以爬取基于某个用户的信息
?使用方法:先携带cookie登录后,转而使用session方法
session=ruquests.session()
response=session.get()
response=session.post()
3.参数params
params为字典形式,值为下图中的查询字符串参数
注意这里有些参数是随机生成的,为了防爬虫,无法构造
data和params
4.代理proxies
proxies是为了防止服务器由于同一个IP不断发起请求,认为是爬虫而封杀的属性
代理的种类和使用:
proxies为字典形式,值为
import requests
url='http://youtube.com'
Proxy={
'http':'http://103.138.164.106:80',
'https':'https://61.133.87.228:55443'
}
response=requests.get(url,proxies=Proxy)
5.时间间隔
如果不间断地重复发起请求,很有可能被服务器认为是爬虫,毕竟如果是人不可能不间断地重复发起请求,所以需要设置时间间隔来伪装
import time
#n为间隔时间
time.sleep(n)
二.模拟登陆
selenium模块:浏览器自动化模块,与爬虫的关系:便捷获取动态加载数据和便捷实现模拟登陆
验证码
爬取验证码图片→调用网上识别验证码接口→得到验证码
三.发起请求获取页面源代码
有两种方式get和post
?get请求的参数一般在浏览器地址中就能找到或者在包里查看查询字符串,一般都用get请求
post请求的参数不在浏览器地址中显示,在包里查看表单数据,登录时用post请求
发起请求后响应的数据类型
#此时response为
response = requests.get(url = url,headers=UserAgent.get_headers())
#此时response是str型的页面源代码
response = requests.get(url = url,headers=UserAgent.get_headers()).text
#此时response是bytes型的页面源代码
response = requests.get(url = url,headers=UserAgent.get_headers()).content
注意事项
1.动态加载的数据
页面源代码与F12中的元素的内容可能是不同的,因为很多数据都是通过动态加载的
通过requests模块获得的是页面源代码就不包含动态加载的数据,而F12中的元素内容是经过浏览器渲染过后的产物包含了动态加载的数据.
但是要爬取的资源往往是动态加载的,所以每次爬取之前都需要先确认爬取的资源是否是动态加载的
例子:
F12中有mp4资源
但在源代码中却没有找到mp4
这时的方法在抓的包中搜索MP4?
但是要注意此时json中的地址并不一定是真正的地址,打开json中的地址
?打开F12中的地址
?所以我们可以获取到的是json中的地址,但是真正的地址其实是F12中的,所以需要将json中的地址构造成真正的地址
2.乱码
处理方法
response = requests.get(url=url,headers=UserAgent.get_headers())
page_text = response.text.encode('iso-8859-1').decode('gbk')
3.分页
(1)for循环
(2)定位
四.资源定位
1.正则
得到页面源码→构造正则→提取资源url列表
提取标签中的内容
#re.sub(正则字符串,替代字符串,被替代字符串, (count=n n为最大替代次数), (flags=) 标志位,用于控制正则表达式的匹配方式)
content = re.sub('<(\S*?)[^>]*>.*?|<.*? />','',content_text)
2.beautifulsoup
实例化对象→定位→得到资源url列表
五.请求URL爬取资源
1.单线程
就是平常用的,执行下一个任务必须等待上一个任务的完成
2.多线程
六.存储
注意写入的数据应为bytes型
.encode()将数据编码为bytes型
.decode()将数据解码为str型
还有存储时文件的名称图片jpg/png,文章txt,视频mp4
起因:前几天想找壁纸的wallhaven那个网站,好久没去了,结果发现图片预览显示不出来,看F12好像是js被劫持了,弄了个脚本才可以显示,然后开始找,又觉得一张一张保存好麻烦,就整了爬虫,感觉这玩意确实方便,也很好玩.
|