爬虫(二)requests模块
01.requests基础
**requests模块:**python中原生的一款基于网络请求的模块,功能非常强大,简单便捷,效率极高。
作用:模拟浏览器发请求。
- 使用流程/编码流程
- 指定url
- 基于requests模块发起请求
- 获取响应对象中的数据值
- 持久化存储
02.实战编码
1.需求:爬取搜狗主页
import requests
if __name__=="__main__":
url='https://www.sogou.com/'
headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36'
}
response=requests.get(url=url,headers=headers)
page_text=response.text
with open('搜狗.html','w',encoding='utf-8') as fp:
fp.write(page_text)
print("爬虫完毕")
2、requests 的巩固练习(实战巩固)
2.1、UA伪装
需求:爬取搜狗指定词条对应的搜索结果页面(简易网页采集器)
import requests
if __name__=="__main__":
url='https://www.sogou.com/web?'
world=input('请输入搜索内容')
param={
'query':world
}
headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36'
}
response=requests.get(url=url,params=param,headers=headers)
page=response.text
kw=world+'.html'
with open(kw,'w',encoding='utf-8') as fp:
fp.write(page)
print("爬虫完毕")
2.2、post请求(携带了参数),响应json数据
需求:破解百度翻译
- post请求(携带了参数)
- 响应数据是一组json数据
import requests
import json
if __name__=="__main__":
post_url='https://fanyi.baidu.com/sug'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36'
}
word=input('请输入一个单词:')
data={
'kw':word
}
response=requests.post(url=post_url,data=data,headers=headers)
page=response.json()
fp=word+'.json'
filename=open(fp,'w',encoding='utf-8')
json.dump(page,fp=filename,ensure_ascii=False)
2.3 需求:爬取豆瓣电影分类排行榜
import requests
import json
if __name__=="__main__":
url='https://movie.douban.com/j/chart/top_list?'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36'
}
param={
'type': '4',
'interval_id': '100:90',
'action':'',
'start': '0',
'limit': '20',
}
response=requests.get(url=url,params=param,headers=headers)
list=response.json()
fp=open('xiju.json','w',encoding='utf-8')
json.dump(list,fp=fp,ensure_ascii=False)
print('结束')
import requests
import json
if __name__=="__main__":
url='http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36'
}
data={
'cname':'',
'pid':'',
'keyword': '郑州',
'pageIndex': '1',
'pageSize': '10',
}
response=requests.post(url=url,data=data,headers=headers)
dic=response.text
with open('kendeji.html','w',encoding='utf-8') as fp:
fp.write(dic)
print("结束")
2.4 综合,复杂
-
需求:爬取国家药品监督管理总局中基于中华人民共和国化妆品生产许可证相关数据,网站(http://scxk.nmpa.gov.cn:81/xk/) -
动态加载数据 -
首页中对应的企业信息数据是通过ajax动态请求到的 http://scxk.nmpa.gov.cn:81/xk/itownet/portal/dzpz.jsp?id=af4832c505b749dea76e22a193f873c6 http://scxk.nmpa.gov.cn:81/xk/itownet/portal/dzpz.jsp?id=e48046ec68d34d4692abbb6e06373866
只有两个ID值不同
import requests
import json
if __name__=="__main__":
url='http://scxk.nmpa.gov.cn:81/xk/itownet/portalAction.do?method=getXkzsList'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36'
}
id_list=[]
all_data_list=[]
for i in range(1,10):
data={
'on': 'true',
'page': i,
'pageSize': '15',
'productName':'',
'conditionType': '1',
'applyname':'',
'applysn':'',
}
dic=requests.post(url=url,data=data,headers=headers).json()
for i in dic['list']:
id_list.append(i['ID'])
post_url='http://scxk.nmpa.gov.cn:81/xk/itownet/portalAction.do?method=getXkzsById'
for id in id_list:
data={
'id': id,
}
dic1=requests.post(url=post_url,data=data,headers=headers).json()
all_data_list.append(dic1)
fp=open('化妆品.json','w',encoding='utf-8')
json.dump(all_data_list,fp=fp,ensure_ascii=False)
print("完成")
|