开发: C++知识库 Java知识库 JavaScript Python PHP知识库人工智能区块链大数据移动开发嵌入式开发工具数据结构与算法开发测试游戏开发网络协议系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑笔记本显卡显示器固态硬盘硬盘耳机手机 iphone vivo oppo 小米华为单反装机图拉丁

-> 网络协议 -> Python爬虫篇：HTTP库requests -> 正文阅读

[网络协议]Python爬虫篇：HTTP库requests

一：简介

requests是一种第三方HTTP library，因url3的提供的API不好用，requests是对url3的一种封装，类似于Java中的HttpClient。
支持常见的请求方式GET，POST， PUT，DELETE，PATCH，OPTIONS，HEAD等。
GitHub https://github.com/psf/requests
官方文档 https://requests.readthedocs.io/en/latest/
http://httpbin.org/ httpbin是一个用于测试HTTP请求的网站，可以向该网站发送请求进行测试。
pip安装：pip install requests

1.1 预请求 PreparedRequest

参数	描述
url	请求地址（除url外其它参数都是可选的）。
params	查询字符串Query String（字典，列表，元组）
data	body体（字段，列表，元组，字节或者文件） `"Content-Type": "application/x-www-form-urlencoded"`
json	json参数 `"Content-Type": "application/json"`
headers	请求头（dict）
cookies	Cookie (`RequestsCookieJar`) 常用于保持用户登录状态
timeout	设置超时时间，单位`秒`
proxies	代理
verify	SSL 验证
cert	证书
files	文件
auth	指定身份验证机制 `Basic Auth`
allow_redirects	是否允许重定向
stream	流式请求，主要对接流式 API

1.2 响应 Response

参数	描述
url	响应的最终URL位置，重定向之后的地址
status_code	响应状态码
ok	只要状态码 status_code 小于 400，都会返回 True
content	响应内容 `bytes`，如获取图片时会使用
text	响应内容 `str`
json()	响应内容 `dict`
raw	http响应的原始值 HTTPResponse
headers	响应头
cookies	Cookie (`RequestsCookieJar`)
encoding	text 时的编码
is_redirect	重定向属性
history	重定向历史
reason	地址是否有效： OK 、NOT FOUND
request	请求对象
request.headers	请求对象的请求头
request.url	请求对象的请求url

在这里插入图片描述

二：基础案例

2.1 get | post | put | delete | head | options

import requests
res = requests.request('get', 'https://api.github.com/events')
# 打印对象的帮助文档
help(res)
# 查看对象的所有属性和方法
print(dir(res))
print(res.text)


res = requests.get('https://api.github.com/events')
res = requests.post('http://httpbin.org/post', data={'key': 'value'})
requests.put('http://httpbin.org/put', data = {'key':'value'})
requests.delete('http://httpbin.org/delete')
requests.head('http://httpbin.org/get')
requests.options('http://httpbin.org/get')

2.2 乱码

如果乱码时设置encoding。
如果header中不存在charset字段，默认编码为ISO-8859-1，此时的编码输出text中的中文将是乱码。
apparent_encoding:会根据HTTP网页的内容分析出应该使用的编码。

import requests

# 请求头中如果包含了User-Agent，对方可能就认为该请求是从浏览器中发出的，将自己伪装成浏览器
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36'
}
res = requests.get('http://www.baidu.com', headers=headers)
# ISO-8859-1
print(res.encoding)
# utf-8
print(res.apparent_encoding)
# '百度一下，你就知道' 如果不设置encoding，res.text中的中文会乱码
res.encoding = res.apparent_encoding
print(res.text)

# res.text是requests模块按照chardet模块推测出的编码字符集进行解码的结果
# 网络传输的字符串都是bytes类型的，所以res.text = res.content.decode('推测出的字符集编码')
# 我们可以在网页源码中搜索charset, 尝试参考编码字符串，注意存在不准确的情况
# decode(charset) 接收一个字符编码集的参数，默认是'utf-8'
content = res.content.decode('utf-8')

三：get

# 参数也可以直接放到url后面，如 https://www.baidu.com/s?wd=python
# timeout在指定时间内没有响应会报连接错误 ConnectTimeout
payload = {'key1': 'value1', 'key2': 'value2', 'key3': None}
res = requests.get("http://httpbin.org/get", params=payload, timeout=3)
res.encoding=res.apparent_encoding


print(res.url)
print(res.status_code, requests.codes.ok)
print(res.headers)
print(res.headers['Content-Type'])
print(res.request.headers)
print(res.text)
print(content)
print(res.json())

请求头

注意：爬虫一般都必须要携带User-Agent头。

import requests

headers = {
	'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
	'Accept-Encoding': 'gzip, deflate',
	'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
	'Cache-Control': 'no-cache',
	'Connection': 'keep-alive',
	'Host': 'httpbin.org',
	'Pragma': 'no-cache',
	'Upgrade-Insecure-Requests': '1',
	'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36',
}

res = requests.get('http://httpbin.org/get', headers=headers)
print(res.text)

通常我们会粘贴浏览器中的Request Headers，但是粘贴过来的有两个问题，一个问题是没有缩进，另一个问题是Key和Value没有被字符串单引号包括住，可以使用PyCharm的正则表达式替换。首先选中要操作的字符串，然后选择匹配模式为正则表达式模式，输入查找的字符串和替换的字符串，然后Replace all。

. 表示单个任意字符。
* 表示匹配任意次。
? 表示匹配0次或多次。
() 表示正则表达式为一个整体。
$数字：表示匹配到的结果，数字从1开始。

请求图片

import requests
from io import BytesIO
from PIL import Image

res = requests.get('https://profile-avatar.csdnimg.cn/8b6439a0bcb34188953be722a564c8cc_vbirdbest.jpg')
img = Image.open(BytesIO(res.content))
print(img)

Cookie

Cookie可以放入到请求头headers中。
Cookie也可以作为某个请求的参数 cookies。
Cookie常用于保存用户的登录状态。

# 方式一
headers = {
    'Cookie': 'key1=value1;key2=value2'
}
res = requests.get('http://httpbin.org/cookies', headers=headers)
res = requests.get('http://httpbin.org/cookies', cookies=dict(key1='value1', key2='value2'))
print(res.text)

import requests

cookies = requests.cookies.RequestsCookieJar()
cookies.set('cookie1', 'value1', domain='httpbin.org', path='/cookies')
cookies.set('cookie2', 'value2', domain='httpbin.org', path='/cookies')

session = requests.Session()
res = session.get('http://httpbin.org/cookies', cookies=cookies)
print(res.text)

res = requests.get('http://www.baidu.com')
print(res.text)
# 不包含域名
# dict -> cookiejar
cookiejar = requests.utils.dict_from_cookiejar(res.cookies)
# cookiejar -> dict
cookiedict = requests.utils.cookiejar_from_dict(cookiejar)

重定向

res = requests.get('http://github.com', allow_redirects=True)
# 302 表示重定向
print(res.status_code)
print(res.text)

钩子函数

def print_url(r, *args, **kwargs):
    print(r.url)
hooks=dict(response=print_url)

res = requests.get('http://httpbin.org', hooks=dict(response=print_url))
print(res)

证书

访问一些网站时有时候会提示”您的连接不是私密连接“，很多证书都是官方颁布的而是自己颁布的，自己颁布的这些证书是可以关闭认证的（只是会提示一个警告）, 有些证书是绕不过去的还需要提供证书。

# 忽略证书验证
r = requests.get("https://sam.huat.edu.cn:8443/selfservice/", verify = False)
print(r.text)

requests.get('https://kennethreitz.org', cert=('/path/client.cert', '/path/client.key'))

import requests

session = requests.Session()
session.auth = ('user', 'pass')
# 全局header
session.headers.update({'token': '123456789'})
# 针对当前请求
res = session.get('http://httpbin.org/headers', headers={'sign': 'xxx'})
print(res.text)

代理

在这里插入图片描述

根据方向来分类

正向代理: 知道服务器的地址。
反向代理：不知道服务器的地址，如nginx。

根据协议来分类

HTTP代理。
HTTPS代理。
Socks隧道代理：在socket层设置的代理。

根据透明度来分类

透明代理（Transparent Proxy）：透明代理虽然可以直接”隐藏“你的IP地址，但是还是可以查到你是谁。
```
REMOTE_ADDR = Proxy IP
HTTP_VIA = Proxy IP
HTTP_X_FORWARDED_FOR = Your IP
```
匿名代理（Anonymous Proxy）：匿名代理只能知道你用了代理，不知道你是谁。
```
REMOTE_ADDR = Proxy IP
HTTP_VIA = Proxy IP
HTTP_X_FORWARDED_FOR = Proxy IP
```
高匿代理（High Anonymous Proxy）：高级代理别人根本不知道你是不是在使用代理，所以是最好的选择。
```
REMOTE_ADDR = Proxy IP
HTTP_VIA = not determined
HTTP_X_FORWARDED_FOR = not determined
```

免费代理网站：一般能够代理http的也能代理https

# HTTP Basic Auth 可以在url中指定用户名和密码 http://user:pass@123.171.1.78:8089
# 免费代理经常不可用
proxies = {
    'http': '223.96.90.216:8085'
}
res = requests.get('https://www.baidu.com/', proxies=proxies)
print(res.text)

四：post

请求值的一般来源：

固定值：参数值是固定不变的
输入值：一般是输入参数。
预设值-隐藏在静态文件中：需要从html中通过正则获取
预设值-发送请求：如获取token
在客户端生成的：如签名sign，可能会随着时间戳ts(timestamp)、岩(salt)等和其它参数拼接成一个字符串来生成的。

爬虫时一般要发送大于1次请求（如发送2次请求），然后比较两次请求哪些参数是变化的，哪些参数是不变的。

# data可以是dict或tuple
# data是"Content-Type": "application/x-www-form-urlencoded",
res = requests.post('http://httpbin.org/post', data={'key': 'value'})

payload = (('key1', 'value1'), ('key1', 'value2'))
res = requests.post('http://httpbin.org/post', data=payload)


# json参数使用的是"Content-Type": "application/json"
res = requests.post('http://httpbin.org/post', json={'key': 'value'})
print(res.text)

# data也可以传入json
import json
url = 'https://api.github.com/some/endpoint'
payload = {'some': 'data'}
requests.post(url, data=payload)
requests.post(url, data=json.dumps(payload))

url = 'http://httpbin.org/post'
files = {'file': open('report.xls', 'rb')}
r = requests.post(url, files=files)
print(r.text)

文件

url = 'http://httpbin.org/post'
files = {'file': open('report.xls', 'rb')}
# 设置文件的名字，文件乐西，请求头
files = {'file': ('report.xls', open('report.xls', 'rb'), 'application/vnd.ms-excel', {'Expires': '0'})}
r = requests.post(url, files=files)
print(r.text)

多个文件

<input type="file" name="images" multiple="true" required="true"/>

multiple_files = [
        ('images', ('foo.png', open('foo.png', 'rb'), 'image/png')),
        ('images', ('bar.png', open('bar.png', 'rb'), 'image/png'))]

五：Session

使用Session连续发送多个请求性能更好：底层的 TCP 连接将会被重用。
Session用于保持登录会话，下一次请求会携带上一次Response的Cookie。实际使用时只需要调用一次登录接口获取返回的Cookie（如JSESSIONID等），后面再发其它请求就会自动携带登录返回的Cookie，这样后面的接口就可以调通了。无论以什么样的方式发送请求方法最终会调动session.get() -> request() -> prepare_request(req) -> cookies处理。
Session会默认添加4个请求头，并对请求头进行按字母排序。

在这里插入图片描述

注意：注意：注意：有些接口在请求时对Headers的顺序是有严格的要求的（可能是为了防止爬虫吧），如果顺序不对结果就不对，浏览器默认会按照字母进行排序。所以我们在发送请求时一定要点击【View source】来获取浏览器没有排过续的请求头，以后要养成习惯。

在这里插入图片描述

session = requests.Session()

# 清空默认的排序后的请求头
session.headers.clear()
# 按照顺序设置自己的请求头
session.headers.update(headers)
# 设置Basic Auth：http://httpbin.org/basic-auth
# session.auth = ('user', 'passwd')
res = session.get('https://www.baidu.com/')
# {
#     'User-Agent': 'python-requests/2.28.1',
#     'Accept-Encoding': 'gzip, deflate',
#     'Accept': '*/*',
#     'Connection': 'keep-alive'
# }
print(res.request.headers)
session.close()

# with代码块结束销毁session
with requests.Session() as s:
    s.get('http://httpbin.org/cookies/set/sessioncookie/123456789')

六：GitHub登录

注意：一定要先退出自己的账号，如果账号应已经在网页中登录了则login接口会一直没有响应。

import re
import requests

headers = {
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36'
}

session = requests.Session()
# 1.获取token(注意：一定要先退出自己的账号，如果账号应登录则请求会一直没有响应)
content_login = session.get('https://github.com/login').content.decode()
# <input type="hidden" name="authenticity_token" value="MPoH_UoMCducxg_iNdl7b2r9cGxGwsunkrYBN-XCUJC_dzW78zc2U-OLQ5d2I5pPv92FErXziG_yT-uRdg_QQA" />
authenticity_token = re.findall('name="authenticity_token" value="(.*?)" />', content_login)[0]
print(authenticity_token)

# 2.登录
data = {
    'commit': 'Sign in',
    'authenticity_token': authenticity_token,
    'login': '账号',
    'password': '密码',
    'webauthn-support': 'supported'
}
session.post('https://github.com/session', data=data)

# 3.profile 获取用户信息
profile = session.get('https://github.com/账号').content
with open('github.com', 'wb') as f:
    f.write(profile)

网络协议最新文章

使用Easyswoole 搭建简单的Websoket服务

常见的数据通信方式有哪些？

Openssl 1024bit RSA算法---公私钥获取和处

加:2022-10-31 12:34:57 更:2022-10-31 12:38:46

360图书馆购物三丰科技阅读网日历万年历 2025年7日历

-2025/7/12 12:16:45-

图片自动播放器
↓图片自动播放器↓

TxT小说阅读器
↓语音阅读,小说下载,古典文学↓

一键清除垃圾
↓轻轻一点,清除系统垃圾↓

图片批量下载器
↓批量下载图片,美女图库↓

网站联系: qq:121756557 email:121756557@qq.com IT数码