学习笔记:哔哩哔哩 Python 爬虫视频教程全集(62P)| 6 小时从入门到精通
0. 学习视频地址
- https://www.bilibili.com/video/BV1pt41137qK?p=15
1. 示例代码
大部分情况下,如果不修改Headers字段,则一般浏览器的robots协议会直接把python访问的行为拒绝(返回非200的status code),故我们可以根据Headers字段模拟浏览器访问亚马逊界面爬取.
if __name__ == '__main__':
agent = {'User-Agent':'Mozilla/5.0'}
# 通过Headers字段模拟浏览器访问
kv = {'headers':agent}
r = requests.request('GET'
,'https://www.amazon.cn/dp/B09C8VKG4Y/?_encoding=UTF8&pd_rd_w=zToxc&pf_rd_p=b2c3fdd4-a66d-4966-afad-3e4771df6879&pf_rd_r=QSAV56W094T9YZ88BXWV&pd_rd_r=811e4077-5d96-4f9f-94e5-a8b31b9a3970&pd_rd_wg=V3NAZ&ref_=pd_gw_unk'
, **kv)
try:
r.raise_for_status()
r.encoding = r.apparent_encoding
print(r.status_code, r.text[0:2000])
except:
# 爬取失败:status_code 503,表示亚马逊的robots,禁止'User-Agent': 'python-requests/2.25.1'进行访问
print("爬取失败:status_code", r.status_code, r.request.headers)
2.运行结果
这里发现status_code为200了,比之前的status_code 503好多了。对应爬取网页非200其实都是表示爬取失败
C:\Users\珞落\AppData\Local\Programs\Python\Python39\python.exe D:/PythonProject/main.py
200 <!DOCTYPE html>
<!--[if lt IE 7]> <html lang="zh-CN" class="a-no-js a-lt-ie9 a-lt-ie8 a-lt-ie7"> <![endif]-->
<!--[if IE 7]> <html lang="zh-CN" class="a-no-js a-lt-ie9 a-lt-ie8"> <![endif]-->
<!--[if IE 8]> <html lang="zh-CN" class="a-no-js a-lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!-->
<html class="a-no-js" lang="zh-CN"><!--<![endif]--><head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
<title dir="ltr">Amazon.cn</title>
<meta name="viewport" content="width=device-width">
<link rel="stylesheet" href="https://images-na.ssl-images-amazon.com/images/G/01/AUIClients/AmazonUI-3c913031596ca78a3768f4e934b1cc02ce238101.secure.min._V1_.css">
<script>
if (true === true) {
var ue_t0 = (+ new Date()),
ue_csm = window,
ue = { t0: ue_t0, d: function() { return (+new Date() - ue_t0); } },
ue_furl = "fls-cn.amazon.cn",
ue_mid = "AAHKV2X7AFYLW",
ue_sid = (document.cookie.match(/session-id=([0-9-]+)/) || [])[1],
ue_sn = "opfcaptcha.amazon.cn",
ue_id = 'A8AX1VD8FQY8GRKVJ9C8';
}
</script>
</head>
<body>
<!--
To discuss automated access to Amazon data please contact api-services-support@amazon.com.
For information about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.com.cn/index.html/ref=rm_c_sv, or our Product Advertising API at https://associates.amazon.cn/gp/advertising/api/detail/main.html/ref=rm_c_ac for advertising use cases.
-->
<!--
Correios.DoNotSend
-->
<div class="a-container a-padding-double-large" style="min-width:350px;padding:44px 0 !important">
<div class="a-row a-spacing-double-large" style="width: 350px; margin: 0 auto">
<div class="a-row a-spacing-medium a-text-center"><i class="a-icon a-logo"></i></div>
<div class="a-box a-alert a-alert-info a-spacing-base">
<div class="a-box-inner">
Process finished with exit code 0
|