开发: C++知识库 Java知识库 JavaScript Python PHP知识库人工智能区块链大数据移动开发嵌入式开发工具数据结构与算法开发测试游戏开发网络协议系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑笔记本显卡显示器固态硬盘硬盘耳机手机 iphone vivo oppo 小米华为单反装机图拉丁

-> Python知识库 -> 爬虫学习笔记（2） -> 正文阅读

[Python知识库]爬虫学习笔记（2）

爬虫学习

注：本笔记使用jupyter编写。

web前端知识

jupyter 可以直接运行html和javascript只需要在代码前面加上%%html或者%%javascript

%%html

<html>
    <head>
        <title>python爬虫开发与项目实战</title>
            <meta charset='UTF-8'>
    </head>
    <body>
            文档设置标记<br>
            <p>这是段落</p>
    </body>
</html>

python爬虫开发与项目实战文档设置标记

这是段落

%%html
<html>
    <head>
        <script type='text/javascript'>
            alert('Hello,world!');
            var str1='hi';
            var str2 = 'you';
            str1 +=str2
            alert(str1)
        </script>
    </head>
    <body>
        python爬虫
    </body>
</html>

python爬虫 str1

下面可以直接运行javascrit语言。

%%javascript
alert('hello li')
var str1='hi';
var str2 = 'you';
str1 +=str2
alert(str1)
var person = {name:'li',age:17};
alert(person.name)

<IPython.core.display.Javascript object>

Xpath节点

%%html
<xml version="1.0" encoding="ISO-8859-1">
<classroom>
    <student>
        <id>1001</id>
        <name lang="en">marry</name>
        <country>China</country>
    </student>
</classroom>

1001 marry China

CSS层叠样式表

CSS由选择器和若干条声明构成。
一般有三种做法：

内联样式表，直接使用style属性改变样式，例如

<body style='background-color:green;margin:0;padding:0;'></body>

嵌入式样式表，代码写在<style type = 'text/css'></style>中间
外部样式表，css文件写一个单独的外部文件中。使用<link rel='StyleSheet' type='text/css' href='style.css'>。

javascript

两种引用方法：

直接写入代码，使用<script type='text/javascript'>alert('hello')</script>
引用外部文件使用<scipt src='temp/test1.js'></script>一般放在<head></head>中间。

HTTP 标准

常见状态码含义，200联接成功。301资源被永久转移其他url。404访问不存在。500内部服务器错误。
头部信息，常用的User-Agent,这个常用来反爬虫。
GET方式与POST方式的区别，GET通过url传递数据，数据最大只能是1024B，并且参数会显示在地址栏上。POST通过实体传递数据，数据大小没有限制，安全性更高。

python爬虫概述

爬虫的种类：

通用网络爬虫，如百度谷歌搜索引擎。
聚焦网络爬虫，自动下载网页程序。
增量式网络爬虫，变则改，不变则不下载。
深层网络爬虫，必须登录后才能访问的网页。

HTTP请求的python实现

import urllib
response = urllib.request.urlopen('http://www.zhihu.com')
html=response.read()
print(html[0:20])

b'<!doctype html>\n<htm'

上面的方式是GET的请求方式。下面是POST请求。

# encoding:utf-8
import urllib
# import urllib2
url = 'http://www.zhihu.com/login'
postdata = {b'username' : b'qiye',
                b'password' : b'qiye_pass'}
# info 需要被编码为urllib2能理解的格式，这里用到的是urllib
data = urllib.parse.urlencode(postdata).encode('utf-8')
req = urllib.request.Request(url, data)
response = urllib.request.urlopen(req)
html = response.read()
print(html[0:20])

b'<!DOCTYPE html>\n<htm'

书上代码运行错误：

AttributeError: module 'urllib' has no attribute 'urlencode'

解决方法：
urllib在python3中分解了，

urllib.urlencode()

改为

urllib.parse.urlencode()

然而继续出错：

TypeError: POST data should be bytes, an iterable of bytes, or a file object. It cannot be of type str.

采用方法：
输入格式设置为’utf-8‘

data = urllib.parse.urlencode(postdata).encode('utf-8')

继续出错：

HTTPError: HTTP Error 403: Forbidden

采用方法：
原来的网址输入不明确。

url = 'http://www.zhihu.com/login'

请求头的处理

import urllib
url = 'https://www.cnblogs.com/login'
user_agent = 'Mozilla/4.0(compatible;MSIE 5.5;Windows NT)'
referer = 'https://www.cnblogs.com'
postdata = {'username':'小李同学314','password':'***'}
#将user_agent,referer写入头信息
headers = {'User_Agent':user_agent,'Referer':referer}
data = urllib.parse.urlencode(postdata).encode('utf-8')
req = urllib.request.Request(url,data,headers)
response = urllib.request.urlopen(req)
html = response.read()
print(html[0:50])

b'<!DOCTYPE html>\n<html lang="zh-cn">\n<head>\n    <me'

也可以采用add_header()函数

import urllib
url = 'https://wwww.cnblogs.com/login'
user_agent='Mozilla/4.0(compatible;MSIE 5.5;Windows NT)'
referer = 'https://wwww.cnblogs.com'
postdata = {'username':'小李同学314','password':'***'}
data = urllib.parse.urlencode(postdata).encode('utf-8')
# add_header
req = urllib.request.Request(url)
req.add_header('User-Agent',user_agent)
req.add_header('Referer',referer)
req.data =data
response = urllib.request.urlopen(req)
html = response.read()
print(html[0:10])

b'<!DOCTYPE '

上面两个应该还是有问题的，密码错误也返回了相同的结果。

requests库的介绍

import requests
r = requests.get('https://www.baidu.com')
print(r.content[0:20])

b'<!DOCTYPE html>\r\n<!-'

import requests
postdata = {'key':'value'}
r = requests.post('https://www.baidu.com/login',data=postdata)
print(r.content)

b'<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">\n<html><head>\n<title>404 Not Found</title>\n</head><body>\n<h1>Not Found</h1>\n<p>The requested URL /login was not found on this server.</p>\n</body></html>\n'

响应与编码

import requests
r = requests.get('https://www.baidu.com')
print(r.encoding)
r.encoding = 'utf-8'
print(r.text[0:20])

ISO-8859-1
<!DOCTYPE html>
<!-

import chardet
import requests
r = requests.get('https://www.baidu.com')
print(chardet.detect(r.content))
r.encoding = chardet.detect(r.content)
print(r.text[0:20])

{'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}
<!DOCTYPE html>
<!-

除了全部响应还有流模式,将会以字节流的方式读取

import requests
r = requests.get('https://www.baidu.com',stream=True)
print(r.raw.read(10))

b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03'

请求头的处理

import requests
user_agent= 'Mozilla/4.0 (compatible;MSIE 5.5;Windows NT)'
headers = {'User-Agent':user_agent}
r = requests.get('https://www.baidu.com',headers=headers)
print(r.content[0:20])

b'<!DOCTYPE html><!--S'

响应码和响应头的处理

import requests
r = requests.get('http://www.baidu.com')
if r.status_code == requests.codes.ok:
    print(r.status_code)
    print(r.headers.get('content-type'))#推荐使用这种方式，也可以采用headers['conten-type'],但是没有字段时会返回异常。
    print(r.headers)
else:
    pritn(r.raise_for_status())

200
text/html
{'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Connection': 'keep-alive', 'Content-Encoding': 'gzip', 'Content-Type': 'text/html', 'Date': 'Wed, 12 Jan 2022 13:00:11 GMT', 'Last-Modified': 'Mon, 23 Jan 2017 13:28:12 GMT', 'Pragma': 'no-cache', 'Server': 'bfe/1.0.8.18', 'Set-Cookie': 'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/', 'Transfer-Encoding': 'chunked'}

cookie的处理

import requests
user_agent = 'Mozilla/4.0 (compatible;MSIE 5.5;Windows NT)'
headers = {'User-Agent':user_agent}
r = requests.get('https://www.baidu.com',headers=headers)
for cookie in r.cookies.keys():
    print(cookie+':'+r.cookies.get(cookie))

BAIDUID:29C35B46ABEAF0273D2DC8EB99F1EE42:FG=1
BIDUPSID:29C35B46ABEAF02741391EDC08B8E0B8
H_PS_PSSID:35106_35627_35489_34584_35491_35698_35688_35541_35316_26350_35613_22159
PSTM:1641993903
BDSVRTM:13
BD_HOME:1

这里介绍一种自动处理cookie的方法以便换网页

import requests
loginUrl = 'https://www.baidu.com'
s = requests.Session()
r = s.get(loginUrl,allow_redirects=True)
datas = {'name':'qiye','passwd':'qiye'}
r = s.post(loginUrl,data=datas,allow_redirects=True)
print(r.text[0:10])

使用代理

import requests
proxies = {
    'http:':'http://0.10.1.10:3128',
    'https:':'http://0.10.1.10:1080'
}
requests.get('http://example.org',proxies=proxies)

Python知识库最新文章

Python中String模块

【Python】 14-CVS文件操作

python的panda库读写文件

使用Nordic的nrf52840实现蓝牙DFU过程

【Python学习记录】numpy数组用法整理

Python学习笔记

python字符串和列表

python如何从txt文件中解析出有效的数据

Python编程从入门到实践自学/3.1-3.2

python变量

加:2022-01-14 01:55:36 更:2022-01-14 01:57:36

360图书馆购物三丰科技阅读网日历万年历 2024年11日历

-2024/11/16 3:42:45-

图片自动播放器
↓图片自动播放器↓

TxT小说阅读器
↓语音阅读,小说下载,古典文学↓

一键清除垃圾
↓轻轻一点,清除系统垃圾↓

图片批量下载器
↓批量下载图片,美女图库↓

网站联系: qq:121756557 email:121756557@qq.com IT数码