[Python知识库] python与爬虫-01简单介绍

开发: C++知识库 Java知识库 JavaScript Python PHP知识库人工智能区块链大数据移动开发嵌入式开发工具数据结构与算法开发测试游戏开发网络协议系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑笔记本显卡显示器固态硬盘硬盘耳机手机 iphone vivo oppo 小米华为单反装机图拉丁

-> Python知识库 -> python与爬虫-01简单介绍 -> 正文阅读

[Python知识库]python与爬虫-01简单介绍

PS：论文写完了，似乎又没写完！打算研究一下爬虫！也不知道能坚持多久呢！最后，求论文过！！！过，过，过！！！

序：网页抓取需要抛开一些接口的遮挡，比如，浏览器层、网络连接层。

1.模仿A与B的网络通信
A：10101010，包括请求头和消息体，请求头包含B的本地路由器MAC地址、A的IP地址，消息体包含B对A服务器应用的请求。
B：本地路由器可收到10101010，数据包packet，从B的MAC地址寄到A的IP地址，B的路由器把数据包附上自己的IP地址，通过互联网发送出去。
经过：B的数据包经过中介服务器，到了A的服务器。A的服务器在A的IP地址收到数据包，A的服务器读取数据包里面的请求头的目标端口，传送到网络服务器应用，（目标端口通常是网络应用的80端口）
网络服务器应用从服务器处理器收到数据，GET请求+文件index.html，然后，打包文件发送给B，本体路由器传送到B的电脑上。

2.简单案例

from urllib.request import urlopen
html = urlopen('http://pythonscraping.com/pages/page1.html')
print(html.read())

PS：代码的话，在gitee码云上，搜索python-scraping，就可以找到了！估计会有很多个！但是，熟悉这本书的小伙伴可能会知道是哪一个！如果不知道这本书的就算了！
抓取结果如下：
下面的\n就是换行的意思！

b'<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h1>An Interesting Title</h1>\n<div>\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n</div>\n</body>\n</html>\n'

PS：看着不像英语！
网页地址截图如下：
在这里插入图片描述
输出了目标地址的全部HTML代码，即是输出在域名为http://pythonscraping.com的服务器上的<网络应用根地址>/pages文件夹的HTML文件page1.html的源代码。
python程序直接请求了单个HTML文件，查找python的request模块。
补充：urllib是Python的标准库，功能：网页请求数据、处理cookie、改变类似请求头和用户代理这些元数据的函数。urlopen用来打开并读取一个从网络获取的远程对象。

PS：当我运行如下代码时候，html = urlopen('https://github.com/search?q=REMitchell/python-scraping/blob/master/Chapter01_BeginningToScrape.ipynb') print(html.read())，会发现很难爬取，所以会有结果：URLError: <urlopen error [WinError 10060] 由于连接方在一段时间后没有正确答复或连接的主机没有反应，连接尝试失败。>

3.BeautifulSoup
建议：这本书可能会建议你单独安装一些模块，给你一点tips，首先，jupyter和这些库，可能需要一个anaconda3就都搞定了，因为，这个软件好像都事先装好了！在安装目录的lib文件夹下面，仔细找一找，应该能够找到你需要的模块（site-packages文件夹）。
3.1.案例分析
执行代码：

from urllib.request import urlopen
from bs4 import BeautifulSoup
html=urlopen('http://www.pythonscraping.com/pages/page1.html')
bs = BeautifulSoup(html.read(),'html.parser')
print(bs.h1)
print(bs.html.body.div)

显示结果为：

<h1>An Interesting Title</h1>
<div>
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
</div>

再次输入代码：

bs = BeautifulSoup(html.read(),'lxml')
print(bs.h1)
bs = BeautifulSoup(html.read(),'html5lib')
print(bs.h1)

结果为

None
None

PS：以为你没安装这两个模块嘛？并不是！输入pip install lxml会显示，你已经安装了！
3.2.知识总结
BeautifulSoup库通过定位HTML标签来格式化和组织复杂的网页信息。创建BeautifulSoup对象时，需要两个参数，一个是HTML文本，第二个是解析器。解析器包括：html.parser,lxml,html5lib等。
3.3.简单异常处理
第一种：网页在服务器上不存在。解决方式：HTTPError。
第二中：服务器不存在。解决方式：URLError。
示例代码：

from urllib.error import HTTPError
from urllib.error import URLError
try:
    html = urlopen('https://github.com/search?q=REMitchell/python-scraping/blob/master/Chapter01_BeginningToScrape.ipynb')    
except HTTPError as e:
    print(e)
except URLError as e:
    print('The server could not be found!')
else:
    print(html.read())

PS：运行这段代码的时候，这个页面居然被成功爬取出来了！突然好难过！！！
即使成功从服务器获取页面，当页面并非预期结果的时候，比如并没有这个内容，None对象和此对象的其他的内容，检查！
代码：

try:
    html=urlopen('http://www.pythonscraping.com/pages/page1.html')
    bs = BeautifulSoup(html.read(),'html.parser')
    badContent = bs.html.h1
except AttributeError as e:
    print("Tag was not found")
else:
    if badContent == None:
        print('Tag was not found')
    else:
        print(badContent)

结果：<h1>An Interesting Title</h1>
当参数为badContent = bs.html.h1.ss，显示为：Tag was not found

补充：给了一个通用的参考函数

from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError
def getTitle(url):
    try:
        html = urlopen(url)  
        bs = BeautifulSoup(html.read(),'html.parser')
        title = bs.html.h1
    except HTTPError as e:
        print(e)
    except URLError as e:
        print('The server could not be found!')
    except AttributeError as e:
        print("Tagtitle was not found")
    else:
        if title == None:
            print('title was not found')
        else:
            print(title)