[Python知识库] 爬虫120例之第17例，用Python面向对象的思路，采集各种精彩句子

开发: C++知识库 Java知识库 JavaScript Python PHP知识库人工智能区块链大数据移动开发嵌入式开发工具数据结构与算法开发测试游戏开发网络协议系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑笔记本显卡显示器固态硬盘硬盘耳机手机 iphone vivo oppo 小米华为单反装机图拉丁

-> Python知识库 -> 爬虫120例之第17例，用Python面向对象的思路，采集各种精彩句子 -> 正文阅读

[Python知识库]爬虫120例之第17例，用Python面向对象的思路，采集各种精彩句子

采集完这7000+句子，里面好多神转折的段子呀
eg:我若带伞，便是晴天，若不带伞，便是雨天。

目标站点分析

本次要抓取的目标站点地址为学句子网，目标地址为 http://www.xuejuzi.cn/gaoxiao/，第一步需要获取下图红框位置详情页链接。

爬虫120例，第一阶段最后1篇，用Python面向对象的思路，采集各种精彩句子
列表页分页规律如下，区分第一页即可。

http://www.xuejuzi.cn/gaoxiao
http://www.xuejuzi.cn/gaoxiao/2.html
http://www.xuejuzi.cn/gaoxiao/3.html

由于网页中存在 末页 数据，可通过提取页面数据获取总页码。

爬虫120例，第一阶段最后1篇，用Python面向对象的思路，采集各种精彩句子
详情页数据提取也比较简单，目标数据存在于 p 标签中。

详细编码如下

本案例详细代码如下，重要部分已经添加到注释中。

import requests
from lxml import etree
import random


class Spider16:
    def __init__(self):

        self.wait_urls = ["http://www.xuejuzi.cn/gaoxiao/"]
        self.url_template = "http://www.xuejuzi.cn/gaoxiao/{num}.html"
        self.details = []

    def get_headers(self):
        uas = [
            "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)",
            "Mozilla/5.0 (compatible; Baiduspider-render/2.0; +http://www.baidu.com/search/spider.html)",
            "Baiduspider-image+(+http://www.baidu.com/search/spider.htm)",
            "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.81 YisouSpider/5.0 Safari/537.36",
            "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)",
            "Mozilla/5.0 (compatible; Googlebot-Image/1.0; +http://www.google.com/bot.html)",
            "Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)",
            "Sogou News Spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)",
            "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0);",
            "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)",
            "Sosospider+(+http://help.soso.com/webspider.htm)",
            "Mozilla/5.0 (compatible; Yahoo! Slurp China; http://misc.yahoo.com.cn/help.html)"
        ]
        ua = random.choice(uas)
        headers = {
            "user-agent": ua,
            "referer": "https://www.baidu.com"
        }
        return headers

    # 生成待爬取页面
    def create_urls(self):
        headers = self.get_headers()
        page_url = self.wait_urls[0]
        res = requests.get(url=page_url, headers=headers, timeout=5)
        html = etree.HTML(res.text)
        # 提取总页码
        last_page = html.xpath("//div[@class='page']/a[last()]/@href")
        if len(last_page) > 0:
            last_page = int(last_page[0].split(".")[0])

        # 生成待爬取页面
        for i in range(1, last_page + 1):
            self.wait_urls.append(self.url_template.format(num=i))

    def get_html(self):
        for url in self.wait_urls:
            headers = self.get_headers()
            res = requests.get(url, headers=headers, timeout=5)
            if res:
                html = etree.HTML(res.text)
                detail_link = html.xpath("//dl/dd[1]/a/@href")
                self.details.extend(detail_link)

    def get_detail(self):
        for url in self.details:
            headers = self.get_headers()
            res = requests.get(url, headers=headers, timeout=5)
            res.encoding = "gb2312"
            if res:
                html = etree.HTML(res.text)
                sentences = html.xpath("//div[@class='content']/p/text()")
                # 打印句子
                long_str = "\n".join(sentences)

                with open("sentences.txt","a+",encoding="utf-8") as f:
                    f.write(long_str)

    def run(self):
        self.create_urls()
        self.get_html()
        self.get_detail()

if __name__ == '__main__':
    s = Spider16()
    s.run()

最终爬取到的句子，有的确实有趣：

1，时间真的很宝贵，就差一秒厕所就被其他人抢了。
2，我要给我未来婆婆一个差评，发货太慢。
3，爱上你，疼死了自己。
4，戒烟了，再抽真就腾云驾雾了！
5，我发现这么多年我就是一个裤衩，什么屁都得接着。
6，祝我生日快乐！愿我未来的媳妇找到我，我们赶紧登记结婚生孩子。

收藏时间

代码下载地址：https://codechina.csdn.net/hihell/python120，可否给个 Star。

本案例采集到的素材下载：https://download.csdn.net/download/hihell/21048666

来都来了，不发个评论，点个赞，收个藏吗？

今天是持续写作的第 196 / 200 天。
可以关注我，点赞我、评论我、收藏我啦。

更多精彩

《Python 爬虫 100 例》只需要 39.9 元，即可享受 100+篇博客阅读权，每篇不到 4 毛钱。

Python 爬虫 100 例教程导航帖（已完结）

Python知识库最新文章

Python中String模块

【Python】 14-CVS文件操作

python的panda库读写文件

使用Nordic的nrf52840实现蓝牙DFU过程

【Python学习记录】numpy数组用法整理

Python学习笔记

python字符串和列表

python如何从txt文件中解析出有效的数据

Python编程从入门到实践自学/3.1-3.2

python变量

加:2021-08-14 13:59:04 更:2021-08-14 13:59:31

360图书馆购物三丰科技阅读网日历万年历 2025年10日历

-2025/10/25 13:39:08-

图片自动播放器
↓图片自动播放器↓

TxT小说阅读器
↓语音阅读,小说下载,古典文学↓

一键清除垃圾
↓轻轻一点,清除系统垃圾↓

图片批量下载器
↓批量下载图片,美女图库↓

网站联系: qq:121756557 email:121756557@qq.com IT数码