[Python知识库] 可以直接使用的python爬虫代码块

开发: C++知识库 Java知识库 JavaScript Python PHP知识库人工智能区块链大数据移动开发嵌入式开发工具数据结构与算法开发测试游戏开发网络协议系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑笔记本显卡显示器固态硬盘硬盘耳机手机 iphone vivo oppo 小米华为单反装机图拉丁

-> Python知识库 -> 可以直接使用的python爬虫代码块 -> 正文阅读

[Python知识库]可以直接使用的python爬虫代码块

一定要看注释！一定要看注释！一定要看注释！

傻瓜级爬虫代码块，改改参数就搞定，参数变量中输入网址，直接运行就可以出结果。

自由组合各代码块，如果需要文本格式化，则需要多次使用changeTxt函数，直到全部改好为止；

看清注释，简单易懂；需要那块拿哪块，记得添加必要的import句段，下载好需要的包。

注意：由于re.compile部分参数变量代码不方便使用形参，下载图片的代码块需要读者自己编写正则表达式部分代码。

import urllib.request
import urllib.response
from bs4 import BeautifulSoup
import re

#获取特定页面全部内容
def getPointResult(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.39 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.39'
    }
    req = urllib.request.Request(url=url, headers=headers, method="GET")
    response = urllib.request.urlopen(req, timeout=20)
    html = response.read().decode("UTF-8")
    return html

#创建文件，用于存储该网页内容(需要配合getPointResult使用)
def text_create(name1,name2,url):
    desktop_path = "resFolder/"  # 新创建的txt文件的存放路径
    full_path1 = desktop_path + name1 + '.html' # 也可以创建一个.doc的word文档
    full_path2 = desktop_path + name2 + '.html'
    file1 = open(full_path1, 'w', encoding="UTF-8")
    file2 = open(full_path2, 'w', encoding="UTF-8")
    file1.write(getPointResult(url))
    file1.close()

#对索获取页面内容进行文本格式化，两个name分别对应原始文件和格式化后的文件，当有多套文本需要被替换时可以多次使用
def changeTxt(name1,name2,filepath,orign_rep,after_rep):
    infile = open(filepath+"/"+name1+".html", "r", encoding="UTF-8")  # 打开文件
    outfile = open(filepath+"/"+name2+".html", "w", encoding="UTF-8")  # 内容输出
    for line in infile:  # 按行读文件，可避免文件过大，内存消耗
        outfile.write(line.replace(orign_rep, after_rep))  # first is old ,second is new
    infile.close()  # 文件关闭
    outfile.close()
#==============================================================================================

#文件相对路径示例：resFolder/res.html，胥要获取的类型，其他需要添加的内容，例如class_
def urlImgout(file_fullpath,type,other):
    file = open(file_fullpath, "r+", encoding="UTF-8")
    html = file.read()
    bs = BeautifulSoup(html, "html.parser")
    bs1 = bs.find_all(type, class_=other)
    return bs1

#对urlTypeout中的内容进行简化，
def getFinurl():
    findLink = re.compile(r'<img alt="" class="attachment-thumbnail size-thumbnail" height="180" loading="lazy" src="(.*?)" width="285"/>')
    for item in urlImgout():
        item = str(urlImgout())
    link = re.findall(findLink, item)
    return link

#保存图片，相对于本py文件来说相对应的图片需要被保存的地址，示例：resFolder/img[以数字递增作为文件名保存图片]
def savePics(filepath,set_timeout):
    import requests
    for i in range(0,len(getFinurl())):
        r = requests.get(getFinurl()[i],timeout=set_timeout)
        file_name = str(i)+'.jpg'
        with open(filepath+'/'+file_name, 'wb') as f:
            f.write(r.content)
            f.close()

Python知识库最新文章

Python中String模块

【Python】 14-CVS文件操作

python的panda库读写文件

使用Nordic的nrf52840实现蓝牙DFU过程

【Python学习记录】numpy数组用法整理

Python学习笔记

python字符串和列表

python如何从txt文件中解析出有效的数据

Python编程从入门到实践自学/3.1-3.2

python变量

加:2021-11-22 12:17:56 更:2021-11-22 12:19:39

360图书馆购物三丰科技阅读网日历万年历 2025年7日历

-2025/7/13 2:51:02-

图片自动播放器
↓图片自动播放器↓

TxT小说阅读器
↓语音阅读,小说下载,古典文学↓

一键清除垃圾
↓轻轻一点,清除系统垃圾↓

图片批量下载器
↓批量下载图片,美女图库↓

网站联系: qq:121756557 email:121756557@qq.com IT数码