开发: C++知识库 Java知识库 JavaScript Python PHP知识库人工智能区块链大数据移动开发嵌入式开发工具数据结构与算法开发测试游戏开发网络协议系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑笔记本显卡显示器固态硬盘硬盘耳机手机 iphone vivo oppo 小米华为单反装机图拉丁

-> Python知识库 -> Python网络数据采集 -> 正文阅读

[Python知识库]Python网络数据采集

文章目录

Python网络数据采集

Python网络数据采集

requests高级用法

"""
example01 - requests高级用法 ---> Session（会话）

Author: Lj~Asus
Date: 2021/8/23
"""
import requests

session = requests.Session()
session.verify = False
session.headers.update({
    'User-Agent': '...'
})
resp = session.get('要获取的网址')
print(resp.status_code)
print(resp.text)

Selenium破解爬虫蜜罐

破解Selenium反爬最重要的一行代码
browser.execute_cdp_cmd(
‘Page.addScriptToEvaluateOnNewDocument’,
{
‘source’: ‘Object.defineProperty(navigator, “webdriver”, {get: () => undefined})’
}
)

"""
example03 - Selenium破解爬虫蜜罐

Author: Lj~Asus
Date: 2021/8/23
"""
from selenium import webdriver

browser = webdriver.Chrome('resources/chromedriver.exe')

# 设置取消测试环境
# browser.add_experimental_option('excludeSwitches', ['enable-automation'])

# 破解Selenium反爬最重要的一行代码
browser.execute_cdp_cmd(
    'Page.addScriptToEvaluateOnNewDocument',
    {
        'source': 'Object.defineProperty(navigator, "webdriver", {get: () => undefined})'
    }
)

browser.get('')

browser.implicitly_wait(10)

anchor = browser.find_element_by_css_selector('')
# 通过WebElement对象的is_displayed方法判定元素是否可见
# 注意∶不可见的超链接─般都不能访问，因为它极有可能是一个诱使爬虫访问的蜜罐链接
print(anchor.is_displayed())
print(anchor.size)
print(anchor.location)

光学文字识别

注意：在安装easyocr时，还会另外安装其他库，有1.7G左右，务必在网络好的时候安装

"""
example04 - 光学文字识别

Author: Lj~Asus
Date: 2021/8/23
"""
import warnings

import easyocr

# 去除警告
warnings.filterwarnings('ignore')

# 简体中文：ch_sim, 繁体中文：ch_tra, 英文和数字：en
reader = easyocr.Reader(['en'], gpu=False)
print(reader.readtext('导入的要识别的图片', detail=0))

从页面上抠图

PIL(Python Image Library) —> pillow

再使用crop()函数

"""
example05 - 从页面上抠图

Author: Lj~Asus
Date: 2021/8/23
"""
from PIL import Image as img
from PIL.Image import Image

image = img.open('resources/idcard.jpg')  # type: image
print(image.size)
# 抠图
# 500, 316
head = image.crop((320, 50, 460, 235))
# 显示
head.show()

加速爬去的方式

并发编程

多线程

Thread(target=…, args=(…, …)) —> start()

继承Thread, 重写run() —> 创建自定义类的对象 —> start()

ThreadPoolExecutor() —> submit(fn, …) / map(fn, […])

"""
example08 - 编写多线程编码的第一种方式

Author: Lj~Asus
Date: 2021/8/24
"""
import time


def output(content):
    while True:
        # 具有输出缓冲区，加入flush可以把输出缓冲区清空，不用把输出缓冲区堆满就可以输出
        print(content, end='', flush=True)
        time.sleep(0.1)

# output('Ping')
Thread(target=output, args=('Ping', )).start()
Thread(target=output, args=('Pong', )).start()
output('Hello')

"""
example10 - 编写多线程代码的第二种方式：自定义线程类

Author: Hao
Date: 2021/8/24
"""
import time
from threading import Thread


class OutputThread(Thread):
    """自定义线程类"""

    def __init__(self, content):
        self.content = content
        super().__init__()

    def run(self):
        while True:
            print(self.content, end='', flush=True)
            time.sleep(0.1)


OutputThread('Ping').start()
OutputThread('Pong').start()

"""
example11 - 编写多线程编码的第三种方式：线程池

Author: Lj~Asus
Date: 2021/8/24
"""
import time


def output(content):
    while True:
        # 具有输出缓冲区，加入flush可以把输出缓冲区清空，不用把输出缓冲区堆满就可以输出
        print(content, end='', flush=True)
        time.sleep(0.1)

with ThreadPoolExecutor(max_workers=16) as pool:
    pool.submit(output, 'Ping')
    pool.submit(output, 'Pong')

多进程

Process(target=…, args=(…, …)) —> start()

继承Process, 重写run() —> 创建自定义类的对象 —> start()

ProcessPoolExecutor() —> submit(fn, …) / map(fn, […])

异步编程（异步IO）—> 协作式并发，通过提高CPU利用率来制造并发效果
I/O密集型任务 —> 大量的操作都是输入输出的操作，需要CPU运算很少
计算密集型任务 —> 大量的操作都是需要CPU做运算，I/O中断很少发生

分布式爬虫

要点:一般会通过部署Redis数据库(KV数据库) ，通过这个数据库保存待爬取的页面、
爬取过的页面、有可能还要保存一些数据，这样多个运行爬虫程序的计算机，就可以彼此协调行为，最终达成一个共同的目标。

多进程和进程池的使用

多线程因为GIL的存在不能够发挥CPU的多核特性，对于计算密集型任务应该考虑使用多进程
在终端Terminal运行：

用线程池的方式运行下面的代码
python example08.py

用进程池的方式运行下面的代码（可以在任务管理器中查看自己的电脑是几核的）
python example08.py

"""
example08 - 多进程和进程池的使用
多线程因为GIL的存在不能够发挥CPU的多核特性，对于计算密集型任务应该考虑使用多进程

time python example08.py ---> 执行代码并统计用时

Author: Hao
Date: 2021/8/23
"""
from concurrent.futures import ThreadPoolExecutor
from concurrent.futures import ProcessPoolExecutor

# 判断列表中的数是不是质数（计算密集型任务）
PRIMES = [
    1116281,
    1297337,
    104395303,
    472882027,
    533000389,
    817504243,
    982451653,
    112272535095293,
    112582705942171,
    112272535095293,
    115280095190773,
    115797848077099,
    1099726899285419,
    1099726899285421
] * 5


def is_prime(num):
    """判断素数"""
    for i in range(2, int(num ** 0.5) + 1):
        if num % i == 0:
            return False
    return num > 1


def main():
    """主函数"""
    # # 使用多线程的方式执行
    # with ThreadPoolExecutor(max_workers=4) as pool:
    #     for number, result in zip(PRIMES, pool.map(is_prime, PRIMES)):
    #         print(f'{number} is prime: {result}')
    # 使用多进程的方式执行（可以判断自己的电脑是几核）
    with ProcessPoolExecutor(max_workers=4) as pool:
        for number, result in zip(PRIMES, pool.map(is_prime, PRIMES)):
            print(f'{number} is prime: {result}')


if __name__ == '__main__':
    main()

请添加图片描述

生成器

"""
example12 - 生成器

Author: Lj~Asus
Date: 2021/8/24
"""

# 创建生成器的字面量语法（生成器表达式）
nums = (num for num in range(1, 10))

# 通过next函数从生成器取值
print(next(nums))

for num in nums:
    print(num, end=' ')

"""
example13 - 生成器

函数中如果出现了yield，它已经不是一个普通的函数，它是一个生成器
调用函数不是得到返回值而是得到—个生成器对象。

Author: Lj~Asus
Date: 2021/8/24
"""


def fib(n):
    a, b = 0, 1
    for _ in range(n):
        a, b = b, a + 
    # return a
        yield a

gen_obj = fib(20)
print(next(nums))
print(next(nums))

for i in gen_obj:
    print(i, end=' ')

爬虫框架的应用

框架：把项目开发中常用功能和样板代码全部都封装好冷清，你可以专注于核心问题，而不要再次编写重复的样板代码，重复的去实现之前已经实现过无数次的功能。

Scrapy —> 命令行工具 —> 创建爬虫项目

安装Scrapy（注意：记得在命令提示符窗口进行操作）
创建Scrapy项目：scrapy startproject demo
创建一个蜘蛛: scrapy genspider douban movie.douban.com

在创建成功之后，将其拖入pycharm中，将会出现以下项目：

    - 修改配置文件（在`settings.py`中找到指定位置修改）:
        - USER-AGENT
        - DOWNLOAD_DELAY
        - CONCURRENT_REQUESTS
    - 运行一个蜘蛛: scrapy crawl douban

Python知识库最新文章

Python中String模块

【Python】 14-CVS文件操作

python的panda库读写文件

使用Nordic的nrf52840实现蓝牙DFU过程

【Python学习记录】numpy数组用法整理

Python学习笔记

python字符串和列表

python如何从txt文件中解析出有效的数据

Python编程从入门到实践自学/3.1-3.2

python变量

加:2021-08-25 12:09:51 更:2021-08-25 12:10:38

360图书馆购物三丰科技阅读网日历万年历 2025年7日历

-2025/7/19 9:17:52-

图片自动播放器
↓图片自动播放器↓

TxT小说阅读器
↓语音阅读,小说下载,古典文学↓

一键清除垃圾
↓轻轻一点,清除系统垃圾↓

图片批量下载器
↓批量下载图片,美女图库↓

网站联系: qq:121756557 email:121756557@qq.com IT数码