[Python知识库] 【Python核心】并发编程之Futures

开发: C++知识库 Java知识库 JavaScript Python PHP知识库人工智能区块链大数据移动开发嵌入式开发工具数据结构与算法开发测试游戏开发网络协议系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑笔记本显卡显示器固态硬盘硬盘耳机手机 iphone vivo oppo 小米华为单反装机图拉丁

-> Python知识库 -> 【Python核心】并发编程之Futures -> 正文阅读

[Python知识库]【Python核心】并发编程之Futures

无论对于哪门语言，并发编程都是一项很常用很重要的技巧

正确合理地使用并发编程，无疑会让程序带来极大的性能提升。接下来学习理解、运用Python中的并发编程——Futures

一、区分并发和并行

在学习并发编程时，常常同时听到并发(Concurrency)和并行(Parallelism)这两个术语，这两者经常一起使用导致很多人以为它们是一个意思，其实不然

1.1 理解误区

首先辨别一个误区，在Python中并发并不是指同一时刻有多个操作(thread、task)同时进行。相反，某个特定的时刻它只允许有一个操作发生，只不过线程/任务之间会互相切换，直到完成。看下面这张图：
在这里插入图片描述
图中出现了thread和task两种切换顺序的不同方式，分别对应Python中并发的两种形式—threading和asyncio

1.2 threading和asyncio

threading

对于threading，操作系统知道每个线程的所有信息，因此它会做主在适当的时候做线程切换
优点：
代码容易书写，因为程序员不需要做任何切换操作的处理
不足：
切换线程的操作，有可能出现在一个语句执行的过程中(比如 x += 1)，这样就容易出现race condition 的情况

asyncio

对于asyncio，主程序想要切换任务时，必须得到此任务可以被切换的通知，这样一来也就可以避免刚刚提到的race condition的情况

1.3 并行的理解

至于所谓的并行，指的才是同一时刻、同时发生
Python中的multi-processing便是这个意思，对于multi-processing可以简单地这么理解：比如电脑是6核处理器，那么在运行程序时就可以强制Python开6个进程同时执行以加快运行速度，原理示意图如下：
在这里插入图片描述

1.4 并行和并发对比

并发通常应用于I/O操作频繁的场景
比如要从网站上下载多个文件，I/O操作的时间可能会比CPU运行处理的时间长得多
并行更多应用于CPU heavy的场景
比如MapReduce中的并行计算，为了加快运行速度一般会用多台机器、多个处理器来完成

二、并发编程之Futures

2.1 单线程与多线程性能比较

接下来通过具体的实例，从代码的角度来理解并发编程中的Futures，并进一步来比较其与单线程的性能区别

假设有一个任务是下载一些网站的内容并打印，如果用单线程的方式代码实现如下所示(为了简化代码和突出主题，代码中忽略了异常处理)：

import requests
import time

def download_one(url):
    resp = requests.get(url)
    print('Read {} from {}'.format(len(resp.content), url))
    
def download_all(sites):
    for site in sites:
        download_one(site)

def main():
    sites = [
        'https://en.wikipedia.org/wiki/Portal:Arts',
        'https://en.wikipedia.org/wiki/Portal:History',
        'https://en.wikipedia.org/wiki/Portal:Society',
        'https://en.wikipedia.org/wiki/Portal:Biography',
        'https://en.wikipedia.org/wiki/Portal:Mathematics',
        'https://en.wikipedia.org/wiki/Portal:Technology',
        'https://en.wikipedia.org/wiki/Portal:Geography',
        'https://en.wikipedia.org/wiki/Portal:Science',
        'https://en.wikipedia.org/wiki/Computer_science',
        'https://en.wikipedia.org/wiki/Python_(programming_language)',
        'https://en.wikipedia.org/wiki/Java_(programming_language)',
        'https://en.wikipedia.org/wiki/PHP',
        'https://en.wikipedia.org/wiki/Node.js',
        'https://en.wikipedia.org/wiki/The_C_Programming_Language',
        'https://en.wikipedia.org/wiki/Go_(programming_language)'
    ]
    start_time = time.perf_counter()
    download_all(sites)
    end_time = time.perf_counter()
    print('Download {} sites in {} seconds'.format(len(sites), end_time - start_time))
    
if __name__ == '__main__':
    main()

# 输出
Read 129886 from https://en.wikipedia.org/wiki/Portal:Arts
Read 184343 from https://en.wikipedia.org/wiki/Portal:History
Read 224118 from https://en.wikipedia.org/wiki/Portal:Society
Read 107637 from https://en.wikipedia.org/wiki/Portal:Biography
Read 151021 from https://en.wikipedia.org/wiki/Portal:Mathematics
Read 157811 from https://en.wikipedia.org/wiki/Portal:Technology
Read 167923 from https://en.wikipedia.org/wiki/Portal:Geography
Read 93347 from https://en.wikipedia.org/wiki/Portal:Science
Read 321352 from https://en.wikipedia.org/wiki/Computer_science
Read 391905 from https://en.wikipedia.org/wiki/Python_(programming_language)
Read 321417 from https://en.wikipedia.org/wiki/Java_(programming_language)
Read 468461 from https://en.wikipedia.org/wiki/PHP
Read 180298 from https://en.wikipedia.org/wiki/Node.js
Read 56765 from https://en.wikipedia.org/wiki/The_C_Programming_Language
Read 324039 from https://en.wikipedia.org/wiki/Go_(programming_language)
Download 15 sites in 2.464231112999869 seconds

这种方式应该是最直接也最简单的：

遍历存储网站的列表
对当前网站执行下载操作
等到当前操作完成后，再对下一个网站进行同样的操作，一直到结束

可以看到总共耗时约 2.4s
单线程的优点是简单明了，但是明显效率低下，因为上述程序的绝大多数时间都浪费在了I/O等待上。程序每次对一个网站执行下载操作，都必须等到前一个网站下载完成后才能开始。在实际生产环境中，需要下载的网站数量至少是以万为单位的，不难想象这种方案根本行不通

接着再来看多线程版本的代码实现：

import concurrent.futures
import requests
import threading
import time

def download_one(url):
    resp = requests.get(url)
    print('Read {} from {}'.format(len(resp.content), url))


def download_all(sites):
    with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
        executor.map(download_one, sites)

def main():
    sites = [
        'https://en.wikipedia.org/wiki/Portal:Arts',
        'https://en.wikipedia.org/wiki/Portal:History',
        'https://en.wikipedia.org/wiki/Portal:Society',
        'https://en.wikipedia.org/wiki/Portal:Biography',
        'https://en.wikipedia.org/wiki/Portal:Mathematics',
        'https://en.wikipedia.org/wiki/Portal:Technology',
        'https://en.wikipedia.org/wiki/Portal:Geography',
        'https://en.wikipedia.org/wiki/Portal:Science',
        'https://en.wikipedia.org/wiki/Computer_science',
        'https://en.wikipedia.org/wiki/Python_(programming_language)',
        'https://en.wikipedia.org/wiki/Java_(programming_language)',
        'https://en.wikipedia.org/wiki/PHP',
        'https://en.wikipedia.org/wiki/Node.js',
        'https://en.wikipedia.org/wiki/The_C_Programming_Language',
        'https://en.wikipedia.org/wiki/Go_(programming_language)'
    ]
    start_time = time.perf_counter()
    download_all(sites)
    end_time = time.perf_counter()
    print('Download {} sites in {} seconds'.format(len(sites), end_time - start_time))

if __name__ == '__main__':
    main()

## 输出
Read 151021 from https://en.wikipedia.org/wiki/Portal:Mathematics
Read 129886 from https://en.wikipedia.org/wiki/Portal:Arts
Read 107637 from https://en.wikipedia.org/wiki/Portal:Biography
Read 224118 from https://en.wikipedia.org/wiki/Portal:Society
Read 184343 from https://en.wikipedia.org/wiki/Portal:History
Read 167923 from https://en.wikipedia.org/wiki/Portal:Geography
Read 157811 from https://en.wikipedia.org/wiki/Portal:Technology
Read 91533 from https://en.wikipedia.org/wiki/Portal:Science
Read 321352 from https://en.wikipedia.org/wiki/Computer_science
Read 391905 from https://en.wikipedia.org/wiki/Python_(programming_language)
Read 180298 from https://en.wikipedia.org/wiki/Node.js
Read 56765 from https://en.wikipedia.org/wiki/The_C_Programming_Language
Read 468461 from https://en.wikipedia.org/wiki/PHP
Read 321417 from https://en.wikipedia.org/wiki/Java_(programming_language)
Read 324039 from https://en.wikipedia.org/wiki/Go_(programming_language)
Download 15 sites in 0.19936635800002023 seconds

非常明显，总耗时是0.2s左右，效率一下子提升了10倍+

具体来看这段代码，多线程版本和单线程版的主要区别所在：

with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
     executor.map(download_one, sites)

这里创建了一个线程池，总共有5个线程可以分配使用
executer.map()与Python内置的map()函数类似，表示对sites中的每一个元素并发地调用函数download_one()

顺便提一下，在download_one()函数中使用的requests.get()方法是线程安全的(thread-safe)，因此在多线程的环境下也可以安全使用，并不会出现race condition的情况

另外，虽然线程的数量可以自己定义，但是线程数并不是越多越好
因为线程的创建、维护和删除也会有一定的开销，所以如果设置的很大反而可能会导致速度变慢。往往需要根据实际的需求做一些测试来寻找最优的线程数量

当然，也可以用并行的方式去提高程序运行效率，只需要在download_all()函数中做出下面的变化即可：

with futures.ThreadPoolExecutor(workers) as executor
=>
with futures.ProcessPoolExecutor() as executor:

在需要修改的这部分代码中，函数ProcessPoolExecutor()表示创建进程池，使用多个进程并行的执行程序
不过通常省略参数workers，因为系统会自动返回CPU的数量作为可以调用的进程数

刚刚提到过，并行的方式一般用在CPU heavy的场景中，因为对于I/O heavy的操作多数时间都会用于等待，相比于多线程，使用多进程并不会提升效率。反而很多时候，因为CPU数量的限制，会导致其执行效率不如多线程版本

三、什么是Futures

3.1 Futures的作用

Python中的Futures模块，位于concurrent.futures和asyncio中，它们都表示带有延迟的操作
Futures会将处于等待状态的操作包裹起来放到队列中，这些操作的状态随时可以查询。当然，它们的结果如果是异常也能够在操作完成后被获取

通常来说，作为用户不用考虑如何去创建Futures，这些Futures底层都会处理好，要做的是实际上是去schedule这些Futures的执行

比如，Futures中的Executor类，当执行executor.submit(func)时便会安排里面的func()函数执行，并返回创建好的future实例，以便你之后查询调用

3.2 一些常用的函数

done()

Futures中的方法done()，表示相对应的操作是否完成——True表示完成，False表示没有完成

注意：done()是non-blocking，立即返回结果

add_done_callback(fn)

相对应的add_done_callback(fn)，表示Futures完成后相对应的参数函数fn，会被通知并执行调用

result()

Futures中还有一个重要的函数result()，它表示当future完成后返回其对应的结果或异常

as_completed(fs)

as_completed(fs)，则是针对给定的future迭代器fs，在其完成后返回完成后的迭代器

上述例子也可以写成下面的形式：

import concurrent.futures
import requests
import time

def download_one(url):
    resp = requests.get(url)
    print('Read {} from {}'.format(len(resp.content), url))

def download_all(sites):
    with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
        to_do = []
        for site in sites:
            future = executor.submit(download_one, site)
            to_do.append(future)
            
        for future in concurrent.futures.as_completed(to_do):
            future.result()
def main():
    sites = [
        'https://en.wikipedia.org/wiki/Portal:Arts',
        'https://en.wikipedia.org/wiki/Portal:History',
        'https://en.wikipedia.org/wiki/Portal:Society',
        'https://en.wikipedia.org/wiki/Portal:Biography',
        'https://en.wikipedia.org/wiki/Portal:Mathematics',
        'https://en.wikipedia.org/wiki/Portal:Technology',
        'https://en.wikipedia.org/wiki/Portal:Geography',
        'https://en.wikipedia.org/wiki/Portal:Science',
        'https://en.wikipedia.org/wiki/Computer_science',
        'https://en.wikipedia.org/wiki/Python_(programming_language)',
        'https://en.wikipedia.org/wiki/Java_(programming_language)',
        'https://en.wikipedia.org/wiki/PHP',
        'https://en.wikipedia.org/wiki/Node.js',
        'https://en.wikipedia.org/wiki/The_C_Programming_Language',
        'https://en.wikipedia.org/wiki/Go_(programming_language)'
    ]
    start_time = time.perf_counter()
    download_all(sites)
    end_time = time.perf_counter()
    print('Download {} sites in {} seconds'.format(len(sites), end_time - start_time))

if __name__ == '__main__':
    main()

# 输出
Read 129886 from https://en.wikipedia.org/wiki/Portal:Arts
Read 107634 from https://en.wikipedia.org/wiki/Portal:Biography
Read 224118 from https://en.wikipedia.org/wiki/Portal:Society
Read 158984 from https://en.wikipedia.org/wiki/Portal:Mathematics
Read 184343 from https://en.wikipedia.org/wiki/Portal:History
Read 157949 from https://en.wikipedia.org/wiki/Portal:Technology
Read 167923 from https://en.wikipedia.org/wiki/Portal:Geography
Read 94228 from https://en.wikipedia.org/wiki/Portal:Science
Read 391905 from https://en.wikipedia.org/wiki/Python_(programming_language)
Read 321352 from https://en.wikipedia.org/wiki/Computer_science
Read 180298 from https://en.wikipedia.org/wiki/Node.js
Read 321417 from https://en.wikipedia.org/wiki/Java_(programming_language)
Read 468421 from https://en.wikipedia.org/wiki/PHP
Read 56765 from https://en.wikipedia.org/wiki/The_C_Programming_Language
Read 324039 from https://en.wikipedia.org/wiki/Go_(programming_language)
Download 15 sites in 0.21698231499976828 seconds