[人工智能] 理解torch.distributed.barrier()

开发: C++知识库 Java知识库 JavaScript Python PHP知识库人工智能区块链大数据移动开发嵌入式开发工具数据结构与算法开发测试游戏开发网络协议系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑笔记本显卡显示器固态硬盘硬盘耳机手机 iphone vivo oppo 小米华为单反装机图拉丁

-> 人工智能 -> 理解torch.distributed.barrier() -> 正文阅读

[人工智能]理解torch.distributed.barrier()

if args.local_rank not in [-1, 0]:
        torch.distributed.barrier()  # Make sure only the first process in distributed training will download model & vocab

        ... (loads the model and the vocabulary)

    if args.local_rank == 0:
        torch.distributed.barrier()  # Make sure only the first process in distributed training will download model & vocab

?理解：

? 四个进程（0，1,? 2 ，3）多卡训练模型时都是同步并行的，但是在读取数据，数据预处理等操作是不需要并行做的。一般只需要主进程（local_rank = 0）进行这些操作。

在执行到第一个if语句，其他进程（local_rank != 0）会被阻塞。主进程执行后面操作。直到执行第二个if语句时，主进程也被阻塞。当所有进程都被阻塞时，torch.distributed.barrier()会释放所有进程。