官网document

dgl distributed training user guide

interacting processes

dgl分布式中主要有server、sampler、trainer
在这里插入图片描述

Server processes run on each machine that stores a graph partition (this includes the graph structure and node/edge features). These servers work together to serve the graph data to trainers. Note that one machine may run multiple server processes simultaneously to parallelize computation as well as network communication.也就是说server不仅管数据，也管通信
Sampler processes interact with the servers and sample nodes and edges to generate mini-batches for training.
Trainers contain multiple classes to interact with servers. It has DistGraph to get access to partitioned graph data and has DistEmbedding and DistTensor to access the node/edge features/embeddings. It has DistDataLoader to interact with samplers to get mini-batches.
注意DistEmbedding、DisGraph、DisTensor、DisDataLoader等这几个分布式API

API

initialize

this API builds connections with DGL servers and creates sampler processes

DisGraph

Each machine is responsible for one and only one partition. It loads the partition data (the graph structure and the node data and edge data in the partition) and makes it accessible to all trainers in the cluster.
注意这里DisGraph即有单机版也有分布式版
单机版：测试开发，可以测试下单机版DisGraph
分布式版：DistGraph connects with the servers in the cluster of machines and access them through the network. 说明server之间还是通信来传输图信息的啊

DisTensor

Currently, DGL does not provide protection for concurrent writes from multiple trainers when a machine runs multiple servers. This may result in data corruption. One way to avoid concurrent writes to the same row of data is to run one server process on a machine.

怎样在一台机器上运行一个server process？

DisEmbedding

Internally, distributed embeddings are built on top of distributed tensors, and, thus, has very similar behaviors to distributed tensors. For example, when embeddings are created, they are sharded and stored across all machines in the cluster. It can be uniquely identified by a name.

embedding所有机器共享的话，如果实现通信？Distensor是不是也是共享？

DisSampling

有两种level，但不论哪种level

low-level

需要自己写代码定义如何sample
dgl.sampling.sample_neighbors()
For the lower-level sampling API, it provides sample_neighbors() for distributed neighborhood sampling on DistGraph.

所以DisGraph是整张图而DisSampling是采样为minibatch

high-level

经典算法NodeDataLoader和EdgeDataLoader

异构

关于异构图的描述
Below is an example adjancency matrix of a heterogeneous graph showing the homogeneous ID assignment. Here, the graph has two types of nodes (T0 and T1 ), and four types of edges (R0, R1, R2, R3 ). There are a total of 400 nodes in the graph and each type has 200 nodes. Nodes of T0 have IDs in [0,200), while nodes of T1 have IDs in [200, 400). In this example, if we use a tuple to identify the nodes, nodes of T0 are identified as (T0, type-wise ID), where type-wise ID falls in [0, 200); nodes of T1 are identified as (T1, type-wise ID), where type-wise ID also falls in [0, 200).
在这里插入图片描述

DGL分布式脚本文件

copy_files.py

复制切割图和training脚本到指定机器上（via ip_config）
但是需要每台机器之间ssh无密访问

这种方式是不是不需要NFS？
在使用copy_files.py的时候发现需要.npy文件，但是之前切割的图里面并没有.npy文件，于是在使用partition_graph的时候，将reshuffle改成false，发现.npy文件出来了，可见下面👇
If reshuffle=False, node IDs and edge IDs of a partition do not fall into contiguous
ID ranges. In this case, DGL stores node/edge mappings (from
node/edge IDs to partition IDs) in separate files (node_map.npy and edge_map.npy).
The node/edge mappings are stored in numpy files.