官网document
dgl distributed training user guide
interacting processes
dgl分布式中主要有server、sampler、trainer
-
Server processes run on each machine that stores a graph partition (this includes the graph structure and node/edge features). These servers work together to serve the graph data to trainers. Note that one machine may run multiple server processes simultaneously to parallelize computation as well as network communication.也就是说server不仅管数据,也管通信 -
Sampler processes interact with the servers and sample nodes and edges to generate mini-batches for training. -
Trainers contain multiple classes to interact with servers. It has DistGraph to get access to partitioned graph data and has DistEmbedding and DistTensor to access the node/edge features/embeddings. It has DistDataLoader to interact with samplers to get mini-batches. 注意DistEmbedding、DisGraph、DisTensor、DisDataLoader等这几个分布式API
API
initialize
this API builds connections with DGL servers and creates sampler processes
DisGraph
Each machine is responsible for one and only one partition. It loads the partition data (the graph structure and the node data and edge data in the partition) and makes it accessible to all trainers in the cluster. 注意这里DisGraph即有单机版也有分布式版 单机版:测试开发,可以测试下单机版DisGraph 分布式版:DistGraph connects with the servers in the cluster of machines and access them through the network. 说明server之间还是通信来传输图信息的啊
DisTensor
Currently, DGL does not provide protection for concurrent writes from multiple trainers when a machine runs multiple servers. This may result in data corruption. One way to avoid concurrent writes to the same row of data is to run one server process on a machine.
怎样在一台机器上运行一个server process?
DisEmbedding
Internally, distributed embeddings are built on top of distributed tensors, and, thus, has very similar behaviors to distributed tensors. For example, when embeddings are created, they are sharded and stored across all machines in the cluster. It can be uniquely identified by a name.
embedding所有机器共享的话,如果实现通信?Distensor是不是也是共享?
DisSampling
有两种level,但不论哪种level
low-level
需要自己写代码定义如何sample dgl.sampling.sample_neighbors() For the lower-level sampling API, it provides sample_neighbors() for distributed neighborhood sampling on DistGraph.
所以DisGraph是整张图而DisSampling是采样为minibatch
high-level
经典算法NodeDataLoader和EdgeDataLoader
异构
关于异构图的描述 Below is an example adjancency matrix of a heterogeneous graph showing the homogeneous ID assignment. Here, the graph has two types of nodes (T0 and T1 ), and four types of edges (R0, R1, R2, R3 ). There are a total of 400 nodes in the graph and each type has 200 nodes. Nodes of T0 have IDs in [0,200), while nodes of T1 have IDs in [200, 400). In this example, if we use a tuple to identify the nodes, nodes of T0 are identified as (T0, type-wise ID), where type-wise ID falls in [0, 200); nodes of T1 are identified as (T1, type-wise ID), where type-wise ID also falls in [0, 200).
DGL分布式脚本文件
copy_files.py
复制切割图和training脚本到指定机器上(via ip_config) 但是需要每台机器之间ssh无密访问
launch.py
具体请参考我的博客 默认端口号是30050,当出现端口占用的时候需要kill掉相关进程
|