SAR’s communication routines

SAR uses only two types of collective communication calls: all_to_all and all_reduce. This choice was made to improve scalability by avoiding any point-to-point communication. SAR supports four backends, which are ccl, nccl, mpi and gloo. (Note: Using gloo backend may not be as optimal as using other backends, because it doesn’t support all_to_all routine - SAR must use its own implementation, which uses multiple asynchronous sends (torch.dist.isend) between workers). Nvidia’s nccl is already included in the PyTorch distribution and it is the natural choice when training on GPUs.

The ccl backend uses Intel’s OneCCL library. You can install the PyTorch bindings for OneCCL here . ccl is the preferred backend when training on CPUs.

You can train on CPUs and still use the nccl backend, or you can train on GPUs and use the ccl backend. However, you will incur extra overhead to move tensors back and forth between the CPU and GPU in order to provide the right tensors to the communication backend.

In an environment with a networked file system, initializing torch.distributed is quite easy:

if backend_name == 'nccl':
    comm_device = torch.device('cuda')
else:
    comm_device = torch.device('cpu')

master_ip_address = sar.nfs_ip_init(rank, path_to_ip_file)
sar.initialize_comms(rank, world_size, master_ip_address, backend_name, comm_device)

sar.initialize_comms() tries to initialize the torch.distributed process group, but only if it has not been initialized. User can initialize process group on his own before calling sar.initialize_comms(). sar.nfs_ip_init() communicates the master’s ip address to the workers through the file system. In the absence of a networked file system, you should develop your own mechanism to communicate the master’s ip address.

You can specify the name of the socket that will be used for communication with SAR_SOCKET_NAME environment variable (if not specified, the first available socket will be selected).

Relevant methods

initialize_comms(_rank, _world_size, ...[, ...])

Initialize Pytorch's communication library

rank()

Get rank of current host

world_size()

Get world size of the current distributed setup

sync_params(model)

Synchronize the model parameters across all workers.

gather_grads(model)

Sum the parameter gradients from all workers.

nfs_ip_init(_rank, ip_file)

Communicate the ip address of the master machine/worker (with rank = 0) to other machines/workers through the file system