17. Distributed Training: Master/slave

ZeroMQ-based star topology of distributed operation handles up to 100 nodes, depending on the task. Optionally, some compression can be applied. The state of things is sent to the Web Status Server.

The whole thing is fault tolerant, the death of a node does not cause the master stop. You can add new nodes dynamically, at any time, with any backends.

Snapshotting feature allows recovering from any disaster, restart the training in a different mode and with a different backend.


Master process does not do any calculation and just serves other actors. It stores the current workflow state, including all units’ data. Slave processes maintain two channels of communication with master: plain TCP (commands, discovery, etc.) and ZeroMQ (data). Initially, a new slave connect to a TCP socket on master, registers itself and starts sending job requests. Master receives job requests, generates jobs (serialized data from each unit in the workflow) and sends them to corresponding slaves. The thing worth noting is that workflows that exists in master and slave are the same, they are just operated in different modes.

Master runs the graphics server (see Graphics and plotting), so that any number of client can draw plots of what’s going on. Besides, it sends periodic status information to the web status server via HTTP and listens to commands on the same raw TCP socket which is used for talking to slaves. The special communication protocol is used based on JSON.


17.1. Distributed units

TODO: write about apply_data_from/generate_data_for.

17.2. Distributed training. Command line examples

Run workflow in distributed environment:

# on master node
python3 -m veles -l <workflow> <config>
# on slave node
python3 -m veles -m <master host name or IP>:5000 <workflow> <config>


5000 is the port number - use any you like!

Run workflow in distributed environment (known nodes):

# on master node
python3 -m veles -l -n <slave 1>/cuda:0,<slave 2>/ocl:0:1 <workflow> <config>


It’s ok to use different backends to the each slave node. “ocl:0:1” sets the OpenCL device to use. “cuda:0” sets the CUDA device to use. Syntax can be much more complicated, for example, cuda:0-3x2 launches 8 instances overall: two instances on each device from 0, 1, 2 and 3.

Table Of Contents

Previous topic

16. Configuration

Next topic

18. Snapshotting

This Page