When integrating Horovod with Tensorflow, we need to change the model script as per: horovod/docs/tensorflow.rst at master · horovod/horovod · GitHub
- Pin each GPU to a single process.
- Scale the learning rate by the number of workers.
- Wrap the optimizer in
- Broadcast the initial variable states from rank 0 to all other processes.
But after checking the code of MLPerf BERT in TensorFlow: training_results_v2.1/Intel-HabanaLabs/benchmarks/bert/implementations/TensorFlow at main · mlcommons/training_results_v2.1 · GitHub, I found that step 3 & 5 are missing(step 4 is not needed for gradient accumulation).
May I know the reason? or habana-horovod did something under the hook?
The horovod doc you refer to is based off the original horovod doc from here.
The scaling of learning rate might be a heuristic, and some training might converge with original learning rate. In this particular case we use LAMB, which is designed for large batch training. The original horovod doc advice of scaling learning rates might be more applicable for simpler optimizers like SGD.
Regarding broadcasting of variables I think we initialize from a checkpoint, hence we might be skipping broadcast.
Thanks for the reply.
For the learning rate scaling, we can see from NVIDIA BERT that it’s still needed for LAMB: DeepLearningExamples/TensorFlow2/LanguageModeling/BERT/run_pretraining.py at master · NVIDIA/DeepLearningExamples · GitHub
For the broadcast part, we can see the comments from: horovod/examples/tensorflow2/tensorflow2_synthetic_benchmark.py at master · horovod/horovod · GitHub
Horovod: broadcast initial variable states from rank 0 to all other processes. This is necessary to ensure consistent initialization of all workers when training is started with random weights or restored from a checkpoint.