On the steps of integrating habana-horovod with TensorFlow

Gmc2 · June 18, 2023, 6:09am

When integrating Horovod with Tensorflow, we need to change the model script as per: horovod/docs/tensorflow.rst at master · horovod/horovod · GitHub

hvd.init()
Pin each GPU to a single process.
Scale the learning rate by the number of workers.
Wrap the optimizer in hvd.DistributedOptimizer
Broadcast the initial variable states from rank 0 to all other processes.

But after checking the code of MLPerf BERT in TensorFlow: training_results_v2.1/Intel-HabanaLabs/benchmarks/bert/implementations/TensorFlow at main · mlcommons/training_results_v2.1 · GitHub, I found that step 3 & 5 are missing(step 4 is not needed for gradient accumulation).

May I know the reason? or habana-horovod did something under the hook?

Sayantan_S · June 23, 2023, 4:40am

The horovod doc you refer to is based off the original horovod doc from here.

The scaling of learning rate might be a heuristic, and some training might converge with original learning rate. In this particular case we use LAMB, which is designed for large batch training. The original horovod doc advice of scaling learning rates might be more applicable for simpler optimizers like SGD.

Regarding broadcasting of variables I think we initialize from a checkpoint, hence we might be skipping broadcast.

Gmc2 · June 26, 2023, 3:10am

Thanks for the reply.

For the learning rate scaling, we can see from NVIDIA BERT that it’s still needed for LAMB: DeepLearningExamples/TensorFlow2/LanguageModeling/BERT/run_pretraining.py at master · NVIDIA/DeepLearningExamples · GitHub

For the broadcast part, we can see the comments from: horovod/examples/tensorflow2/tensorflow2_synthetic_benchmark.py at master · horovod/horovod · GitHub

Horovod: broadcast initial variable states from rank 0 to all other processes. This is necessary to ensure consistent initialization of all workers when training is started with random weights or restored from a checkpoint.

Topic		Replies	Views
Habana Gaudi Hpus Training time improvement TensorFlow	2	654	September 30, 2022
How to use hccl with horovod? Training	1	492	June 13, 2023
Tensors taking time to shift from HPU to CPU Inference pytorch	2	127	July 9, 2024
SynapseAI 1.3.0 Release Announcements	0	708	February 23, 2022
Hugging Face Transformers using all 8 Habana Gaudi Devices PyTorch	4	1367	July 7, 2022

On the steps of integrating habana-horovod with TensorFlow

Related topics