Habana Gaudi Hpus Training time improvement

Improve training time. Taking more than expected time compared to other vendors.

slow training.

yes, can be reproduced from
Deep Learning AMI Habana TensorFlow 2.9.1 SynapseAI 1.5.0 (Ubuntu 20.04) 20220714

• git clone GitHub - HabanaAI/Snapshot_For_Debug: Snapshot scripts for gathering information about the model and Habana training session for Habana analysis and debug
• touch OUT_DOCKER.txt
• python src/gather_info_docker.py --lite --cmd=<command_script> -s OUT_DOCKER.txt
• post the generated tar file (gather_info_docker.tar.gz) after checking its contents

Hi @Purvang :

Please note that for distributed horovod code, some changes need to be made such as ones shown here: Model-References/example_hvd.py at master · HabanaAI/Model-References · GitHub

I see that your file has most of the changes, but not all. For example we must wrap the optimizer (optimizer = hvd.DistributedOptimizer(optimizer)) as shown here.

Without this change, mpirun just launches 8 separate uncoordinated training processes that do not sync together at the optimizer step.

Once you have made this correction, could you please provide us with single card and 8x multicard logs for atleast 10 (or more) epochs of training

Note: I couldn’t get your code to run as-is (got errors that learning_rate, weights, utils, initial_epoch etc are not defined) so had to make some minor modifications