Habana Gaudi Hpus Training time improvement

When posting a technical issue, please describe the issue; be as descriptive as possible, you can include things like:
• What was the expected behavior:
Improve training time. Taking more than expected time compared to other vendors.

• What is the observed result:
slow training.

• Is the issue consistently reproducible? how long does it take to reproduce:
yes, can be reproduced from
https://drive.google.com/drive/folders/1Vq5f9lk_jiRlbkcwpqN_ecS3GC4gKO3W?usp=sharing
• If you are using AWS DL1 instance, please report the AMI name that you are using
Deep Learning AMI Habana TensorFlow 2.9.1 SynapseAI 1.5.0 (Ubuntu 20.04) 20220714

What is the minimal script/command to reproduce the issue:
Please include any error message or stack trace observed:
Please run the Snapshot for Debug tool and post to the issue
• git clone GitHub - HabanaAI/Snapshot_For_Debug: Snapshot scripts for gathering information about the model and Habana training session for Habana analysis and debug
• touch OUT_DOCKER.txt
• python src/gather_info_docker.py --lite --cmd=<command_script> -s OUT_DOCKER.txt
• post the generated tar file (gather_info_docker.tar.gz) after checking its contents

Thanks for posting, we will take a look at your issue and get back to you

Hi @Purvang :

Please note that for distributed horovod code, some changes need to be made such as ones shown here: Model-References/example_hvd.py at master · HabanaAI/Model-References · GitHub

I see that your file has most of the changes, but not all. For example we must wrap the optimizer (optimizer = hvd.DistributedOptimizer(optimizer)) as shown here.

Without this change, mpirun just launches 8 separate uncoordinated training processes that do not sync together at the optimizer step.

Once you have made this correction, could you please provide us with single card and 8x multicard logs for atleast 10 (or more) epochs of training

Note: I couldn’t get your code to run as-is (got errors that learning_rate, weights, utils, initial_epoch etc are not defined) so had to make some minor modifications

Thanks