Habana Gaudi Hpus Training time improvement

Purvang · September 21, 2022, 4:53pm

When posting a technical issue, please describe the issue; be as descriptive as possible, you can include things like:
• What was the expected behavior:
Improve training time. Taking more than expected time compared to other vendors.

• What is the observed result:
slow training.

• Is the issue consistently reproducible? how long does it take to reproduce:
yes, can be reproduced from
https://drive.google.com/drive/folders/1Vq5f9lk_jiRlbkcwpqN_ecS3GC4gKO3W?usp=sharing
• If you are using AWS DL1 instance, please report the AMI name that you are using
Deep Learning AMI Habana TensorFlow 2.9.1 SynapseAI 1.5.0 (Ubuntu 20.04) 20220714

What is the minimal script/command to reproduce the issue:
Please include any error message or stack trace observed:
Please run the Snapshot for Debug tool and post to the issue
• git clone GitHub - HabanaAI/Snapshot_For_Debug: Snapshot scripts for gathering information about the model and Habana training session for Habana analysis and debug
• touch OUT_DOCKER.txt
• python src/gather_info_docker.py --lite --cmd=<command_script> -s OUT_DOCKER.txt
• post the generated tar file (gather_info_docker.tar.gz) after checking its contents

Sayantan_S · September 22, 2022, 5:58pm

Thanks for posting, we will take a look at your issue and get back to you

Sayantan_S · September 30, 2022, 7:17pm

Hi @Purvang :

Please note that for distributed horovod code, some changes need to be made such as ones shown here: Model-References/example_hvd.py at master · HabanaAI/Model-References · GitHub

I see that your file has most of the changes, but not all. For example we must wrap the optimizer (optimizer = hvd.DistributedOptimizer(optimizer)) as shown here.

Without this change, mpirun just launches 8 separate uncoordinated training processes that do not sync together at the optimizer step.

Once you have made this correction, could you please provide us with single card and 8x multicard logs for atleast 10 (or more) epochs of training

Note: I couldn’t get your code to run as-is (got errors that learning_rate, weights, utils, initial_epoch etc are not defined) so had to make some minor modifications

Thanks

Topic		Replies	Views
Hugging Face Transformers using all 8 Habana Gaudi Devices PyTorch	4	1371	July 7, 2022
On the steps of integrating habana-horovod with TensorFlow TensorFlow	2	440	June 26, 2023
Multi-node non-mlperf Resnet50 training with Horovod Training	1	564	June 23, 2023
Gaudi2 slower compared to A100 Training	10	656	June 7, 2023
Trainer killed/Segfault PyTorch	6	638	September 1, 2023

Habana Gaudi Hpus Training time improvement

Related topics