I am trying to train YOLOX algorithm on gaudi2. I am getting above error at this operation.
grid = torch.stack((xv, yv), 2).view(1, 1, hsize, wsize, 2).type(dtype)
How can I solve it?
When posting a technical issue, please describe the issue; be as descriptive as possible, you can include things like:
• What was the expected behavior:
• What is the observed result:
• Is the issue consistently reproducible? how long does it take to reproduce:
• If you are using AWS DL1 instance, please report the AMI name that you are using What is the minimal script/command to reproduce the issue: Please include any error message or stack trace observed: Please run the Snapshot for Debug tool and post to the issue
• git clone GitHub - HabanaAI/Snapshot_For_Debug: Snapshot scripts for gathering information about the model and Habana training session for Habana analysis and debug
• touch OUT_DOCKER.txt
• python src/gather_info_docker.py --lite --cmd=<command_script> -s OUT_DOCKER.txt
• post the generated tar file (gather_info_docker.tar.gz) after checking its contents
No. This error comes when I try to train using horovod instead of pytorch distributed module. Integrating horovod in original_yolox and running it on A100 working fine, but getting mentioned error when I run with gaudi2. post mentioned by you is using code yolox_gaudi2.