I am trying to run some DL orkload with horovod (one with tensorflow and one with pytorch). Installing horovod with pip build with gloo, mpi and nccl and it uses mpi for collective operations. How can I use hccl for collective ops with horovod?
When posting a technical issue, please describe the issue; be as descriptive as possible, you can include things like:
• What was the expected behavior:
• What is the observed result:
• Is the issue consistently reproducible? how long does it take to reproduce:
• If you are using AWS DL1 instance, please report the AMI name that you are using
What is the minimal script/command to reproduce the issue:
Please include any error message or stack trace observed:
Please run the Snapshot for Debug tool and post to the issue
• git clone GitHub - HabanaAI/Snapshot_For_Debug: Snapshot scripts for gathering information about the model and Habana training session for Habana analysis and debug
• touch OUT_DOCKER.txt
• python src/gather_info_docker.py --lite --cmd=<command_script> -s OUT_DOCKER.txt
• post the generated tar file (gather_info_docker.tar.gz) after checking its contents