Hi, I’m trying to run multi-node training with horovod for Resnet50, failing with below error though the ports are online and able to training without multi-node using mpirun not getting the hl:7 error
Error:
[1,8]:Synapse detected a device critical error that requires a restart. Killing process in 5 seconds (hl: 7) 18:53:23 [HCL triggered error]
An MPI communication peer process has unexpectedly disconnected. This
usually indicates a failure in the peer process (e.g., a crash or
otherwise exiting without calling MPI_FINALIZE first).
Although this local MPI process will likely now behave unpredictably
(it may even hang or crash), the root cause of this problem is the
failure of the peer – that is what you need to investigate. For
example, there may be a core file that you can examine. More
generally: such peer hangups are frequently caused by application bugs
or other external events.
Local host: smc-gaudi1
Local PID: 2331
Peer host: 192.168.10.162
root@smc-gaudi2:~# /opt/habanalabs/qual/gaudi2/bin/manage_network_ifs.sh --status
hl0
3 ports up (8, 22, 23)
hl1
3 ports up (8, 22, 23)
hl2
3 ports up (8, 22, 23)
hl3
3 ports up (8, 22, 23)
hl4
3 ports up (8, 22, 23)
hl5
3 ports up (8, 22, 23)
hl6
3 ports up (8, 22, 23)
hl7
3 ports up (8, 22, 23)
root@smc-gaudi2:~# ip a s | grep 192
inet 192.168.10.162/24 brd 192.168.10.255 scope global ens102f1np1
root@smc-gaudi2:~#
root@smc-gaudi1:~# ip a s | grep -i 192
inet 192.168.10.152/24 brd 192.168.10.255 scope global ens102f1np1
root@smc-gaudi1:~# /opt/habanalabs/qual/gaudi2/bin/manage_network_ifs.sh --status
hl0
3 ports up (8, 22, 23)
hl1
3 ports up (8, 22, 23)
hl2
3 ports up (8, 22, 23)
hl3
3 ports up (8, 22, 23)
hl4
3 ports up (8, 22, 23)
hl5
3 ports up (8, 22, 23)
hl6
3 ports up (8, 22, 23)
hl7
3 ports up (8, 22, 23)
root@smc-gaudi1:~#
Command i use,
mpirun
–allow-run-as-root --mca plm_rsh_args -p4022
–bind-to core
–map-by socket:PE=10 -np 16
–mca btl_tcp_if_include ens102f1np1
–tag-output --merge-stderr-to-stdout --prefix $MPI_ROOT
-H 192.168.10.162:8,192.168.10.152:8
-x GC_KERNEL_PATH -x HABANA_LOGS
-x PYTHONPATH -x HCCL_SOCKET_IFNAME=ens102f1np1
$PYTHON resnet_ctl_imagenet_main.py
-dt bf16
-dlit bf16
-bs 256
-te 40
-ebe 40
–use_horovod
–data_dir /root/datasets/FILL_ME_IN/tf_records/
–optimizer LARS
–base_learning_rate 13
–warmup_epochs 7
–momentum 0.9
–lars_decay_epochs 41
–lr_schedule polynomial
–label_smoothing 0.1
–weight_decay 0.0001
–single_l2_loss_op
–enable_tensorboard
Model reference used:
please let me know if you need any additional information.
Thanks, Rajesh