multi-node training with horovod failing with Synpase error but ports are online

rajeshitshoulders · June 5, 2023, 3:57am

Hi, I’m trying to run multi-node training with horovod for Resnet50, failing with below error though the ports are online and able to training without multi-node using mpirun not getting the hl:7 error

Error:
[1,8]:Synapse detected a device critical error that requires a restart. Killing process in 5 seconds (hl: 7) 18:53:23 [HCL triggered error]

An MPI communication peer process has unexpectedly disconnected. This
usually indicates a failure in the peer process (e.g., a crash or
otherwise exiting without calling MPI_FINALIZE first).

Although this local MPI process will likely now behave unpredictably
(it may even hang or crash), the root cause of this problem is the
failure of the peer – that is what you need to investigate. For
example, there may be a core file that you can examine. More
generally: such peer hangups are frequently caused by application bugs
or other external events.

Local host: smc-gaudi1
Local PID: 2331
Peer host: 192.168.10.162

root@smc-gaudi2:~# /opt/habanalabs/qual/gaudi2/bin/manage_network_ifs.sh --status
hl0
3 ports up (8, 22, 23)
hl1
3 ports up (8, 22, 23)
hl2
3 ports up (8, 22, 23)
hl3
3 ports up (8, 22, 23)
hl4
3 ports up (8, 22, 23)
hl5
3 ports up (8, 22, 23)
hl6
3 ports up (8, 22, 23)
hl7
3 ports up (8, 22, 23)
root@smc-gaudi2:~# ip a s | grep 192
inet 192.168.10.162/24 brd 192.168.10.255 scope global ens102f1np1
root@smc-gaudi2:~#

root@smc-gaudi1:~# ip a s | grep -i 192
inet 192.168.10.152/24 brd 192.168.10.255 scope global ens102f1np1
root@smc-gaudi1:~# /opt/habanalabs/qual/gaudi2/bin/manage_network_ifs.sh --status
hl0
3 ports up (8, 22, 23)
hl1
3 ports up (8, 22, 23)
hl2
3 ports up (8, 22, 23)
hl3
3 ports up (8, 22, 23)
hl4
3 ports up (8, 22, 23)
hl5
3 ports up (8, 22, 23)
hl6
3 ports up (8, 22, 23)
hl7
3 ports up (8, 22, 23)
root@smc-gaudi1:~#

Command i use,
mpirun
–allow-run-as-root --mca plm_rsh_args -p4022
–bind-to core
–map-by socket:PE=10 -np 16
–mca btl_tcp_if_include ens102f1np1
–tag-output --merge-stderr-to-stdout --prefix $MPI_ROOT
-H 192.168.10.162:8,192.168.10.152:8
-x GC_KERNEL_PATH -x HABANA_LOGS
-x PYTHONPATH -x HCCL_SOCKET_IFNAME=ens102f1np1
$PYTHON resnet_ctl_imagenet_main.py
-dt bf16
-dlit bf16
-bs 256
-te 40
-ebe 40
–use_horovod
–data_dir /root/datasets/FILL_ME_IN/tf_records/
–optimizer LARS
–base_learning_rate 13
–warmup_epochs 7
–momentum 0.9
–lars_decay_epochs 41
–lr_schedule polynomial
–label_smoothing 0.1
–weight_decay 0.0001
–single_l2_loss_op
–enable_tensorboard

Model reference used:

please let me know if you need any additional information.

Thanks, Rajesh

Sayantan_S · June 6, 2023, 4:53pm

Can you please post your docker link/version and driver version ( hl-smi | grep -i "driver version")

Topic		Replies	Views
Multi-node non-mlperf Resnet50 training with Horovod Training	1	576	June 23, 2023
Habana Gaudi Hpus Training time improvement TensorFlow	2	670	September 30, 2022
How can someone monitor Gaudi device load balancing during a distributed training with Horovod on multiple Gaudi nodes FAQ	0	849	June 30, 2021
Synapse detected a device critical error TensorFlow	3	880	December 21, 2023
RuntimeError: synStatus=8 [Device not found] Device acquire failed General Questions	1	934	August 14, 2023

multi-node training with horovod failing with Synpase error but ports are online

Error: [1,8]:Synapse detected a device critical error that requires a restart. Killing process in 5 seconds (hl: 7) 18:53:23 [HCL triggered error]

Related topics

Error:
[1,8]:Synapse detected a device critical error that requires a restart. Killing process in 5 seconds (hl: 7) 18:53:23 [HCL triggered error]