Multi-node non-mlperf Resnet50 training with Horovod

Hi, Trying to train resen50 model using 2 node with 16 gaudi2 using https://github.com/HabanaAI/Model-References/tree/master/TensorFlow/computer_vision/Resnets/resnet_keras

But i’m getting below message and then hangs forever

[1,8]:Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. ht
tps://github.com/tensorflow/tensorflow/issues/56089
[1,8]:W0531 20:45:05.513363 140451780773696 deprecation.py:350] From /usr/local/lib/python3.8/dist-packages/tensorflow/python/au
tograph/pyct/static_analysis/liveness.py:83: Analyzer.lamba_check (from tensorflow.python.autograph.pyct.static_analysis.liveness) is de
precated and will be removed after 2023-09-23.
[1,8]:Instructions for updating:
[1,8]:Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. ht
tps://github.com/tensorflow/tensorflow/issues/56089
[1,12]:I0531 20:45:05.624749 139692319606592 controller.py:263] Train at step 0 of 12480
[1,12]:I0531 20:45:05.624903 139692319606592 controller.py:267] Entering training loop with 200 steps, at step 0 of 12480
[1,12]:warning:tensorflow:From /root/MLPERF/Model-References/TensorFlow/computer_vision/Resnets/resnet_keras/orbit/utils.py:144:
StrategyBase.experimental_distribute_datasets_from_function (from tensorflow.python.distribute.distribute_lib) is deprecated and will b
e removed in a future version.
[1,12]:Instructions for updating:
[1,12]:rename to distribute_datasets_from_function
[1,12]:W0531 20:45:05.625305 139692319606592 deprecation.py:350] From /root/MLPERF/Model-References/TensorFlow/computer_vision/R
esnets/resnet_keras/orbit/utils.py:144: StrategyBase.experimental_distribute_datasets_from_function (from tensorflow.python.distribute.d
istribute_lib) is deprecated and will be removed in a future version.
[1,12]:Instructions for updating:
[1,12]:rename to distribute_datasets_from_function
[1,12]:rename to distribute_datasets_from_function
[1,12]:I0531 20:45:05.637351 139692319606592 imagenet_preprocessing.py:324] HVD sharding the dataset: input_pipeline_id=12 num_input_pipelines=16
[1,12]:I0531 20:45:05.641161 139692319606592 imagenet_preprocessing.py:330] Sharding the dataset: input_pipeline_id=0 num_input_pipelines=1
[1,12]:warning:tensorflow:From /usr/local/lib/python3.8/dist-packages/tensorflow/python/autograph/pyct/static_analysis/liveness.py:83: Analyzer.lamba_check (from tensorflow.python.autograph.pyct.static_analysis.liveness) is deprecated and will be removed after 2023-09-23.
[1,12]:Instructions for updating:
[1,12]:Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. [Autograph] Inconsistent behaviour with lambda variable in loop · Issue #56089 · tensorflow/tensorflow · GitHub
[1,12]:W0531 20:45:05.684433 139692319606592 deprecation.py:350] From /usr/local/lib/python3.8/dist-packages/tensorflow/python/autograph/pyct/static_analysis/liveness.py:83: Analyzer.lamba_check (from tensorflow.python.autograph.pyct.static_analysis.liveness) is deprecated and will be removed after 2023-09-23.
[1,12]:Instructions for updating:
[1,12]:Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. [Autograph] Inconsistent behaviour with lambda variable in loop · Issue #56089 · tensorflow/tensorflow · GitHub
[1,3]:/home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources—bp-dt/repos/hcl/src/interfaces/hcl_idevice.cpp::433(allocateConnection): The condition [ isNicUp(port) ] failed. Port(22) is DOWN, can’t allocate connection
[1,1]:/home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources—bp-dt/repos/hcl/src/interfaces/hcl_idevice.cpp::433(allocateConnection): The condition [ isNicUp(port) ] failed. Port(22) is DOWN, can’t allocate connection
[1,5]:/home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources—bp-dt/repos/hcl/src/interfaces/hcl_idevice.cpp::433(allocateConnection): The condition [ isNicUp(port) ] failed. Port(22) is DOWN, can’t allocate connection
[1,4]:/home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources—bp-dt/repos/hcl/src/interfaces/hcl_idevice.cpp::433(allocateConnection): The condition [ isNicUp(port) ] failed. Port(22) is DOWN, can’t allocate connection
[1,0]:/home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources—bp-dt/repos/hcl/src/interfaces/hcl_idevice.cpp::433(allocateConnection): The condition [ isNicUp(port) ] failed. Port(22) is DOWN, can’t allocate connection
[1,6]:/home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources—bp-dt/repos/hcl/src/interfaces/hcl_idevice.cpp::433(allocateConnection): The condition [ isNicUp(port) ] failed. Port(22) is DOWN, can’t allocate connection
[1,7]:/home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources—bp-dt/repos/hcl/src/interfaces/hcl_idevice.cpp::433(allocateConnection): The condition [ isNicUp(port) ] failed. Port(22) is DOWN, can’t allocate connection
[1,2]:/home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources—bp-dt/repos/hcl/src/interfaces/hcl_idevice.cpp::433(allocateConnection): The condition [ isNicUp(port) ] failed. Port(22) is DOWN, can’t allocate connection

Training with horovod using below command options, after running this command, I’m able to see the model loaded into HPU memory on both servers on all devices, then got the above message

mpirun
–allow-run-as-root --mca plm_rsh_args -p4022
–bind-to core
–map-by socket:PE=10 -np 16
–mca btl_tcp_if_include 172.27.21.0/24 \ <= used also interface ens99f1
–tag-output --merge-stderr-to-stdout --prefix $MPI_ROOT
-H 172.27.21.156:8,172.27.21.151:8
-x GC_KERNEL_PATH -x HABANA_LOGS
-x PYTHONPATH
-x RDMAV_FORK_SAFE=1 -x FI_EFA_USE_DEVICE_RDMA=1
$PYTHON resnet_ctl_imagenet_main.py
-dt bf16
-dlit bf16
-bs 256
-te 40
-ebe 40
–use_horovod
–data_dir /root/datasets/FILL_ME_IN/tf_records/
–optimizer LARS
–base_learning_rate 13
–warmup_epochs 7
–momentum 0.9
–lars_decay_epochs 41
–lr_schedule polynomial
–label_smoothing 0.1
–weight_decay 0.0001
–single_l2_loss_op
–enable_tensorboard

And, If I use the same command with only 8HPU with -np as 8 and -H with -H 172.27.21.156:0,172.27.21.151:8 able to run without issue, but only unable to run with 16HPU.

Please let me know if any additional steps needed, I followed document closely make sure PYTHON, PYTHONPATH and MPI_ROOT path set as required in the above command

Please let me know if you need any additional information

Thanks Rajesh

Are you using AWS?

In that case here are some more instructions to help.