Hccl failure to be connected on two nodes with a simple script

Hi,

I have two Gaudi2D that are connected (PING and TCP connection are established), also physically connected, with the same habana driver version.

The problem is that the minimal program fails with a simple allreduce. Here is the official script to reproduce, which is from the official doc.

import os
import torch

import habana_frameworks.torch.core as htcore

import platform

torch.manual_seed(0)
#load hpu backend for PyTorch

device = torch.device('hpu')


def setup(rank, world_size):
    os.environ['MASTER_ADDR'] = '192.168.100.231'  # obtained by hostname -I
    os.environ['MASTER_PORT'] = '12340'

    import habana_frameworks.torch.distributed.hccl

    torch.distributed.init_process_group(backend='hccl', rank=rank, world_size=world_size)


def cleanup():
    torch.distributed.destroy_process_group()

def allReduce(rank):
    _tensor = torch.ones(8).to(device)
    torch.distributed.all_reduce(_tensor)
    _tensor_cpu = _tensor.cpu()
    # Optionally, print the result for debugging
    if rank == 0:
        print(_tensor_cpu)


def run_allreduce(rank, world_size):
    setup(rank, world_size)
    print("setup")

    for i in range(100):
        allReduce(rank)

    cleanup()

def main():
    #Run Habana's Initialize HPU function to collect the world size and rank

    from habana_frameworks.torch.distributed.hccl import initialize_distributed_hpu

    world_size, rank, local_rank = initialize_distributed_hpu()

    run_allreduce(rank, world_size)

Before running the script, I set those env variables

export WORLD_SIZE=2
export RANK=0 # export RANK=1 on the other node

# interface set to be 192.168.100.231 the master ip address's interface
export HCCL_SOCKET_IFNAME=ens108np0
export NCCL_SOCKET_IFNAME=ens108np0
export GLOO_SOCKET_IFNAME=ens108np0

When I run with python hccl_example.py on two nodes. The master node has the error log:

/usr/lib/python3.10/inspect.py:288: FutureWarning: `torch.distributed.reduce_op` is deprecated, please use `torch.distributed.ReduceOp` instead
  return isinstance(object, types.FunctionType)
setup
============================= HABANA PT BRIDGE CONFIGURATION ===========================
 PT_HPU_LAZY_MODE = 1
 PT_RECIPE_CACHE_PATH =
 PT_CACHE_FOLDER_DELETE = 0
 PT_HPU_RECIPE_CACHE_CONFIG =
 PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
 PT_HPU_LAZY_ACC_PAR_MODE = 1
 PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
 PT_HPU_EAGER_PIPELINE_ENABLE = 1
 PT_HPU_EAGER_COLLECTIVE_PIPELINE_ENABLE = 1
---------------------------: System Configuration :---------------------------
Num CPU Cores : 192
CPU RAM       : 2113384620 KB
------------------------------------------------------------------------------
/home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources---bp-dt/repos/hcl/src/platform/gen2_arch_common/host_scheduler.cpp::274(processScaleoutWaitForCompCommand): The condition [ status == true ] failed. waitForCompletion returned with an error

The other node:

/usr/lib/python3.10/inspect.py:288: FutureWarning: `torch.distributed.reduce_op` is deprecated, please use `torch.distributed.ReduceOp` instead
  return isinstance(object, types.FunctionType)
setup...
/home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources---bp-dt/repos/hcl/src/platform/gen2_arch_common/host_scheduler.cpp::274(processScaleoutWaitForCompCommand): The condition [ status == true ] failed. waitForCompletion returned with an error

It seems to crash at the allreduce when the tensor is copy back from the other node, but I do not know how to debug.

I also check run on the same node, running the same script twice with different ranks, and the allreduce works correctly.

Could you please give me some suggestions on what I’m missing?

It is resolved by a cleanup on the environment. May be it’s because some cards were not released completely with a process killed.