Hi,
I have two Gaudi2D that are connected (PING and TCP connection are established), also physically connected, with the same habana driver version.
The problem is that the minimal program fails with a simple allreduce. Here is the official script to reproduce, which is from the official doc.
import os
import torch
import habana_frameworks.torch.core as htcore
import platform
torch.manual_seed(0)
#load hpu backend for PyTorch
device = torch.device('hpu')
def setup(rank, world_size):
os.environ['MASTER_ADDR'] = '192.168.100.231' # obtained by hostname -I
os.environ['MASTER_PORT'] = '12340'
import habana_frameworks.torch.distributed.hccl
torch.distributed.init_process_group(backend='hccl', rank=rank, world_size=world_size)
def cleanup():
torch.distributed.destroy_process_group()
def allReduce(rank):
_tensor = torch.ones(8).to(device)
torch.distributed.all_reduce(_tensor)
_tensor_cpu = _tensor.cpu()
# Optionally, print the result for debugging
if rank == 0:
print(_tensor_cpu)
def run_allreduce(rank, world_size):
setup(rank, world_size)
print("setup")
for i in range(100):
allReduce(rank)
cleanup()
def main():
#Run Habana's Initialize HPU function to collect the world size and rank
from habana_frameworks.torch.distributed.hccl import initialize_distributed_hpu
world_size, rank, local_rank = initialize_distributed_hpu()
run_allreduce(rank, world_size)
Before running the script, I set those env variables
export WORLD_SIZE=2
export RANK=0 # export RANK=1 on the other node
# interface set to be 192.168.100.231 the master ip address's interface
export HCCL_SOCKET_IFNAME=ens108np0
export NCCL_SOCKET_IFNAME=ens108np0
export GLOO_SOCKET_IFNAME=ens108np0
When I run with python hccl_example.py
on two nodes. The master node has the error log:
/usr/lib/python3.10/inspect.py:288: FutureWarning: `torch.distributed.reduce_op` is deprecated, please use `torch.distributed.ReduceOp` instead
return isinstance(object, types.FunctionType)
setup
============================= HABANA PT BRIDGE CONFIGURATION ===========================
PT_HPU_LAZY_MODE = 1
PT_RECIPE_CACHE_PATH =
PT_CACHE_FOLDER_DELETE = 0
PT_HPU_RECIPE_CACHE_CONFIG =
PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
PT_HPU_LAZY_ACC_PAR_MODE = 1
PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
PT_HPU_EAGER_PIPELINE_ENABLE = 1
PT_HPU_EAGER_COLLECTIVE_PIPELINE_ENABLE = 1
---------------------------: System Configuration :---------------------------
Num CPU Cores : 192
CPU RAM : 2113384620 KB
------------------------------------------------------------------------------
/home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources---bp-dt/repos/hcl/src/platform/gen2_arch_common/host_scheduler.cpp::274(processScaleoutWaitForCompCommand): The condition [ status == true ] failed. waitForCompletion returned with an error
The other node:
/usr/lib/python3.10/inspect.py:288: FutureWarning: `torch.distributed.reduce_op` is deprecated, please use `torch.distributed.ReduceOp` instead
return isinstance(object, types.FunctionType)
setup...
/home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources---bp-dt/repos/hcl/src/platform/gen2_arch_common/host_scheduler.cpp::274(processScaleoutWaitForCompCommand): The condition [ status == true ] failed. waitForCompletion returned with an error
It seems to crash at the allreduce when the tensor is copy back from the other node, but I do not know how to debug.
I also check run on the same node, running the same script twice with different ranks, and the allreduce works correctly.
Could you please give me some suggestions on what I’m missing?