I am getting the following error while running HCCL within the container and wondering if someone could please help me?
[gaudi-2-26:07741] *** An error occurred in MPI_Bcast
[gaudi-2-26:07741] *** reported by process [3562864641,0]
[gaudi-2-26:07741] *** on communicator MPI_COMM_WORLD
[gaudi-2-26:07741] *** MPI_ERR_OTHER: known error not in list
[gaudi-2-26:07741] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[gaudi-2-26:07741] *** and potentially your MPI job)
while running
python3 run_hccl_demo.py --size 32m --test all_reduce --loop 1000 -mpi -np 8 -clean --mca btl_tcp_if_include eth0
inside the container
vault.habana.ai/gaudi-docker/1.16.2/ubuntu22.04/habanalabs/pytorch-installer-2.2.2:latest
This issue is seen only within the container. HCCL benchmark runs on the host directly. Also, this issue is seen inside containers on only four hosts. Containers in other hosts are able to run the same HCCL command without any issues. I even build the host from the beginning by deploying the OS on it. Earlier, I was running v1.14 and was able to run the same command without any issues.
root@gaudi-2-26:~# hl-smi
+-----------------------------------------------------------------------------+
| HL-SMI Version: hl-1.16.2-rc-fw-50.1.2.0 |
| Driver Version: 1.16.2-f195ec4 |
|-------------------------------+----------------------+----------------------+
| AIP Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | AIP-Util Compute M. |
|===============================+======================+======================|
| 0 HL-225 N/A | 0000:33:00.0 N/A | 0 |
| N/A 30C N/A 96W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 1 HL-225 N/A | 0000:9a:00.0 N/A | 0 |
| N/A 32C N/A 77W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 2 HL-225 N/A | 0000:4d:00.0 N/A | 0 |
| N/A 32C N/A 95W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 3 HL-225 N/A | 0000:9b:00.0 N/A | 0 |
| N/A 34C N/A 86W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 4 HL-225 N/A | 0000:34:00.0 N/A | 0 |
| N/A 32C N/A 74W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 5 HL-225 N/A | 0000:b3:00.0 N/A | 0 |
| N/A 35C N/A 90W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 6 HL-225 N/A | 0000:4e:00.0 N/A | 0 |
| N/A 29C N/A 86W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 7 HL-225 N/A | 0000:b4:00.0 N/A | 0 |
| N/A 32C N/A 89W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| Compute Processes: AIP Memory |
| AIP PID Type Process name Usage |
|=============================================================================|
| 0 N/A N/A N/A N/A |
| 1 N/A N/A N/A N/A |
| 2 N/A N/A N/A N/A |
| 3 N/A N/A N/A N/A |
| 4 N/A N/A N/A N/A |
| 5 N/A N/A N/A N/A |
| 6 N/A N/A N/A N/A |
| 7 N/A N/A N/A N/A |
+=============================================================================+
root@gaudi-2-26:~#
Any help, please?