Model.to device faile: "RuntimeError: synStatus=8 [Device not found] Device acquire failed."

harborn · March 12, 2024, 7:19pm

I use docker image on VM:

vault.habana.ai/gaudi-docker/1.14.0/ubuntu22.04/habanalabs/pytorch-installer-2.1.1:latest

I applied two nodes, each node with 1 HPU,
And I create ray cluster with these 2 nodes by following command:

ray start --head # on one node
ray start --x.x.x.x:port # on other node

I want to fine-tune Llama-2-7b with this ray cluster by DDP mode on those 2 nodes.
After I load model from pretrained model path, I tried to:

model = model.to(dtype=torch.bfloat16, device="hpu")

but the code run failed:

  File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/core/weight_sharing.py", line 173, in wrapped_to
    result = self.original_to(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1163, in to
    return self._apply(convert)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 810, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 810, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 810, in _apply
    module._apply(fn)
  [Previous line repeated 1 more time]
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 833, in _apply
    param_applied = fn(param)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1161, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
  File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/core/weight_sharing.py", line 53, in __torch_function__
    return super().__torch_function__(func, types, new_args, kwargs)
RuntimeError: synStatus=8 [Device not found] Device acquire failed.

Is there anyone know how to fix this problem?

Sayantan_S · March 12, 2024, 7:21pm

Can you run the code without ray?

Like are you able to run model = model.to(dtype=torch.bfloat16, device="hpu") on a non-distributed 1 node setting?

harborn · March 13, 2024, 2:42am

Yes, no error when fine-tuning on 1 or multi HPUs with ray 1 node cluster.
The fail message only occur when fine-tuning on ray 2 or more node cluster.

Sayantan_S · March 13, 2024, 4:10am

Hi, can you please verify these:

If you have connection between 2 dockers on 2 nodes
If password less communication between dockers works
If dataset accessible from both dockers and has same path
If all gaudi interfaces are up on both nodes: /opt/habanalabs/qual/gaudi2/bin/manage_network_ifs.sh --status

Topic		Replies	Views
Trainer killed/Segfault PyTorch	6	636	September 1, 2023
Gaudi2 PyTorch Container - Device acquire failed System Setup pytorch	1	1342	February 22, 2023
Hugging Face Transformers using all 8 Habana Gaudi Devices PyTorch	4	1370	July 7, 2022
PyTorch model works on CPU/CUDA but not on HPU Training pytorch	5	1750	January 19, 2022
ValueError: invalid type: 'torch.hpu.FloatTensor' Training	9	736	June 6, 2023

Model.to device faile: "RuntimeError: synStatus=8 [Device not found] Device acquire failed."

Related topics