- I use docker image on VM:
vault.habana.ai/gaudi-docker/1.14.0/ubuntu22.04/habanalabs/pytorch-installer-2.1.1:latest
- I applied two nodes, each node with 1 HPU,
And I create ray cluster with these 2 nodes by following command:
ray start --head # on one node
ray start --x.x.x.x:port # on other node
- I want to fine-tune Llama-2-7b with this ray cluster by DDP mode on those 2 nodes.
After I load model from pretrained model path, I tried to:
model = model.to(dtype=torch.bfloat16, device="hpu")
but the code run failed:
File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/core/weight_sharing.py", line 173, in wrapped_to
result = self.original_to(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1163, in to
return self._apply(convert)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 810, in _apply
module._apply(fn)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 810, in _apply
module._apply(fn)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 810, in _apply
module._apply(fn)
[Previous line repeated 1 more time]
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 833, in _apply
param_applied = fn(param)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1161, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/core/weight_sharing.py", line 53, in __torch_function__
return super().__torch_function__(func, types, new_args, kwargs)
RuntimeError: synStatus=8 [Device not found] Device acquire failed.
Is there anyone know how to fix this problem?