I’m using the AWS Gaudi 1 instances with:
AMI name: “Deep Learning AMI Habana PyTorch 1.12.0 SynapseAI 1.6.0 (Ubuntu 20.04) 20220928”
Docker image: “vault.habana.ai/gaudi-docker/1.11.0/ubuntu20.04/habanalabs/pytorch-installer-2.0.1:latest”
Docker command: “docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.11.0/ubuntu20.04/habanalabs/pytorch-installer-2.0.1:latest”
And I am trying to run the example given here: https://github.com/HabanaAI/Model-References/blob/master/PyTorch/examples/computer_vision/hello_world/example.py#L12
However, I got this error upon running this example:
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to ./data/MNIST/raw/t10k-labels-idx1-ubyte.gz
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4542/4542 [00:00<00:00, 27098902.94it/s]
Extracting ./data/MNIST/raw/t10k-labels-idx1-ubyte.gz to ./data/MNIST/raw
Traceback (most recent call last):
File "ex.py", line 155, in <module>
main()
File "ex.py", line 131, in main
net.to(device)
File "/usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/core/weight_sharing.py", line 173, in wrapped_to
result = self.original_to(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1148, in to
return self._apply(convert)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 797, in _apply
module._apply(fn)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 820, in _apply
param_applied = fn(param)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1146, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
File "/usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/core/weight_sharing.py", line 53, in __torch_function__
return super().__torch_function__(func, types, new_args, kwargs)
RuntimeError: synStatus=8 [Device not found] Device acquire failed.
Please advice.