Huggingface Text Embedding Interface failure

benjamin_barber · July 9, 2024, 3:54am

HL-SMI Version: hl-1.16.2-rc-fw-50.1.2.0
Driver Version: 1.16.2-f195ec4

Traceback (most recent call last):
File “/usr/local/bin/python-text-embeddings-server”, line 8, in
sys.exit(app())
File “/usr/local/lib/python3.10/dist-packages/typer/main.py”, line 311, in call
return get_command(self)(*args, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/click/core.py”, line 1157, in call
return self.main(*args, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/typer/core.py”, line 716, in main
return _main(
File “/usr/local/lib/python3.10/dist-packages/typer/core.py”, line 216, in _main
rv = self.invoke(ctx)
File “/usr/local/lib/python3.10/dist-packages/click/core.py”, line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File “/usr/local/lib/python3.10/dist-packages/click/core.py”, line 783, in invoke
return __callback(*args, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/typer/main.py”, line 683, in wrapper
return callback(**use_params) # type: ignore
File “/usr/src/backends/python/server/text_embeddings_server/cli.py”, line 50, in serve
server.serve(model_path, dtype, uds_path)
File “/usr/src/backends/python/server/text_embeddings_server/server.py”, line 79, in serve
asyncio.run(serve_inner(model_path, dtype))
File “/usr/lib/python3.10/asyncio/runners.py”, line 44, in run
return loop.run_until_complete(main)
File “/usr/lib/python3.10/asyncio/base_events.py”, line 636, in run_until_complete
self.run_forever()
File “/usr/lib/python3.10/asyncio/base_events.py”, line 603, in run_forever
self._run_once()
File “/usr/lib/python3.10/asyncio/base_events.py”, line 1909, in _run_once
handle._run()
File “/usr/lib/python3.10/asyncio/events.py”, line 80, in _run
self._context.run(self._callback, *self._args)

File “/usr/src/backends/python/server/text_embeddings_server/server.py”, line 48, in serve_inner
model = get_model(model_path, dtype)
File “/usr/src/backends/python/server/text_embeddings_server/models/init.py”, line 66, in get_model
return DefaultModel(model_path, device, dtype)
File “/usr/src/backends/python/server/text_embeddings_server/models/default_model.py”, line 23, in init
model = AutoModel.from_pretrained(model_path).to(dtype).to(device)
File “/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py”, line 2556, in to
return super().to(*args, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/core/weight_sharing.py”, line 173, in wrapped_to
result = self.original_to(*args, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py”, line 1155, in to
return self._apply(convert)
File “/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py”, line 802, in _apply
module._apply(fn)
File “/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py”, line 802, in _apply
module._apply(fn)
File “/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py”, line 825, in _apply
param_applied = fn(param)
File “/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py”, line 1153, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
File “/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/core/weight_sharing.py”, line 53, in torch_function
return super().torch_function(func, types, new_args, kwargs)
RuntimeError: synStatus=26 [Generic failure] Device acquire failed.

Sayantan_S · July 9, 2024, 3:56am

The error ends with: “Device acquire failed”
Which seems to suggest that cards were busy, so it failed.

If you are in a multi card machine, and someone is using some card, but others are free, you can use set the env var HABANA_VISIBLE_MODULES=6,7 (or whatever number you choose)

benjamin_barber · July 9, 2024, 8:16pm

Yes, I had made sure that there were no cards being used, I also ran hl-smi inside of the process, and checked to make sure that the /dev/ac* were present, and rebooted the host, and rebuilt the container.

if you really insist i will make a video to that effect.

Sayantan_S · July 9, 2024, 8:18pm

Then lets try running a simple model like mnist Model-References/PyTorch/examples/computer_vision/hello_world at master · HabanaAI/Model-References (github.com)

Also, can you confirm that the driver version (which you can see using hl-smi) and docker version are same? Sometimes I have seen “device acquire failed” errors if driver and docker versions do not match.

benjamin_barber · July 11, 2024, 4:19pm

HL-SMI Version: hl-1.16.2-rc-fw-50.1.2.0
Driver Version: 1.16.2-f195ec4

on both the docker and the system itself.

---------------------------: System Configuration :---------------------------
Num CPU Cores : 160
CPU RAM : 1056375244 KB

Train Epoch: 1 [0/60000.0 (0%)] Loss: 2.296875
Train Epoch: 1 [640/60000.0 (1%)] Loss: 1.421875
Train Epoch: 1 [1280/60000.0 (2%)] Loss: 0.789062
Train Epoch: 1 [1920/60000.0 (3%)] Loss: 0.515625

it does work correctly.

benjamin_barber · July 16, 2024, 10:59pm

I have seen that the repository has been updated since i posted this ticket, and it is now working and stable with the migration from v1.2.2 to v1.4.0 see e.g. Merge pull request #20 from kaixuanliu/rebase_1.4 · huggingface/tei-gaudi@3db1796 · GitHub

Topic		Replies	Views
SynapseAI 1.15.0 Release Announcements	0	345	April 17, 2024
Hugging Face Transformers using all 8 Habana Gaudi Devices PyTorch	4	1363	July 7, 2022
SynapseAI 1.0.0 Release Announcements	0	722	September 27, 2021
SynapseAI 1.6.0 Release Announcements	0	632	August 15, 2022
Llama inference result with infinite eot_id tokens Inference	4	187	December 3, 2024

Huggingface Text Embedding Interface failure

---------------------------: System Configuration :--------------------------- Num CPU Cores : 160 CPU RAM : 1056375244 KB

Related topics

---------------------------: System Configuration :---------------------------
Num CPU Cores : 160
CPU RAM : 1056375244 KB