Wrong error message when out of memory

Alexander · January 30, 2023, 4:38am

When there is not enough memory on HPUs for my current batch size, I receive “Graph compile failed” error message instead of OOM. Reducing the batch size allows the model to train. I’ve spend a couple days trying to figure out what is wrong with my model by double-checking all the guidelines and reference models code since I got message that problem is in “torch.autograd.backward”.

Full error message:
*Traceback (most recent call last): File “train.py”, line 788, in main() File “/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py”, line 346, in wrapper return f(*args, *kwargs) File “train.py”, line 780, in main train(config) File “train.py”, line 666, in train train_metrics = train_one_epoch( File “train.py”, line 457, in train_one_epoch loss.backward() File “/usr/local/lib/python3.8/dist-packages/torch/_tensor.py”, line 487, in backward torch.autograd.backward( File “/usr/local/lib/python3.8/dist-packages/torch/autograd/init.py”, line 197, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: FATAL ERROR :: MODULE:BRIDGE syn compile encountered : Graph compile failed. 26 compile time 488716541353 ns

Sayantan_S · January 30, 2023, 4:52am

Thanks for posting the issue.

Is it possible to share the model/code, or some small sample which reproduces the error?

Topic		Replies	Views
Graph compile failed Training	1	787	October 28, 2022
RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_BRIDGE Exception in Lowering thread PyTorch	3	252	March 23, 2025
I'm trying to run my model and it does not compile, it crashes, what should I do? FAQ debug	0	715	June 30, 2021
Hpu_backend not found on torch.compile PyTorch	2	279	July 11, 2024
AttributeError: module 'habana_frameworks.torch.hpu' has no attribute 'wrap_in_hpu_graph PyTorch	4	119	January 19, 2025

Wrong error message when out of memory

Related topics