When there is not enough memory on HPUs for my current batch size, I receive “Graph compile failed” error message instead of OOM. Reducing the batch size allows the model to train. I’ve spend a couple days trying to figure out what is wrong with my model by double-checking all the guidelines and reference models code since I got message that problem is in “torch.autograd.backward”.
Full error message:
*Traceback (most recent call last): File “train.py”, line 788, in main() File “/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py”, line 346, in wrapper return f(*args, *kwargs) File “train.py”, line 780, in main train(config) File “train.py”, line 666, in train train_metrics = train_one_epoch( File “train.py”, line 457, in train_one_epoch loss.backward() File “/usr/local/lib/python3.8/dist-packages/torch/_tensor.py”, line 487, in backward torch.autograd.backward( File “/usr/local/lib/python3.8/dist-packages/torch/autograd/init.py”, line 197, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: FATAL ERROR :: MODULE:BRIDGE syn compile encountered : Graph compile failed. 26 compile time 488716541353 ns