RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_BRIDGE Exception in Lowering thread

vezenbu · November 3, 2024, 6:17am

We encounter

RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_BRIDGE Exception in Lowering thread...
[Rank:0] FATAL ERROR :: MODULE:PT_EAGER HabanaLaunchOpPT Run returned exception....
Graph compile failed. synStatus=synStatus 26 [Generic failure]. 
[Rank:0] Habana exception raised from compile at graph.cpp:599
[Rank:0] Habana exception raised from LaunchRecipe at graph_exec.cpp:558

when running a GNN training code.

I use the docker image vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest and then pip install torch_geometric
Specifically, torch-geometric==2.6.1

See the error messages here and see the code here.

vezenbu · November 12, 2024, 6:14pm

Let me also show a workaround and some debugging information.

Workaround

We are able to adapt the code by

Removing model = torch.compile(model, backend="hpu_backend") and
Moving the evaluation part to CPU (while keeping the training part on HPU).

Debugging Information

The same errors appear even when we only include the training.

The code works well on CPU.

After removing model = torch.compile(model, backend="hpu_backend"), we encounter another error RuntimeError: synStatus=1 [Invalid argument] Node reshape failed.

However, it is possible to run the code with only training after removing model = torch.compile(model, backend="hpu_backend").

Analysis: Seemingly, after removing model = torch.compile(model, backend="hpu_backend"), the error appears when we conduct model.forward() after using model.eval(), while it is okay when the model is in the training mode, i.e., after model.train().
Our workaround above also supports this analysis.

rachtsingh · December 3, 2024, 9:26am

I’m getting the same error:

RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_BRIDGE Exception in Lowering thread...
[Rank:0] FATAL ERROR :: MODULE:PT_EAGER HabanaLaunchOpPT Run returned exception....
Graph compile failed. synStatus=synStatus 26 [Generic failure].
[Rank:0] Habana exception raised from compile at graph.cpp:599
[Rank:0] Habana exception raised from LaunchRecipe at graph_exec.cpp:558

In my case the the following cases were attempted:

Attempt to run the model with PT_HPU_LAZY_MODE=0, torch.compile, and model.eval: (fails with above error)
Attempt to run the model with torch.compile and model.eval: hpu_backend is not available
Attempt to run the model with PT_HPU_LAZY_MODE=0, and model.eval: works fine.
Attempt to run the model with model.eval: works fine
Attempt to run the model with PT_HPU_LAZY_MODE=0, torch.compile, and model.train: fails with the above error (graph_exec.cpp:558)
Attempt to run the model with torch.compile and model.train: Invalid backend
Attempt to run the model with PT_HPU_LAZY_MODE=0 and model.train: works fine
Attempt to run the model with model.train: works fine.

So I think basically an issue with torch.compile on this specific model.

Here’s the config:

============================= HABANA PT BRIDGE CONFIGURATION ===========================
 PT_HPU_LAZY_MODE = 1
 PT_RECIPE_CACHE_PATH =
 PT_CACHE_FOLDER_DELETE = 0
 PT_HPU_RECIPE_CACHE_CONFIG =
 PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
 PT_HPU_LAZY_ACC_PAR_MODE = 1
 PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
 PT_HPU_EAGER_PIPELINE_ENABLE = 1
 PT_HPU_EAGER_COLLECTIVE_PIPELINE_ENABLE = 1

Sayantan_S · March 23, 2025, 4:34am

Hi,

This issue has been fixed in 1.20

Topic		Replies	Views
GCNConv fails with normalization PyTorch	0	100	November 5, 2024
AttributeError: module 'habana_frameworks.torch.hpu' has no attribute 'wrap_in_hpu_graph PyTorch	4	125	January 19, 2025
RuntimeError: Input sizes must be equal when doing loss.backward() during the training of a GNN PyTorch pytorch	3	108	March 20, 2025
Transferring kNN results from CPU to HPU breaks back propagation PyTorch	0	67	December 3, 2024
Hpu_backend not found on torch.compile PyTorch	2	285	July 11, 2024

RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_BRIDGE Exception in Lowering thread

Workaround

Debugging Information

Related topics