RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_BRIDGE Exception in Lowering thread

We encounter

RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_BRIDGE Exception in Lowering thread...
[Rank:0] FATAL ERROR :: MODULE:PT_EAGER HabanaLaunchOpPT Run returned exception....
Graph compile failed. synStatus=synStatus 26 [Generic failure]. 
[Rank:0] Habana exception raised from compile at graph.cpp:599
[Rank:0] Habana exception raised from LaunchRecipe at graph_exec.cpp:558

when running a GNN training code.

  • I use the docker image vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest and then pip install torch_geometric
  • Specifically, torch-geometric==2.6.1

See the error messages here and see the code here.

Let me also show a workaround and some debugging information.

Workaround

We are able to adapt the code by

  1. Removing model = torch.compile(model, backend="hpu_backend") and

  2. Moving the evaluation part to CPU (while keeping the training part on HPU).

Debugging Information

The same errors appear even when we only include the training.

The code works well on CPU.

After removing model = torch.compile(model, backend="hpu_backend"), we encounter another error RuntimeError: synStatus=1 [Invalid argument] Node reshape failed.

However, it is possible to run the code with only training after removing model = torch.compile(model, backend="hpu_backend").

Analysis: Seemingly, after removing model = torch.compile(model, backend="hpu_backend"), the error appears when we conduct model.forward() after using model.eval(), while it is okay when the model is in the training mode, i.e., after model.train().
Our workaround above also supports this analysis.

I’m getting the same error:

RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_BRIDGE Exception in Lowering thread...
[Rank:0] FATAL ERROR :: MODULE:PT_EAGER HabanaLaunchOpPT Run returned exception....
Graph compile failed. synStatus=synStatus 26 [Generic failure].
[Rank:0] Habana exception raised from compile at graph.cpp:599
[Rank:0] Habana exception raised from LaunchRecipe at graph_exec.cpp:558

In my case the the following cases were attempted:

  1. Attempt to run the model with PT_HPU_LAZY_MODE=0, torch.compile, and model.eval: (fails with above error)
  2. Attempt to run the model with torch.compile and model.eval: hpu_backend is not available
  3. Attempt to run the model with PT_HPU_LAZY_MODE=0, and model.eval: works fine.
  4. Attempt to run the model with model.eval: works fine
  5. Attempt to run the model with PT_HPU_LAZY_MODE=0, torch.compile, and model.train: fails with the above error (graph_exec.cpp:558)
  6. Attempt to run the model with torch.compile and model.train: Invalid backend
  7. Attempt to run the model with PT_HPU_LAZY_MODE=0 and model.train: works fine
  8. Attempt to run the model with model.train: works fine.

So I think basically an issue with torch.compile on this specific model.

Here’s the config:

============================= HABANA PT BRIDGE CONFIGURATION ===========================
 PT_HPU_LAZY_MODE = 1
 PT_RECIPE_CACHE_PATH =
 PT_CACHE_FOLDER_DELETE = 0
 PT_HPU_RECIPE_CACHE_CONFIG =
 PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
 PT_HPU_LAZY_ACC_PAR_MODE = 1
 PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
 PT_HPU_EAGER_PIPELINE_ENABLE = 1
 PT_HPU_EAGER_COLLECTIVE_PIPELINE_ENABLE = 1