RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_BRIDGE Exception in Lowering thread

rachtsingh · December 3, 2024, 9:26am

I’m getting the same error:

RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_BRIDGE Exception in Lowering thread...
[Rank:0] FATAL ERROR :: MODULE:PT_EAGER HabanaLaunchOpPT Run returned exception....
Graph compile failed. synStatus=synStatus 26 [Generic failure].
[Rank:0] Habana exception raised from compile at graph.cpp:599
[Rank:0] Habana exception raised from LaunchRecipe at graph_exec.cpp:558

In my case the the following cases were attempted:

Attempt to run the model with PT_HPU_LAZY_MODE=0, torch.compile, and model.eval: (fails with above error)
Attempt to run the model with torch.compile and model.eval: hpu_backend is not available
Attempt to run the model with PT_HPU_LAZY_MODE=0, and model.eval: works fine.
Attempt to run the model with model.eval: works fine
Attempt to run the model with PT_HPU_LAZY_MODE=0, torch.compile, and model.train: fails with the above error (graph_exec.cpp:558)
Attempt to run the model with torch.compile and model.train: Invalid backend
Attempt to run the model with PT_HPU_LAZY_MODE=0 and model.train: works fine
Attempt to run the model with model.train: works fine.

So I think basically an issue with torch.compile on this specific model.

Here’s the config:

============================= HABANA PT BRIDGE CONFIGURATION ===========================
 PT_HPU_LAZY_MODE = 1
 PT_RECIPE_CACHE_PATH =
 PT_CACHE_FOLDER_DELETE = 0
 PT_HPU_RECIPE_CACHE_CONFIG =
 PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
 PT_HPU_LAZY_ACC_PAR_MODE = 1
 PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
 PT_HPU_EAGER_PIPELINE_ENABLE = 1
 PT_HPU_EAGER_COLLECTIVE_PIPELINE_ENABLE = 1

Topic		Replies	Views
GCNConv fails with normalization PyTorch	0	92	November 5, 2024
RuntimeError: Input sizes must be equal when doing loss.backward() during the training of a GNN PyTorch pytorch	3	99	March 20, 2025
Transferring kNN results from CPU to HPU breaks back propagation PyTorch	0	60	December 3, 2024
Trainer killed/Segfault PyTorch	6	631	September 1, 2023
Model.to device faile: "RuntimeError: synStatus=8 [Device not found] Device acquire failed." Training models , pytorch	3	618	March 13, 2024

RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_BRIDGE Exception in Lowering thread

Related topics