RuntimeError: Input sizes must be equal when doing loss.backward() during the training of a GNN

vezenbu · November 3, 2024, 6:17am

We encounter RuntimeError: Input sizes must be equal when doing loss.backward() during the training of a GNN.
The code is adapted from a PyG tutorial.

I use the docker image vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest and then pip install torch_geometric
Specifically, torch-geometric==2.6.1

See the error messages here and see the code here.

vezenbu · November 12, 2024, 6:14pm

More updates: Let me also show a workaround and some debugging information.

Workaround

We are able to adapt the code by

Removing model = torch.compile(model, backend="hpu_backend") and
Moving the evaluation part to CPU (while keeping the training part on HPU).

Debugging Information

The same errors appear even when we only include the training.

The code works well on CPU.

After removing model = torch.compile(model, backend="hpu_backend"), we encounter another error RuntimeError: synStatus=1 [Invalid argument] Node reshape failed.

However, it is possible to run the code with only training after removing model = torch.compile(model, backend="hpu_backend").

vezenbu · November 12, 2024, 6:14pm

Some updates:

The code works well on CPU. See the code here.

After removing model = torch.compile(model, backend="hpu_backend"), we encounter


[Rank:0] FATAL ERROR :: MODULE:PT_BRIDGE Exception in Lowering thread...

synStatus=1 [Invalid argument] Node reshape failed.

See the error messages here.

Sayantan_S · March 20, 2025, 4:57pm

Hi @vezenbu,

Unfortunately, GNN is not supported at the moment.

We will update you when GNN is supported.
Thank you for the interest.

Topic		Replies	Views
RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_BRIDGE Exception in Lowering thread PyTorch	3	261	March 23, 2025
Transferring kNN results from CPU to HPU breaks back propagation PyTorch	0	67	December 3, 2024
GCNConv fails with normalization PyTorch	0	100	November 5, 2024
Training of torch.nn.embedding failed: loss not decreasing PyTorch pytorch	2	68	January 2, 2025
Gaudi1 HPU doesn't support long? PyTorch pytorch	11	332	April 4, 2024

RuntimeError: Input sizes must be equal when doing loss.backward() during the training of a GNN

Workaround

Debugging Information

Related topics