RuntimeError: Input sizes must be equal when doing loss.backward() during the training of a GNN

We encounter RuntimeError: Input sizes must be equal when doing loss.backward() during the training of a GNN.
The code is adapted from a PyG tutorial.

  • I use the docker image vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest and then pip install torch_geometric
  • Specifically, torch-geometric==2.6.1

See the error messages here and see the code here.

More updates: Let me also show a workaround and some debugging information.

Workaround

We are able to adapt the code by

  1. Removing model = torch.compile(model, backend="hpu_backend") and
  2. Moving the evaluation part to CPU (while keeping the training part on HPU).

Debugging Information

The same errors appear even when we only include the training.

The code works well on CPU.

After removing model = torch.compile(model, backend="hpu_backend"), we encounter another error RuntimeError: synStatus=1 [Invalid argument] Node reshape failed.

However, it is possible to run the code with only training after removing model = torch.compile(model, backend="hpu_backend").

Some updates:

The code works well on CPU. See the code here.

After removing model = torch.compile(model, backend="hpu_backend"), we encounter


[Rank:0] FATAL ERROR :: MODULE:PT_BRIDGE Exception in Lowering thread...

synStatus=1 [Invalid argument] Node reshape failed.

See the error messages here.