Training of torch.nn.embedding failed: loss not decreasing

vezenbu · December 3, 2024, 9:27am

When using Gaudi to train GNNs using Node2Vec, we found that the loss does not decrease, while the code works well on CPUs.

After some debugging, we identified the problem was at torch.nn.embedding.

Using a toy example, we found that the loss does not decrease when we use torch.nn.embedding on Gaudi, but correctly decreases on CPUs.

See the test code here.

sunson · December 11, 2024, 7:33am

Hi,

Could you try using torch.nn.Parameter and check if loss decreases properly?
Thanks

vezenbu · January 2, 2025, 3:28am

@sunson Thank you for your suggestion. We have tried it and similar trends have been observed, i.e., the loss on HPU does not decrease properly. See the updated test code here.

Topic		Replies	Views
RuntimeError: Input sizes must be equal when doing loss.backward() during the training of a GNN PyTorch pytorch	3	104	March 20, 2025
Gaudi1 HPU doesn't support long? PyTorch pytorch	11	327	April 4, 2024
Transferring kNN results from CPU to HPU breaks back propagation PyTorch	0	63	December 3, 2024
RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_BRIDGE Exception in Lowering thread PyTorch	3	251	March 23, 2025
GCNConv fails with normalization PyTorch	0	96	November 5, 2024

Training of torch.nn.embedding failed: loss not decreasing

Related topics