Let me also show a workaround and some debugging information.
Workaround
We are able to adapt the code by
Removing model = torch.compile(model, backend="hpu_backend") and
Moving the evaluation part to CPU (while keeping the training part on HPU).
Debugging Information
The same errors appear even when we only include the training.
The code works well on CPU.
After removing model = torch.compile(model, backend="hpu_backend"), we encounter another error RuntimeError: synStatus=1 [Invalid argument] Node reshape failed.
However, it is possible to run the code with only training after removing model = torch.compile(model, backend="hpu_backend").
Analysis: Seemingly, after removing model = torch.compile(model, backend="hpu_backend"), the error appears when we conduct model.forward() after using model.eval(), while it is okay when the model is in the training mode, i.e., after model.train().
Our workaround above also supports this analysis.