Gaudi1 HPU doesn't support long?

I was trying to train custom dataset with YOLOv7 with Gaudi1 Synapse 1.14.
I found some error when computing loss.

gain = torch.ones(7, device=targets.device).long()  # normalized to gridspace gain
gain[2:6] = torch.tensor(p[i].shape)[[3, 2, 3, 2]]  # xyxy gain

p[i].shape would be like:

torch.size([12, 3, 80, 80, 7])

and gain output shape is:

tensor([ 1,  1, 80,  0, 80,  0,  1], device='hpu:0')

which should be:

tensor([ 1,  1, 80,  80, 80, 80,  1], device='hpu:0')

Reproduce:

import torch
import habana_frameworks.torch.core as htcore
gain = torch.ones(7, device = "hpu", dtype = torch.int64)
a =torch.tensor([[[10, 22,32,234],[10, 22,32,234],[10, 22,32,234]], [[10, 22,32,234],[10, 22,32,234],[10, 22,32,234]]], dtype = torch.int64)
gain[2:6] = torch.tensor(a.shape, dtype = torch.int64)[[2, 1, 2, 1]]
print(gain)

if modified int64 to int32
output is normal

Hi,

Can you try adding the device for the “a.shape” tensor as well:

torch.tensor(a.shape, dtype = torch.int64, device = “hpu”)

import torch
import habana_frameworks.torch.core as htcore
gain = torch.ones(7, device = "hpu", dtype = torch.int64)
a =torch.tensor([[[10, 22,32,234],[10, 22,32,234],[10, 22,32,234]], [[10, 22,32,234],[10, 22,32,234],[10, 22,32,234]]], dtype = torch.int64)
gain[2:6] = torch.tensor(a.shape, dtype = torch.int64, device = "hpu")[[2, 1, 2, 1]]
print(gain)

Thanks

Hi Sayantan_S,
thanks for reply,
Yes, I tried add device to tensor, and get the expected tensor.
and able to start training,

but loss increse drastically every iteration, and error occurred but still running.

Did I miss something?

Is it possible to share a minimal script that repros the issue?

If not, can you see if u get the same error if you run it with int32 instead of int64?

Sorry for late,

I made some changes from cuda to hpu, and remove tensor board etc.

I tried revised int64 to int32,
It started training, but I got loss: nan, which supposed to be very big number,
after that I use int64, and add torch.tensor to which parameter contains tensor,
loss looks quite normal in the beggining, but it turns to nan after 20 epochs.

I tried to verify these model, can’t detect object.
So I think my proble haven’t solved yet.

I’ll take a look at your issue. In the meanwhile, if your usecase allows it, yolox has been optimized and tested for habana: Model-References/PyTorch/computer_vision/detection/yolox at a84f22b621af0d8e0502bd7997252bfeb513dda3 · HabanaAI/Model-References · GitHub

Can you point in your repo where you have the int64 tensor, regarding which you posted the original question, so its easier for me to look into

Yes, I have undergone testing with YOLOX, and the most significant distinction between them lies in the method of loss computation.
YOLOX calculates loss directly from the output during training (inference), whereas YOLOv7 computes it based on its prediction results. My goal is to evaluate the performance differences between these algorithms when deployed on HABANA.
This is the reason I am experimenting with training on YOLOv7, and even v8 and v9.

It’s in yolov7/utils/loss.py
class ComputeLoss:
def build_targets
line 532

gain[2:6] = torch.tensor(p[i].shape, device = torch.device("hpu"), dtype=torch.int64)[[3, 2, 3, 2]]  # xyxy gain

I’ve also made changes for some parameters to fit habana hpu setting.
source code:

Good news,
I am able to train YOLOv8 using G1.
v8 package support for training v5, v6, v8, and v9.
I plan to test the remaining versions soon.

@LilRay , Regarding your original post of long usage, if you have issues with using int64, you can try :
PT_ENABLE_INT64_SUPPORT=1 like this. With the flag, I validated that the original long indexing issue you posted does not show up.

Our documentation will be upgraded to explain its usage.

@Sayantan_S, Many thanks, I’ll try this env variable later.