Gaudi1 HPU doesn't support long?

LilRay · March 21, 2024, 5:03pm

I was trying to train custom dataset with YOLOv7 with Gaudi1 Synapse 1.14.
I found some error when computing loss.

gain = torch.ones(7, device=targets.device).long()  # normalized to gridspace gain
gain[2:6] = torch.tensor(p[i].shape)[[3, 2, 3, 2]]  # xyxy gain

p[i].shape would be like:

torch.size([12, 3, 80, 80, 7])

and gain output shape is:

tensor([ 1,  1, 80,  0, 80,  0,  1], device='hpu:0')

which should be:

tensor([ 1,  1, 80,  80, 80, 80,  1], device='hpu:0')

Reproduce:

import torch
import habana_frameworks.torch.core as htcore
gain = torch.ones(7, device = "hpu", dtype = torch.int64)
a =torch.tensor([[[10, 22,32,234],[10, 22,32,234],[10, 22,32,234]], [[10, 22,32,234],[10, 22,32,234],[10, 22,32,234]]], dtype = torch.int64)
gain[2:6] = torch.tensor(a.shape, dtype = torch.int64)[[2, 1, 2, 1]]
print(gain)

if modified int64 to int32
output is normal

Sayantan_S · March 21, 2024, 6:37pm

Hi,

Can you try adding the device for the “a.shape” tensor as well:

torch.tensor(a.shape, dtype = torch.int64, device = “hpu”)

import torch
import habana_frameworks.torch.core as htcore
gain = torch.ones(7, device = "hpu", dtype = torch.int64)
a =torch.tensor([[[10, 22,32,234],[10, 22,32,234],[10, 22,32,234]], [[10, 22,32,234],[10, 22,32,234],[10, 22,32,234]]], dtype = torch.int64)
gain[2:6] = torch.tensor(a.shape, dtype = torch.int64, device = "hpu")[[2, 1, 2, 1]]
print(gain)

Thanks

LilRay · March 22, 2024, 4:24pm

Hi Sayantan_S,
thanks for reply,
Yes, I tried add device to tensor, and get the expected tensor.
and able to start training,

but loss increse drastically every iteration, and error occurred but still running.

Did I miss something?

Sayantan_S · March 22, 2024, 11:46pm

Is it possible to share a minimal script that repros the issue?

If not, can you see if u get the same error if you run it with int32 instead of int64?

LilRay · March 26, 2024, 4:11pm

Sorry for late,

I made some changes from cuda to hpu, and remove tensor board etc.

I tried revised int64 to int32,
It started training, but I got loss: nan, which supposed to be very big number,
after that I use int64, and add torch.tensor to which parameter contains tensor,
loss looks quite normal in the beggining, but it turns to nan after 20 epochs.

I tried to verify these model, can’t detect object.
So I think my proble haven’t solved yet.

Sayantan_S · March 26, 2024, 4:19pm

I’ll take a look at your issue. In the meanwhile, if your usecase allows it, yolox has been optimized and tested for habana: Model-References/PyTorch/computer_vision/detection/yolox at a84f22b621af0d8e0502bd7997252bfeb513dda3 · HabanaAI/Model-References · GitHub

Sayantan_S · March 26, 2024, 4:46pm

Can you point in your repo where you have the int64 tensor, regarding which you posted the original question, so its easier for me to look into

LilRay · March 27, 2024, 11:35pm

Yes, I have undergone testing with YOLOX, and the most significant distinction between them lies in the method of loss computation.
YOLOX calculates loss directly from the output during training (inference), whereas YOLOv7 computes it based on its prediction results. My goal is to evaluate the performance differences between these algorithms when deployed on HABANA.
This is the reason I am experimenting with training on YOLOv7, and even v8 and v9.

LilRay · March 27, 2024, 11:35pm

It’s in yolov7/utils/loss.py
class ComputeLoss:
def build_targets
line 532

gain[2:6] = torch.tensor(p[i].shape, device = torch.device("hpu"), dtype=torch.int64)[[3, 2, 3, 2]]  # xyxy gain

I’ve also made changes for some parameters to fit habana hpu setting.
source code:

github.com

WongKinYiu/yolov7/blob/main/utils/loss.py

# Loss functions

import torch
import torch.nn as nn
import torch.nn.functional as F

from utils.general import bbox_iou, bbox_alpha_iou, box_iou, box_giou, box_diou, box_ciou, xywh2xyxy
from utils.torch_utils import is_parallel


def smooth_BCE(eps=0.1):  # https://github.com/ultralytics/yolov3/issues/238#issuecomment-598028441
    # return positive, negative label smoothing BCE targets
    return 1.0 - 0.5 * eps, 0.5 * eps


class BCEBlurWithLogitsLoss(nn.Module):
    # BCEwithLogitLoss() with reduced missing label effects.
    def __init__(self, alpha=0.05):
        super(BCEBlurWithLogitsLoss, self).__init__()
        self.loss_fcn = nn.BCEWithLogitsLoss(reduction='none')  # must be nn.BCEWithLogitsLoss()

This file has been truncated. show original

LilRay · March 29, 2024, 11:18pm

Good news,
I am able to train YOLOv8 using G1.
v8 package support for training v5, v6, v8, and v9.
I plan to test the remaining versions soon.

Sayantan_S · April 2, 2024, 4:50pm

@LilRay , Regarding your original post of long usage, if you have issues with using int64, you can try :
PT_ENABLE_INT64_SUPPORT=1 like this. With the flag, I validated that the original long indexing issue you posted does not show up.

Our documentation will be upgraded to explain its usage.

LilRay · April 4, 2024, 5:26pm

@Sayantan_S, Many thanks, I’ll try this env variable later.

Topic		Replies	Views
PyTorch model works on CPU/CUDA but not on HPU Training pytorch	5	1745	January 19, 2022
Gaudi Torch Cummax PyTorch pytorch	4	857	November 14, 2022
Linear Layer Inconsistency General Questions pytorch	2	225	April 24, 2024
RuntimeError: Input sizes must be equal when doing loss.backward() during the training of a GNN PyTorch pytorch	3	99	March 20, 2025
How to install all necessary libraries/software to write and debug Gaudi-oriented code on a local PC? System Setup	1	551	December 13, 2022

Gaudi1 HPU doesn't support long?

Related topics