Gaudi Torch Cummax

Running torch.cummax on HPU seems to fail. I can run the snippet below on CPU, and it works fine. But when I run on HPU it fails.

torch.cummax(torch.range(10,0,-1)[None,:].repeat(5,1).to('hpu'), dim=1)
*** RuntimeError: vector::_M_range_check: __n (which is 0) >= this->size() (which is 0)

On CPU i get the expected results.

torch.cummax(torch.range(10,0,-1)[None,:].repeat(5,1).to('cpu'), dim=1)
torch.return_types.cummax(
values=tensor([[10., 10., 10., 10., 10., 10., 10., 10., 10., 10., 10.],
        [10., 10., 10., 10., 10., 10., 10., 10., 10., 10., 10.],
        [10., 10., 10., 10., 10., 10., 10., 10., 10., 10., 10.],
        [10., 10., 10., 10., 10., 10., 10., 10., 10., 10., 10.],
        [10., 10., 10., 10., 10., 10., 10., 10., 10., 10., 10.]]),
indices=tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]))

I’m running this on an AWS DL1 Instance with the Habana Gaudi AMI. I’ve also manually installed torch_hpu from GitHub - HabanaAI/Setup_and_Install: Setup and Installation Instructions for Habana binaries, docker image creation

Hi @SohrabAndaz

Thank you for pointing out the issue. We’ll update you here when the issue gets fixed in a future release. For now you can maybe move the cummax to CPU, something along the lines of:

x = ... # x is computed on HPU
y = torch.cummax(x.to('cpu'), dim=1) # y is on CPU, so cummax is on CPU
y = y.to('hpu') # moving y back to HPU, so that furthur computations happen on HPU again

I can do that… in general how long should I expect to wait to have this updated?? Weeks? or Months?

-Sohrab Andaz

In case the CPU fallback is making things slow, here is another possible workaround, where we implement cummax using max:

import torch
import habana_frameworks.torch.core as htcore

x = torch.tensor([[1,3,2],[2,1,3]])
y = torch.cummax(x, 1) # on cpu

x = x.to('hpu')

rsp = [x.shape[0], x.shape[1], x.shape[1]]
x_tiled = torch.tile(x, [1,x.shape[1]]).reshape(rsp)
print(y)

tril = torch.tril(torch.ones([x.shape[1], x.shape[1]]))
tril_tiled = torch.tile(tril, [x.shape[0], 1]).reshape(rsp)
mul = x_tiled * tril_tiled
result_replacement = torch.max(mul,2)

assert (torch.all(result_replacement.values == y.values))
assert (torch.all(result_replacement.indices == y.indices))

You can get an idea of our past release cadence from the announcements here