Error related to complex torch indexing on HPU only

I’m getting a strange error when performing a complex index assignment:

RuntimeError: expand(HPUBFloat16Type{[2, 82, 4096]}, size=[265897904, 4096]): the number of sizes provided (2) must be greater or equal to the number of dimensions in the tensor (3)implicit = 0
> /fooformers/fooformers.py(81)add_memories()
     79         barf_if_nans(new_memories)
     80         self.memory = self.memory.clone()
---> 81         self.memory[helper, slots] = new_memories.detach()
     82         self.lru[helper, slots] = torch.maximum(
     83             self.lru.mean(dim=1),

(That 265897904 number isn’t consistent between reruns.)
Sizes:

ipdb> p self.memory.size()
torch.Size([2, 1024, 4096])
ipdb> p helper.size()
torch.Size([2, 82])
ipdb> p slots.size()
torch.Size([2, 82])
ipdb> p new_memories.size()
torch.Size([2, 82, 4096])

The indices are all well within bounds:

ipdb> p torch.min(helper)
tensor(0, device='hpu:0')
ipdb> p torch.max(helper)
tensor(1, device='hpu:0')
ipdb> p torch.min(slots)
tensor(0, device='hpu:0')
ipdb> p torch.max(slots)
tensor(81, device='hpu:0')

This operation is valid according to the rules torch specifies, and if I send the tensors to the CPU, the operation completes as expected:

ipdb> memory_cpu = self.memory.to("cpu")
ipdb> helper_cpu = helper.to("cpu")
ipdb> slots_cpu = slots.to("cpu")
ipdb> new_memories_cpu = new_memories.to("cpu")
ipdb> memory_cpu[helper_cpu, slots_cpu] = new_memories_cpu
# (no error)

When I try to do the operation on the HPU in ipdb, I get the following backtrace:

ipdb> self.memory[helper, slots] = new_memories
*** RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_BRIDGE Exception in Launch thread...
Check $HABANA_LOGS/ for details
expand(HPUBFloat16Type{[2, 82, 4096]}, size=[265897904, 4096]): the number of sizes provided (2) must be greater or equal to the number of dimensions in the tensor (3)implicit = 0
Exception raised from AllocateAndAddSynapseNode at /npu-stack/pytorch-integration/habana_kernels/tensor_shape_kernels.cpp:691 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6c (0x7f4e4396166c in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xfa (0x7f4e439169f0 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #2: BroadcastOperator::AllocateAndAddSynapseNode(synapse_helpers::graph&, std::vector<c10::IValue, std::allocator<c10::IValue> >&, std::vector<habana::OutputMetaData, std::allocator<habana::OutputMetaData> > const&) + 0x96d (0x7f4ee3f45aed in /usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/lib/libhabana_pytorch_plugin.so)
frame #3: habana::IndexPutOperator::AllocateAndAddSynapseNodeNonBoolIndices(synapse_helpers::graph&, std::vector<c10::IValue, std::allocator<c10::IValue> >&, std::vector<habana::OutputMetaData, std::allocator<habana::OutputMetaData> > const&) + 0xb86 (0x7f4ee3d40356 in /usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/lib/libhabana_pytorch_plugin.so)
frame #4: habana::IndexPutOperator::AllocateAndAddSynapseNode(synapse_helpers::graph&, std::vector<c10::IValue, std::allocator<c10::IValue> >&, std::vector<habana::OutputMetaData, std::allocator<habana::OutputMetaData> > const&) + 0x17b (0x7f4ee3d4885b in /usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/lib/libhabana_pytorch_plugin.so)
frame #5: habana::HabanaLaunchOpPT::BuildSynapseGraph(std::shared_ptr<synapse_helpers::graph>&, bool) + 0x1c6e (0x7f4ede5ac07e in /usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/lib/libhabana_pytorch_backend.so)
frame #6: habana::HabanaLaunchOpPT::run(std::vector<c10::IValue, std::allocator<c10::IValue> >&, std::optional<std::vector<at::Tensor, std::allocator<at::Tensor> > >, bool) + 0x83b (0x7f4ede5badab in /usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/lib/libhabana_pytorch_backend.so)
frame #7: <unknown function> + 0xda0f37 (0x7f4ee3fabf37 in /usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/lib/libhabana_pytorch_plugin.so)
frame #8: habana_lazy::exec::HlExec::Launch(std::vector<c10::IValue, std::allocator<c10::IValue> >&, c10::hpu::HPUStream const&, bool) + 0x82c (0x7f4ee3faef3c in /usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/lib/libhabana_pytorch_plugin.so)
frame #9: LaunchSyncTensorsGraph(LaunchTensorsInfo&&, LaunchEagerInfo&&, LaunchStreamInfo&&) + 0x4b7 (0x7f4ee3f845e7 in /usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/lib/libhabana_pytorch_plugin.so)
frame #10: std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, std::__future_base::_Result_base::_Deleter>, std::__future_base::_Task_state<habana_helpers::ThreadPoolBase<habana_helpers::BlockingQueue>::enqueue<void (&)(LaunchTensorsInfo&&, LaunchEagerInfo&&, LaunchStreamInfo&&), LaunchTensorsInfo, LaunchEagerInfo, LaunchStreamInfo>(void (&)(LaunchTensorsInfo&&, LaunchEagerInfo&&, LaunchStreamInfo&&), LaunchTensorsInfo&&, LaunchEagerInfo&&, LaunchStreamInfo&&)::{lambda()#1}, std::allocator<int>, void ()>::_M_run()::{lambda()#1}, void> >::_M_invoke(std::_Any_data const&) + 0x40 (0x7f4ee3f8b820 in /usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/lib/libhabana_pytorch_plugin.so)
frame #11: std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*) + 0x2d (0x7f4ee3cbffad in /usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/lib/libhabana_pytorch_plugin.so)
frame #12: <unknown function> + 0x114df (0x7f4ef4c524df in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #13: std::__future_base::_Task_state<habana_helpers::ThreadPoolBase<habana_helpers::BlockingQueue>::enqueue<void (&)(LaunchTensorsInfo&&, LaunchEagerInfo&&, LaunchStreamInfo&&), LaunchTensorsInfo, LaunchEagerInfo, LaunchStreamInfo>(void (&)(LaunchTensorsInfo&&, LaunchEagerInfo&&, LaunchStreamInfo&&), LaunchTensorsInfo&&, LaunchEagerInfo&&, LaunchStreamInfo&&)::{lambda()#1}, std::allocator<int>, void ()>::_M_run() + 0x10a (0x7f4ee3f8baea in /usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/lib/libhabana_pytorch_plugin.so)
frame #14: habana_helpers::ThreadPoolBase<habana_helpers::BlockingQueue>::executePendingTask(std::packaged_task<void ()>&&) + 0x38 (0x7f4ee3cd0218 in /usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/lib/libhabana_pytorch_plugin.so)
frame #15: habana_helpers::ThreadPoolBase<habana_helpers::BlockingQueue>::main_loop() + 0x124 (0x7f4ee3cd0eb4 in /usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/lib/libhabana_pytorch_plugin.so)
frame #16: <unknown function> + 0xd6df4 (0x7f4ef4972df4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #17: <unknown function> + 0x8609 (0x7f4ef4c49609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #18: clone + 0x43 (0x7f4ef4d83133 in /lib/x86_64-linux-gnu/libc.so.6)

(There isn’t any useful additional information in $HABANA_LOGS/)

This is on a freshly-provisioned Intel Developer Cloud Gaudi 2 instance that is running SynapseAI 1.13.0:

+ ---------------------------------------------------------------------- +
| Version:            1.13.0                                             |
| Synapse:            6599d95d6                                          |
| HCL:                decb342d                                           |
| MME:                1556117                                            |
| SCAL:               f750a52                                            |
| Description:        HabanaLabs Runtime and GraphCompiler               |
| Time:               2024-01-03 07:40:54.874212                         |
+ ---------------------------------------------------------------------- +

Is this an issue in the Habana stack, or am I doing something wrong?

I distilled the issue into a short script:

import torch
import habana_frameworks.torch.core as htcore

DIM_0, DIM_1, DIM_2 = 2, 3, 4
SUB_DIM_1 = 2
OFFSET = 1000

def test(device):
    dest = torch.arange(DIM_0 * DIM_1 * DIM_2, device=device).reshape(DIM_0, DIM_1, DIM_2)
    helper = torch.arange(DIM_0, device=device).expand(SUB_DIM_1, DIM_0).transpose(0, 1)
    slots = torch.arange(SUB_DIM_1, device=device).expand(DIM_0, SUB_DIM_1)
    data = torch.arange(OFFSET, OFFSET + DIM_0 * SUB_DIM_1 * DIM_2, device=device).reshape(DIM_0, SUB_DIM_1, DIM_2)

    pre_dest = dest.clone()
    dest[helper, slots] = data

    return pre_dest, dest

print("Testing CPU:")
cpu_res = test("cpu")
print(cpu_res[0])
print(cpu_res[1])

print("\n\n\n\n\n")

# Crashes with SynapseAI 1.13.0 on a fresh IDC Gaudi 2 instance

print("Testing HPU:")
hpu_res = test("hpu")
print(hpu_res[0])
print(hpu_res[1])

Thanks for posting the issue and a minimum-repro script. We’ll take a look at it and get back

This seems to be a bug on the HPU

Posting shapes and values for easy reference.

Shapes:
# data: 2x2x4
    # helper, slots: 2x2
    # dest: 2x3x4

'''
    dest
    tensor([[[ 0,  1,  2,  3],
         [ 4,  5,  6,  7],
         [ 8,  9, 10, 11]],

        [[12, 13, 14, 15],
         [16, 17, 18, 19],
         [20, 21, 22, 23]]])
    helper
tensor([[0, 0],
        [1, 1]])
    slots
tensor([[0, 1],
        [0, 1]])
    data
tensor([[[1000, 1001, 1002, 1003],
         [1004, 1005, 1006, 1007]],

        [[1008, 1009, 1010, 1011],
         [1012, 1013, 1014, 1015]]])


after line dest[helper, slots] = data:
dest
tensor([[[1000, 1001, 1002, 1003],
         [1004, 1005, 1006, 1007],
         [   8,    9,   10,   11]],

        [[1008, 1009, 1010, 1011],
         [1012, 1013, 1014, 1015],
         [  20,   21,   22,   23]]])
    '''

Relating to the min-repro, here is a possible line of workaround.

idx = torch.tensor([[[0,0,0,0], [0,0,0,0]], [[1,1,1,1], [1,1,1,1]]], device=device)
x = torch.zeros_like(dest)
dest = dest.scatter(0, idx, data)

Scatter (rather than advanced indexing) seems to work on HPU
The only issue is converting helper and slots tensors to an appropriate indexing tensor for the scatter.

Another workaround it to do the dest[helper, slots] = data operation on cpu:

data = data.to('cpu')
helper = helper.to('cpu')
slots = slots.to('cpu')
dest= dest.to('cpu')
dest[helper, slots] = data
data = data.to('hpu')
helper = helper.to('hpu')
slots = slots.to('hpu')
dest= dest.to('hpu')

Though this will likely be slow, so I’d recommend a scatter based workaround better speed

Let us know if the workaround is feasible.

Thanks! I haven’t had a chance to try it yet (hopefully in the coming days); instances are very expensive . . .

This code is in the hot path of the training loop, so moving to CPU is not viable. I’m worried that the scatter will also be slower: in the original formulation, we are effectively scattering rows, so (in theory) the kernel can copy a whole row of 4096 elements at once. With the scatter operation built in to torch, we’re forced to specify indices for each element, so even if we fill up that last dim with arange, the kernel can’t optimize over that fairly large last dim.

Do you have any idea when a fix could be available? (Weeks, months?) Thanks again.

Fixes can happen only when a new version is release, which has its own cadence. Once it is fixed in some future release, we’ll post here. In the meanwhile, you might have to workaround by something like scatter unfortunately.