Out of order addressing

brief : we are writing a TPC kernel, we want to use the vector as the subscripts of the array of a tensor to get the value of the tensor out of order.

we want to port this function ( which is a subfunction in cuda) to a TPC kernel.
the function snippet shows below:

float3 xyz_unit = apply_contraction(
    xyz, roi_min, roi_max, type);
int idx = grid_idx_at(xyz_unit, grid_res);
return grid_value[idx];

this runs in a cuda thread, but we want to run parallel in TPC kernel, we need to change the type of idx to int64, which cannot be used as a subscript of an array or tensor.
My question is do we have any function or mechanism to make it possible to get a vector of value by using int64 vector as the address offset which is NOT continuously?

We can’t do out of ordering in a single kernel, but it works if the sub-script are stored to a tensor and then loaded back in a succeeding kernel.
Basically in the first kernel, you save the index to a tensor. Then in the second kernel, use gen_addr to create the index coordinates and load tensor using the coordinates you just created, for example,

__global__ int* idx_coord_ptr_1 = gen_addr(idx_coord_1, indices_tensor);
in_coord_1[1] = s_i32_ld_g(idx_coord_ptr_1);

You can get some details at Habana_Custom_Kernel/sparse_lengths_sum_bf16_2D_f32_embed.c at main · HabanaAI/Habana_Custom_Kernel · GitHub