- We use the example code (matrix-matrix multiplier in TPC Habana_Custom_Kernel/matrix_mul_fwd_f32.c at main · HabanaAI/Habana_Custom_Kernel · GitHub). Now we want to measure its run time and calculate its TFLOPS. The following is our code. We find that the run time is very small and will not change with input size. So we think that the function call will directly return without waiting for its finish.
def test_custom_mm_op_function(in0, in1):
out0 = torch.ops.custom_op.custom_mm(in0, in1)
return out0
out2 = test_custom_mm_op_function(in0_dense, in1)
htcore.mark_step()
start_time = time.perf_counter()
ht.hpu.synchronize()
start_event =ht.hpu.Event(enable_timing=True)
end_event = ht.hpu.Event(enable_timing=True)
start_event.record()
for _ in range(num_exps):
out2 = test_custom_mm_op_function(in0_dense, in1)
htcore.mark_step()
ht.hpu.synchronize()
end_event.record()
end_event.synchronize()
tpc_mm_time = start_event.elapsed_time(end_event)
tpc_mm_time = tpc_mm_time / 1000.0
- What is the ideal TPC TFLOPS.