How to measure the TPC run time?

  1. We use the example code (matrix-matrix multiplier in TPC Habana_Custom_Kernel/matrix_mul_fwd_f32.c at main · HabanaAI/Habana_Custom_Kernel · GitHub). Now we want to measure its run time and calculate its TFLOPS. The following is our code. We find that the run time is very small and will not change with input size. So we think that the function call will directly return without waiting for its finish.
    def test_custom_mm_op_function(in0, in1):
    out0 = torch.ops.custom_op.custom_mm(in0, in1)
    return out0
    out2 = test_custom_mm_op_function(in0_dense, in1)
    htcore.mark_step()

start_time = time.perf_counter()

ht.hpu.synchronize()
start_event =ht.hpu.Event(enable_timing=True)
end_event = ht.hpu.Event(enable_timing=True)
start_event.record()
for _ in range(num_exps):
out2 = test_custom_mm_op_function(in0_dense, in1)
htcore.mark_step()
ht.hpu.synchronize()
end_event.record()
end_event.synchronize()
tpc_mm_time = start_event.elapsed_time(end_event)
tpc_mm_time = tpc_mm_time / 1000.0

  1. What is the ideal TPC TFLOPS.

The easiest way to measure the TPC run time is to use habana profiler. Just set env variable HABANA_PROFILE=1 and when you run your script, you will find .hltv file generated. Then use https://hltv.habana-labs.com/ to load that hltv file, you can zoom in to get the TPC run time. Thanks.