I am currently analyzing the simple GEMM operation using torch.mm with Intel Gaudi Software as explained in (Analysis — Gaudi Documentation 1.17.0 documentation)
After I run my script using Gaudi 2, there are 4 rows related to MME, e.g., [D0] MME, [D1] MME, [D2] MME, and [D3] MME. When I run the script with a sufficient size of GEMM, all 4 rows indicate some kernel execution named GEMM3_bundle_0/op_x. In detail, all 4 rows indicate execution of GEMM3_bundle0/op_0 (with different slice index) and also indicate execution of GEMM3_bundle_0/op_1 concurrently.
As explained in the white paper, Gaudi 2 has 2 MME units that can be used concurrently. However, 8 threads(or slice?) are executing concurrently. Can you clarify the rows in profiling tools such as [Dx] MME, and also can you clarify the notion of slice in GEMM execution?
Thanks.