About MME profiling result with Intel Gaudi Software

I am currently analyzing the simple GEMM operation using torch.mm with Intel Gaudi Software as explained in (Analysis — Gaudi Documentation 1.17.0 documentation)

After I run my script using Gaudi 2, there are 4 rows related to MME, e.g., [D0] MME, [D1] MME, [D2] MME, and [D3] MME. When I run the script with a sufficient size of GEMM, all 4 rows indicate some kernel execution named GEMM3_bundle_0/op_x. In detail, all 4 rows indicate execution of GEMM3_bundle0/op_0 (with different slice index) and also indicate execution of GEMM3_bundle_0/op_1 concurrently.

As explained in the white paper, Gaudi 2 has 2 MME units that can be used concurrently. However, 8 threads(or slice?) are executing concurrently. Can you clarify the rows in profiling tools such as [Dx] MME, and also can you clarify the notion of slice in GEMM execution?

Thanks.

Here is a profiling results for GEMM execution (M, K, N = 16384, 16384, 16384)

There are 2 logical MMEs which actually consist of 4 physical MME. They can be configured as various modes (use FP8 as example below):

Symmetric: HxW 256 x 256

2xWide: HxW 128 x 512

2xHigh: HxW 512 x 128

Those four rows [D0, D1, D2, D3] should be corresponding to those 4 physical MME dies. This kind of detail is usually NOT needed for model performance profiling.

As for slices [op0, op1,…], this is used for pipeline of data fetching (HBM → on-die SRAM) and computation. The on-die SRAM is much faster than HBM (e.g. >10TB/s vs 2.45TB/s) but with limited capacity (48MB vs 96GB). On Gaudi2, we (Graph Compiler) usually use DMA engine to pre-fetch data from HBM to SRAM, meanwhile use MME engine doing GEMM operation by using prefetched data in SRAM. DMA engine (for data prefetch) and MME engine could run concurrently to pipeline the data loading and computation.

Due to the limited size of SRAM, for large GEMM, we (Graph Compiler) have to split GEMM into smaller one so that data for each smaller GEMM can fit into SRAM. That’s why you can see those “slice” in profile. You could see this “pipeline” behavior via checking both MME and DMA in profile.