About MME profiling result with Intel Gaudi Software

juntaek0425 · August 20, 2024, 5:04pm

I am currently analyzing the simple GEMM operation using torch.mm with Intel Gaudi Software as explained in (Analysis — Gaudi Documentation 1.17.0 documentation)

After I run my script using Gaudi 2, there are 4 rows related to MME, e.g., [D0] MME, [D1] MME, [D2] MME, and [D3] MME. When I run the script with a sufficient size of GEMM, all 4 rows indicate some kernel execution named GEMM3_bundle_0/op_x. In detail, all 4 rows indicate execution of GEMM3_bundle0/op_0 (with different slice index) and also indicate execution of GEMM3_bundle_0/op_1 concurrently.

As explained in the white paper, Gaudi 2 has 2 MME units that can be used concurrently. However, 8 threads(or slice?) are executing concurrently. Can you clarify the rows in profiling tools such as [Dx] MME, and also can you clarify the notion of slice in GEMM execution?

Thanks.

juntaek0425 · September 3, 2024, 8:29pm

Here is a profiling results for GEMM execution (M, K, N = 16384, 16384, 16384)

sungwookson · October 9, 2024, 3:04am

There are 2 logical MMEs which actually consist of 4 physical MME. They can be configured as various modes (use FP8 as example below):

Symmetric: HxW 256 x 256

2xWide: HxW 128 x 512

2xHigh: HxW 512 x 128

Those four rows [D0, D1, D2, D3] should be corresponding to those 4 physical MME dies. This kind of detail is usually NOT needed for model performance profiling.

As for slices [op0, op1,…], this is used for pipeline of data fetching (HBM → on-die SRAM) and computation. The on-die SRAM is much faster than HBM (e.g. >10TB/s vs 2.45TB/s) but with limited capacity (48MB vs 96GB). On Gaudi2, we (Graph Compiler) usually use DMA engine to pre-fetch data from HBM to SRAM, meanwhile use MME engine doing GEMM operation by using prefetched data in SRAM. DMA engine (for data prefetch) and MME engine could run concurrently to pipeline the data loading and computation.

Due to the limited size of SRAM, for large GEMM, we (Graph Compiler) have to split GEMM into smaller one so that data for each smaller GEMM can fit into SRAM. That’s why you can see those “slice” in profile. You could see this “pipeline” behavior via checking both MME and DMA in profile.

Topic		Replies	Views
Questions regarding the architecture about Habana Gaudi General Questions	10	1285	February 23, 2023
Questions about Gaudi 2 General Questions	1	550	March 14, 2023
Why is there no hello-world level tutorial for using the Gaudi chip? Training	3	346	April 23, 2024
RDMA Process using RoCE v2 General Questions advisory	1	165	June 13, 2024
How can I profile execution graphs running on Gaudi, i.e. graphs produced by Synapse compiler? FAQ profiling	0	695	June 30, 2021

About MME profiling result with Intel Gaudi Software

Related topics