Questions regarding the architecture about Habana Gaudi

Hello, I want to understand the architecture of Haban Gaudi and I read the documents (Welcome to Habana® Gaudi® v1.7 Documentation — Gaudi Documentation). I have some questions:

Question 1: I read in the documentation that the Habana Gaudi TPC contains Load, SPU, VPU and Store. For each TPC, how many VPUs and SPUs?

Question 2: How many units/PEs in the GEMM engine (MME)? Does MME have a similar architecture to the GPU tensor core? I find there is not much info about MME engine design.

Question 3: Does the TPC-LLVM offload the GEMM operations (like convolutions, fully-connected layers) onto the GEMM engine in the compilation or is there another compiler for MME? The name of “TPC-LLVM” makes me think it’s only for TPC rather than MME. Is there someone can clarify?

Question 4: Will the computations run in GEMM and TPC in parallel (overlapped)?

Question 3: Does the TPC-LLVM offload the GEMM operations (like convolutions, fully-connected layers) onto the GEMM engine in the compilation or is there another compiler for MME? The name of “TPC-LLVM” makes me think it’s only for TPC rather than MME. Is there someone can clarify?
(Answer): TPC-llvm is solely for TPC application, it can have all kinds of ops, including con2d and non-linear ops. But most matrix multiplication are done in MME, through our Synapse API, graph compiler.

Question 4: Will the computations run in GEMM and TPC in parallel (overlapped)?
(Answer): Yes, TPC and MME are independent hardware cores, they operate independently and compute in parallelly.

1 Like

Thank you so much for the information!

Hello! I am trying to understand the architecture of GEMM engine.

Question 1: For each execution unit core, what is the size of an instruction?

Question 2: How can users configure the GEMM engine? I see in this document (Gaudi Architecture — Gaudi Documentation) that the GEMM is configurable. Could you please provide more details about how configurable works?

Question 3: Is the graph compiler code for GEMM engine available? I mean, is there a GEMM engine compiler API that I can directly use to compile code?

Thank you!

Question 2: How can users configure the GEMM engine? I see in this document (Gaudi Architecture — Gaudi Documentation ) that the GEMM is configurable. Could you please provide more details about how configurable works?
(Answer): The configurable is in the sense of the Synapse API can config MME, which do all kinds of computation. Not for user-wise. User can’t directly interact with MME.
Question 3: Is the graph compiler code for GEMM engine available? I mean, is there a GEMM engine compiler API that I can directly use to compile code?
(Answer): Only Synapse API (graph compiler) can interact with MME, not user. This is different than TPC, which like a DSP processor, user can use TPC-C language to write kernel codes directly.

Thank you so much for the helpful information!

Question 1: Could you please provide me with the MME clock speed?

Question 2: Is configurable MME means that the graph compiler can configure which operation on MME and which on TPC?

Question 3: How is the sparse matrix multiplication (SpMM) optimized on the MME?

Question 4: I observed that fp16 is not supported for the GEMM operation. Could you please tell me the architectural difference between NVIDIA GPU Tensor cores and the MME EU cores?

Thank you!

Question 2: Is configurable MME means that the graph compiler can configure which operation on MME and which on TPC?
(Answer): That is correct. Most matrix multiplcation will go to MME and non-linear ops go to TPC.
Question 3: How is the sparse matrix multiplication (SpMM) optimized on the MME?
(Answer): Our MME doesn’t support SPMM. We could have TPC kernel does the job.

Thanks.

1 Like

hello, we have some questions regarding programming on Habana gaudi1.

  1. Is any low-level API (in C or C++) for us to call MME? Or we have to call PyTorch API like torch.mm to enable the operations on MME.
  2. Can TPC directly interact with MME? (for example, directly call MME from TPC).
  3. If there is no low-level API for us to call MME, how can we map desired operations into the MME. For example, in Nvidia GPU there are specific APIs for us to call tensor core. We want to try similar APIs like those.
  4. We find a source code to program the MME (hl-thunk/gaudi_mme_conv.c at 77a59c35d284d2f987c7266e7db7f6d6bd08568b · HabanaAI/hl-thunk · GitHub), but we do not understand its logic. How to set up MME if we want to write a function running on it?

Thank you!

  1. Is any low-level API (in C or C++) for us to call MME? Or we have to call PyTorch API like torch.mm to enable the operations on MME.
    (Answer): We don’t expose the low-level API to call MME. The way to enable MME is through framework, like torch.mm etc.
  2. Can TPC directly interact with MME? (for example, directly call MME from TPC).
    (Answer): No. TPC can’t call MME directly. The graph compiler in Synapse will redirect the operations either to MME or TPC.
  3. If there is no low-level API for us to call MME, how can we map desired operations into the MME. For example, in Nvidia GPU there are specific APIs for us to call tensor core. We want to try similar APIs like those.
    (Answer): Gaudi architectures are different than Nvidia GPU, user can’t directly interact with MME, MME related operations are controlled by Synapse.
  4. We find a source code to program the MME, but we do not understand its logic. How to set up MME if we want to write a function running on it?
    (Answer): All the exposed APIs are documented here, APIs — Gaudi Documentation,
    That piece of codes are for hl-thunk test purpose, not recommend to use it as a template.

Thanks

Hi, I was wondering if you can share similar details about MMEs on Gaudi2. How many MACs they can do per cycle?

Here are some specs about Gaudi2 MME,
Support BF16,FP32, TF32, FP16, FP8.