Questions regarding the architecture about Habana Gaudi

Hello, I want to understand the architecture of Haban Gaudi and I read the documents (Welcome to Habana® Gaudi® v1.7 Documentation — Gaudi Documentation). I have some questions:

Question 1: I read in the documentation that the Habana Gaudi TPC contains Load, SPU, VPU and Store. For each TPC, how many VPUs and SPUs?

Question 2: How many units/PEs in the GEMM engine (MME)? Does MME have a similar architecture to the GPU tensor core? I find there is not much info about MME engine design.

Question 3: Does the TPC-LLVM offload the GEMM operations (like convolutions, fully-connected layers) onto the GEMM engine in the compilation or is there another compiler for MME? The name of “TPC-LLVM” makes me think it’s only for TPC rather than MME. Is there someone can clarify?

Question 4: Will the computations run in GEMM and TPC in parallel (overlapped)?

Question 1: I read in the documentation that the Habana Gaudi TPC contains Load, SPU, VPU and Store. For each TPC, how many VPUs and SPUs?
(Answer): In Gaudi each TPC has one VPU which can operate in on of three modes: (VPE, Vector Processing Elements, is to process the same instruction on the same time (SIMD) on different values)
(1). 64 VPEs of 32 bit element width (int32 / float)
(2). 128 VPEs of 16 bit ( bf16/fp16/int16)
(3). 256 VPEs of 8 bit (fp8/int8)
Also each TPC has one SPU, The SPU is equivalent in arithmetic capabilities to a single VPU lane (AKA VPE). There are minor differences in instruction support, but the overall capability is similar.

Question 2: How many units/PEs in the GEMM engine (MME)? Does MME have a similar architecture to the GPU tensor core? I find there is not much info about MME engine design.
(Answer): Gaudi MME has different architecture comparing with GPU. MME has 4 EU (Execution Unit) cores, and can produce 4096 fp32 MACs/cycle or 16,384 bf16 MACs/cycles

Question 3: Does the TPC-LLVM offload the GEMM operations (like convolutions, fully-connected layers) onto the GEMM engine in the compilation or is there another compiler for MME? The name of “TPC-LLVM” makes me think it’s only for TPC rather than MME. Is there someone can clarify?
(Answer): TPC-llvm is solely for TPC application, it can have all kinds of ops, including con2d and non-linear ops. But most matrix multiplication are done in MME, through our Synapse API, graph compiler.

Question 4: Will the computations run in GEMM and TPC in parallel (overlapped)?
(Answer): Yes, TPC and MME are independent hardware cores, they operate independently and compute in parallelly.

1 Like

Thank you so much for the information!

Hello! I am trying to understand the architecture of GEMM engine.

Question 1: For each execution unit core, what is the size of an instruction?

Question 2: How can users configure the GEMM engine? I see in this document (Gaudi Architecture — Gaudi Documentation) that the GEMM is configurable. Could you please provide more details about how configurable works?

Question 3: Is the graph compiler code for GEMM engine available? I mean, is there a GEMM engine compiler API that I can directly use to compile code?

Thank you!

Question 1: For each execution unit core, what is the size of an instruction?
(Answer): GEMM or MME is different than TPC. MME is for matrix multiplication purpose, each EU geometry is like for 16 bits, it is 64x64, for 32bits, it is 32x32.
Question 2: How can users configure the GEMM engine? I see in this document (Gaudi Architecture — Gaudi Documentation ) that the GEMM is configurable. Could you please provide more details about how configurable works?
(Answer): The configurable is in the sense of the Synapse API can config MME, which do all kinds of computation. Not for user-wise. User can’t directly interact with MME.
Question 3: Is the graph compiler code for GEMM engine available? I mean, is there a GEMM engine compiler API that I can directly use to compile code?
(Answer): Only Synapse API (graph compiler) can interact with MME, not user. This is different than TPC, which like a DSP processor, user can use TPC-C language to write kernel codes directly.

Thank you so much for the helpful information!

Question 1: Could you please provide me with the MME clock speed?

Question 2: Is configurable MME means that the graph compiler can configure which operation on MME and which on TPC?

Question 3: How is the sparse matrix multiplication (SpMM) optimized on the MME?

Question 4: I observed that fp16 is not supported for the GEMM operation. Could you please tell me the architectural difference between NVIDIA GPU Tensor cores and the MME EU cores?

Thank you!

Question 1: Could you please provide me with the MME clock speed?
(Answer): FOr Gaudi1, MME has clock 1.95GHz.
Question 2: Is configurable MME means that the graph compiler can configure which operation on MME and which on TPC?
(Answer): That is correct. Most matrix multiplcation will go to MME and non-linear ops go to TPC.
Question 3: How is the sparse matrix multiplication (SpMM) optimized on the MME?
(Answer): Our MME doesn’t support SPMM. We could have TPC kernel does the job.

Question 4: I observed that fp16 is not supported for the GEMM operation. Could you please tell me the architectural difference between NVIDIA GPU Tensor cores and the MME EU cores?
(Answer): There are fundamental differences in both architecture and programming model. For examples, like, Habana MME EU geometry is substantially larger than a tensor core. In GPUs, the compute core invokes the MME directly while with Habana the MME is invoked by a command buffer from host.
Thanks.

1 Like

hello, we have some questions regarding programming on Habana gaudi1.

  1. Is any low-level API (in C or C++) for us to call MME? Or we have to call PyTorch API like torch.mm to enable the operations on MME.
  2. Can TPC directly interact with MME? (for example, directly call MME from TPC).
  3. If there is no low-level API for us to call MME, how can we map desired operations into the MME. For example, in Nvidia GPU there are specific APIs for us to call tensor core. We want to try similar APIs like those.
  4. We find a source code to program the MME (hl-thunk/gaudi_mme_conv.c at 77a59c35d284d2f987c7266e7db7f6d6bd08568b · HabanaAI/hl-thunk · GitHub), but we do not understand its logic. How to set up MME if we want to write a function running on it?

Thank you!

  1. Is any low-level API (in C or C++) for us to call MME? Or we have to call PyTorch API like torch.mm to enable the operations on MME.
    (Answer): We don’t expose the low-level API to call MME. The way to enable MME is through framework, like torch.mm etc.
  2. Can TPC directly interact with MME? (for example, directly call MME from TPC).
    (Answer): No. TPC can’t call MME directly. The graph compiler in Synapse will redirect the operations either to MME or TPC.
  3. If there is no low-level API for us to call MME, how can we map desired operations into the MME. For example, in Nvidia GPU there are specific APIs for us to call tensor core. We want to try similar APIs like those.
    (Answer): Gaudi architectures are different than Nvidia GPU, user can’t directly interact with MME, MME related operations are controlled by Synapse.
  4. We find a source code to program the MME, but we do not understand its logic. How to set up MME if we want to write a function running on it?
    (Answer): All the exposed APIs are documented here, APIs — Gaudi Documentation,
    That piece of codes are for hl-thunk test purpose, not recommend to use it as a template.

Thanks