Slurm GRES plugin supporting Generic Resource allocation?

Describe the issue; be as descriptive as possible, you can include things like:
• What was the expected behavior:

Allow for Generic Resource allocation through Slurm of Habana NPUs.

I.e.

#SBATCH --gres=npu:{hl200|hl202|hl205|hl225h}:X # Where X is number of NPUs

• What is the observed result:

Currently, there is no support for Slurm Generic Resource for Habana Gaudi NPUs.
Would it be nice to know if anyone is working on it at Intel Habana Labs?
We are willing to beta test. Currently running Slurm 21.08.8.

• Is the issue consistently reproducible?

NA

• If you are using AWS DL1 instance, please report the AMI name that you are using

NA

What is the Details of the Environment

  • Docker or not docker

Both

  • Build from source or binary distribution

NA

  • OS version: uname -a
    root@h001:~# uname -ar
    Linux h001 5.15.0-86-generic #96-Ubuntu SMP Wed Sep 20 08:23:49 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
    root@h001:~# lsb_release -a
    No LSB modules are available.
    Distributor ID: Ubuntu
    Description: Ubuntu 22.04.3 LTS
    Release: 22.04
    Codename: jammy

  • Software versions: (dpkg -l | grep habanalabs)

root@h001:~# dpkg -l |grep -i habanalabs
ii habanalabs-container-runtime 1.12.0-480 amd64 Habana Labs container runtime. Provides a modified version of runc allowing users to run GPU enabled containers.
ii habanalabs-dkms 1.12.0-480 all habanalabs driver in DKMS format.
ii habanalabs-firmware 1.12.0-480 amd64 Firmware package for Habana Labs processing accelerators
ii habanalabs-firmware-tools 1.12.0-480 amd64 Habanalabs firmware tools package
ii habanalabs-graph 1.12.0-480 amd64 habanalabs graph compiler
ii habanalabs-qual 1.12.0-480 amd64 This package contains Habanalabs qualification package. It designed to assist server vendors to qualify their Goya based server on the production line.
ii habanalabs-rdma-core 1.12.0-480 all Habana Labs rdma-core components.
ii habanalabs-thunk 1.12.0-480 all habanalabs thunk

  • Python versions used: python –version

Default is
root@h001:~# /usr/bin/python3 -V
Python 3.10.12

But I install all in order to allow for any virtual env.

root@h001:~# which python python2 python3 python3.8 python3.9 python3.10 python3.11 python3.12
/usr/bin/python
/usr/bin/python3
/usr/bin/python3.8
/usr/bin/python3.9
/usr/bin/python3.10
/usr/bin/python3.11
/usr/bin/python3.12

  • Please attach the dmesg dump, dmesg.log: dmesg > dmesg.log

NA

If Bare Metal, please share the current Habana release version and Firmware version by running this command: sudo hl-smi -q

[7] AIP (accel7) 0000:33:00.0
	Product Name			: HL-205
	Model Number			: F08GL0AI2007A
	Serial Number			: AL110XxxxXXX
	Module status			: Operational
	Module ID			: 1
	PCB Assembly Version		: V1A
	PCB Version			: R0F
	HL Revision			: 4
	AIP UUID			: 00P1-HL2000A1-14-P64X00-12-04-07
	AIP Status			: Production Level
	Firmware [FIT] Version		: Linux OAM[1] gaudi 5.10.18-hl-gaudi-1.2.3-fw-32.6.6-sec-4 #1 SMP PREEMPT Wed Jan 4 20:45:04 IST 2023 aarch64 GNU/Linux
	Firmware [SPI] Version		: BTL version 81608d8d,Preboot version hl-gaudi-1.1.0-fw-32.3.5-sec-4 (Oct 05 2021 - 15:13:16)
	Firmware [UBOOT] Version	: U-Boot 2021.04-hl-gaudi-1.2.3-fw-32.6.6-sec-4 (Jan 04 2023 - 20:43:58 +0200) build#: 6981
	Firmware [OS] Version		: N/A
	CPLD Version			: 0x00000018

–ToreL

I am not sure if we support slurm/slurm-21.08.8

We have slurm solution now,

  1. check out slurm/habana-test at master htang2012/slurm ,
  2. follow the instructions under slurm/habana-test/
  • $make build-debian-package
  • $make install-debian-package
  • For single card test: $make test-single-card
  • For 8 card tests: $make test-8-cards

Both Slurm (Slurm Workload Manager - Quick Start Administrator Guide ) and Habana currently support RPM builds that are compatible with RHEL8/9. You can find detailed instructions on setting this up here: Habana RPM Build Instructions.