Describe the issue; be as descriptive as possible, you can include things like:
• What was the expected behavior:
Allow for Generic Resource allocation through Slurm of Habana NPUs.
I.e.
#SBATCH --gres=npu:{hl200|hl202|hl205|hl225h}:X # Where X is number of NPUs
• What is the observed result:
Currently, there is no support for Slurm Generic Resource for Habana Gaudi NPUs.
Would it be nice to know if anyone is working on it at Intel Habana Labs?
We are willing to beta test. Currently running Slurm 21.08.8.
• Is the issue consistently reproducible?
NA
• If you are using AWS DL1 instance, please report the AMI name that you are using
NA
What is the Details of the Environment
- Docker or not docker
Both
- Build from source or binary distribution
NA
-
OS version: uname -a
root@h001:~# uname -ar
Linux h001 5.15.0-86-generic #96-Ubuntu SMP Wed Sep 20 08:23:49 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
root@h001:~# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 22.04.3 LTS
Release: 22.04
Codename: jammy -
Software versions: (dpkg -l | grep habanalabs)
root@h001:~# dpkg -l |grep -i habanalabs
ii habanalabs-container-runtime 1.12.0-480 amd64 Habana Labs container runtime. Provides a modified version of runc allowing users to run GPU enabled containers.
ii habanalabs-dkms 1.12.0-480 all habanalabs driver in DKMS format.
ii habanalabs-firmware 1.12.0-480 amd64 Firmware package for Habana Labs processing accelerators
ii habanalabs-firmware-tools 1.12.0-480 amd64 Habanalabs firmware tools package
ii habanalabs-graph 1.12.0-480 amd64 habanalabs graph compiler
ii habanalabs-qual 1.12.0-480 amd64 This package contains Habanalabs qualification package. It designed to assist server vendors to qualify their Goya based server on the production line.
ii habanalabs-rdma-core 1.12.0-480 all Habana Labs rdma-core components.
ii habanalabs-thunk 1.12.0-480 all habanalabs thunk
- Python versions used: python –version
Default is
root@h001:~# /usr/bin/python3 -V
Python 3.10.12
But I install all in order to allow for any virtual env.
root@h001:~# which python python2 python3 python3.8 python3.9 python3.10 python3.11 python3.12
/usr/bin/python
/usr/bin/python3
/usr/bin/python3.8
/usr/bin/python3.9
/usr/bin/python3.10
/usr/bin/python3.11
/usr/bin/python3.12
- Please attach the dmesg dump, dmesg.log: dmesg > dmesg.log
NA
If Bare Metal, please share the current Habana release version and Firmware version by running this command: sudo hl-smi -q
[7] AIP (accel7) 0000:33:00.0
Product Name : HL-205
Model Number : F08GL0AI2007A
Serial Number : AL110XxxxXXX
Module status : Operational
Module ID : 1
PCB Assembly Version : V1A
PCB Version : R0F
HL Revision : 4
AIP UUID : 00P1-HL2000A1-14-P64X00-12-04-07
AIP Status : Production Level
Firmware [FIT] Version : Linux OAM[1] gaudi 5.10.18-hl-gaudi-1.2.3-fw-32.6.6-sec-4 #1 SMP PREEMPT Wed Jan 4 20:45:04 IST 2023 aarch64 GNU/Linux
Firmware [SPI] Version : BTL version 81608d8d,Preboot version hl-gaudi-1.1.0-fw-32.3.5-sec-4 (Oct 05 2021 - 15:13:16)
Firmware [UBOOT] Version : U-Boot 2021.04-hl-gaudi-1.2.3-fw-32.6.6-sec-4 (Jan 04 2023 - 20:43:58 +0200) build#: 6981
Firmware [OS] Version : N/A
CPLD Version : 0x00000018
–ToreL