Problems installing habanalabs-dkms_1.13.0-463_all.deb ao. on ubuntu 22.04.3LTS w/5.15.0-89-generic

Describe the issue; be as descriptive as possible, you can include things like:
• What was the expected behavior:

Upgraded to the latest 1.13.0-463 drivers trouble.

• What is the observed result:

Issue 1: Same as seen in Problems installing habanalabs-dkms_1.12.0-480_all.deb on ubuntu 22.04.3LTS w/5.15.0-86-generic

Removed /var/lib/dkms/habanalabs-dkms/1.12.0-480 and the I could install habanalabs_dkms-1.13.0-463_all.deb.

BUT, some packages refuse to install completely. It seems to be linked to lacking internal deb files (possible error in deb build spec files)

• Is the issue consistently reproducible?

Yes. Have removed completely and installed it again, and I get the same issue.

root@h001:~# dpkg -l |grep -i habana
ii habanalabs-container-runtime 1.13.0-463 amd64 Habana Labs container runtime. Provides a modified version of runc allowing users to run GPU enabled containers.
ii habanalabs-dkms 1.13.0-463 all habanalabs driver in DKMS format.
ii habanalabs-firmware 1.13.0-463 amd64 Firmware package for Habana Labs processing accelerators
iU habanalabs-firmware-tools 1.13.0-463 amd64 Habanalabs firmware tools package
iU habanalabs-graph 1.13.0-463 amd64 habanalabs graph compiler
iU habanalabs-qual 1.13.0-463 amd64 This package contains Habanalabs qualification package. It designed to assist server vendors to qualify their Goya based server on the production line.
iF habanalabs-rdma-core 1.13.0-463 all Habana Labs rdma-core components.
iU habanalabs-thunk 1.13.0-463 all habanalabs thunk
ii habanatools 1.13.0-463 amd64 Habana Labs tools package
root@h001:~#

• If you are using AWS DL1 instance, please report the AMI name that you are using

What is the Details of the Environment

  • Docker or not docker

Bare metal install.

  • Build from source or binary distribution

From your repo.

  • OS version: uname -a

root@h001:~# uname -a
Linux h001 5.15.0-89-generic #99-Ubuntu SMP Mon Oct 30 20:42:41 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

  • Software versions: (dpkg -l | grep habanalabs)

Ref. above.

  • Python versions used: python –version

root@h001:~# python --version
Python 3.10.12

  • Please attach the dmesg dump, dmesg.log: dmesg > dmesg.log

NA.

If Bare Metal, please share the current Habana release version and Firmware version by running this command: sudo hl-smi -q

The problem seems to be with habanalabs-rdma-core package. Could it be an incompatibility with the installed Mellanox OFED 5.8 LTS (used for rest of infrastructure)?

root@h001:~# apt install habanalabs-rdma-core
Reading package lists… Done
Building dependency tree… Done
Reading state information… Done
habanalabs-rdma-core is already the newest version (1.13.0-463).
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
5 not fully installed or removed.
After this operation, 0 B of additional disk space will be used.
Do you want to continue? [Y/n] Y
Setting up habanalabs-rdma-core (1.13.0-463) …
/opt/habanalabs/rdma-core/src /
/opt/habanalabs/rdma-core/src/build/lib /opt/habanalabs/rdma-core/src /
ln: failed to create symbolic link ‘libibverbs.so.1’: File exists
dpkg: error processing package habanalabs-rdma-core (–configure):
installed habanalabs-rdma-core package post-installation script subprocess returned error exit status 1
dpkg: dependency problems prevent configuration of habanalabs-thunk:
habanalabs-thunk depends on habanalabs-rdma-core; however:
Package habanalabs-rdma-core is not configured yet.
Package habanalabs-rdma-core which provides habanalabs-rdma-core is not configured yet.

dpkg: error processing package habanalabs-thunk (–configure):
dependency problems - leaving unconfigured
dpkg: dependency problems prevent configuration of habanalabs-firmware-tools:
habanalabs-firmware-tools depends on habanalabs-thunk (>= 1.13.0-463) | habanalabs-thunk-internal (>= 1.13.0-463); however:
Package habanalabs-thunk is not configured yet.
Package habanalabs-thunk-internal is not installed.

dpkg: error processing package habanalabs-firmware-tools (–configure):
dependency problems - leaving unconfigured
dpkg: dependency problems prevent configuration of habanalabs-graph:
habanalabs-graph depends on habanalabs-thunk (>= 1.13.0-463) | habanalabs-thunk-internal (>= 1.13.0-463); however:
Package habanalabs-thunk is not configured yet.
Package habanalabs-thunk-internal is not installed.

dpkg: error processing package habanalabs-graph (–configure):
dependency problems - leaving unconfigured
dpkg: dependency problems prevent configuration of habanalabs-qual:
habanalabs-qual depends on habanalabs-graph (>= 1.13.0) | habanalabs-graph-internal (>= 1.13.0); however:
Package habanalabs-graph is not configured yet.
Package habanalabs-graph-internal is not installed.

dpkg: error processing package habanalabs-qual (–configure):
dependency problems - leaving unconfigured
Errors were encountered while processing:
habanalabs-rdma-core
habanalabs-thunk
habanalabs-firmware-tools
habanalabs-graph
habanalabs-qual
E: Sub-process /usr/bin/dpkg returned an error code (1)

FYI; the new habanalabs-dkms 1.13.0-463 fixed an issue we had with softlockups when running llama2 langchain docker container.

Thanks for that.

root@h001:~# hl-smi -q

================ HL-SMI LOG ================

Timestamp : Mon Nov 27 21:24:59 CET 2023
Driver Version : 1.13.0-ee32e42
HL-SMI Version : hl-1.13.0-rc-fw-47.0.0.0 (Nov 17 2023 - 13:42:59)

Attached AIPs : 8

[0] AIP (accel0) 0000:19:00.0
Product Name : HL-205
Model Number : F08GL0AI2007A
Serial Number : AL11019894
Module status : Operational
Module ID : 3
PCB Assembly Version : V1A
PCB Version : R0F
HL Revision : 4
AIP UUID : 00P1-HL2000A1-14-P64X78-02-07-10
AIP Status : Production Level
Firmware [FIT] Version : Linux OAM[3] gaudi 5.10.18-hl-gaudi-1.2.3-fw-32.6.6-sec-4 #1 SMP PREEMPT Wed Jan 4 20:45:04 IST 2023 aarch64 GNU/Linux
Firmware [SPI] Version : BTL version 81608d8d,Preboot version hl-gaudi-1.1.0-fw-32.3.5-sec-4 (Oct 05 2021 - 15:13:16)
Firmware [UBOOT] Version : U-Boot 2021.04-hl-gaudi-1.2.3-fw-32.6.6-sec-4 (Jan 04 2023 - 20:43:58 +0200) build#: 6981
Firmware [OS] Version : N/A
CPLD Version : 0x00000018
PCI
Bus : 0x19
Device : 0x00
Domain : 0x0000
Rev : 01
Device Id : 0x1da31000
Bus Id : 0000:19:00.0
Sub System Id : 0x1da31000
AIP Link Info
Link Speed
Max : 16GT/s
Current : 16GT/s
Link Width
Max : x16
Current : x16
Fan Speed
: N/A
Clocks Throttle Reasons
HW Slowdown : Not Active
Thermal Slowdown
: Not Active
Power Slowdown
: Not Active
Memory Type : HBM
Memory Usage
Total : 32768 MB
Used : 512 MB
Free : 32256 MB
Temperature
[ 1] Core Power supply : 42 C
[ 2] On Board W : 34 C
[ 3] On Board NE : 36 C
[ 4] On Board E : 35 C
[ 5] On Chip SE : 30 C
[ 6] On Chip SW : 30 C
[ 7] On Chip NE : 31 C
[ 8] On Chip NW : 32 C
[ 9] HBM TS1-SE : 30 C
[10] HBM TS2-SW : 30 C
[11] HBM TS3-NE : 29 C
[12] HBM TS4-NW : 30 C
[13] On Board N : 32 C
[14] HL2000-TD3 : 33 C
Power Readings
Power Management : auto
Power Draw : 95 W
Power Max : 350 W
Power Limit : 350 W
Clocks
[1] soc : 1800 MHz
Clocks Max
[1] soc : 1800 MHz
Clocks Limit : 1800 MHz
Network Information
[ 1] MAC : b0:fd:0b:d3:0b:e1
[ 2] MAC : b0:fd:0b:d3:0b:e2
[ 3] MAC : b0:fd:0b:d3:0b:e3
[ 4] MAC : b0:fd:0b:d3:0b:e4
[ 5] MAC : b0:fd:0b:d3:0b:e5
[ 6] MAC : b0:fd:0b:d3:0b:e6
[ 7] MAC : b0:fd:0b:d3:0b:e7
[ 8] MAC : b0:fd:0b:d3:0b:e8
[ 9] MAC : b0:fd:0b:d3:0b:e9
[10] MAC : b0:fd:0b:d3:0b:ea
[11] MAC : b0:fd:0b:d3:0b:eb
[12] MAC : b0:fd:0b:d3:0b:ec
[13] MAC : b0:fd:0b:d3:0b:ed
[14] MAC : b0:fd:0b:d3:0b:ee
[15] MAC : b0:fd:0b:d3:0b:ef
[16] MAC : b0:fd:0b:d3:0b:f0
[17] MAC : b0:fd:0b:d3:0b:f1
[18] MAC : b0:fd:0b:d3:0b:f2
[19] MAC : b0:fd:0b:d3:0b:f3
[20] MAC : b0:fd:0b:d3:0b:f4
Replaced Rows
Single Bit ECC : 0
Double Bit ECC : 0
Pending : No

[1] AIP (accel1) 0000:b4:00.0
Product Name : HL-205
Model Number : F08GL0AI2007A
Serial Number : AL11020417
Module status : Operational
Module ID : 6
PCB Assembly Version : V1A
PCB Version : R0F
HL Revision : 4
AIP UUID : 00P1-HL2000A1-14-P64X00-16-07-11
AIP Status : Production Level
Firmware [FIT] Version : Linux OAM[6] gaudi 5.10.18-hl-gaudi-1.2.3-fw-32.6.6-sec-4 #1 SMP PREEMPT Wed Jan 4 20:45:04 IST 2023 aarch64 GNU/Linux
Firmware [SPI] Version : BTL version 81608d8d,Preboot version hl-gaudi-1.1.0-fw-32.3.5-sec-4 (Oct 05 2021 - 15:13:16)
Firmware [UBOOT] Version : U-Boot 2021.04-hl-gaudi-1.2.3-fw-32.6.6-sec-4 (Jan 04 2023 - 20:43:58 +0200) build#: 6981
Firmware [OS] Version : N/A
CPLD Version : 0x00000018
PCI
Bus : 0xb4
Device : 0x00
Domain : 0x0000
Rev : 01
Device Id : 0x1da31000
Bus Id : 0000:b4:00.0
Sub System Id : 0x1da31000
AIP Link Info
Link Speed
Max : 16GT/s
Current : 16GT/s
Link Width
Max : x16
Current : x16
Fan Speed
: N/A
Clocks Throttle Reasons
HW Slowdown : Not Active
Thermal Slowdown
: Not Active
Power Slowdown
: Not Active
Memory Type : HBM
Memory Usage
Total : 32768 MB
Used : 512 MB
Free : 32256 MB
Temperature
[ 1] Core Power supply : 34 C
[ 2] On Board W : 30 C
[ 3] On Board NE : 28 C
[ 4] On Board E : 28 C
[ 5] On Chip SE : 27 C
[ 6] On Chip SW : 29 C
[ 7] On Chip NE : 28 C
[ 8] On Chip NW : 29 C
[ 9] HBM TS1-SE : 24 C
[10] HBM TS2-SW : 27 C
[11] HBM TS3-NE : 29 C
[12] HBM TS4-NW : 26 C
[13] On Board N : 27 C
[14] HL2000-TD3 : 29 C
Power Readings
Power Management : auto
Power Draw : 102 W
Power Max : 350 W
Power Limit : 350 W
Clocks
[1] soc : 1800 MHz
Clocks Max
[1] soc : 1800 MHz
Clocks Limit : 1800 MHz
Network Information
[ 1] MAC : b0:fd:0b:d3:34:bd
[ 2] MAC : b0:fd:0b:d3:34:be
[ 3] MAC : b0:fd:0b:d3:34:bf
[ 4] MAC : b0:fd:0b:d3:34:c0
[ 5] MAC : b0:fd:0b:d3:34:c1
[ 6] MAC : b0:fd:0b:d3:34:c2
[ 7] MAC : b0:fd:0b:d3:34:c3
[ 8] MAC : b0:fd:0b:d3:34:c4
[ 9] MAC : b0:fd:0b:d3:34:c5
[10] MAC : b0:fd:0b:d3:34:c6
[11] MAC : b0:fd:0b:d3:34:c7
[12] MAC : b0:fd:0b:d3:34:c8
[13] MAC : b0:fd:0b:d3:34:c9
[14] MAC : b0:fd:0b:d3:34:ca
[15] MAC : b0:fd:0b:d3:34:cb
[16] MAC : b0:fd:0b:d3:34:cc
[17] MAC : b0:fd:0b:d3:34:cd
[18] MAC : b0:fd:0b:d3:34:ce
[19] MAC : b0:fd:0b:d3:34:cf
[20] MAC : b0:fd:0b:d3:34:d0
Replaced Rows
Single Bit ECC : 0
Double Bit ECC : 0
Pending : No

[2] AIP (accel2) 0000:b3:00.0
Product Name : HL-205
Model Number : F08GL0AI2007A
Serial Number : AL11020339
Module status : Operational
Module ID : 7
PCB Assembly Version : V1A
PCB Version : R0F
HL Revision : 4
AIP UUID : 00P1-HL2000A1-14-P64W98-10-03-11
AIP Status : Production Level
Firmware [FIT] Version : Linux OAM[7] gaudi 5.10.18-hl-gaudi-1.2.3-fw-32.6.6-sec-4 #1 SMP PREEMPT Wed Jan 4 20:45:04 IST 2023 aarch64 GNU/Linux
Firmware [SPI] Version : BTL version 81608d8d,Preboot version hl-gaudi-1.1.0-fw-32.3.5-sec-4 (Oct 05 2021 - 15:13:16)
Firmware [UBOOT] Version : U-Boot 2021.04-hl-gaudi-1.2.3-fw-32.6.6-sec-4 (Jan 04 2023 - 20:43:58 +0200) build#: 6981
Firmware [OS] Version : N/A
CPLD Version : 0x00000018
PCI
Bus : 0xb3
Device : 0x00
Domain : 0x0000
Rev : 01
Device Id : 0x1da31000
Bus Id : 0000:b3:00.0
Sub System Id : 0x1da31000
AIP Link Info
Link Speed
Max : 16GT/s
Current : 16GT/s
Link Width
Max : x16
Current : x16
Fan Speed
: N/A
Clocks Throttle Reasons
HW Slowdown : Not Active
Thermal Slowdown
: Not Active
Power Slowdown
: Not Active
Memory Type : HBM
Memory Usage
Total : 32768 MB
Used : 512 MB
Free : 32256 MB
Temperature
[ 1] Core Power supply : 38 C
[ 2] On Board W : 32 C
[ 3] On Board NE : 35 C
[ 4] On Board E : 33 C
[ 5] On Chip SE : 28 C
[ 6] On Chip SW : 29 C
[ 7] On Chip NE : 32 C
[ 8] On Chip NW : 31 C
[ 9] HBM TS1-SE : 30 C
[10] HBM TS2-SW : 30 C
[11] HBM TS3-NE : 28 C
[12] HBM TS4-NW : 31 C
[13] On Board N : 30 C
[14] HL2000-TD3 : 32 C
Power Readings
Power Management : auto
Power Draw : 103 W
Power Max : 350 W
Power Limit : 350 W
Clocks
[1] soc : 1800 MHz
Clocks Max
[1] soc : 1800 MHz
Clocks Limit : 1800 MHz
Network Information
[ 1] MAC : b0:fd:0b:d3:2e:a5
[ 2] MAC : b0:fd:0b:d3:2e:a6
[ 3] MAC : b0:fd:0b:d3:2e:a7
[ 4] MAC : b0:fd:0b:d3:2e:a8
[ 5] MAC : b0:fd:0b:d3:2e:a9
[ 6] MAC : b0:fd:0b:d3:2e:aa
[ 7] MAC : b0:fd:0b:d3:2e:ab
[ 8] MAC : b0:fd:0b:d3:2e:ac
[ 9] MAC : b0:fd:0b:d3:2e:ad
[10] MAC : b0:fd:0b:d3:2e:ae
[11] MAC : b0:fd:0b:d3:2e:af
[12] MAC : b0:fd:0b:d3:2e:b0
[13] MAC : b0:fd:0b:d3:2e:b1
[14] MAC : b0:fd:0b:d3:2e:b2
[15] MAC : b0:fd:0b:d3:2e:b3
[16] MAC : b0:fd:0b:d3:2e:b4
[17] MAC : b0:fd:0b:d3:2e:b5
[18] MAC : b0:fd:0b:d3:2e:b6
[19] MAC : b0:fd:0b:d3:2e:b7
[20] MAC : b0:fd:0b:d3:2e:b8
Replaced Rows
Single Bit ECC : 0
Double Bit ECC : 0
Pending : No

[3] AIP (accel3) 0000:33:00.0
Product Name : HL-205
Model Number : F08GL0AI2007A
Serial Number : AL11020427
Module status : Operational
Module ID : 1
PCB Assembly Version : V1A
PCB Version : R0F
HL Revision : 4
AIP UUID : 00P1-HL2000A1-14-P64X00-12-04-07
AIP Status : Production Level
Firmware [FIT] Version : Linux OAM[1] gaudi 5.10.18-hl-gaudi-1.2.3-fw-32.6.6-sec-4 #1 SMP PREEMPT Wed Jan 4 20:45:04 IST 2023 aarch64 GNU/Linux
Firmware [SPI] Version : BTL version 81608d8d,Preboot version hl-gaudi-1.1.0-fw-32.3.5-sec-4 (Oct 05 2021 - 15:13:16)
Firmware [UBOOT] Version : U-Boot 2021.04-hl-gaudi-1.2.3-fw-32.6.6-sec-4 (Jan 04 2023 - 20:43:58 +0200) build#: 6981
Firmware [OS] Version : N/A
CPLD Version : 0x00000018
PCI
Bus : 0x33
Device : 0x00
Domain : 0x0000
Rev : 01
Device Id : 0x1da31000
Bus Id : 0000:33:00.0
Sub System Id : 0x1da31000
AIP Link Info
Link Speed
Max : 16GT/s
Current : 16GT/s
Link Width
Max : x16
Current : x16
Fan Speed
: N/A
Clocks Throttle Reasons
HW Slowdown : Not Active
Thermal Slowdown
: Not Active
Power Slowdown
: Not Active
Memory Type : HBM
Memory Usage
Total : 32768 MB
Used : 512 MB
Free : 32256 MB
Temperature
[ 1] Core Power supply : 37 C
[ 2] On Board W : 31 C
[ 3] On Board NE : 28 C
[ 4] On Board E : 28 C
[ 5] On Chip SE : 27 C
[ 6] On Chip SW : 30 C
[ 7] On Chip NE : 30 C
[ 8] On Chip NW : 30 C
[ 9] HBM TS1-SE : 25 C
[10] HBM TS2-SW : 28 C
[11] HBM TS3-NE : 28 C
[12] HBM TS4-NW : 28 C
[13] On Board N : 28 C
[14] HL2000-TD3 : 30 C
Power Readings
Power Management : auto
Power Draw : 100 W
Power Max : 350 W
Power Limit : 350 W
Clocks
[1] soc : 1800 MHz
Clocks Max
[1] soc : 1800 MHz
Clocks Limit : 1800 MHz
Network Information
[ 1] MAC : b0:fd:0b:d3:35:85
[ 2] MAC : b0:fd:0b:d3:35:86
[ 3] MAC : b0:fd:0b:d3:35:87
[ 4] MAC : b0:fd:0b:d3:35:88
[ 5] MAC : b0:fd:0b:d3:35:89
[ 6] MAC : b0:fd:0b:d3:35:8a
[ 7] MAC : b0:fd:0b:d3:35:8b
[ 8] MAC : b0:fd:0b:d3:35:8c
[ 9] MAC : b0:fd:0b:d3:35:8d
[10] MAC : b0:fd:0b:d3:35:8e
[11] MAC : b0:fd:0b:d3:35:8f
[12] MAC : b0:fd:0b:d3:35:90
[13] MAC : b0:fd:0b:d3:35:91
[14] MAC : b0:fd:0b:d3:35:92
[15] MAC : b0:fd:0b:d3:35:93
[16] MAC : b0:fd:0b:d3:35:94
[17] MAC : b0:fd:0b:d3:35:95
[18] MAC : b0:fd:0b:d3:35:96
[19] MAC : b0:fd:0b:d3:35:97
[20] MAC : b0:fd:0b:d3:35:98
Replaced Rows
Single Bit ECC : 0
Double Bit ECC : 0
Pending : No

[4] AIP (accel4) 0000:1a:00.0
Product Name : HL-205
Model Number : F08GL0AI2007A
Serial Number : AL11020199
Module status : Operational
Module ID : 2
PCB Assembly Version : V1A
PCB Version : R0F
HL Revision : 4
AIP UUID : 00P1-HL2000A1-14-P64W99-02-08-05
AIP Status : Production Level
Firmware [FIT] Version : Linux OAM[2] gaudi 5.10.18-hl-gaudi-1.2.3-fw-32.6.6-sec-4 #1 SMP PREEMPT Wed Jan 4 20:45:04 IST 2023 aarch64 GNU/Linux
Firmware [SPI] Version : BTL version 81608d8d,Preboot version hl-gaudi-1.1.0-fw-32.3.5-sec-4 (Oct 05 2021 - 15:13:16)
Firmware [UBOOT] Version : U-Boot 2021.04-hl-gaudi-1.2.3-fw-32.6.6-sec-4 (Jan 04 2023 - 20:43:58 +0200) build#: 6981
Firmware [OS] Version : N/A
CPLD Version : 0x00000018
PCI
Bus : 0x1a
Device : 0x00
Domain : 0x0000
Rev : 01
Device Id : 0x1da31000
Bus Id : 0000:1a:00.0
Sub System Id : 0x1da31000
AIP Link Info
Link Speed
Max : 16GT/s
Current : 16GT/s
Link Width
Max : x16
Current : x16
Fan Speed
: N/A
Clocks Throttle Reasons
HW Slowdown : Not Active
Thermal Slowdown
: Not Active
Power Slowdown
: Not Active
Memory Type : HBM
Memory Usage
Total : 32768 MB
Used : 512 MB
Free : 32256 MB
Temperature
[ 1] Core Power supply : 36 C
[ 2] On Board W : 32 C
[ 3] On Board NE : 30 C
[ 4] On Board E : 30 C
[ 5] On Chip SE : 28 C
[ 6] On Chip SW : 30 C
[ 7] On Chip NE : 32 C
[ 8] On Chip NW : 30 C
[ 9] HBM TS1-SE : 25 C
[10] HBM TS2-SW : 26 C
[11] HBM TS3-NE : 30 C
[12] HBM TS4-NW : 25 C
[13] On Board N : 29 C
[14] HL2000-TD3 : 30 C
Power Readings
Power Management : auto
Power Draw : 105 W
Power Max : 350 W
Power Limit : 350 W
Clocks
[1] soc : 1800 MHz
Clocks Max
[1] soc : 1800 MHz
Clocks Limit : 1800 MHz
Network Information
[ 1] MAC : b0:fd:0b:d3:23:b5
[ 2] MAC : b0:fd:0b:d3:23:b6
[ 3] MAC : b0:fd:0b:d3:23:b7
[ 4] MAC : b0:fd:0b:d3:23:b8
[ 5] MAC : b0:fd:0b:d3:23:b9
[ 6] MAC : b0:fd:0b:d3:23:ba
[ 7] MAC : b0:fd:0b:d3:23:bb
[ 8] MAC : b0:fd:0b:d3:23:bc
[ 9] MAC : b0:fd:0b:d3:23:bd
[10] MAC : b0:fd:0b:d3:23:be
[11] MAC : b0:fd:0b:d3:23:bf
[12] MAC : b0:fd:0b:d3:23:c0
[13] MAC : b0:fd:0b:d3:23:c1
[14] MAC : b0:fd:0b:d3:23:c2
[15] MAC : b0:fd:0b:d3:23:c3
[16] MAC : b0:fd:0b:d3:23:c4
[17] MAC : b0:fd:0b:d3:23:c5
[18] MAC : b0:fd:0b:d3:23:c6
[19] MAC : b0:fd:0b:d3:23:c7
[20] MAC : b0:fd:0b:d3:23:c8
Replaced Rows
Single Bit ECC : 0
Double Bit ECC : 0
Pending : No

[5] AIP (accel5) 0000:cd:00.0
Product Name : HL-205
Model Number : F08GL0AI2007A
Serial Number : AL11020017
Module status : Operational
Module ID : 4
PCB Assembly Version : V1A
PCB Version : R0F
HL Revision : 4
AIP UUID : 00P1-HL2000A1-14-P64X00-08-05-09
AIP Status : Production Level
Firmware [FIT] Version : Linux OAM[4] gaudi 5.10.18-hl-gaudi-1.2.3-fw-32.6.6-sec-4 #1 SMP PREEMPT Wed Jan 4 20:45:04 IST 2023 aarch64 GNU/Linux
Firmware [SPI] Version : BTL version 81608d8d,Preboot version hl-gaudi-1.1.0-fw-32.3.5-sec-4 (Oct 05 2021 - 15:13:16)
Firmware [UBOOT] Version : U-Boot 2021.04-hl-gaudi-1.2.3-fw-32.6.6-sec-4 (Jan 04 2023 - 20:43:58 +0200) build#: 6981
Firmware [OS] Version : N/A
CPLD Version : 0x00000018
PCI
Bus : 0xcd
Device : 0x00
Domain : 0x0000
Rev : 01
Device Id : 0x1da31000
Bus Id : 0000:cd:00.0
Sub System Id : 0x1da31000
AIP Link Info
Link Speed
Max : 16GT/s
Current : 16GT/s
Link Width
Max : x16
Current : x16
Fan Speed
: N/A
Clocks Throttle Reasons
HW Slowdown : Not Active
Thermal Slowdown
: Not Active
Power Slowdown
: Not Active
Memory Type : HBM
Memory Usage
Total : 32768 MB
Used : 512 MB
Free : 32256 MB
Temperature
[ 1] Core Power supply : 40 C
[ 2] On Board W : 34 C
[ 3] On Board NE : 36 C
[ 4] On Board E : 35 C
[ 5] On Chip SE : 30 C
[ 6] On Chip SW : 34 C
[ 7] On Chip NE : 36 C
[ 8] On Chip NW : 33 C
[ 9] HBM TS1-SE : 29 C
[10] HBM TS2-SW : 29 C
[11] HBM TS3-NE : 31 C
[12] HBM TS4-NW : 29 C
[13] On Board N : 32 C
[14] HL2000-TD3 : 33 C
Power Readings
Power Management : auto
Power Draw : 103 W
Power Max : 350 W
Power Limit : 350 W
Clocks
[1] soc : 1800 MHz
Clocks Max
[1] soc : 1800 MHz
Clocks Limit : 1800 MHz
Network Information
[ 1] MAC : b0:fd:0b:d3:15:7d
[ 2] MAC : b0:fd:0b:d3:15:7e
[ 3] MAC : b0:fd:0b:d3:15:7f
[ 4] MAC : b0:fd:0b:d3:15:80
[ 5] MAC : b0:fd:0b:d3:15:81
[ 6] MAC : b0:fd:0b:d3:15:82
[ 7] MAC : b0:fd:0b:d3:15:83
[ 8] MAC : b0:fd:0b:d3:15:84
[ 9] MAC : b0:fd:0b:d3:15:85
[10] MAC : b0:fd:0b:d3:15:86
[11] MAC : b0:fd:0b:d3:15:87
[12] MAC : b0:fd:0b:d3:15:88
[13] MAC : b0:fd:0b:d3:15:89
[14] MAC : b0:fd:0b:d3:15:8a
[15] MAC : b0:fd:0b:d3:15:8b
[16] MAC : b0:fd:0b:d3:15:8c
[17] MAC : b0:fd:0b:d3:15:8d
[18] MAC : b0:fd:0b:d3:15:8e
[19] MAC : b0:fd:0b:d3:15:8f
[20] MAC : b0:fd:0b:d3:15:90
Replaced Rows
Single Bit ECC : 0
Double Bit ECC : 0
Pending : No

[6] AIP (accel6) 0000:cc:00.0
Product Name : HL-205
Model Number : F08GL0AI2007A
Serial Number : AL11020019
Module status : Operational
Module ID : 5
PCB Assembly Version : V1A
PCB Version : R0F
HL Revision : 4
AIP UUID : 00P1-HL2000A1-14-P64W98-05-02-04
AIP Status : Production Level
Firmware [FIT] Version : Linux OAM[5] gaudi 5.10.18-hl-gaudi-1.2.3-fw-32.6.6-sec-4 #1 SMP PREEMPT Wed Jan 4 20:45:04 IST 2023 aarch64 GNU/Linux
Firmware [SPI] Version : BTL version 81608d8d,Preboot version hl-gaudi-1.1.0-fw-32.3.5-sec-4 (Oct 05 2021 - 15:13:16)
Firmware [UBOOT] Version : U-Boot 2021.04-hl-gaudi-1.2.3-fw-32.6.6-sec-4 (Jan 04 2023 - 20:43:58 +0200) build#: 6981
Firmware [OS] Version : N/A
CPLD Version : 0x00000018
PCI
Bus : 0xcc
Device : 0x00
Domain : 0x0000
Rev : 01
Device Id : 0x1da31000
Bus Id : 0000:cc:00.0
Sub System Id : 0x1da31000
AIP Link Info
Link Speed
Max : 16GT/s
Current : 16GT/s
Link Width
Max : x16
Current : x16
Fan Speed
: N/A
Clocks Throttle Reasons
HW Slowdown : Not Active
Thermal Slowdown
: Not Active
Power Slowdown
: Not Active
Memory Type : HBM
Memory Usage
Total : 32768 MB
Used : 512 MB
Free : 32256 MB
Temperature
[ 1] Core Power supply : 36 C
[ 2] On Board W : 32 C
[ 3] On Board NE : 30 C
[ 4] On Board E : 30 C
[ 5] On Chip SE : 26 C
[ 6] On Chip SW : 29 C
[ 7] On Chip NE : 29 C
[ 8] On Chip NW : 29 C
[ 9] HBM TS1-SE : 28 C
[10] HBM TS2-SW : 26 C
[11] HBM TS3-NE : 27 C
[12] HBM TS4-NW : 27 C
[13] On Board N : 29 C
[14] HL2000-TD3 : 30 C
Power Readings
Power Management : auto
Power Draw : 103 W
Power Max : 350 W
Power Limit : 350 W
Clocks
[1] soc : 1800 MHz
Clocks Max
[1] soc : 1800 MHz
Clocks Limit : 1800 MHz
Network Information
[ 1] MAC : b0:fd:0b:d3:15:a5
[ 2] MAC : b0:fd:0b:d3:15:a6
[ 3] MAC : b0:fd:0b:d3:15:a7
[ 4] MAC : b0:fd:0b:d3:15:a8
[ 5] MAC : b0:fd:0b:d3:15:a9
[ 6] MAC : b0:fd:0b:d3:15:aa
[ 7] MAC : b0:fd:0b:d3:15:ab
[ 8] MAC : b0:fd:0b:d3:15:ac
[ 9] MAC : b0:fd:0b:d3:15:ad
[10] MAC : b0:fd:0b:d3:15:ae
[11] MAC : b0:fd:0b:d3:15:af
[12] MAC : b0:fd:0b:d3:15:b0
[13] MAC : b0:fd:0b:d3:15:b1
[14] MAC : b0:fd:0b:d3:15:b2
[15] MAC : b0:fd:0b:d3:15:b3
[16] MAC : b0:fd:0b:d3:15:b4
[17] MAC : b0:fd:0b:d3:15:b5
[18] MAC : b0:fd:0b:d3:15:b6
[19] MAC : b0:fd:0b:d3:15:b7
[20] MAC : b0:fd:0b:d3:15:b8
Replaced Rows
Single Bit ECC : 0
Double Bit ECC : 0
Pending : No

[7] AIP (accel7) 0000:34:00.0
Product Name : HL-205
Model Number : F08GL0AI2007A
Serial Number : AL11019974
Module status : Operational
Module ID : 0
PCB Assembly Version : V1A
PCB Version : R0F
HL Revision : 4
AIP UUID : 00P1-HL2000A1-14-P64W99-13-05-03
AIP Status : Production Level
Firmware [FIT] Version : Linux OAM[0] gaudi 5.10.18-hl-gaudi-1.2.3-fw-32.6.6-sec-4 #1 SMP PREEMPT Wed Jan 4 20:45:04 IST 2023 aarch64 GNU/Linux
Firmware [SPI] Version : BTL version 81608d8d,Preboot version hl-gaudi-1.1.0-fw-32.3.5-sec-4 (Oct 05 2021 - 15:13:16)
Firmware [UBOOT] Version : U-Boot 2021.04-hl-gaudi-1.2.3-fw-32.6.6-sec-4 (Jan 04 2023 - 20:43:58 +0200) build#: 6981
Firmware [OS] Version : N/A
CPLD Version : 0x00000018
PCI
Bus : 0x34
Device : 0x00
Domain : 0x0000
Rev : 01
Device Id : 0x1da31000
Bus Id : 0000:34:00.0
Sub System Id : 0x1da31000
AIP Link Info
Link Speed
Max : 16GT/s
Current : 16GT/s
Link Width
Max : x16
Current : x16
Fan Speed
: N/A
Clocks Throttle Reasons
HW Slowdown : Not Active
Thermal Slowdown
: Not Active
Power Slowdown
: Not Active
Memory Type : HBM
Memory Usage
Total : 32768 MB
Used : 512 MB
Free : 32256 MB
Temperature
[ 1] Core Power supply : 40 C
[ 2] On Board W : 33 C
[ 3] On Board NE : 36 C
[ 4] On Board E : 35 C
[ 5] On Chip SE : 30 C
[ 6] On Chip SW : 33 C
[ 7] On Chip NE : 35 C
[ 8] On Chip NW : 34 C
[ 9] HBM TS1-SE : 31 C
[10] HBM TS2-SW : 30 C
[11] HBM TS3-NE : 32 C
[12] HBM TS4-NW : 31 C
[13] On Board N : 32 C
[14] HL2000-TD3 : 34 C
Power Readings
Power Management : auto
Power Draw : 104 W
Power Max : 350 W
Power Limit : 350 W
Clocks
[1] soc : 1800 MHz
Clocks Max
[1] soc : 1800 MHz
Clocks Limit : 1800 MHz
Network Information
[ 1] MAC : b0:fd:0b:d3:12:21
[ 2] MAC : b0:fd:0b:d3:12:22
[ 3] MAC : b0:fd:0b:d3:12:23
[ 4] MAC : b0:fd:0b:d3:12:24
[ 5] MAC : b0:fd:0b:d3:12:25
[ 6] MAC : b0:fd:0b:d3:12:26
[ 7] MAC : b0:fd:0b:d3:12:27
[ 8] MAC : b0:fd:0b:d3:12:28
[ 9] MAC : b0:fd:0b:d3:12:29
[10] MAC : b0:fd:0b:d3:12:2a
[11] MAC : b0:fd:0b:d3:12:2b
[12] MAC : b0:fd:0b:d3:12:2c
[13] MAC : b0:fd:0b:d3:12:2d
[14] MAC : b0:fd:0b:d3:12:2e
[15] MAC : b0:fd:0b:d3:12:2f
[16] MAC : b0:fd:0b:d3:12:30
[17] MAC : b0:fd:0b:d3:12:31
[18] MAC : b0:fd:0b:d3:12:32
[19] MAC : b0:fd:0b:d3:12:33
[20] MAC : b0:fd:0b:d3:12:34
Replaced Rows
Single Bit ECC : 0
Double Bit ECC : 0
Pending : No

Please try the following:
apt list --installed|grep habana
sudo apt purge “habana packages from the list”
sudo apt update --fix-missing
reboot
and after please follow these steps:
sudo apt list | grep habana | grep 1.13
install all packages from the output list above:
sudo apt install \

In summary;

  1. When new habanalabs-dkms versions is available, you need to remove previous version directory /var/lib/dkms/habanalabs-dkms/1.13.0-463. As seen in previous releases and this release. Should be fixable by actually removing the directory during upgrade/install of habanalabs-dkms .

  2. Your driver habanalabs-rdma-core does not seem to appreciate MOFED 5.8 LTS (5.8-0.3.7.0) having installed ibverbs.so . This causes the rest of the packages to not install properly.

Install after reboot:

root@h001:~# sudo apt list | grep habana | grep 1.13

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

habanalabs-container-runtime/jammy 1.13.0-463 amd64
habanalabs-dkms/jammy 1.13.0-463 all
habanalabs-firmware-tools/jammy 1.13.0-463 amd64
habanalabs-firmware/jammy 1.13.0-463 amd64
habanalabs-graph/jammy 1.13.0-463 amd64
habanalabs-qual/jammy 1.13.0-463 amd64
habanalabs-rdma-core/jammy 1.13.0-463 all
habanalabs-thunk/jammy 1.13.0-463 all
habanatools/jammy 1.13.0-463 amd64
root@h001:~# sudo apt install habanalabs-container-runtime habanalabs-dkms habanalabs-firmware-tools habanalabs-firmware habanalabs-graph habanalabs-qual habanalabs-rdma-core habanalabs-thunk habanatools
Reading package lists… Done
Building dependency tree… Done
Reading state information… Done
The following NEW packages will be installed:
habanalabs-container-runtime habanalabs-dkms habanalabs-firmware habanalabs-firmware-tools habanalabs-graph habanalabs-qual
habanalabs-rdma-core habanalabs-thunk habanatools
0 upgraded, 9 newly installed, 0 to remove and 0 not upgraded.
Need to get 0 B/297 MB of archives.
After this operation, 0 B of additional disk space will be used.
Selecting previously unselected package habanalabs-container-runtime.
(Reading database … 374338 files and directories currently installed.)
Preparing to unpack …/0-habanalabs-container-runtime_1.13.0-463_amd64.deb …
Unpacking habanalabs-container-runtime (1.13.0-463) …
Selecting previously unselected package habanalabs-dkms.
Preparing to unpack …/1-habanalabs-dkms_1.13.0-463_all.deb …
Unpacking habanalabs-dkms (1.13.0-463) …
Selecting previously unselected package habanalabs-firmware.
Preparing to unpack …/2-habanalabs-firmware_1.13.0-463_amd64.deb …
Unpacking habanalabs-firmware (1.13.0-463) …
Selecting previously unselected package habanalabs-rdma-core.
Preparing to unpack …/3-habanalabs-rdma-core_1.13.0-463_all.deb …
Unpacking habanalabs-rdma-core (1.13.0-463) …
Selecting previously unselected package habanalabs-thunk.
Preparing to unpack …/4-habanalabs-thunk_1.13.0-463_all.deb …
Unpacking habanalabs-thunk (1.13.0-463) …
Selecting previously unselected package habanalabs-firmware-tools.
Preparing to unpack …/5-habanalabs-firmware-tools_1.13.0-463_amd64.deb …
Unpacking habanalabs-firmware-tools (1.13.0-463) …
Selecting previously unselected package habanalabs-graph.
Preparing to unpack …/6-habanalabs-graph_1.13.0-463_amd64.deb …
Unpacking habanalabs-graph (1.13.0-463) …
Selecting previously unselected package habanalabs-qual.
Preparing to unpack …/7-habanalabs-qual_1.13.0-463_amd64.deb …
Unpacking habanalabs-qual (1.13.0-463) …
Selecting previously unselected package habanatools.
Preparing to unpack …/8-habanatools_1.13.0-463_amd64.deb …
Unpacking habanatools (1.13.0-463) …
Setting up habanalabs-rdma-core (1.13.0-463) …
/opt/habanalabs/rdma-core/src /
/opt/habanalabs/rdma-core/src/build/lib /opt/habanalabs/rdma-core/src /
ln: failed to create symbolic link ‘libibverbs.so.1’: File exists
dpkg: error processing package habanalabs-rdma-core (–configure):
installed habanalabs-rdma-core package post-installation script subprocess returned error exit status 1
Setting up habanalabs-dkms (1.13.0-463) …
Adding Module to DKMS build system habanalabs-dkms 1.13.0-463
Doing initial module build habanalabs-dkms 1.13.0-463
Secure Boot not enabled on this system.
Installing initial module habanalabs-dkms 1.13.0-463

The driver does not load automatically. To load the driver, do the following:
rmmod habanalabs; rmmod habanalabs_cn; rmmod habanalabs_en
modprobe habanalabs_en && modprobe habanalabs_cn && modprobe habanalabs
or reboot

Habanalabs Driver package was installed successfully.

Setting up habanalabs-firmware (1.13.0-463) …
Firmware package for Habana Labs processing accelerators installation finished.
Setting up habanatools (1.13.0-463) …
Setting up habanalabs-container-runtime (1.13.0-463) …
dpkg: dependency problems prevent configuration of habanalabs-thunk:
habanalabs-thunk depends on habanalabs-rdma-core; however:
Package habanalabs-rdma-core is not configured yet.
Package habanalabs-rdma-core which provides habanalabs-rdma-core is not configured yet.

dpkg: error processing package habanalabs-thunk (–configure):
dependency problems - leaving unconfigured
dpkg: dependency problems prevent configuration of habanalabs-firmware-tools:
habanalabs-firmware-tools depends on habanalabs-thunk (>= 1.13.0-463) | habanalabs-thunk-internal (>= 1.13.0-463); however:
Package habanalabs-thunk is not configured yet.
Package habanalabs-thunk-internal is not installed.

dpkg: error processing package habanalabs-firmware-tools (–configure):
dependency problems - leaving unconfigured
dpkg: dependency problems prevent configuration of habanalabs-graph:
habanalabs-graph depends on habanalabs-thunk (>= 1.13.0-463) | habanalabs-thunk-internal (>= 1.13.0-463); however:
Package habanalabs-thunk is not configured yet.
Package habanalabs-thunk-internal is not installed.

dpkg: error processing package habanalabs-graph (–configure):
dependency problems - leaving unconfigured
dpkg: dependency problems prevent configuration of habanalabs-qual:
habanalabs-qual depends on habanalabs-graph (>= 1.13.0) | habanalabs-graph-internal (>= 1.13.0); however:
Package habanalabs-graph is not configured yet.
Package habanalabs-graph-internal is not installed.

dpkg: error processing package habanalabs-qual (–configure):
dependency problems - leaving unconfigured
Processing triggers for libc-bin (2.35-0ubuntu3.4) …
/sbin/ldconfig.real: /opt/DIS/lib64/libsisci.so.3 is not a symbolic link

Errors were encountered while processing:
habanalabs-rdma-core
habanalabs-thunk
habanalabs-firmware-tools
habanalabs-graph
habanalabs-qual
needrestart is being skipped since dpkg has failed
E: Sub-process /usr/bin/dpkg returned an error code (1)
root@h001:~#

Incompatibility with installed Mellanox OFED 5.8 LTS?

/opt/habanalabs/rdma-core/src/build/lib /opt/habanalabs/rdma-core/src /
ln: failed to create symbolic link ‘libibverbs.so.1’: File exists

Still fail unless I remove the directory as shown below.

root@h001:/etc# rm -rf /var/lib/dkms/habanalabs-dkms/1.13.0-463
root@h001:/etc# apt purge habanalabs-container-runtime habanalabs-dkms habanalabs-firmware habanatools
Reading package lists… Done
Building dependency tree… Done
Reading state information… Done
The following packages were automatically installed and are no longer required:
libao-common libao4 libboost-dev libboost-filesystem-dev libboost-filesystem1.74-dev libboost-system-dev libboost-system1.74-dev libboost-system1.74.0 libboost1.74-dev libc6-i386 libcmark-gfm-extensions0.29.0.gfm.3 libcmark-gfm0.29.0.gfm.3
libcurl4-openssl-dev libgsm1 libid3tag0 libllvm14 libmad0 libmp3lame0 libncurses5 libomp-14-dev libomp-dev libomp5 libomp5-14 libopencore-amrnb0 libopencore-amrwb0 libpython3-dev libpython3.10-dev libsox-dev libsox-fmt-all libsox-fmt-alsa libsox-fmt-ao
libsox-fmt-base libsox-fmt-mp3 libsox-fmt-oss libsox-fmt-pulse libsox3 libsystemd-dev libtwolame0 libudev-dev libwavpack1 pandoc pandoc-data python3-dev python3-pip python3-wheel python3.10-dev valgrind
Use ‘apt autoremove’ to remove them.
The following packages will be REMOVED:
habanalabs-container-runtime* habanalabs-dkms* habanalabs-firmware* habanatools*
0 upgraded, 0 newly installed, 4 to remove and 0 not upgraded.
After this operation, 0 B of additional disk space will be used.
Do you want to continue? [Y/n] Y
(Reading database … 375158 files and directories currently installed.)
Removing habanalabs-dkms (1.13.0-463) …

The driver does not unload automatically. To unload the driver, do the following:
rmmod habanalabs; rmmod habanalabs_cn; rmmod habanalabs_ib; rmmod habanalabs_en
or reboot

Habanalabs Driver package was uninstalled successfully.

Removing habanalabs-firmware (1.13.0-463) …
Removing habanatools (1.13.0-463) …
Processing triggers for libc-bin (2.35-0ubuntu3.4) …
/sbin/ldconfig.real: /opt/DIS/lib64/libsisci.so.3 is not a symbolic link

(Reading database … 374342 files and directories currently installed.)
Purging configuration files for habanalabs-dkms (1.13.0-463) …
Purging configuration files for habanatools (1.13.0-463) …
Purging configuration files for habanalabs-container-runtime (1.13.0-463) …
root@h001:/etc# apt list --installed|grep habana

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

Then rebooting…

@Sayantan_S any news?

Packages that are not completely installed are necessary for some users. New 1.13 dkms driver fixes softlockups seen. But incomplete install causes other issues. Any suggestions on how to fix this?

–ToreL

Here is a response from an expert on installations:

Seems like you had a successful install, but something broke afterwards. We are not able to repro this. Here are some steps that have helped in all situations:

  1. sudo rmmod habanalabs; sudo rmmod habanalabs_cn; sudo rmmod habanalabs_ib; sudo rmmod habanalabs_en
  2. sudo apt purge habana*
  3. sudo rm -rf /opt/habanalabs/rdma-core/src
  4. sudo apt autoremove --fix-broken #(resolve all conflicts)
  5. sudo apt install habanalabs-thunk habanalabs-graph
  6. sudo apt install habanalabs-firmware-tools habanalabs-qual habanalabs-rdma-core habanalabs-firmware habanalabs-container-runtime
  7. sudo apt install habanalabs-dkms
  8. sudo modprobe habanalabs_en && sudo modprobe habanalabs_ib && sudo modprobe habanalabs_cn && sudo modprobe habanalabs #(or reboot)
1 Like

That seems to resolve it. I’m not sure which “step” that fixed it. Note that I had to add

$ sudo rm -rf /var/lib/dkms/habanalabs-dkms/1.13.0-463

to remove habanalabs-dkms package properly. If you want a complete log, send me a direct email.

Thanks!