Seeing idle load of approx.12 on Gaudi system with linux-hw 5.4.0-124-generic

Our Habana Gaudi system (Supermicro) sees an idle load avg of ~12.

I’ve seen similar issues with Xilinx xrt drivers for U250 and U280. After reporting it, it was resolved.

I believe something similar is at play with the Habana kernel driver. Is it using polling rather than interrupts?


Kernel: Ubuntu 18.04.6 LTS (GNU/Linux 5.4.0-124-generic x86_64)
ii  habanalabs-container-runtime           1.6.0-439                                       amd64        Habana Labs container runtime. Provides a modified version of runc allowing users to run GPU enabled containers.
ii  habanalabs-dkms                        1.6.0-439                                       all          habanalabs driver in DKMS format.
ii  habanalabs-firmware                    1.6.0-439                                       amd64        Firmware package for Habana Labs processing accelerators
ii  habanalabs-firmware-odm                1.1.0-614                                       amd64        Firmware ODM package for Habana Labs processing accelerators
ii  habanalabs-firmware-tools              1.6.0-439                                       amd64        Habanalabs firmware tools package
ii  habanalabs-graph                       1.6.0-439                                       amd64        habanalabs graph compiler
ii  habanalabs-qual                        1.6.0-439                                       amd64        This package contains Habanalabs qualification package. It designed to assist server vendors to qualify their Goya based server on the production line.
ii  habanalabs-thunk                       1.6.0-439                                       all          habanalabs thunk
ii  habanatools                            1.6.0-439                                       amd64        Habana Labs tools package
[    5.686464] Kernel command line: BOOT_IMAGE=images/default-habana-image/vmlinuz initrd=images/defa
ult-habana-image/initrd console=tty0 console=ttyS0,115200n8 rd.blacklist=nouveau ip=10.128.0.33:10.128.0.2:10.128.1.254:255.255.254.0 BOOTIF=01-b8-ce-f6-ad-84-8c
[  120.721964] habanalabs_en: loading driver, version: 1.6.0-3c06a7c
[  121.670217] habanalabs: loading driver, version: 1.6.0-3c06a7c
[  121.670421] habanalabs 0000:34:00.0: habanalabs device found [1da3:1000] (rev 1)
[  121.670522] habanalabs 0000:34:00.0: enabling device (0140 -> 0142)
[  121.670546] habanalabs 0000:34:00.0: PCI INT A: no GSI - using ISA IRQ 11
[  121.673718] habanalabs 0000:1a:00.0: habanalabs device found [1da3:1000] (rev 1)
[  121.673808] habanalabs 0000:1a:00.0: enabling device (0140 -> 0142)
[  121.673828] habanalabs 0000:1a:00.0: PCI INT A: no GSI - using ISA IRQ 11
[  121.673954] habanalabs 0000:33:00.0: habanalabs device found [1da3:1000] (rev 1)
[  121.674032] habanalabs 0000:33:00.0: enabling device (0140 -> 0142)
[  121.674053] habanalabs 0000:33:00.0: PCI INT A: no GSI - using ISA IRQ 11
[  121.674367] habanalabs 0000:19:00.0: habanalabs device found [1da3:1000] (rev 1)
[  121.674452] habanalabs 0000:19:00.0: enabling device (0140 -> 0142)
[  121.674465] habanalabs 0000:19:00.0: PCI INT A: no GSI - using ISA IRQ 11
[  121.677694] habanalabs 0000:b3:00.0: habanalabs device found [1da3:1000] (rev 1)
[  121.677794] habanalabs 0000:b3:00.0: enabling device (0140 -> 0142)
[  121.677819] habanalabs 0000:b3:00.0: PCI INT A: no GSI - using ISA IRQ 11
[  121.682745] habanalabs 0000:b4:00.0: habanalabs device found [1da3:1000] (rev 1)
[  121.682833] habanalabs 0000:b4:00.0: enabling device (0140 -> 0142)
[  121.682851] habanalabs 0000:b4:00.0: PCI INT A: no GSI - using ISA IRQ 11
[  121.682933] habanalabs 0000:cd:00.0: habanalabs device found [1da3:1000] (rev 1)
[  121.683015] habanalabs 0000:cd:00.0: enabling device (0140 -> 0142)
[  121.683038] habanalabs 0000:cd:00.0: PCI INT A: no GSI - using ISA IRQ 11
[  121.683331] habanalabs 0000:cc:00.0: habanalabs device found [1da3:1000] (rev 1)
[  121.683413] habanalabs 0000:cc:00.0: enabling device (0140 -> 0142)
[  121.683425] habanalabs 0000:cc:00.0: PCI INT A: no GSI - using ISA IRQ 11
[  121.782654] habanalabs hl3: Loading firmware to device, may take some time...
[  121.782666] habanalabs hl0: Loading firmware to device, may take some time...
[  121.783004] habanalabs hl2: Loading firmware to device, may take some time...
[  121.783086] habanalabs hl1: Loading firmware to device, may take some time...
[  121.810415] habanalabs hl7: Loading firmware to device, may take some time...
[  121.823809] habanalabs hl6: Loading firmware to device, may take some time...
[  121.823819] habanalabs hl5: Loading firmware to device, may take some time...
[  121.823828] habanalabs hl4: Loading firmware to device, may take some time...
[  121.849780] habanalabs hl0: BTL version 81608d8d
[  121.849782] habanalabs hl0: preboot version 32.3.5-sec-4
[  121.870246] habanalabs hl3: BTL version 81608d8d
[  121.870248] habanalabs hl3: preboot version 32.3.5-sec-4
[  121.881785] habanalabs hl7: BTL version 81608d8d
[  121.881787] habanalabs hl7: preboot version 32.3.5-sec-4
[  121.891063] habanalabs hl2: BTL version 81608d8d
[  121.891065] habanalabs hl2: preboot version 32.3.5-sec-4
[  121.909783] habanalabs hl5: BTL version 81608d8d
[  121.909785] habanalabs hl5: preboot version 32.3.5-sec-4
[  121.911549] habanalabs hl1: BTL version 81608d8d
[  121.911551] habanalabs hl1: preboot version 32.3.5-sec-4
[  121.930248] habanalabs hl6: BTL version 81608d8d
[  121.930250] habanalabs hl6: preboot version 32.3.5-sec-4
[  121.951108] habanalabs hl4: BTL version 81608d8d
[  121.951110] habanalabs hl4: preboot version 32.3.5-sec-4
[  129.903679] habanalabs hl1: boot-fit version 32.6.3-sec-4
[  129.904363] habanalabs hl3: boot-fit version 32.6.3-sec-4
[  129.905045] habanalabs hl0: boot-fit version 32.6.3-sec-4
[  129.905732] habanalabs hl2: boot-fit version 32.6.3-sec-4
[  129.937375] habanalabs hl6: boot-fit version 32.6.3-sec-4
[  129.938319] habanalabs hl5: boot-fit version 32.6.3-sec-4
[  129.939261] habanalabs hl7: boot-fit version 32.6.3-sec-4
[  129.939946] habanalabs hl4: boot-fit version 32.6.3-sec-4
[  131.080082] habanalabs hl1: Successfully loaded firmware to device
[  131.080951] habanalabs hl3: Successfully loaded firmware to device
[  131.081790] habanalabs hl0: Successfully loaded firmware to device
[  131.082615] habanalabs hl2: Successfully loaded firmware to device
[  131.109653] habanalabs hl5: Successfully loaded firmware to device
[  131.110521] habanalabs hl4: Successfully loaded firmware to device
[  131.117644] habanalabs hl7: Successfully loaded firmware to device
[  131.118487] habanalabs hl6: Successfully loaded firmware to device
[  133.669654] habanalabs hl3: Linux version 32.6.3-sec-4
[  133.686671] habanalabs hl0: Linux version 32.6.3-sec-4
[  133.703671] habanalabs hl6: Linux version 32.6.3-sec-4
[  133.705658] habanalabs hl5: Linux version 32.6.3-sec-4
[  133.709182] habanalabs hl1: Linux version 32.6.3-sec-4
[  133.716062] habanalabs hl2: Linux version 32.6.3-sec-4
[  133.722654] habanalabs hl4: Linux version 32.6.3-sec-4
[  133.733648] habanalabs hl3: Found GAUDI device with 32GB DRAM
[  133.738658] habanalabs hl7: Linux version 32.6.3-sec-4
[  133.742187] habanalabs hl0: Found GAUDI device with 32GB DRAM
[  133.758654] habanalabs hl1: Found GAUDI device with 32GB DRAM
[  133.759191] habanalabs hl5: Found GAUDI device with 32GB DRAM
[  133.762667] habanalabs hl6: Found GAUDI device with 32GB DRAM
[  133.773649] habanalabs hl4: Found GAUDI device with 32GB DRAM
[  133.780678] habanalabs hl2: Found GAUDI device with 32GB DRAM
[  133.799666] habanalabs hl7: Found GAUDI device with 32GB DRAM
[  134.858468] habanalabs 0000:34:00.0 enp52s0d1: renamed from eth0
[  134.871814] habanalabs hl0: hwmon3: add sensors information
[  134.871815] habanalabs hl0: Successfully added device to habanalabs driver
[  134.902462] habanalabs 0000:34:00.0 enp52s0d8: renamed from eth1
[  134.941984] habanalabs 0000:34:00.0 enp52s0d9: renamed from eth2
[  134.974676] habanalabs 0000:cd:00.0 enp205s0d1: renamed from eth0
[  134.982797] habanalabs hl1: hwmon4: add sensors information
[  134.982798] habanalabs hl1: Successfully added device to habanalabs driver
[  134.991859] habanalabs hl3: hwmon5: add sensors information
[  134.991860] habanalabs hl3: Successfully added device to habanalabs driver
[  134.993757] habanalabs hl6: hwmon6: add sensors information
[  134.993759] habanalabs hl6: Successfully added device to habanalabs driver
[  134.998297] habanalabs 0000:1a:00.0 ens2d1: renamed from eth4
[  135.038029] habanalabs 0000:19:00.0 ens1d1: renamed from eth3
[  135.061860] habanalabs 0000:cd:00.0 enp205s0d8: renamed from eth5
[  135.093850] habanalabs 0000:19:00.0 ens1d8: renamed from eth2
[  135.134469] habanalabs 0000:1a:00.0 ens2d9: renamed from eth10
[  135.143561] habanalabs hl5: hwmon7: add sensors information
[  135.143562] habanalabs hl5: Successfully added device to habanalabs driver
[  135.147801] habanalabs hl2: hwmon8: add sensors information
[  135.147803] habanalabs hl2: Successfully added device to habanalabs driver
[  135.165983] habanalabs 0000:33:00.0 enp51s0d8: renamed from eth11
[  135.193971] habanalabs 0000:b4:00.0 enp180s0d1: renamed from eth1
[  135.206746] habanalabs hl4: hwmon9: add sensors information
[  135.206747] habanalabs hl4: Successfully added device to habanalabs driver
[  135.225925] habanalabs 0000:cd:00.0 enp205s0d9: renamed from eth8
[  135.273777] habanalabs 0000:1a:00.0 ens2d8: renamed from eth7
[  135.301850] habanalabs 0000:19:00.0 ens1d9: renamed from eth9
[  135.333766] habanalabs 0000:33:00.0 enp51s0d1: renamed from eth0
[  135.366012] habanalabs 0000:33:00.0 enp51s0d9: renamed from eth2
[  135.397989] habanalabs 0000:b4:00.0 enp180s0d9: renamed from eth3
[  135.408708] habanalabs hl7: hwmon10: add sensors information
[  135.408709] habanalabs hl7: Successfully added device to habanalabs driver
[  135.433801] habanalabs 0000:cc:00.0 enp204s0d1: renamed from eth13
[  135.469783] habanalabs 0000:b4:00.0 enp180s0d8: renamed from eth12
[  135.501891] habanalabs 0000:cc:00.0 enp204s0d9: renamed from eth0
[  135.537728] habanalabs 0000:cc:00.0 enp204s0d8: renamed from eth1
[  135.573969] habanalabs 0000:b3:00.0 enp179s0d1: renamed from eth6
[  135.613747] habanalabs 0000:b3:00.0 enp179s0d9: renamed from eth5
[  135.657835] habanalabs 0000:b3:00.0 enp179s0d8: renamed from eth4
[  137.065649] habanalabs hl0: link up, port 3
[  137.069649] habanalabs hl4: link up, port 3
[  137.129651] habanalabs hl2: link up, port 3
[  137.129656] habanalabs hl2: link up, port 6
[  137.161654] habanalabs hl5: link up, port 5
[  137.193648] habanalabs hl5: link up, port 6
[  137.225632] habanalabs hl5: link up, port 3
[  137.257647] habanalabs hl2: link up, port 5
[  137.353633] habanalabs hl6: link up, port 5
[  137.353636] habanalabs hl4: link up, port 5
[  137.385644] habanalabs hl7: link up, port 5
[  137.421642] habanalabs hl1: link up, port 4
[  137.421647] habanalabs hl3: link up, port 5
[  137.453642] habanalabs hl3: link up, port 4
[  137.485642] habanalabs hl1: link up, port 5
[  137.485644] habanalabs hl5: link up, port 4
[  137.485649] habanalabs hl0: link up, port 5
[  137.485651] habanalabs hl4: link up, port 6
[  137.517643] habanalabs hl4: link up, port 4
[  137.549640] habanalabs hl0: link up, port 6
[  137.613641] habanalabs hl0: link up, port 2
[  137.645641] habanalabs hl1: link up, port 3
[  137.645644] habanalabs hl7: link up, port 3
[  137.673639] habanalabs hl1: link up, port 0
[  137.677643] habanalabs hl5: link up, port 0
[  137.705640] habanalabs hl6: link up, port 3
[  137.801643] habanalabs hl1: link up, port 2
[  137.805630] habanalabs hl5: link up, port 7
[  137.805633] habanalabs hl6: link up, port 0
[  137.833642] habanalabs hl7: link up, port 6
[  137.833656] habanalabs hl3: link up, port 3
[  137.833661] habanalabs hl3: link up, port 6
[  137.865652] habanalabs hl0: link up, port 7
[  137.897634] habanalabs hl1: link up, port 6
[  137.929647] habanalabs hl3: link up, port 7
[  137.929652] habanalabs hl2: link up, port 4
[  137.961640] habanalabs hl1: link up, port 7
[  137.961644] habanalabs hl4: link up, port 2
[  137.961649] habanalabs hl6: link up, port 6
[  137.993685] habanalabs hl2: link up, port 7
[  137.993688] habanalabs hl2: link up, port 0
[  138.029639] habanalabs hl3: link up, port 2
[  138.121637] habanalabs hl0: link up, port 0
[  138.125642] habanalabs hl7: link up, port 7
[  138.217641] habanalabs hl7: link up, port 4
[  138.253642] habanalabs hl2: link up, port 2
[  138.281636] habanalabs hl4: link up, port 0
[  138.345642] habanalabs hl7: link up, port 2
[  138.377672] habanalabs hl3: link up, port 0
[  138.537671] habanalabs hl6: link up, port 2
[  138.569672] habanalabs hl7: link up, port 0
[  138.601671] habanalabs hl6: link up, port 4
[  138.601674] habanalabs hl5: link up, port 2
[  138.857650] habanalabs hl0: link up, port 4
[  138.921647] habanalabs hl6: link up, port 7
[  138.921654] habanalabs hl4: link up, port 7

Brgds,
Tor

Thanks for posting, We are taking a look at your query and will get back to you.

We use polling in the networking driver, but it should not be taking up so much CPU power. We suspect some network traffic affects interfaces and recommend to use ethtool to check if the number of packets increases on gaudi network interface

No, doesn’t seem to be any traffic on the interfaces. Still seeing idle load avg of >11.

Suggestions?

root@h001:~# w -u
 23:01:53 up 27 min,  1 user,  load average: 14.53, 14.13, 11.84
USER     TTY      FROM             LOGIN@   IDLE   JCPU   PCPU WHAT
root     pts/0    10.128.0.1       22:53    1.00s  0.05s  0.01s w -u

I’ve removed the MACs on the primary NIC (CX6 VPI 100Gbps ethernet), and the HDR IB interface (RDMA for BeeGFS and MPI).

Upgraded kernel to 5.4.0-195-generic (linux-hwe) and also latest habana packages.

root@h001:~# dpkg -l |grep -i habana
ii  habanalabs-container-runtime           1.6.1-92                                        amd64        Habana Labs container runtime. Provides a modified version of runc allowing users to run GPU enabled containers.
ii  habanalabs-dkms                        1.6.1-92                                        all          habanalabs driver in DKMS format.
ii  habanalabs-firmware                    1.6.1-92                                        amd64        Firmware package for Habana Labs processing accelerators
ii  habanalabs-firmware-odm                1.1.0-614                                       amd64        Firmware ODM package for Habana Labs processing accelerators
ii  habanalabs-firmware-tools              1.6.1-92                                        amd64        Habanalabs firmware tools package
ii  habanalabs-graph                       1.6.1-92                                        amd64        habanalabs graph compiler
ii  habanalabs-qual                        1.6.1-92                                        amd64        This package contains Habanalabs qualification package. It designed to assist server vendors to qualify their Goya based server on the production line.
ii  habanalabs-thunk                       1.6.1-92                                        all          habanalabs thunk
ii  habanatools                            1.6.1-92                                        amd64        Habana Labs tools package

root@h001:~# ifconfig -a
docker0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        inet 172.17.0.1  netmask 255.255.0.0  broadcast 172.17.255.255
        ether 02:42:1a:8c:52:a3  txqueuelen 0  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

docker_gwbridge: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        inet 172.18.0.1  netmask 255.255.0.0  broadcast 172.18.255.255
        ether 02:42:e3:22:a3:43  txqueuelen 0  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

enp101s0f0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.128.0.33  netmask 255.255.254.0  broadcast 10.128.1.255
        ether XX:YY:ZZ:XX:YY:ZZ  txqueuelen 1000  (Ethernet)
        RX packets 3222889  bytes 608830529 (608.8 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 2977987  bytes 225996890 (225.9 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

enp101s0f1: flags=4098<BROADCAST,MULTICAST>  mtu 1500
        ether XX:YY:ZZ:XX:YY:ZY  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

enp152s0f0: flags=4098<BROADCAST,MULTICAST>  mtu 1500
        ether b8:ce:f6:d6:26:82  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

enp179s0d1: flags=4098<BROADCAST,MULTICAST>  mtu 1500
        ether b0:fd:0b:d3:2e:a6  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

enp179s0d8: flags=4098<BROADCAST,MULTICAST>  mtu 1500
        ether b0:fd:0b:d3:2e:ad  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

enp179s0d9: flags=4098<BROADCAST,MULTICAST>  mtu 1500
        ether b0:fd:0b:d3:2e:ae  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

enp180s0d1: flags=4098<BROADCAST,MULTICAST>  mtu 1500
        ether b0:fd:0b:d3:34:be  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

enp180s0d8: flags=4098<BROADCAST,MULTICAST>  mtu 1500
        ether b0:fd:0b:d3:34:c5  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

enp180s0d9: flags=4098<BROADCAST,MULTICAST>  mtu 1500
        ether b0:fd:0b:d3:34:c6  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

enp204s0d1: flags=4098<BROADCAST,MULTICAST>  mtu 1500
        ether b0:fd:0b:d3:15:a6  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

enp204s0d8: flags=4098<BROADCAST,MULTICAST>  mtu 1500
        ether b0:fd:0b:d3:15:ad  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

enp204s0d9: flags=4098<BROADCAST,MULTICAST>  mtu 1500
        ether b0:fd:0b:d3:15:ae  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

enp205s0d1: flags=4098<BROADCAST,MULTICAST>  mtu 1500
        ether b0:fd:0b:d3:15:7e  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

enp205s0d8: flags=4098<BROADCAST,MULTICAST>  mtu 1500
        ether b0:fd:0b:d3:15:85  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

enp205s0d9: flags=4098<BROADCAST,MULTICAST>  mtu 1500
        ether b0:fd:0b:d3:15:86  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

enp51s0d1: flags=4098<BROADCAST,MULTICAST>  mtu 1500
        ether b0:fd:0b:d3:35:86  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

enp51s0d8: flags=4098<BROADCAST,MULTICAST>  mtu 1500
        ether b0:fd:0b:d3:35:8d  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

enp51s0d9: flags=4098<BROADCAST,MULTICAST>  mtu 1500
        ether b0:fd:0b:d3:35:8e  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

enp52s0d1: flags=4098<BROADCAST,MULTICAST>  mtu 1500
        ether b0:fd:0b:d3:12:22  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

enp52s0d8: flags=4098<BROADCAST,MULTICAST>  mtu 1500
        ether b0:fd:0b:d3:12:29  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

enp52s0d9: flags=4098<BROADCAST,MULTICAST>  mtu 1500
        ether b0:fd:0b:d3:12:2a  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

ens1d1: flags=4098<BROADCAST,MULTICAST>  mtu 1500
        ether b0:fd:0b:d3:0b:e2  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

ens1d8: flags=4098<BROADCAST,MULTICAST>  mtu 1500
        ether b0:fd:0b:d3:0b:e9  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

ens1d9: flags=4098<BROADCAST,MULTICAST>  mtu 1500
        ether b0:fd:0b:d3:0b:ea  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

ens2d1: flags=4098<BROADCAST,MULTICAST>  mtu 1500
        ether b0:fd:0b:d3:23:b6  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

ens2d8: flags=4098<BROADCAST,MULTICAST>  mtu 1500
        ether b0:fd:0b:d3:23:bd  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

ens2d9: flags=4098<BROADCAST,MULTICAST>  mtu 1500
        ether b0:fd:0b:d3:23:be  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

enxb03af2b6059f: flags=4098<BROADCAST,MULTICAST>  mtu 1500
        ether b0:3a:f2:b6:05:9f  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 2044
        inet 10.128.2.33  netmask 255.255.254.0  broadcast 10.128.3.255
        unspec 20-XX-XX-XX-XX-XX-00-00-00-00-00-00-00-00-00-00  txqueuelen 256  (UNSPEC)
        RX packets 29  bytes 3849 (3.8 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 50  bytes 4716 (4.7 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 1250  bytes 70859 (70.8 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 1250  bytes 70859 (70.8 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

Still have an idle load of around 12.

Kernel 5.4.0-131-generic, Ubuntu 18.04.6LTS. Habana sw upgraded to

root@h001:~# dpkg -l |grep -i habanalabs
ii habanalabs-container-runtime 1.7.0-665 amd64 Habana Labs container runtime. Provides a modified version of runc allowing users to run GPU enabled containers.
ii habanalabs-dkms 1.7.0-665 all habanalabs driver in DKMS format.
ii habanalabs-firmware 1.7.0-665 amd64 Firmware package for Habana Labs processing accelerators
ii habanalabs-firmware-odm 1.1.0-614 amd64 Firmware ODM package for Habana Labs processing accelerators
ii habanalabs-firmware-tools 1.7.0-665 amd64 Habanalabs firmware tools package
ii habanalabs-graph 1.7.0-665 amd64 habanalabs graph compiler
ii habanalabs-qual 1.7.0-665 amd64 This package contains Habanalabs qualification package. It designed to assist server vendors to qualify their Goya based server on the production line.
ii habanalabs-thunk 1.7.0-665 all habanalabs thunk

@torel
Just so we are on the same page, could you please clarify what you mean by “idle load of around 12.”, ie what you are running to get this number

Hi Sayantan_S,

  Sorry for the late response.  Its seen in top, htop and similar tools.

root@h001:~# top -b | head
top - 10:55:15 up 1 day, 17:21,  1 user,  load average: 12.62, 12.62, 12.29
Tasks: 2580 total,   1 running, 754 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.1 us,  0.1 sy,  0.0 ni, 99.8 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 21133976+total, 20713639+free, 39297040 used,  2736580 buff/cache
KiB Swap: 33554428 total, 33554428 free,        0 used. 20665077+avail Mem 

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 24002 root      20   0   34956   6076   2932 R  16.7  0.0   0:00.05 top
     1 root      20   0   79372  10440   6764 S   0.0  0.0   1:09.82 systemd
     2 root      20   0       0      0      0 S   0.0  0.0   0:06.54 kthreadd

But what I notice is that sar reports 100% idle.

root@h001:~# sar -u 1 10000
Linux 5.4.0-137-generic (h001) 	02/06/2023 	_x86_64_	(144 CPU)

10:57:21 AM     CPU     %user     %nice   %system   %iowait    %steal     %idle
10:57:22 AM     all      0.00      0.00      0.00      0.00      0.00    100.00
10:57:23 AM     all      0.00      0.00      0.00      0.00      0.00    100.00
10:57:24 AM     all      0.00      0.00      0.00      0.00      0.00    100.00
10:57:25 AM     all      0.00      0.00      0.00      0.00      0.00    100.00
10:57:26 AM     all      0.00      0.00      0.00      0.00      0.00    100.00
10:57:27 AM     all      0.00      0.00      0.01      0.00      0.00     99.99
^C
Average:        all      0.00      0.00      0.00      0.00      0.00    100.00
root@h001:~# 

I’ve seen the same issue with other drivers. Xilinx had the same issue with their u250/u280 drivers, but they fixed it.

Currently using

root@h001:~# dpkg -l |grep -i habana
ii  habanalabs-container-runtime           1.8.0-690                                       amd64        Habana Labs container runtime. Provides a modified version of runc allowing users to run GPU enabled containers.
ii  habanalabs-dkms                        1.8.0-690                                       all          habanalabs driver in DKMS format.
ii  habanalabs-firmware                    1.8.0-690                                       amd64        Firmware package for Habana Labs processing accelerators
ii  habanalabs-firmware-odm                1.1.0-614                                       amd64        Firmware ODM package for Habana Labs processing accelerators
ii  habanalabs-firmware-tools              1.8.0-690                                       amd64        Habanalabs firmware tools package
ii  habanalabs-graph                       1.8.0-690                                       amd64        habanalabs graph compiler
ii  habanalabs-horovod                     1.3.0-499                                       all          TF/Horovod package for Habana Labs processing accelerators
ii  habanalabs-qual                        1.8.0-690                                       amd64        This package contains Habanalabs qualification package. It designed to assist server vendors to qualify their Goya based server on the production line.
ii  habanalabs-sw-tools                    1.6.1-92                                        amd64        Internal SW Tools package for Habana Labs processing accelerators
ii  habanalabs-thunk                       1.8.0-690                                       all          habanalabs thunk
ii  habanatools                            1.8.0-690                                       amd64        Habana Labs tools package

Sayantan_S,

 any news on this "idle load" issue?   Driver issue? SMP locking?  

 Xilinx had the same issue with their U250/U280 drivers but fixed it. 

 Current drivers versions are shown below.

root@h001:~# dmesg| grep -i habana
[    0.000000] Command line: BOOT_IMAGE=images/default-habana-image/vmlinuz initrd=images/default-habana-image/initrd console=tty0 console=ttyS1,115200n8r rd.blacklist=nouveau ip=10.10.10.33:10.10.10.1:10.10.11.254:255.255.254.0 BOOTIF=01-22-33-44-55-66-77
[    5.624787] Kernel command line: BOOT_IMAGE=images/default-habana-image/vmlinuz initrd=images/default-habana-image/initrd console=tty0 console=ttyS1,115200n8r rd.blacklist=nouveau ip=10.10.10.33:10.10.10.1:10.10.11.254:255.255.254.0 BOOTIF=01-22-33-44-55-66-77
[  178.329319] habanalabs_en: loading driver, version: 1.8.0-a9c2c49
[  179.541191] habanalabs: loading driver, version: 1.8.0-a9c2c49
[  179.543029] habanalabs 0000:19:00.0: habanalabs device found [1da3:1000] (rev 1)
[  179.543122] habanalabs 0000:19:00.0: enabling device (0140 -> 0142)
[  179.543145] habanalabs 0000:19:00.0: PCI INT A: no GSI - using ISA IRQ 11
[  179.544982] habanalabs 0000:1a:00.0: habanalabs device found [1da3:1000] (rev 1)
[  179.545064] habanalabs 0000:1a:00.0: enabling device (0140 -> 0142)
[  179.545078] habanalabs 0000:1a:00.0: PCI INT A: no GSI - using ISA IRQ 11
[  179.545137] habanalabs 0000:33:00.0: habanalabs device found [1da3:1000] (rev 1)
[  179.545220] habanalabs 0000:33:00.0: enabling device (0140 -> 0142)
[  179.545240] habanalabs 0000:33:00.0: PCI INT A: no GSI - using ISA IRQ 11
[  179.545629] habanalabs 0000:34:00.0: habanalabs device found [1da3:1000] (rev 1)
[  179.545662] habanalabs 0000:b3:00.0: habanalabs device found [1da3:1000] (rev 1)
[  179.545715] habanalabs 0000:34:00.0: enabling device (0140 -> 0142)
[  179.545728] habanalabs 0000:34:00.0: PCI INT A: no GSI - using ISA IRQ 11
[  179.545768] habanalabs 0000:b3:00.0: enabling device (0140 -> 0142)
[  179.545795] habanalabs 0000:b3:00.0: PCI INT A: no GSI - using ISA IRQ 11
[  179.545889] habanalabs 0000:b4:00.0: habanalabs device found [1da3:1000] (rev 1)
[  179.545981] habanalabs 0000:b4:00.0: enabling device (0140 -> 0142)
[  179.545993] habanalabs 0000:b4:00.0: PCI INT A: no GSI - using ISA IRQ 11
[  179.546262] habanalabs 0000:cd:00.0: habanalabs device found [1da3:1000] (rev 1)
[  179.546345] habanalabs 0000:cd:00.0: enabling device (0140 -> 0142)
[  179.546365] habanalabs 0000:cd:00.0: PCI INT A: no GSI - using ISA IRQ 11
[  179.549666] habanalabs 0000:cc:00.0: habanalabs device found [1da3:1000] (rev 1)
[  179.549750] habanalabs 0000:cc:00.0: enabling device (0140 -> 0142)
[  179.549767] habanalabs 0000:cc:00.0: PCI INT A: no GSI - using ISA IRQ 11
[  179.653424] habanalabs hl0: Loading firmware to device, may take some time...
[  179.653756] habanalabs hl5: Loading firmware to device, may take some time...
[  179.653765] habanalabs hl4: Loading firmware to device, may take some time...
[  179.654927] habanalabs hl1: Loading firmware to device, may take some time...
[  179.654935] habanalabs hl3: Loading firmware to device, may take some time...
[  179.655002] habanalabs hl2: Loading firmware to device, may take some time...
[  179.655011] habanalabs hl7: Loading firmware to device, may take some time...
[  179.655093] habanalabs hl6: Loading firmware to device, may take some time...
[  179.718415] habanalabs hl0: BTL version 81608d8d
[  179.718417] habanalabs hl0: preboot version 32.3.5-sec-4
[  179.718429] habanalabs hl5: BTL version 81608d8d
[  179.718431] habanalabs hl5: preboot version 32.3.5-sec-4
[  179.739112] habanalabs hl4: BTL version 81608d8d
[  179.739114] habanalabs hl4: preboot version 32.3.5-sec-4
[  179.739475] habanalabs hl2: BTL version 81608d8d
[  179.739476] habanalabs hl2: preboot version 32.3.5-sec-4
[  179.759793] habanalabs hl6: BTL version 81608d8d
[  179.759795] habanalabs hl6: preboot version 32.3.5-sec-4
[  179.760185] habanalabs hl1: BTL version 81608d8d
[  179.760187] habanalabs hl1: preboot version 32.3.5-sec-4
[  179.780876] habanalabs hl7: BTL version 81608d8d
[  179.780877] habanalabs hl7: preboot version 32.3.5-sec-4
[  179.781235] habanalabs hl3: BTL version 81608d8d
[  179.781236] habanalabs hl3: preboot version 32.3.5-sec-4
[  187.754052] habanalabs hl0: boot-fit version 32.6.6-sec-4
[  187.754771] habanalabs hl2: boot-fit version 32.6.6-sec-4
[  187.755487] habanalabs hl3: boot-fit version 32.6.6-sec-4
[  187.756201] habanalabs hl1: boot-fit version 32.6.6-sec-4
[  187.762551] habanalabs hl4: boot-fit version 32.6.6-sec-4
[  187.763589] habanalabs hl5: boot-fit version 32.6.6-sec-4
[  187.764510] habanalabs hl7: boot-fit version 32.6.6-sec-4
[  187.765228] habanalabs hl6: boot-fit version 32.6.6-sec-4
[  188.947402] habanalabs hl0: Successfully loaded firmware to device
[  188.948271] habanalabs hl2: Successfully loaded firmware to device
[  188.949100] habanalabs hl1: Successfully loaded firmware to device
[  188.957388] habanalabs hl3: Successfully loaded firmware to device
[  188.957425] habanalabs hl5: Successfully loaded firmware to device
[  188.958448] habanalabs hl7: Successfully loaded firmware to device
[  188.963258] habanalabs hl6: Successfully loaded firmware to device
[  188.964106] habanalabs hl4: Successfully loaded firmware to device
[  191.514136] habanalabs hl0: Linux version 32.6.6-sec-4
[  191.517108] habanalabs hl4: Linux version 32.6.6-sec-4
[  191.528109] habanalabs hl1: Linux version 32.6.6-sec-4
[  191.557196] habanalabs hl7: Linux version 32.6.6-sec-4
[  191.559218] habanalabs hl5: Linux version 32.6.6-sec-4
[  191.578604] habanalabs hl6: Linux version 32.6.6-sec-4
[  191.582610] habanalabs hl3: Linux version 32.6.6-sec-4
[  191.588640] habanalabs hl2: Linux version 32.6.6-sec-4
[  191.617051] habanalabs hl0: Found GAUDI device with 32GB DRAM
[  191.617566] habanalabs hl4: Found GAUDI device with 32GB DRAM
[  191.627085] habanalabs hl1: Found GAUDI device with 32GB DRAM
[  191.646638] habanalabs hl7: Found GAUDI device with 32GB DRAM
[  191.668995] habanalabs hl5: Found GAUDI device with 32GB DRAM
[  191.677040] habanalabs hl6: Found GAUDI device with 32GB DRAM
[  191.700023] habanalabs hl2: Found GAUDI device with 32GB DRAM
[  191.710032] habanalabs hl3: Found GAUDI device with 32GB DRAM
[  192.725298] habanalabs 0000:b3:00.0 enp179s0d1: renamed from eth0
[  192.740759] habanalabs hl4: hwmon3: add sensors information
[  192.740761] habanalabs hl4: Successfully added device to habanalabs driver
[  192.748014] habanalabs 0000:b3:00.0 enp179s0d9: renamed from eth2
[  192.767598] habanalabs 0000:b3:00.0 enp179s0d8: renamed from eth1
[  192.808410] habanalabs 0000:19:00.0 ens1d1: renamed from eth0
[  192.821888] habanalabs hl0: hwmon4: add sensors information
[  192.821890] habanalabs hl0: Successfully added device to habanalabs driver
[  192.828966] habanalabs hl1: hwmon5: add sensors information
[  192.828968] habanalabs hl1: Successfully added device to habanalabs driver
[  192.836169] habanalabs 0000:1a:00.0 ens2d1: renamed from eth3
[  192.871632] habanalabs 0000:19:00.0 ens1d8: renamed from eth1
[  192.924042] habanalabs 0000:1a:00.0 ens2d9: renamed from eth5
[  192.941778] habanalabs hl5: hwmon6: add sensors information
[  192.941779] habanalabs hl5: Successfully added device to habanalabs driver
[  192.947733] habanalabs 0000:b4:00.0 enp180s0d1: renamed from eth0
[  192.961744] habanalabs hl7: hwmon7: add sensors information
[  192.961745] habanalabs hl7: Successfully added device to habanalabs driver
[  192.975703] habanalabs 0000:cd:00.0 enp205s0d1: renamed from eth7
[  193.011717] habanalabs 0000:cc:00.0 enp204s0d8: renamed from eth9
[  193.035684] habanalabs 0000:b4:00.0 enp180s0d8: renamed from eth8
[  193.071747] habanalabs 0000:cc:00.0 enp204s0d1: renamed from eth6
[  193.091448] habanalabs 0000:1a:00.0 ens2d8: renamed from eth2
[  193.123435] habanalabs 0000:19:00.0 ens1d9: renamed from eth4
[  193.163816] habanalabs 0000:cd:00.0 enp205s0d8: renamed from eth3
[  193.183684] habanalabs hl6: hwmon8: add sensors information
[  193.183686] habanalabs hl6: Successfully added device to habanalabs driver
[  193.186671] habanalabs hl2: hwmon9: add sensors information
[  193.186672] habanalabs hl2: Successfully added device to habanalabs driver
[  193.195500] habanalabs 0000:b4:00.0 enp180s0d9: renamed from eth11
[  193.227450] habanalabs 0000:cc:00.0 enp204s0d9: renamed from eth5
[  193.263646] habanalabs 0000:34:00.0 enp52s0d1: renamed from eth0
[  193.280624] habanalabs hl3: hwmon10: add sensors information
[  193.280625] habanalabs hl3: Successfully added device to habanalabs driver
[  193.303445] habanalabs 0000:34:00.0 enp52s0d8: renamed from eth7
[  193.335770] habanalabs 0000:34:00.0 enp52s0d9: renamed from eth3
[  193.375423] habanalabs 0000:cd:00.0 enp205s0d9: renamed from eth4
[  193.419609] habanalabs 0000:33:00.0 enp51s0d8: renamed from eth10
[  193.455269] habanalabs 0000:33:00.0 enp51s0d1: renamed from eth1
[  193.487340] habanalabs 0000:33:00.0 enp51s0d9: renamed from eth2

Hi,

From your previous post, I see this:

habanalabs-firmware-odm                1.1.0-614   
habanalabs-sw-tools                    1.6.1-92
habanalabs-horovod                     1.3.0-499 

while other parts of the stack are at 1.8

Can you please upgrade everything to 1.8. Please refer to this page.

This is still an issue under Ubuntu 22.04.3LTS w/ 5.15.0-86-generic

It is related to Habana drivers. - On our Habana Gaudi system idle loadavg is around 8-12% using top and w -u.

Interestingly vmstat and mpstat (procps) shows differently (correctly). Not sure I understand why the difference. But as I said, I’ve seen this before with kernel drivers.

Maybe you have a lacking semaphore or mutex release somewhere in the drivers?

root@h001:~# uname -r
5.15.0-86-generic
root@h001:~# dpkg -l |grep habanalab
ii habanalabs-container-runtime 1.12.0-480 amd64 Habana Labs container runtime. Provides a modified version of runc allowing users to run GPU enabled containers.
ii habanalabs-dkms 1.12.0-480 all habanalabs driver in DKMS format.
ii habanalabs-firmware 1.12.0-480 amd64 Firmware package for Habana Labs processing accelerators
ii habanalabs-firmware-tools 1.12.0-480 amd64 Habanalabs firmware tools package
ii habanalabs-graph 1.12.0-480 amd64 habanalabs graph compiler
ii habanalabs-qual 1.12.0-480 amd64 This package contains Habanalabs qualification package. It designed to assist server vendors to qualify their Goya based server on the production line.
ii habanalabs-rdma-core 1.12.0-480 all Habana Labs rdma-core components.
ii habanalabs-thunk 1.12.0-480 all habanalabs thunk