Trivial Tensorflow setup crashes on AWS DL1 Ubuntu 20.04 LTS

When running a simple tensorflow script:

import tensorflow as tf
from habana_frameworks.tensorflow import load_habana_module
load_habana_module()
model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28)),
    tf.keras.layers.Dense(10),
])

I get the following error:

2021-10-28 08:27:11.109007: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-10-28 08:27:11.243634: F /home/jenkins/workspace/cdsoftwarebuilder/create-tensorflow-module---bpt-d/tensorflow-training/habana_device/habana_device.cpp:121] Device acquire failed. Err: 26
Aborted (core dumped)

I followed this guide to setup an Ubuntu 20.04 AWS DL1 instance: 3. Installation Guide — Gaudi Documentation 1.0.1 documentation
AMI: ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20210430 / ami-09e67e426f25ce0d7

I’m running without docker with the binary distribution.

Linux kernel: Linux ip-172-30-4-24 5.11.0-1020-aws #21~20.04.2-Ubuntu SMP Fri Oct 1 13:03:59 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Software:
ii habanalabs-aeon 1.1.0-614 amd64 This package contains Habanalabs AEON package.
ii habanalabs-dkms 1.1.0-614 all habanalabs driver in DKMS format.
ii habanalabs-firmware-tools 1.1.0-614 amd64 Habanalabs firmware tools package
ii habanalabs-graph 1.1.0-614 amd64 habanalabs graph compiler
ii habanalabs-qual 1.1.0-614 amd64 This package contains Habanalabs qualification package. It designed to assist server vendors to qualify their Goya based server on the production line.
ii habanalabs-thunk 1.1.0-614 all habanalabs thunk

Python 3.8.10

dmsg.log: [ 0.000000] Linux version 5.11.0-1020-aws (buildd@lcy01-amd64-010) (gcc (Ubun - Pastebin.com
hl-smi output: ================ HL-SMI LOG ================Timestamp : Thu Oct 28 08:53: - Pastebin.com

1 Like

Hi @curufinwe, thanks for posting your question, we’re reviewing your log files now

1 Like

Hi @curufinwe, we suspect that the habanalabs-firmware 1.1.0-614 package is missing, please run 1) dpkg -l | grep habanalabs-firmware 2) dpkg -l | grep habanalabs-dkms commands to confirm this.

1 Like

The missing habanalabs-firmware package was indeed the root cause of the problem. After installing that package the tensorflow example could run. Thank you!

4 Likes

That’s great, good luck.