Hi Habana team!
I have been trying to run some of the models from the Model-References, but no success so far. In the example I am reporting here I have used T5, however, I have also encountered the same issue when running other models.
The only model I was able to run so far was the simple MNIST model (Model-References/example.py at master · HabanaAI/Model-References · GitHub). Apart from that, I couldn’t use (either for training/inference) any other model.
I have tried in three different AMIs (two Ubuntus and one Amazon Linux):
- Deep Learning AMI Habana TensorFlow 2.5.0 SynapseAI 0.15.4 (Ubuntu 18.04) 20220105
- Deep Learning AMI Habana TensorFlow 2.5.0 SynapseAI 0.15.4 (Ubuntu 18.04) 20211208
- Deep Learning AMI Habana TensorFlow 2.5.0 SynapseAI 0.15.4 (Amazon Linux 2) 20211025
In all experiments, I have started from a fresh instance and the only commands I had run were the following:
export PYTHON=/usr/bin/python3.7
git clone https://github.com/HabanaAI/Model-References
cd Model-References/TensorFlow/nlp/T5-base/
pip3 install -r requirements.txt
$PYTHON prepare_data.py ./data/huggingface
$PYTHON train.py --dtype bf16 --data_dir ./data/huggingface --model_dir ./model
The dataset is downloaded prepared successfully, but the training doesn’t work.
This last output line after running the script, where you can see the it was aborted.
The /home/jenkins/workspace/cdsoftwarebuilder/create-tensorflow-module---bpt-d/tensorflow-training/habana_device/kernels/hpu_resource_gather_op.h
is an internal package for Habana, so I don’t exactly what is wrong. I have seen this error message also when running custom models.
$PYTHON train.py --dtype bf16 --data_dir ./data/huggingface --model_dir ./model
.....
2022-01-17 07:10:38.270952: W tensorflow/python/util/util.cc:348] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
2022-01-17 07:10:39.066711: F /home/jenkins/workspace/cdsoftwarebuilder/create-tensorflow-module---bpt-d/tensorflow-training/habana_device/kernels/hpu_resource_gather_op.h:40] This impl should not appear in execution. It should have been replaced by pattern matcher into ReadVariable + GatherV2
Aborted
I also tried to run inference using the base checkpoint but I still get the exact same error:
2022-01-17 08:07:52.156436: W tensorflow/python/util/util.cc:348] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
2022-01-17 08:07:54.039282: F /home/jenkins/workspace/cdsoftwarebuilder/create-tensorflow-module---bpt-d/tensorflow-training/habana_device/kernels/hpu_resource_gather_op.h:40] This impl should not appear in execution. It should have been replaced by pattern matcher into ReadVariable + GatherV2
Aborted
I have included the habana_logs and outputs of the debugger tool.
What do you suggest I could do? I want to start using Habana as soon as possible on my custom models, but without being able to run even the references models it is hard to make progress.
Some of my questions are:
-
Are the AMIs available correct and ready to use?
-
Is it necessary to install any other packages on the AMIs? From what I understood they should already have everything already installed