T5 model reference training/inference not working on Tensorflow AMI

gustavozomer · January 18, 2022, 3:47pm

Hi Habana team!

I have been trying to run some of the models from the Model-References, but no success so far. In the example I am reporting here I have used T5, however, I have also encountered the same issue when running other models.

The only model I was able to run so far was the simple MNIST model (Model-References/example.py at master · HabanaAI/Model-References · GitHub). Apart from that, I couldn’t use (either for training/inference) any other model.

I have tried in three different AMIs (two Ubuntus and one Amazon Linux):

Deep Learning AMI Habana TensorFlow 2.5.0 SynapseAI 0.15.4 (Ubuntu 18.04) 20220105
Deep Learning AMI Habana TensorFlow 2.5.0 SynapseAI 0.15.4 (Ubuntu 18.04) 20211208
Deep Learning AMI Habana TensorFlow 2.5.0 SynapseAI 0.15.4 (Amazon Linux 2) 20211025

In all experiments, I have started from a fresh instance and the only commands I had run were the following:

export PYTHON=/usr/bin/python3.7
git clone https://github.com/HabanaAI/Model-References
cd Model-References/TensorFlow/nlp/T5-base/
pip3 install -r requirements.txt
$PYTHON prepare_data.py ./data/huggingface
$PYTHON train.py --dtype bf16 --data_dir ./data/huggingface --model_dir ./model

The dataset is downloaded prepared successfully, but the training doesn’t work.

This last output line after running the script, where you can see the it was aborted.

The /home/jenkins/workspace/cdsoftwarebuilder/create-tensorflow-module---bpt-d/tensorflow-training/habana_device/kernels/hpu_resource_gather_op.h is an internal package for Habana, so I don’t exactly what is wrong. I have seen this error message also when running custom models.

$PYTHON train.py --dtype bf16 --data_dir ./data/huggingface --model_dir ./model

.....
2022-01-17 07:10:38.270952: W tensorflow/python/util/util.cc:348] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
2022-01-17 07:10:39.066711: F /home/jenkins/workspace/cdsoftwarebuilder/create-tensorflow-module---bpt-d/tensorflow-training/habana_device/kernels/hpu_resource_gather_op.h:40] This impl should not appear in execution. It should have been replaced by pattern matcher into ReadVariable + GatherV2
Aborted

I also tried to run inference using the base checkpoint but I still get the exact same error:

2022-01-17 08:07:52.156436: W tensorflow/python/util/util.cc:348] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
2022-01-17 08:07:54.039282: F /home/jenkins/workspace/cdsoftwarebuilder/create-tensorflow-module---bpt-d/tensorflow-training/habana_device/kernels/hpu_resource_gather_op.h:40] This impl should not appear in execution. It should have been replaced by pattern matcher into ReadVariable + GatherV2
Aborted

I have included the habana_logs and outputs of the debugger tool.

What do you suggest I could do? I want to start using Habana as soon as possible on my custom models, but without being able to run even the references models it is hard to make progress.

Some of my questions are:

Are the AMIs available correct and ready to use?
Is it necessary to install any other packages on the AMIs? From what I understood they should already have everything already installed

Greg_S · January 18, 2022, 5:54pm

Hi @gustavozomer,

To be able to run models from our Model-References on the AWS supplied DLAMI, I would recommend that you make that the Model-References branch match the SynapseAI version in the DLAMI. In this case, the version being used is 0.15.4, so you can clone that specific branch from our repo:

git clone -b 0.15.4 https://github.com/HabanaAI/Model-References

To answer your two questions, and provide some help…

Are the AMIs correct and ready to use? – Yes, but these DLAMIs from AWS are based on Habana’s SynapseAI 0.15.4 release, from October 2021. When selecting the DLAMI, it would be best to use the latest one from AWS, you can see the date appended on the end:
Deep Learning AMI Habana TensorFlow 2.5.0 SynapseAI 0.15.4 (Ubuntu 18.04) 20220105

But this AMI is still based on Habana’s 0.15.4 release, so cloning the associated 0.15.4 branch from our model references is the best option

Do I need to install any other packages? – No, when using the DLAMI from AWS, these include the full Stack and TF Framework.

To be able to run the latest 1.2.0 Habana SynapseAI Software stack, you can use a Base AMI from the AWS Marketplace and TF Docker image from the ECR registry. We have our SynapseAI version 1.2.0 available on AWS on the AWS Marketplace: AWS Marketplace: Search Results. These are “Base” AMIs that have the SynapseAI Driver and Software, so you have to select a docker image for the Framework; the associated 1.2.0 Framework images are here: ECR Public Gallery

You can refer to our documentation for Getting Started with AWS EC2 for more information

gustavozomer · February 22, 2022, 2:48pm

Thanks Greg! Sorry for the late reply, but I was able to make it run with the Docker Images, which are very easy to set up. Thanks once again for your help =)

Topic		Replies	Views
Trivial Tensorflow setup crashes on AWS DL1 Ubuntu 20.04 LTS System Setup tensorflow	4	1073	October 29, 2021
I'm getting an error while trying to run the basic example TensorFlow	3	738	October 13, 2021
Not able to run the example of Model-references in Kubernetes mode System Setup	11	1086	February 23, 2022
Image augmentation failures while training medical image files (bone-marrow) using tensorflow on AWS dl1.24xlarge instance Training tensorflow	9	775	March 16, 2022
SynapseAI 1.8.0 Release Announcements	0	564	February 9, 2023

T5 model reference training/inference not working on Tensorflow AMI

Related topics