Image augmentation failures while training medical image files (bone-marrow) using tensorflow on AWS dl1.24xlarge instance

shankar · March 7, 2022, 3:10pm

I am using the only available AWS AMI on a dl1.24xlarge AWS instance.
AMI Description:

“Deep Learning AMI Habana TensorFlow 2.8.0 SynapseAI 1.3.0 (Ubuntu 20.04) 20220303 - ami-0d0acb47faa127dac
Built with Habana SynapseAI, HPU Driver, Docker and TensorFlow Frameworks. For fully managed experience, check: Machine Learning – Amazon Web Services”

I am running this docker image:
docker run -it --mount type=bind,source=/home/ubuntu/files-2-copy,target=/mnt/sdc/dev/2bone-marrow --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host vault.habana.ai/gaudi-docker/1.2.0/ubuntu20.04/habanalabs/tensorflow-installer-tf-cpu-2.7.0

Note: Had to use 1.2.0 to match with the habanalabs version on the AMI.

Ran the GPU (HPU) training with image augmentation (dl1.24xlarge has 8 cards).

Tensorflow version 2.7.0

I am using mpi4py and ipyparallel to run training across 8 cards.

I am also using “HPUStrategy()”.

Data details:
The dataset has 171364 image files.

Split up across train/val/test:

Training images count: 108836
Validating images count: 27210
Testing images count: 34012

Model:
def build_model(OPTIMIZER, LOSS, METRICS):
model = None
inputs = layers.Input(shape=input_shape)
#TEMP
#x = img_augmentation(inputs)
#x = data_augmentation(inputs)
x=inputs # No data augmentation - as we get errors with bad input data

baseModel = NETWORK(include_top=False, input_tensor=x, weights="imagenet", pooling ='avg')

baseModel.trainable = False

x = BatchNormalization(axis = -1, name="Batch-Normalization-1")(baseModel.output)
x = Dense(512, activation='relu', kernel_regularizer=regularizers.L1L2(l1=1e-5, l2=1e-4))(x)
x = BatchNormalization(axis = -1, name="Batch-Normalization-2")(x)
x = Dropout(.2, name="Dropout-1")(x)

x = Dense(256, activation='relu')(x)
x = BatchNormalization(axis = -1, name="Batch-Normalization-3")(x)

outputs = Dense(NUM_CLASSES, activation="softmax", name="Classifier")(x)
model = tf.keras.Model(inputs=baseModel.input, outputs=outputs, name="Deep-BoneMarrow")

model.compile(optimizer = OPTIMIZER, loss = LOSS, metrics = METRICS)
    
return model

I am getting errors that are related to image augmentation. This is the image augmentation I was trying:

img_augmentation = Sequential([
preprocessing.RandomFlip(“horizontal”),
preprocessing.RandomContrast(factor=0.20)
],name=“Augmentation”)

The error occurs at the end of the very first epoch - this is related to the failure to do augmentation (randomflip):

Here’s the log from one of the engines (it was the same from all other engines):
------------------------------------------
[stderr:3] 2022-03-06 09:50:12.932992: E /home/jenkins/workspace/cdsoftwarebuilder/create-tensorflow-module—bpt-d/tensorflow-training/synapse_helpers/graph.cpp:171] Node: Deep-BoneMarrow/Augmentation/random_flip/stateless_random_flip_left_right/ReverseV2/reverse_f32_n1932,type: reverse_f32 add failed. Err: 1
2022-03-06 09:50:13.134134: E /home/jenkins/workspace/cdsoftwarebuilder/create-tensorflow-module—bpt-d/tensorflow-training/synapse_helpers/graph.cpp:171] Node: Deep-BoneMarrow/Augmentation/random_flip/stateless_random_flip_left_right/ReverseV2/reverse_f32_n1934,type: reverse_f32 add failed. Err: 1
2022-03-06 09:50:13.334756: E /home/jenkins/workspace/cdsoftwarebuilder/create-tensorflow-module—bpt-d/tensorflow-training/synapse_helpers/graph.cpp:171] Node: Deep-BoneMarrow/Augmentation/random_flip/stateless_random_flip_left_right/ReverseV2/reverse_f32_n1936,type: reverse_f32 add failed. Err: 1
2022-03-06 09:50:13.536614: E /home/jenkins/workspace/cdsoftwarebuilder/create-tensorflow-module—bpt-d/tensorflow-training/synapse_helpers/graph.cpp:171] Node: Deep-BoneMarrow/Augmentation/random_flip/stateless_random_flip_left_right/ReverseV2/reverse_f32_n1938,type: reverse_f32 add failed. Err: 1
2022-03-06 09:50:13.541865: E /home/jenkins/workspace/cdsoftwarebuilder/create-tensorflow-module—bpt-d/tensorflow-training/habana_device/runtime/cluster_builder_halo.cpp:1561] Unable to compile dynamic cluster: habana_cluster_969_1467
------------------------------------------

After these messages, the first epoch ends (not sure about the success) - but further training
just slows down drastically - see epoch 2’s numbers below (says ETA about 14 minutes).

Please note that if I comment out image augmentation, first epoch takes 134s and all the
next epochs take roughly 1 minute each) - training gets completed over all the image files.

In this error case, not sure if the second epoch is now probably run on just CPU (since free -m shows higher memory consumption than before):

------------------------------------------
[stdout:2] Epoch 1/2
107/107 [==============================] - 124s 826ms/step - loss: 1.6326 - accuracy: 0.5543 - val_loss: 2.1770 - val_accuracy: 0.6230 - lr: 0.0010
Epoch 2/2
 21/107 [====>.........................] - ETA: 14:18 - loss: 1.4071 - accuracy: 0.6486
%px: 0%
0/8 [05:40<?, ?tasks/s]
------------------------------------------

Free memory while it was running on GPU:

------------------------------------------
              total        used        free      shared  buff/cache   available
Mem:         765930       60833      666622       28674       38474      672111
Swap:             0           0           0
ubuntu@ip-172-31-93-139:~/files-2-copy$
------------------------------------------

Free memory after the failures in image augmentatrion - while the fit.model kept running:

------------------------------------------
              total        used        free      shared  buff/cache   available
Mem:         765930      121901      145718      487426      498311      151785
Swap:             0           0           0
ubuntu@ip-172-31-93-139:~/files-2-copy$
------------------------------------------

hl-smi and free-m output:

------------------------------------------
ubuntu@ip-172-31-93-139:~/files-2-copy$ hl-smi;free -m
+-----------------------------------------------------------------------------+
| HL-SMI Version:                               hl-1.2.0-fw-32.5.0.0          |
| Driver Version:                                      1.2.0-124dd38          |
|-------------------------------+----------------------+----------------------+
| AIP  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | AIP-Util  Compute M. |
|===============================+======================+======================|
|   0  HL-205              N/A  | 0000:10:1d.0     N/A |                   0  |
| N/A   40C   N/A   112W / 350W |  32260Mib / 32768Mib |     6%           N/A |
|-------------------------------+----------------------+----------------------+
|   1  HL-205              N/A  | 0000:10:1e.0     N/A |                   0  |
| N/A   40C   N/A    99W / 350W |  32260Mib / 32768Mib |     1%           N/A |
|-------------------------------+----------------------+----------------------+
|   2  HL-205              N/A  | 0000:90:1d.0     N/A |                   0  |
| N/A   40C   N/A    98W / 350W |  32260Mib / 32768Mib |     0%           N/A |
|-------------------------------+----------------------+----------------------+
|   3  HL-205              N/A  | 0000:20:1d.0     N/A |                   0  |
| N/A   40C   N/A   104W / 350W |  32260Mib / 32768Mib |     3%           N/A |
|-------------------------------+----------------------+----------------------+
|   4  HL-205              N/A  | 0000:90:1e.0     N/A |                   0  |
| N/A   38C   N/A   109W / 350W |  32260Mib / 32768Mib |     5%           N/A |
|-------------------------------+----------------------+----------------------+
|   5  HL-205              N/A  | 0000:20:1e.0     N/A |                   0  |
| N/A   38C   N/A   105W / 350W |  32260Mib / 32768Mib |     3%           N/A |
|-------------------------------+----------------------+----------------------+
|   6  HL-205              N/A  | 0000:a0:1d.0     N/A |                   0  |
| N/A   40C   N/A    99W / 350W |  32260Mib / 32768Mib |     1%           N/A |
|-------------------------------+----------------------+----------------------+
|   7  HL-205              N/A  | 0000:a0:1e.0     N/A |                   0  |
| N/A   41C   N/A    99W / 350W |  32260Mib / 32768Mib |     0%           N/A |
|-------------------------------+----------------------+----------------------+
| Compute Processes:                                               AIP Memory |
|  AIP       PID   Type   Process name                             Usage      |
|=============================================================================|
|   0        N/A   N/A    N/A                                      N/A        |
|   1        N/A   N/A    N/A                                      N/A        |
|   2        N/A   N/A    N/A                                      N/A        |
|   3        N/A   N/A    N/A                                      N/A        |
|   4        N/A   N/A    N/A                                      N/A        |
|   5        N/A   N/A    N/A                                      N/A        |
|   6        N/A   N/A    N/A                                      N/A        |
|   7        N/A   N/A    N/A                                      N/A        |
+=============================================================================+
              total        used        free      shared  buff/cache   available
Mem:         765930      117986      149636      487426      498307      155700
Swap:             0           0           0
ubuntu@ip-172-31-93-139:~/files-2-copy$ 
------------------------------------------

I can share more logs/data as required.

It would improve the model accuracy if I could run training with image augmentation steps too.

Sun_C · March 7, 2022, 9:49pm

@shankar, Thanks for posting this question. We’re reviewing it now and will respond soon. Can you please share your minimal reproducible code and the training data?

shankar · March 8, 2022, 3:44pm

@Sun_C Sure - I have a tar file (large file since it has a subset of the sample data too). Where should I upload it?

Sun_C · March 8, 2022, 9:42pm

@shankar you can share a drive or a dropbox. If you don’t want to share your data publicly, please DM me.

shankar · March 9, 2022, 6:22am

I would like to DM you the link - but how do I do it?
I don’t see the messages icon when I click your Avatar.

Sun_C · March 11, 2022, 6:46pm

@shankar I was able to reproduce your error. The error appears to be due to an incomplete batch. In the code and data you provided, the number of training data is not divisible by the batch size(128) and the last batch has fewer samples. As a workaround would you try padding the dataset or using tf.data.Dataset.repeat()?

For example, this is a one line change in the prepare_for_training function. With this change the error was not seen. Please note that 128 is an arbitrary number to make sure the number of dataset is multiple of the batch size.

def prepare_for_training(ds, cache=True):
    ds = ds.shuffle(buffer_size=1000).repeat(128)
    ds = ds.batch(BATCH_SIZE)

    if cache:
        ds = ds.prefetch(buffer_size=AUTOTUNE)

    return ds

shankar · March 14, 2022, 3:12pm

@Sun_C
I tried out the suggestion to add .repeat(128).

Though this prevents the failure at the end of an epoch, the training runs extremely slow -
what took about 27s on Tesla 4xT4 NVidia GPU took 8:52 minutes on 8 HPU’s.
Clearly this solution will not work.

I tried another method - as suggested in a stackoverflow page:

Suggestion is to use the drop_remainder argument to method batch of tf.data.Dataset.

I tried this out and am able to proceed without errors.

Now the run completes in about 7 seconds.

Your help in identifying the issue with the incomplete last batch helped in finding a solution.
Though I am still not clear how this ran without errors on nVidia GPU’s (the code was as in my tar file - with no .repeat or drop_remainder).

Sun_C · March 15, 2022, 8:46pm

@shankar good to know that the drop_remainder worked for you.

The reason the error only occurs on HPU, not on NVidia’s GPU, is because the software stacks in both cases are different. HPU uses habana kernels whereas GPU uses cuda kernels. The error you reported is from the Habana’s reverse_f32 kernel and will be fixed in the future.

For more details about Habana software stack, please refer to this page: 1. Gaudi Architecture and Software Overview — Gaudi Documentation 1.3.0 documentation

Please let me know if you have more questions. Otherwise I’ll close the ticket.

shankar · March 16, 2022, 2:37pm

@Sun_C Thanks for the quick response!
No other questions at this point.

Topic		Replies	Views
Habana Gaudi Hpus Training time improvement TensorFlow	2	657	September 30, 2022
Trainer killed/Segfault PyTorch	6	637	September 1, 2023
T5 model reference training/inference not working on Tensorflow AMI TensorFlow tensorflow	2	834	February 22, 2022
Gaudi2 slower compared to A100 Training	10	655	June 7, 2023
Trivial Tensorflow setup crashes on AWS DL1 Ubuntu 20.04 LTS System Setup tensorflow	4	1073	October 29, 2021

Image augmentation failures while training medical image files (bone-marrow) using tensorflow on AWS dl1.24xlarge instance

Related topics