I am using the only available AWS AMI on a dl1.24xlarge AWS instance.
AMI Description:
“Deep Learning AMI Habana TensorFlow 2.8.0 SynapseAI 1.3.0 (Ubuntu 20.04) 20220303 - ami-0d0acb47faa127dac
Built with Habana SynapseAI, HPU Driver, Docker and TensorFlow Frameworks. For fully managed experience, check: Machine Learning – Amazon Web Services”
I am running this docker image:
docker run -it --mount type=bind,source=/home/ubuntu/files-2-copy,target=/mnt/sdc/dev/2bone-marrow --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host  vault.habana.ai/gaudi-docker/1.2.0/ubuntu20.04/habanalabs/tensorflow-installer-tf-cpu-2.7.0
Note: Had to use 1.2.0 to match with the habanalabs version on the AMI.
Ran the GPU (HPU) training with image augmentation (dl1.24xlarge has 8 cards).
Tensorflow version 2.7.0
I am using mpi4py and ipyparallel to run training across 8 cards.
I am also using “HPUStrategy()”.
Data details:
The dataset has 171364 image files.
Split up across train/val/test:
Training images count: 108836
Validating images count: 27210
Testing images count: 34012
Model:
def build_model(OPTIMIZER, LOSS, METRICS):
model = None
inputs = layers.Input(shape=input_shape)
#TEMP
#x = img_augmentation(inputs)
#x = data_augmentation(inputs)
x=inputs # No data augmentation - as we get errors with bad input data
baseModel = NETWORK(include_top=False, input_tensor=x, weights="imagenet", pooling ='avg')
baseModel.trainable = False
x = BatchNormalization(axis = -1, name="Batch-Normalization-1")(baseModel.output)
x = Dense(512, activation='relu', kernel_regularizer=regularizers.L1L2(l1=1e-5, l2=1e-4))(x)
x = BatchNormalization(axis = -1, name="Batch-Normalization-2")(x)
x = Dropout(.2, name="Dropout-1")(x)
x = Dense(256, activation='relu')(x)
x = BatchNormalization(axis = -1, name="Batch-Normalization-3")(x)
outputs = Dense(NUM_CLASSES, activation="softmax", name="Classifier")(x)
model = tf.keras.Model(inputs=baseModel.input, outputs=outputs, name="Deep-BoneMarrow")
model.compile(optimizer = OPTIMIZER, loss = LOSS, metrics = METRICS)
    
return model
I am getting errors that are related to image augmentation. This is the image augmentation I was trying:
img_augmentation = Sequential([
preprocessing.RandomFlip(“horizontal”),
preprocessing.RandomContrast(factor=0.20)
],name=“Augmentation”)
The error occurs at the end of the very first epoch - this is related to the failure to do augmentation (randomflip):
Here’s the log from one of the engines (it was the same from all other engines):
------------------------------------------
[stderr:3] 2022-03-06 09:50:12.932992: E /home/jenkins/workspace/cdsoftwarebuilder/create-tensorflow-module—bpt-d/tensorflow-training/synapse_helpers/graph.cpp:171] Node: Deep-BoneMarrow/Augmentation/random_flip/stateless_random_flip_left_right/ReverseV2/reverse_f32_n1932,type: reverse_f32 add failed. Err: 1
2022-03-06 09:50:13.134134: E /home/jenkins/workspace/cdsoftwarebuilder/create-tensorflow-module—bpt-d/tensorflow-training/synapse_helpers/graph.cpp:171] Node: Deep-BoneMarrow/Augmentation/random_flip/stateless_random_flip_left_right/ReverseV2/reverse_f32_n1934,type: reverse_f32 add failed. Err: 1
2022-03-06 09:50:13.334756: E /home/jenkins/workspace/cdsoftwarebuilder/create-tensorflow-module—bpt-d/tensorflow-training/synapse_helpers/graph.cpp:171] Node: Deep-BoneMarrow/Augmentation/random_flip/stateless_random_flip_left_right/ReverseV2/reverse_f32_n1936,type: reverse_f32 add failed. Err: 1
2022-03-06 09:50:13.536614: E /home/jenkins/workspace/cdsoftwarebuilder/create-tensorflow-module—bpt-d/tensorflow-training/synapse_helpers/graph.cpp:171] Node: Deep-BoneMarrow/Augmentation/random_flip/stateless_random_flip_left_right/ReverseV2/reverse_f32_n1938,type: reverse_f32 add failed. Err: 1
2022-03-06 09:50:13.541865: E /home/jenkins/workspace/cdsoftwarebuilder/create-tensorflow-module—bpt-d/tensorflow-training/habana_device/runtime/cluster_builder_halo.cpp:1561] Unable to compile dynamic cluster: habana_cluster_969_1467
------------------------------------------
After these messages, the first epoch ends (not sure about the success) - but further training
just slows down drastically - see epoch 2’s numbers below (says ETA about 14 minutes).
Please note that if I comment out image augmentation, first epoch takes 134s and all the
next epochs take roughly 1 minute each) - training gets completed over all the image files.
In this error case, not sure if the second epoch is now probably run on just CPU (since free -m shows higher memory consumption than before):
------------------------------------------
[stdout:2] Epoch 1/2
107/107 [==============================] - 124s 826ms/step - loss: 1.6326 - accuracy: 0.5543 - val_loss: 2.1770 - val_accuracy: 0.6230 - lr: 0.0010
Epoch 2/2
 21/107 [====>.........................] - ETA: 14:18 - loss: 1.4071 - accuracy: 0.6486
%px: 0%
0/8 [05:40<?, ?tasks/s]
------------------------------------------
Free memory while it was running on GPU:
------------------------------------------
              total        used        free      shared  buff/cache   available
Mem:         765930       60833      666622       28674       38474      672111
Swap:             0           0           0
ubuntu@ip-172-31-93-139:~/files-2-copy$
------------------------------------------
Free memory after the failures in image augmentatrion - while the fit.model kept running:
------------------------------------------
              total        used        free      shared  buff/cache   available
Mem:         765930      121901      145718      487426      498311      151785
Swap:             0           0           0
ubuntu@ip-172-31-93-139:~/files-2-copy$
------------------------------------------
hl-smi and free-m output:
------------------------------------------
ubuntu@ip-172-31-93-139:~/files-2-copy$ hl-smi;free -m
+-----------------------------------------------------------------------------+
| HL-SMI Version:                               hl-1.2.0-fw-32.5.0.0          |
| Driver Version:                                      1.2.0-124dd38          |
|-------------------------------+----------------------+----------------------+
| AIP  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | AIP-Util  Compute M. |
|===============================+======================+======================|
|   0  HL-205              N/A  | 0000:10:1d.0     N/A |                   0  |
| N/A   40C   N/A   112W / 350W |  32260Mib / 32768Mib |     6%           N/A |
|-------------------------------+----------------------+----------------------+
|   1  HL-205              N/A  | 0000:10:1e.0     N/A |                   0  |
| N/A   40C   N/A    99W / 350W |  32260Mib / 32768Mib |     1%           N/A |
|-------------------------------+----------------------+----------------------+
|   2  HL-205              N/A  | 0000:90:1d.0     N/A |                   0  |
| N/A   40C   N/A    98W / 350W |  32260Mib / 32768Mib |     0%           N/A |
|-------------------------------+----------------------+----------------------+
|   3  HL-205              N/A  | 0000:20:1d.0     N/A |                   0  |
| N/A   40C   N/A   104W / 350W |  32260Mib / 32768Mib |     3%           N/A |
|-------------------------------+----------------------+----------------------+
|   4  HL-205              N/A  | 0000:90:1e.0     N/A |                   0  |
| N/A   38C   N/A   109W / 350W |  32260Mib / 32768Mib |     5%           N/A |
|-------------------------------+----------------------+----------------------+
|   5  HL-205              N/A  | 0000:20:1e.0     N/A |                   0  |
| N/A   38C   N/A   105W / 350W |  32260Mib / 32768Mib |     3%           N/A |
|-------------------------------+----------------------+----------------------+
|   6  HL-205              N/A  | 0000:a0:1d.0     N/A |                   0  |
| N/A   40C   N/A    99W / 350W |  32260Mib / 32768Mib |     1%           N/A |
|-------------------------------+----------------------+----------------------+
|   7  HL-205              N/A  | 0000:a0:1e.0     N/A |                   0  |
| N/A   41C   N/A    99W / 350W |  32260Mib / 32768Mib |     0%           N/A |
|-------------------------------+----------------------+----------------------+
| Compute Processes:                                               AIP Memory |
|  AIP       PID   Type   Process name                             Usage      |
|=============================================================================|
|   0        N/A   N/A    N/A                                      N/A        |
|   1        N/A   N/A    N/A                                      N/A        |
|   2        N/A   N/A    N/A                                      N/A        |
|   3        N/A   N/A    N/A                                      N/A        |
|   4        N/A   N/A    N/A                                      N/A        |
|   5        N/A   N/A    N/A                                      N/A        |
|   6        N/A   N/A    N/A                                      N/A        |
|   7        N/A   N/A    N/A                                      N/A        |
+=============================================================================+
              total        used        free      shared  buff/cache   available
Mem:         765930      117986      149636      487426      498307      155700
Swap:             0           0           0
ubuntu@ip-172-31-93-139:~/files-2-copy$ 
------------------------------------------
I can share more logs/data as required.
It would improve the model accuracy if I could run training with image augmentation steps too.