I am using the only available AWS AMI on a dl1.24xlarge AWS instance.
AMI Description:
“Deep Learning AMI Habana TensorFlow 2.8.0 SynapseAI 1.3.0 (Ubuntu 20.04) 20220303 - ami-0d0acb47faa127dac
Built with Habana SynapseAI, HPU Driver, Docker and TensorFlow Frameworks. For fully managed experience, check: Machine Learning – Amazon Web Services”
I am running this docker image:
docker run -it --mount type=bind,source=/home/ubuntu/files-2-copy,target=/mnt/sdc/dev/2bone-marrow --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host vault.habana.ai/gaudi-docker/1.2.0/ubuntu20.04/habanalabs/tensorflow-installer-tf-cpu-2.7.0
Note: Had to use 1.2.0 to match with the habanalabs version on the AMI.
Ran the GPU (HPU) training with image augmentation (dl1.24xlarge has 8 cards).
Tensorflow version 2.7.0
I am using mpi4py and ipyparallel to run training across 8 cards.
I am also using “HPUStrategy()”.
Data details:
The dataset has 171364 image files.
Split up across train/val/test:
Training images count: 108836
Validating images count: 27210
Testing images count: 34012
Model:
def build_model(OPTIMIZER, LOSS, METRICS):
model = None
inputs = layers.Input(shape=input_shape)
#TEMP
#x = img_augmentation(inputs)
#x = data_augmentation(inputs)
x=inputs # No data augmentation - as we get errors with bad input data
baseModel = NETWORK(include_top=False, input_tensor=x, weights="imagenet", pooling ='avg')
baseModel.trainable = False
x = BatchNormalization(axis = -1, name="Batch-Normalization-1")(baseModel.output)
x = Dense(512, activation='relu', kernel_regularizer=regularizers.L1L2(l1=1e-5, l2=1e-4))(x)
x = BatchNormalization(axis = -1, name="Batch-Normalization-2")(x)
x = Dropout(.2, name="Dropout-1")(x)
x = Dense(256, activation='relu')(x)
x = BatchNormalization(axis = -1, name="Batch-Normalization-3")(x)
outputs = Dense(NUM_CLASSES, activation="softmax", name="Classifier")(x)
model = tf.keras.Model(inputs=baseModel.input, outputs=outputs, name="Deep-BoneMarrow")
model.compile(optimizer = OPTIMIZER, loss = LOSS, metrics = METRICS)
return model
I am getting errors that are related to image augmentation. This is the image augmentation I was trying:
img_augmentation = Sequential([
preprocessing.RandomFlip(“horizontal”),
preprocessing.RandomContrast(factor=0.20)
],name=“Augmentation”)
The error occurs at the end of the very first epoch - this is related to the failure to do augmentation (randomflip):
Here’s the log from one of the engines (it was the same from all other engines):
------------------------------------------
[stderr:3] 2022-03-06 09:50:12.932992: E /home/jenkins/workspace/cdsoftwarebuilder/create-tensorflow-module—bpt-d/tensorflow-training/synapse_helpers/graph.cpp:171] Node: Deep-BoneMarrow/Augmentation/random_flip/stateless_random_flip_left_right/ReverseV2/reverse_f32_n1932,type: reverse_f32 add failed. Err: 1
2022-03-06 09:50:13.134134: E /home/jenkins/workspace/cdsoftwarebuilder/create-tensorflow-module—bpt-d/tensorflow-training/synapse_helpers/graph.cpp:171] Node: Deep-BoneMarrow/Augmentation/random_flip/stateless_random_flip_left_right/ReverseV2/reverse_f32_n1934,type: reverse_f32 add failed. Err: 1
2022-03-06 09:50:13.334756: E /home/jenkins/workspace/cdsoftwarebuilder/create-tensorflow-module—bpt-d/tensorflow-training/synapse_helpers/graph.cpp:171] Node: Deep-BoneMarrow/Augmentation/random_flip/stateless_random_flip_left_right/ReverseV2/reverse_f32_n1936,type: reverse_f32 add failed. Err: 1
2022-03-06 09:50:13.536614: E /home/jenkins/workspace/cdsoftwarebuilder/create-tensorflow-module—bpt-d/tensorflow-training/synapse_helpers/graph.cpp:171] Node: Deep-BoneMarrow/Augmentation/random_flip/stateless_random_flip_left_right/ReverseV2/reverse_f32_n1938,type: reverse_f32 add failed. Err: 1
2022-03-06 09:50:13.541865: E /home/jenkins/workspace/cdsoftwarebuilder/create-tensorflow-module—bpt-d/tensorflow-training/habana_device/runtime/cluster_builder_halo.cpp:1561] Unable to compile dynamic cluster: habana_cluster_969_1467
------------------------------------------
After these messages, the first epoch ends (not sure about the success) - but further training
just slows down drastically - see epoch 2’s numbers below (says ETA about 14 minutes).
Please note that if I comment out image augmentation, first epoch takes 134s and all the
next epochs take roughly 1 minute each) - training gets completed over all the image files.
In this error case, not sure if the second epoch is now probably run on just CPU (since free -m shows higher memory consumption than before):
------------------------------------------
[stdout:2] Epoch 1/2
107/107 [==============================] - 124s 826ms/step - loss: 1.6326 - accuracy: 0.5543 - val_loss: 2.1770 - val_accuracy: 0.6230 - lr: 0.0010
Epoch 2/2
21/107 [====>.........................] - ETA: 14:18 - loss: 1.4071 - accuracy: 0.6486
%px: 0%
0/8 [05:40<?, ?tasks/s]
------------------------------------------
Free memory while it was running on GPU:
------------------------------------------
total used free shared buff/cache available
Mem: 765930 60833 666622 28674 38474 672111
Swap: 0 0 0
ubuntu@ip-172-31-93-139:~/files-2-copy$
------------------------------------------
Free memory after the failures in image augmentatrion - while the fit.model kept running:
------------------------------------------
total used free shared buff/cache available
Mem: 765930 121901 145718 487426 498311 151785
Swap: 0 0 0
ubuntu@ip-172-31-93-139:~/files-2-copy$
------------------------------------------
hl-smi and free-m output:
------------------------------------------
ubuntu@ip-172-31-93-139:~/files-2-copy$ hl-smi;free -m
+-----------------------------------------------------------------------------+
| HL-SMI Version: hl-1.2.0-fw-32.5.0.0 |
| Driver Version: 1.2.0-124dd38 |
|-------------------------------+----------------------+----------------------+
| AIP Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | AIP-Util Compute M. |
|===============================+======================+======================|
| 0 HL-205 N/A | 0000:10:1d.0 N/A | 0 |
| N/A 40C N/A 112W / 350W | 32260Mib / 32768Mib | 6% N/A |
|-------------------------------+----------------------+----------------------+
| 1 HL-205 N/A | 0000:10:1e.0 N/A | 0 |
| N/A 40C N/A 99W / 350W | 32260Mib / 32768Mib | 1% N/A |
|-------------------------------+----------------------+----------------------+
| 2 HL-205 N/A | 0000:90:1d.0 N/A | 0 |
| N/A 40C N/A 98W / 350W | 32260Mib / 32768Mib | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 3 HL-205 N/A | 0000:20:1d.0 N/A | 0 |
| N/A 40C N/A 104W / 350W | 32260Mib / 32768Mib | 3% N/A |
|-------------------------------+----------------------+----------------------+
| 4 HL-205 N/A | 0000:90:1e.0 N/A | 0 |
| N/A 38C N/A 109W / 350W | 32260Mib / 32768Mib | 5% N/A |
|-------------------------------+----------------------+----------------------+
| 5 HL-205 N/A | 0000:20:1e.0 N/A | 0 |
| N/A 38C N/A 105W / 350W | 32260Mib / 32768Mib | 3% N/A |
|-------------------------------+----------------------+----------------------+
| 6 HL-205 N/A | 0000:a0:1d.0 N/A | 0 |
| N/A 40C N/A 99W / 350W | 32260Mib / 32768Mib | 1% N/A |
|-------------------------------+----------------------+----------------------+
| 7 HL-205 N/A | 0000:a0:1e.0 N/A | 0 |
| N/A 41C N/A 99W / 350W | 32260Mib / 32768Mib | 0% N/A |
|-------------------------------+----------------------+----------------------+
| Compute Processes: AIP Memory |
| AIP PID Type Process name Usage |
|=============================================================================|
| 0 N/A N/A N/A N/A |
| 1 N/A N/A N/A N/A |
| 2 N/A N/A N/A N/A |
| 3 N/A N/A N/A N/A |
| 4 N/A N/A N/A N/A |
| 5 N/A N/A N/A N/A |
| 6 N/A N/A N/A N/A |
| 7 N/A N/A N/A N/A |
+=============================================================================+
total used free shared buff/cache available
Mem: 765930 117986 149636 487426 498307 155700
Swap: 0 0 0
ubuntu@ip-172-31-93-139:~/files-2-copy$
------------------------------------------
I can share more logs/data as required.
It would improve the model accuracy if I could run training with image augmentation steps too.