unet2d training crash for 8 gaudis

jonathanadler · March 9, 2023, 9:54pm

Hi,

I followed the instructions here Model-References/PyTorch/computer_vision/segmentation/Unet at 1.8.0 · HabanaAI/Model-References · GitHub for Unet2D training with 8 Gaudis

Running this command

$PYTHON -u main.py --results /tmp/Unet/results/fold_0 --task 1 --logname res_log
–fold 0 --hpus 8 --gpus 0 --data /data/01_2d --seed 123 --num_workers 8
–affinity disabled --norm instance --dim 2 --optimizer fusedadamw --exec_mode train
–learning_rate 0.001 --hmp --hmp-bf16 ./config/ops_bf16_unet.txt
–hmp-fp32 ./config/ops_fp32_unet.txt --deep_supervision --batch_size 64
–val_batch_size 64 --min_epochs 30 --max_epochs 10000 --train_batches 0 --test_batches 0

The training crashes. see screenshot

Sayantan_S · March 13, 2023, 6:03pm

Thanks for posting. We’ll take a look and get back to you.

Sayantan_S · March 17, 2023, 5:44pm

Hi @jonathanadler

I am able to run both 1x and 8x bf16 on 1.8 on Gaudi1. Are you using Gaudi1 or Gaudi2?

Can you please check your software stack and be sure that everything is 1.8?

model refernces should be 1.8
docker should be 1.8
everything here should be 1.8: dpkg -l |grep -i habana

Topic		Replies	Views
Trainer killed/Segfault PyTorch	6	632	September 1, 2023
Hugging Face Transformers using all 8 Habana Gaudi Devices PyTorch	4	1367	July 7, 2022
Gaudi2 slower compared to A100 Training	10	652	June 7, 2023
Gaudi eval dataset in tfrecord format to get accuracy of run Training	15	520	April 11, 2023
RuntimeError: synStatus=8 [Device not found] Device acquire failed General Questions	1	911	August 14, 2023

unet2d training crash for 8 gaudis

Related topics