unet2d training crash for 8 gaudis


I followed the instructions here Model-References/PyTorch/computer_vision/segmentation/Unet at 1.8.0 · HabanaAI/Model-References · GitHub for Unet2D training with 8 Gaudis

Running this command

$PYTHON -u main.py --results /tmp/Unet/results/fold_0 --task 1 --logname res_log
–fold 0 --hpus 8 --gpus 0 --data /data/01_2d --seed 123 --num_workers 8
–affinity disabled --norm instance --dim 2 --optimizer fusedadamw --exec_mode train
–learning_rate 0.001 --hmp --hmp-bf16 ./config/ops_bf16_unet.txt
–hmp-fp32 ./config/ops_fp32_unet.txt --deep_supervision --batch_size 64
–val_batch_size 64 --min_epochs 30 --max_epochs 10000 --train_batches 0 --test_batches 0

The training crashes. see screenshot

Thanks for posting. We’ll take a look and get back to you.

Hi @jonathanadler

I am able to run both 1x and 8x bf16 on 1.8 on Gaudi1. Are you using Gaudi1 or Gaudi2?

Can you please check your software stack and be sure that everything is 1.8?

  1. model refernces should be 1.8
  2. docker should be 1.8
  3. everything here should be 1.8: dpkg -l |grep -i habana