Hello,
I am trying to adapt this repo with Gaudi2 (dl1.24xlarge on AWS).
I am using the Habana Deep Learning Base AMI ubuntu 20.04 from the marketplace with Synapse 1.11.0, and PyTorch 2.0.1.
I read through most of the relevant docs, and with the GPU migration import plugin, hcore.mark_step()
it wasn’t too hard to get a workable version running. Here I attach the adapted code, in case anyone would like to reproduce.
My issue is, first of all, speed. I was able to get 80 imgs/s with all 8 chips on dl1.24xlarge… I was really expecting 10,000 imgs/s… This code uses data parallelism, as well as a manually constructed last layer model parallelism cross entropy loss. Could this be relevant?
Moreover, since this morning, I tried the same code on the same instance, and it’s been giving me Killed/Segmentation fault without a way to dig deeper.
The code uses DDP. It doesn’t matter if I launch it with torchrun
, or simply directly executing the python train_v2.py, which fills the
world_sizeto 1 and
rank` to 0.
python train_v2.py configs/ms1mv2_r50.py
/home/ubuntu/habanalabs-venv/lib/python3.8/site-packages/habana_frameworks/torch/gpu_migration/__init__.py:46: UserWarning: apex not installed, gpu_migration will not swap api for this package.
warnings.warn(
Training: 2023-08-15 21:43:58,777-rank_id: 0
Training: 2023-08-15 21:44:02,809-rec file. N of classes: 85742
Training: 2023-08-15 21:44:03,243-Total N of face images: 5822653
============================= HABANA PT BRIDGE CONFIGURATION ===========================
PT_HPU_LAZY_MODE = 1
PT_RECIPE_CACHE_PATH =
PT_CACHE_FOLDER_DELETE = 0
PT_HPU_RECIPE_CACHE_CONFIG =
PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
PT_HPU_LAZY_ACC_PAR_MODE = 1
PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
---------------------------: System Configuration :---------------------------
Num CPU Cores : 96
CPU RAM : 784288608 KB
------------------------------------------------------------------------------
/home/ubuntu/habanalabs-venv/lib/python3.8/site-packages/torch/nn/parallel/distributed.py:1915: UserWarning: You passed find_unused_parameters=true to DistributedDataParallel, `_set_static_graph` will detect unused parameters automatically, so you do not need to set find_unused_parameters=true, just be sure these unused parameters will not change during training loop while calling `_set_static_graph`.
warnings.warn(
Training: 2023-08-15 21:44:07,511-: margin_list [1.0, 0.5, 0.0]
Training: 2023-08-15 21:44:07,511-: network r50
Training: 2023-08-15 21:44:07,511-: resume False
Training: 2023-08-15 21:44:07,511-: save_all_states False
Training: 2023-08-15 21:44:07,511-: output work_dirs/ms1mv2_r50
Training: 2023-08-15 21:44:07,511-: embedding_size 512
Training: 2023-08-15 21:44:07,511-: sample_rate 1.0
Training: 2023-08-15 21:44:07,511-: interclass_filtering_threshold0
Training: 2023-08-15 21:44:07,511-: fp16 False
Training: 2023-08-15 21:44:07,512-: batch_size 128
Training: 2023-08-15 21:44:07,512-: optimizer sgd
Training: 2023-08-15 21:44:07,512-: lr 0.1
Training: 2023-08-15 21:44:07,512-: momentum 0.9
Training: 2023-08-15 21:44:07,512-: weight_decay 0.0005
Training: 2023-08-15 21:44:07,512-: verbose 2000
Training: 2023-08-15 21:44:07,512-: frequent 10
Training: 2023-08-15 21:44:07,512-: dali False
Training: 2023-08-15 21:44:07,512-: dali_aug False
Training: 2023-08-15 21:44:07,512-: gradient_acc 1
Training: 2023-08-15 21:44:07,512-: seed 2048
Training: 2023-08-15 21:44:07,512-: num_workers 2
Training: 2023-08-15 21:44:07,512-: wandb_key XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Training: 2023-08-15 21:44:07,512-: suffix_run_name None
Training: 2023-08-15 21:44:07,512-: using_wandb False
Training: 2023-08-15 21:44:07,512-: wandb_entity entity
Training: 2023-08-15 21:44:07,512-: wandb_project project
Training: 2023-08-15 21:44:07,512-: wandb_log_all True
Training: 2023-08-15 21:44:07,512-: save_artifacts False
Training: 2023-08-15 21:44:07,512-: wandb_resume False
Training: 2023-08-15 21:44:07,512-: rec /nvme1/data/emore
Training: 2023-08-15 21:44:07,512-: num_classes 85742
Training: 2023-08-15 21:44:07,512-: num_image 5822653
Training: 2023-08-15 21:44:07,512-: num_epoch 20
Training: 2023-08-15 21:44:07,512-: warmup_epoch 0
Training: 2023-08-15 21:44:07,512-: val_targets []
Training: 2023-08-15 21:44:07,512-: total_batch_size 128
Training: 2023-08-15 21:44:07,512-: warmup_step 0
Training: 2023-08-15 21:44:07,512-: total_step 909780
Internal Error: Received signal - Segmentation fault
Killed
How do I debug this?