Trainer killed/Segfault

Hello,

I am trying to adapt this repo with Gaudi2 (dl1.24xlarge on AWS).

I am using the Habana Deep Learning Base AMI ubuntu 20.04 from the marketplace with Synapse 1.11.0, and PyTorch 2.0.1.

I read through most of the relevant docs, and with the GPU migration import plugin, hcore.mark_step() it wasn’t too hard to get a workable version running. Here I attach the adapted code, in case anyone would like to reproduce.

My issue is, first of all, speed. I was able to get 80 imgs/s with all 8 chips on dl1.24xlarge… I was really expecting 10,000 imgs/s… This code uses data parallelism, as well as a manually constructed last layer model parallelism cross entropy loss. Could this be relevant?

Moreover, since this morning, I tried the same code on the same instance, and it’s been giving me Killed/Segmentation fault without a way to dig deeper.

The code uses DDP. It doesn’t matter if I launch it with torchrun, or simply directly executing the python train_v2.py, which fills the world_sizeto 1 andrank` to 0.

python train_v2.py configs/ms1mv2_r50.py
/home/ubuntu/habanalabs-venv/lib/python3.8/site-packages/habana_frameworks/torch/gpu_migration/__init__.py:46: UserWarning: apex not installed, gpu_migration will not swap api for this package.
  warnings.warn(
Training: 2023-08-15 21:43:58,777-rank_id: 0
Training: 2023-08-15 21:44:02,809-rec file. N of classes: 85742
Training: 2023-08-15 21:44:03,243-Total N of face images: 5822653
============================= HABANA PT BRIDGE CONFIGURATION ===========================
 PT_HPU_LAZY_MODE = 1
 PT_RECIPE_CACHE_PATH =
 PT_CACHE_FOLDER_DELETE = 0
 PT_HPU_RECIPE_CACHE_CONFIG =
 PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
 PT_HPU_LAZY_ACC_PAR_MODE = 1
 PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
---------------------------: System Configuration :---------------------------
Num CPU Cores : 96
CPU RAM       : 784288608 KB
------------------------------------------------------------------------------
/home/ubuntu/habanalabs-venv/lib/python3.8/site-packages/torch/nn/parallel/distributed.py:1915: UserWarning: You passed find_unused_parameters=true to DistributedDataParallel, `_set_static_graph` will detect unused parameters automatically, so you do not need to set find_unused_parameters=true, just be sure these unused parameters will not change during training loop while calling `_set_static_graph`.
  warnings.warn(
Training: 2023-08-15 21:44:07,511-: margin_list              [1.0, 0.5, 0.0]
Training: 2023-08-15 21:44:07,511-: network                  r50
Training: 2023-08-15 21:44:07,511-: resume                   False
Training: 2023-08-15 21:44:07,511-: save_all_states          False
Training: 2023-08-15 21:44:07,511-: output                   work_dirs/ms1mv2_r50
Training: 2023-08-15 21:44:07,511-: embedding_size           512
Training: 2023-08-15 21:44:07,511-: sample_rate              1.0
Training: 2023-08-15 21:44:07,511-: interclass_filtering_threshold0
Training: 2023-08-15 21:44:07,511-: fp16                     False
Training: 2023-08-15 21:44:07,512-: batch_size               128
Training: 2023-08-15 21:44:07,512-: optimizer                sgd
Training: 2023-08-15 21:44:07,512-: lr                       0.1
Training: 2023-08-15 21:44:07,512-: momentum                 0.9
Training: 2023-08-15 21:44:07,512-: weight_decay             0.0005
Training: 2023-08-15 21:44:07,512-: verbose                  2000
Training: 2023-08-15 21:44:07,512-: frequent                 10
Training: 2023-08-15 21:44:07,512-: dali                     False
Training: 2023-08-15 21:44:07,512-: dali_aug                 False
Training: 2023-08-15 21:44:07,512-: gradient_acc             1
Training: 2023-08-15 21:44:07,512-: seed                     2048
Training: 2023-08-15 21:44:07,512-: num_workers              2
Training: 2023-08-15 21:44:07,512-: wandb_key                XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Training: 2023-08-15 21:44:07,512-: suffix_run_name          None
Training: 2023-08-15 21:44:07,512-: using_wandb              False
Training: 2023-08-15 21:44:07,512-: wandb_entity             entity
Training: 2023-08-15 21:44:07,512-: wandb_project            project
Training: 2023-08-15 21:44:07,512-: wandb_log_all            True
Training: 2023-08-15 21:44:07,512-: save_artifacts           False
Training: 2023-08-15 21:44:07,512-: wandb_resume             False
Training: 2023-08-15 21:44:07,512-: rec                      /nvme1/data/emore
Training: 2023-08-15 21:44:07,512-: num_classes              85742
Training: 2023-08-15 21:44:07,512-: num_image                5822653
Training: 2023-08-15 21:44:07,512-: num_epoch                20
Training: 2023-08-15 21:44:07,512-: warmup_epoch             0
Training: 2023-08-15 21:44:07,512-: val_targets              []
Training: 2023-08-15 21:44:07,512-: total_batch_size         128
Training: 2023-08-15 21:44:07,512-: warmup_step              0
Training: 2023-08-15 21:44:07,512-: total_step               909780
Internal Error: Received signal - Segmentation fault
Killed

How do I debug this?

Regarding the slow speed:

Usually if the model has dynamic ops, it can be slow due to compilation time.

I took a brief look at the repo, and seems like the backbones (resnet/vit) are probably static, but the losses might have dynamic shapes.
For example see this:

def forward(self, logits: torch.Tensor, labels: torch.Tensor):
        index = torch.where(labels != -1)[0]  # index shape is unknown even if we know labels shape

you can check how to handle dynamicity here:

  1. use PT_HPU_METRICS_FILE=/root/metricslog.json PT_HPU_METRICS_DUMP_TRIGGERS=process_exit,metric_change to check if dynamicity is present
  2. If dynamicity is present, try PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES=1. This enables synapse dynamic shape support
  3. If step 2 fails, detect if inputs are dynamic or if model ops are dynamic
    You might need to rewrite some parts (like the loss). Some examples here

regd the segfault, can you try turning on the logs:

$ export HABANA_LOGS=~/.habana_logs
$ export LOG_LEVEL_ALL=0

and also dmesg after the run finishes.

The second output block was duplicated. It should have been this?

[11400.169233] habanalabs hl0: Received H/W interrupt 211 ["TPC5_DEC"]
[11400.173312] habanalabs hl0: TPC5_AXI_SLV_DEC_Error interrupt cause: tpc_hbw_rresp_err
[11404.191663] habanalabs hl0: Device CPU packet timeout (status = 0xffffffff)
[11404.195891] habanalabs hl0: failed to send NIC status, port 4
[11404.199576] habanalabs hl0: failed to send XPCS91 pkt, port 5
[11404.203293] habanalabs hl0: failed to fetch XPCS91 registers from FW, port 5, err -5
[11404.209255] habanalabs hl0: failed to send XPCS91 pkt, port 6
[11404.212972] habanalabs hl0: failed to fetch XPCS91 registers from FW, port 6, err -5
[11404.218879] habanalabs hl0: failed to send XPCS91 pkt, port 7
[11404.222567] habanalabs hl0: failed to fetch XPCS91 registers from FW, port 7, err -5
[11404.228451] habanalabs hl0: failed to send NIC status, port 8
[11404.232147] habanalabs hl0: failed to unmask RAZWI IRQ 656
[11404.235795] habanalabs hl0: failed to send NIC status, port 9
[11404.239501] habanalabs hl0: failed to unmask RAZWI IRQ 648
[11404.239580] habanalabs hl0: Failed to get remote fault cnt for port 6, error -5
[11404.243158] habanalabs hl0: failed to unmask RAZWI IRQ 650
[11404.243324] habanalabs hl0: failed to unmask RAZWI IRQ 211
[11404.243343] habanalabs hl0: Received H/W interrupt 203 ["TPC1_DEC"]
[11404.243354] habanalabs hl0: failed to unmask RAZWI IRQ 203
[11404.243356] habanalabs hl0: Received H/W interrupt 207 ["TPC3_DEC"]
[11404.243365] habanalabs hl0: failed to unmask RAZWI IRQ 207
[11404.243367] habanalabs hl0: Received H/W interrupt 215 ["TPC7_DEC"]
[11404.243377] habanalabs hl0: failed to unmask RAZWI IRQ 215
[11404.243378] habanalabs hl0: Received H/W interrupt 662 ["RAZWI_OR_ADC_SW"]
[11404.243387] habanalabs hl0: Going to reset device
[11404.273745] habanalabs hl0: Card 0 Port 6: link down
[11404.309745] habanalabs hl0: Card 0 Port 4: link down
[11404.309747] habanalabs hl0: Card 0 Port 5: link down
[11404.309748] habanalabs hl0: Card 0 Port 3: link down
[11404.309756] habanalabs hl0: Card 0 Port 7: link down
[11404.309757] habanalabs hl0: Card 0 Port 2: link down
[11404.309764] habanalabs hl0: Card 0 Port 0: link down
[11404.398353] habanalabs hl0: Killing CS 1.1228
[11404.398364] habanalabs hl0: Killing CS 1.1229
[11404.398367] habanalabs hl0: Killing CS 1.1230
[11404.398369] habanalabs hl0: Killing CS 1.1231
[11404.398373] habanalabs hl0: Killing CS 1.1232
[11404.398375] habanalabs hl0: Killing CS 1.1233
[11404.398377] habanalabs hl0: Killing CS 1.1234
[11404.398379] habanalabs hl0: Killing CS 1.1235
[11404.398382] habanalabs hl0: wait_for_fence error :-5 for CS seq 1231
[11404.398384] habanalabs hl0: Killing CS 1.1236
[11404.398386] habanalabs hl0: multi-CS completion context 0 still waiting when calling force completion
[11404.398392] habanalabs hl0: CS 1236 has been aborted while user process is waiting for it
[11405.421728] habanalabs hl0: Killing user process pid=304147
[11410.509932] habanalabs hl0: Driver version: 1.11.0-e6eb0fd
[11410.510025] habanalabs hl0: Loading secured firmware to device, may take some time...
[11410.582630] habanalabs hl0: preboot full version: 'Preboot version hl-gaudi-0.14.10-fw-32.0.13-sec-4 (Aug 13 2021 - 17:47:26)'
[11410.582634] habanalabs hl0: BTL version 9f7a1057
[11419.486024] habanalabs hl0: boot-fit version 32.6.6-sec-4
[11420.855816] habanalabs hl0: Successfully loaded firmware to device
[11423.464884] habanalabs hl0: Linux version 32.6.6-sec-4
[11423.907981] habanalabs hl0: Successfully finished resetting the 0000:10:1d.0 device

Very similar to the first block though.

I got the segfault again. dmesg shows

One instance:

[11206.625762] habanalabs hl0: Received H/W interrupt 203 ["TPC1_DEC"]
[11210.626142] habanalabs hl0: Device CPU packet timeout (status = 0xffffffff)
[11210.630504] habanalabs hl0: failed to send NIC status, port 0
[11210.634377] habanalabs hl0: failed to send NIC status, port 1
[11210.638245] habanalabs hl0: failed to unmask RAZWI IRQ 649
[11210.641942] habanalabs hl0: failed to send XPCS91 pkt, port 2
[11210.645743] habanalabs hl0: failed to fetch XPCS91 registers from FW, port 2, err -5
[11210.651759] habanalabs hl0: failed to send XPCS91 pkt, port 3
[11210.655607] habanalabs hl0: failed to fetch XPCS91 registers from FW, port 3, err -5
[11210.661625] habanalabs hl0: failed to send XPCS91 pkt, port 4
[11210.665445] habanalabs hl0: failed to fetch XPCS91 registers from FW, port 4, err -5
[11210.671515] habanalabs hl0: failed to send XPCS91 pkt, port 5
[11210.675278] habanalabs hl0: failed to fetch XPCS91 registers from FW, port 5, err -5
[11210.681273] habanalabs hl0: failed to send XPCS91 pkt, port 6
[11210.685099] habanalabs hl0: failed to fetch XPCS91 registers from FW, port 6, err -5
[11210.779972] habanalabs hl0: failed to send XPCS91 pkt, port 7
[11210.783834] habanalabs hl0: failed to fetch XPCS91 registers from FW, port 7, err -5
[11210.789915] habanalabs hl0: failed to send NIC status, port 8
[11210.793665] habanalabs hl0: failed to send NIC status, port 9
[11210.793742] habanalabs hl0: failed to unmask RAZWI IRQ 203
[11210.793818] habanalabs hl0: Failed to get remote fault cnt for port 4, error -5
[11210.793841] habanalabs hl0: Failed to get remote fault cnt for port 2, error -5
[11210.793863] habanalabs hl0: Failed to get remote fault cnt for port 0, error -5
[11210.793949] habanalabs hl0: Failed to get remote fault cnt for port 6, error -5
[11210.793972] habanalabs hl0: Failed to get remote fault cnt for port 7, error -5
[11210.793977] habanalabs hl0: Failed to get remote fault cnt for port 3, error -5
[11210.793985] habanalabs hl0: Failed to get remote fault cnt for port 5, error -5
[11210.794019] habanalabs hl0: Device heartbeat failed! PCI link is healthy
[11210.794021] habanalabs hl0: Heartbeat reset is disabled
[11210.818242] habanalabs hl0: Card 0 Port 7: link down
[11210.818254] habanalabs hl0: Card 0 Port 5: link down
[11210.818271] habanalabs hl0: Card 0 Port 0: link down
[11210.818282] habanalabs hl0: Card 0 Port 3: link down
[11210.818292] habanalabs hl0: Card 0 Port 2: link down
[11210.818302] habanalabs hl0: Card 0 Port 4: link down
[11210.818722] habanalabs hl0: Received H/W interrupt 207 ["TPC3_DEC"]
[11210.824576] habanalabs hl0: Card 0 Port 6: link down
[11210.830446] habanalabs hl0: Received H/W interrupt 211 ["TPC5_DEC"]
[11210.857780] habanalabs hl0: Received H/W interrupt 215 ["TPC7_DEC"]
[11210.861710] habanalabs hl0: Received H/W interrupt 662 ["RAZWI_OR_ADC_SW"]
[11210.865916] habanalabs hl0: Going to reset device
[11210.959308] hl_cs_rollback_all: 834 callbacks suppressed
[11210.959311] habanalabs hl0: Killing CS 1.1229
[11210.959320] habanalabs hl0: Killing CS 1.1230
[11210.959324] habanalabs hl0: Killing CS 1.1231
[11210.959326] habanalabs hl0: Killing CS 1.1232
[11210.959329] habanalabs hl0: Killing CS 1.1233
[11210.959332] habanalabs hl0: Killing CS 1.1234
[11210.959332] habanalabs hl0: CS 1229 has been aborted while user process is waiting for it
[11210.959334] habanalabs hl0: Killing CS 1.1235
[11210.959336] habanalabs hl0: Killing CS 1.1236
[11210.959342] habanalabs hl0: wait_for_fence error :-5 for CS seq 1232
[11210.979082] habanalabs hl0: Killing CS 1.1237
[11210.979099] habanalabs hl0: CS 1237 has been aborted while user process is waiting for it
[11212.014261] habanalabs hl0: Killing user process pid=299286
[11217.102447] habanalabs hl0: Driver version: 1.11.0-e6eb0fd
[11217.102561] habanalabs hl0: Loading secured firmware to device, may take some time...
[11217.175150] habanalabs hl0: preboot full version: 'Preboot version hl-gaudi-0.14.10-fw-32.0.13-sec-4 (Aug 13 2021 - 17:47:26)'
[11217.175154] habanalabs hl0: BTL version 9f7a1057
[11226.074443] habanalabs hl0: boot-fit version 32.6.6-sec-4
[11227.462199] habanalabs hl0: Successfully loaded firmware to device
[11230.088248] habanalabs hl0: Linux version 32.6.6-sec-4
[11230.529824] habanalabs hl0: Successfully finished resetting the 0000:10:1d.0 device

Another


[11206.625762] habanalabs hl0: Received H/W interrupt 203 ["TPC1_DEC"]
[11210.626142] habanalabs hl0: Device CPU packet timeout (status = 0xffffffff)
[11210.630504] habanalabs hl0: failed to send NIC status, port 0
[11210.634377] habanalabs hl0: failed to send NIC status, port 1
[11210.638245] habanalabs hl0: failed to unmask RAZWI IRQ 649
[11210.641942] habanalabs hl0: failed to send XPCS91 pkt, port 2
[11210.645743] habanalabs hl0: failed to fetch XPCS91 registers from FW, port 2, err -5
[11210.651759] habanalabs hl0: failed to send XPCS91 pkt, port 3
[11210.655607] habanalabs hl0: failed to fetch XPCS91 registers from FW, port 3, err -5
[11210.661625] habanalabs hl0: failed to send XPCS91 pkt, port 4
[11210.665445] habanalabs hl0: failed to fetch XPCS91 registers from FW, port 4, err -5
[11210.671515] habanalabs hl0: failed to send XPCS91 pkt, port 5
[11210.675278] habanalabs hl0: failed to fetch XPCS91 registers from FW, port 5, err -5
[11210.681273] habanalabs hl0: failed to send XPCS91 pkt, port 6
[11210.685099] habanalabs hl0: failed to fetch XPCS91 registers from FW, port 6, err -5
[11210.779972] habanalabs hl0: failed to send XPCS91 pkt, port 7
[11210.783834] habanalabs hl0: failed to fetch XPCS91 registers from FW, port 7, err -5
[11210.789915] habanalabs hl0: failed to send NIC status, port 8
[11210.793665] habanalabs hl0: failed to send NIC status, port 9
[11210.793742] habanalabs hl0: failed to unmask RAZWI IRQ 203
[11210.793818] habanalabs hl0: Failed to get remote fault cnt for port 4, error -5
[11210.793841] habanalabs hl0: Failed to get remote fault cnt for port 2, error -5
[11210.793863] habanalabs hl0: Failed to get remote fault cnt for port 0, error -5
[11210.793949] habanalabs hl0: Failed to get remote fault cnt for port 6, error -5
[11210.793972] habanalabs hl0: Failed to get remote fault cnt for port 7, error -5
[11210.793977] habanalabs hl0: Failed to get remote fault cnt for port 3, error -5
[11210.793985] habanalabs hl0: Failed to get remote fault cnt for port 5, error -5
[11210.794019] habanalabs hl0: Device heartbeat failed! PCI link is healthy
[11210.794021] habanalabs hl0: Heartbeat reset is disabled
[11210.818242] habanalabs hl0: Card 0 Port 7: link down
[11210.818254] habanalabs hl0: Card 0 Port 5: link down
[11210.818271] habanalabs hl0: Card 0 Port 0: link down
[11210.818282] habanalabs hl0: Card 0 Port 3: link down
[11210.818292] habanalabs hl0: Card 0 Port 2: link down
[11210.818302] habanalabs hl0: Card 0 Port 4: link down
[11210.818722] habanalabs hl0: Received H/W interrupt 207 ["TPC3_DEC"]
[11210.824576] habanalabs hl0: Card 0 Port 6: link down
[11210.830446] habanalabs hl0: Received H/W interrupt 211 ["TPC5_DEC"]
[11210.857780] habanalabs hl0: Received H/W interrupt 215 ["TPC7_DEC"]
[11210.861710] habanalabs hl0: Received H/W interrupt 662 ["RAZWI_OR_ADC_SW"]
[11210.865916] habanalabs hl0: Going to reset device
[11210.959308] hl_cs_rollback_all: 834 callbacks suppressed
[11210.959311] habanalabs hl0: Killing CS 1.1229
[11210.959320] habanalabs hl0: Killing CS 1.1230
[11210.959324] habanalabs hl0: Killing CS 1.1231
[11210.959326] habanalabs hl0: Killing CS 1.1232
[11210.959329] habanalabs hl0: Killing CS 1.1233
[11210.959332] habanalabs hl0: Killing CS 1.1234
[11210.959332] habanalabs hl0: CS 1229 has been aborted while user process is waiting for it
[11210.959334] habanalabs hl0: Killing CS 1.1235
[11210.959336] habanalabs hl0: Killing CS 1.1236
[11210.959342] habanalabs hl0: wait_for_fence error :-5 for CS seq 1232
[11210.979082] habanalabs hl0: Killing CS 1.1237
[11210.979099] habanalabs hl0: CS 1237 has been aborted while user process is waiting for it
[11212.014261] habanalabs hl0: Killing user process pid=299286
[11217.102447] habanalabs hl0: Driver version: 1.11.0-e6eb0fd
[11217.102561] habanalabs hl0: Loading secured firmware to device, may take some time...
[11217.175150] habanalabs hl0: preboot full version: 'Preboot version hl-gaudi-0.14.10-fw-32.0.13-sec-4 (Aug 13 2021 - 17:47:26)'
[11217.175154] habanalabs hl0: BTL version 9f7a1057
[11226.074443] habanalabs hl0: boot-fit version 32.6.6-sec-4
[11227.462199] habanalabs hl0: Successfully loaded firmware to device
[11230.088248] habanalabs hl0: Linux version 32.6.6-sec-4
[11230.529824] habanalabs hl0: Successfully finished resetting the 0000:10:1d.0 device

The curious thing is when I added one print statement of a slice of a tensor at the backward pass of an all gather autograd function, the segfault went away.

print(f"grad: {grads[0][:2, :2]}")

Not sure how to interpret this. If it means anything, it must be a side effect.

Thank you.

I was able to resolve the slowness issue. Upon swapping the PreLu with LeakyRelu, speed dramatically went up, that matched V100s. I’m assuming there was some limited support for it? Prelu also gave me issues when I tried to use torch’s autocast, throwing a “could not promote…” error. Again, with LeakyRelu, speed is high, and device usage is 70%.

I’ll look into the dynamic shape issue to see if it could further boost speed/device usage. Yes, the backbone model is static.

Regarding the segfault, I wasn’t able to reproduce it now. It was clearly there then. Will report when I see it again.

Checking in this oplist here:

I see relu, leaky relu, rrelu, elu, gelu supported. But prelu is not in the list

One way to check if ops are falling back to cpu is using this log. If something shows up as falling back to CPU, this might cause slowness