Problem with training llama-3-70b with deepspeed

We are trying to use habana with deepspeed to train llama-3-70b but in a gaudi-2 machine. While the training starts and runs for 1-2 hours at some point we get an abrupt error:

Synapse detected a device critical error that requires a restart. Killing process in 5 seconds (hl: 5) 11:11:38 [No progress error]

The command that we are using is something like this:

accelerate launch --config_file accelerate_configs/deepspeed_zero3.yaml --num_processes=8 train_sft.py model.model_name_or_path=meta-llama/Meta-Llama-3-70B-Instruct 

The deepspeed zero 3 config is the same as the one that we are using in aws a100 machines (didn’t find any documentation of any necessary changes)

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  deepspeed_multinode_launcher: standard
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

After running the previous command with additional log levels (ENABLE_CONSOLE=true LOG_LEVEL_ALL=4) we got:

[11:11:38.907114][SCAL][error][tid:20B435] tdr: tdr timeout - compute_completion_queue0 ctr_value 0x61854a expected 0x618604 armed true prevCtr 0x61854a since armed 600660448us timeoutUs 600000000
[11:11:38.907162][SCAL][error][tid:20B435] tdr: tdr timeout - network_completion_queue_external_10 ctr_value 0xe6d095 expected 0xe6d572 armed true prevCtr 0xe6d095 since armed 600660615us timeoutUs 600000000
[11:11:38.907171][SCAL][error][tid:20B435] tdr: tdr timeout - network_completion_queue_internal_10 ctr_value 0xe6d095 expected 0xe6d572 armed true prevCtr 0xe6d095 since armed 600660623us timeoutUs 600000000
[11:11:38.907184][SCAL][error][tid:20B435] tdr: tdr timeout - pdma_tx_commands_completion_queue0 ctr_value 0x617d24 expected 0x617dce armed true prevCtr 0x617d24 since armed 600660636us timeoutUs 600000000
[11:11:38.907195][SCAL][error][tid:20B435] tdr: no progress in 600000000 us on cg-noTdr: network_completion_queue_external_10, network_completion_queue_internal_10 cg-Tdr: compute_completion_queue0, pdma_tx_commands_completion_queue0
[11:11:38.907221][SYN_API       ][critical][tid:20B435] DFA detected, see separate file for details dfa_log.txt
[11:11:38.907239][SYN_API       ][error][tid:20B435] DFA detected, see separate file for details dfa_log.txt
000
...
[11:00:53.703211][HCL_API     ][info ][tid:20D016] hcclAllGather_impl: rank=4/8, oam=4, (sendbuff=0x1001600742c31080, recvbuff=0x1001600734b31080, sendcount=29491200, datatype=bf16, uniqId=100.83.87.239:43019, stream_handle=0x1e0f000000000005) - collective#=0x345dc
[11:00:54.009178][HCL_API     ][info ][tid:20D016] hcclAllGather_impl: rank=4/8, oam=4, (sendbuff=0x10016013902ff080, recvbuff=0x10016013820bf080, sendcount=29655040, datatype=bf16, uniqId=100.83.87.239:43019, stream_handle=0x1e0f000000000005) - collective#=0x345dd
[11:00:54.536477][HCL_API     ][info ][tid:20D016] hcclReduceScatter_impl: rank=4/8, oam=4, (sendbuff=0x1001600f6b6b7880, recvbuff=0x1001600f8932b880, recvcount=62449664, datatype=bf16, reduceOp=sum, uniq[11:11:44.619514][HCL       ][critical][tid:20B3B2] SignalsManager::DFA ArchStream 1 is stuck on Long So value 15126677 (0xe6d095)
[11:11:44.619550][HCL       ][critical][tid:20B3B2]     SignalsManager::DFA expecting resource 11[EXTERNAL_CG_SO] (DCORE1_SYNC_MNGR_OBJS SOB_OBJ_469) to reach value CMAX (0x4000) by incrementing 43 signals. current value: 0x3ffa (missing 6 signals)
Synapse detected a device critical error that requires a restart. Killing process in 5 seconds (hl: 5) 11:11:38 [No progress error]
Synapse detected a device critical error that requires a restart. Killing process in 5 seconds (hl: 2) 11:11:39 [No progress error]
Synapse detected a device critical error that requires a restart. Killing process in 5 seconds (hl: 6) 11:11:39 [No progress error]
Synapse detected a device critical error that requires a restart. Killing process in 5 seconds (hl: 3) 11:11:39 [No progress error]
Synapse detected a device critical error that requires a restart. Killing process in 5 seconds (hl: 4) 11:11:39 [No progress error]
DGE       ][[11:12:02.297202][PT_BRIDGE       ][error[11:12:02.297202][PT_BRIDGE       ][errorerrorerrorerrorerrorerrorerror][tid:20ADC7] Crash. signal : 15 Terminated. Severity: low
][tid:20ADB9] Crash. signal : 15 Terminated. Severity: low
][tid:20ADDF] Crash. signal : 15 Terminated. Severity: low
][tid:20ADE7] Crash. signal : 15 Terminated. Severity: low
][tid:20ADD9] Crash. signal : 15 Terminated. Severity: low
][tid:20AE01] Crash. signal : 15 Terminated. Severity: low
][tid:20ADD7] Crash. signal : 15 Terminated. Severity: low
][tid:20ADED] Crash. signal : 15 Terminated. Severity: low
[11:12:02.297342][PT_BRIDGE       ][error[11:12:02.297346][PT_BRIDGE       ][][tid:20ADEF] Crash. signal : 15 Terminated. Severity: low
[11:12:02.297347][PT_BRIDGE       ][[11:12:02.297353][PT_BRIDGE       ][error[11:12:02.297355][PT_BRIDGE       ][error[11:12:02.297363][PT_BRIDGE       ][[11:12:02.297361][PT_BRIDGE       ][][tid:20ADE9] Crash. signal : 15 Terminated. Severity: low
[11:12:02.297366][PT_BRIDGE       ][errorerror[11:12:02.297368][PT_BRIDGE       ][[11:12:02.297369][PT_BRIDGE       ][[11:12:02.297370][PT_BRIDGE       ][[11:12:02.297370][PT_BRIDGE       ][error[11:12:02.297372][PT_BRIDGE       ][[11:12:02.297373][PT_BRIDGE       ][error][tid:20ADFF] Crash. signal : 15 Terminated. Severity: low
[11:12:02.297375][PT_BRIDGE       ][[11:12:02.297375][PT_BRIDGE       ][[11:12:02.297373][PT_BRIDGE       ][[11:12:02.297373][PT_BRIDGE       ][[11:12:02.297380][PT_BRIDGE       ][[11:12:02.297397][PT_BRIDGE       ][[11:12:02.297400][PT_BRIDGE       ][[11:12:02.297403][PT_BRIDGE       ][errorerror][tid:20ADDD] Crash. signal : 15 Terminated. Severity: low
[11:12:02.297409][PT_BRIDGE       ][][tid:20ADD2] Crash. signal : 15 Terminated. Severity: low
errorerrorerrorerror][tid:20ADB1] Crash. signal : 15 Terminated. Severity: low
errorerror][tid:20ADE1] Crash. signal : 15 Terminated. Severity: low
errorerrorerrorerrorerror[11:12:02.297441][PT_BRIDGE       ][error][tid:20ADF5] Crash. signal : 15 Terminated. Severity: low
error][tid:20ADC9] Crash. signal : 15 Terminated. Severity: low
][tid:20ADF1] Crash. signal : 15 Terminated. Severity: low
errorerror][tid:20ADC0] Crash. signal : 15 Terminated. Severity: low
][tid:20ADBC] Crash. signal : 15 Terminated. Severity: low
][tid:20ADF3] Crash. signal : 15 Terminated. Severity: low
][tid:20ADEB] Crash. signal : 15 Terminated. Severity: low
][tid:20ADB5] Crash. signal : 15 Terminated. Severity: low
][tid:20ADF9] Crash. signal : 15 Terminated. Severity: low
][tid:20ADC5] Crash. signal : 15 Terminated. Severity: low
][tid:20ADCF] Crash. signal : 15 Terminated. Severity: low
][tid:20ADF7] Crash. signal : 15 Terminated. Severity: low
][tid:20ADD5] Crash. signal : 15 Terminated. Severity: low
][tid:20ADCD] Crash. signal : 15 Terminated. Severity: low
][tid:20ADCB] Crash. signal : 15 Terminated. Severity: low
][tid:20ADAB] Crash. signal : 15 Terminated. Severity: low
][tid:20ADDB] Crash. signal : 15 Terminated. Severity: low
[11:12:02.301719][SYN_API       ][critical][tid:20A7FD] DFA detected, see separate file for details dfa_log.txt
[11:12:02.301749][SYN_API       ][error][tid:20A7FD] DFA detected, see separate file for details dfa_log.txt
[2024-07-17 11:12:32,296] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 2140157 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-07-17 11:12:37,939] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) local_rank: 0 (pid: 2140151) of binary: /home/sdp/habanalabs-venv/bin/python3.10
Traceback (most recent call last):
  File "/home/sdp/habanalabs-venv/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/sdp/habanalabs-venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/home/sdp/habanalabs-venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1008, in launch_command
    deepspeed_launcher(args)
  File "/home/sdp/habanalabs-venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 724, in deepspeed_launcher
    distrib_run.run(args)
  File "/home/sdp/habanalabs-venv/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/home/sdp/habanalabs-venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))

  rank      : 1 (local_rank: 1)
[2]:
  time      : 2024-07-17_11:12:02
  host      : poc-machine
  rank      : 2 (local_rank: 2)
  exitcode  : -9 (pid: 2140153)
][tid:20ADF3] Crash. signal : 15 Terminated. Severity: low                                                                                                                                         [48/1878]
][tid:20ADEB] Crash. signal : 15 Terminated. Severity: low
][tid:20ADB5] Crash. signal : 15 Terminated. Severity: low
][tid:20ADF9] Crash. signal : 15 Terminated. Severity: low
][tid:20ADC5] Crash. signal : 15 Terminated. Severity: low
][tid:20ADCF] Crash. signal : 15 Terminated. Severity: low
][tid:20ADF7] Crash. signal : 15 Terminated. Severity: low
][tid:20ADD5] Crash. signal : 15 Terminated. Severity: low
][tid:20ADCD] Crash. signal : 15 Terminated. Severity: low
][tid:20ADCB] Crash. signal : 15 Terminated. Severity: low
][tid:20ADAB] Crash. signal : 15 Terminated. Severity: low
][tid:20ADDB] Crash. signal : 15 Terminated. Severity: low
[11:12:02.301719][SYN_API       ][critical][tid:20A7FD] DFA detected, see separate file for details dfa_log.txt
[11:12:02.301749][SYN_API       ][error][tid:20A7FD] DFA detected, see separate file for details dfa_log.txt
[2024-07-17 11:12:32,296] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 2140157 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-07-17 11:12:37,939] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) local_rank: 0 (pid: 2140151) of binary: /home/sdp/habanalabs-venv/bin/python3.10
Traceback (most recent call last):
  File "/home/sdp/habanalabs-venv/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/sdp/habanalabs-venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/home/sdp/habanalabs-venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1008, in launch_command
    deepspeed_launcher(args)
  File "/home/sdp/habanalabs-venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 724, in deepspeed_launcher
    distrib_run.run(args)
  File "/home/sdp/habanalabs-venv/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/home/sdp/habanalabs-venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/sdp/habanalabs-venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
========================================================
train_sft.py FAILED
--------------------------------------------------------
Failures:
[1]:
  time      : 2024-07-17_11:12:02
  host      : poc-machine
  rank      : 1 (local_rank: 1)
  exitcode  : -9 (pid: 2140152)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 2140152
[2]:
  time      : 2024-07-17_11:12:02
  host      : poc-machine
  rank      : 2 (local_rank: 2)
  exitcode  : -9 (pid: 2140153)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 2140153

Any idea what we should try and look for?

Just to be sure, what version docker and driver (output of hl-smi) are you using ?