Synapse detected a device critical error that requires a restart. [Compute or dma timeout]

Hi, I’m using PyTorch for inference with a LLM, and I’m encountering an error when PT_HPU_LAZY_MODE = 1 is enabled. It shows that “Synapse detected a device critical error that requires a restart. Killing process in 5 seconds (hl: 0) 00:21:44 [Compute or dma timeout]”
However, everything works fine when I use eager mode.

Here is the relevant log that I think might help.

synapse_runtime.log

[00:20:57.662273][SYN_API       ][info ][tid:3A7E5] + ---------------------------------------------------------------------- +
[00:20:57.662297][SYN_API       ][info ][tid:3A7E5] | Version:            1.17.0                                             |
[00:20:57.662299][SYN_API       ][info ][tid:3A7E5] | Synapse:            db1a431                                            |
[00:20:57.662304][SYN_API       ][info ][tid:3A7E5] | HCL:                1.17.0-a6d0341                                     |
[00:20:57.662305][SYN_API       ][info ][tid:3A7E5] | MME:                f1ec30d                                            |
[00:20:57.662306][SYN_API       ][info ][tid:3A7E5] | SCAL:               b05f1cf                                            |
[00:20:57.662308][SYN_API       ][info ][tid:3A7E5] | Description:        HabanaLabs Runtime and GraphCompiler               |
[00:20:57.662342][SYN_API       ][info ][tid:3A7E5] | Time:               2024-11-07 00:20:57.662308                         |

[00:20:57.662345][SYN_API       ][info ][tid:3A7E5] + ---------------------------------------------------------------------- +
[00:20:58.541908][SYN_API       ][info ][tid:3A7E5] synInitialize, status 0[synSuccess]
[00:21:00.682921][SYN_API       ][info ][tid:3A7E5] synDeviceAcquireByDeviceType, status 0[synSuccess]
[00:21:44.300630][SYN_STREAM    ][error][tid:3A888] ------------------------------------------------------------
[00:21:44.300748][SYN_STREAM    ][error][tid:3A888] | Engines timeout reached
[00:21:44.300753][SYN_STREAM    ][error][tid:3A888] | on stream:		compute_completion_queue0
[00:21:44.300757][SYN_STREAM    ][error][tid:3A888] | completed:		0x8 out of 0xa commands
[00:21:44.300769][SYN_STREAM    ][error][tid:3A888] | time waited:	30031292us
[00:21:44.300772][SYN_STREAM    ][error][tid:3A888] | timeout:		30000000us, timeout is enabled.
[00:21:44.300774][SYN_STREAM    ][error][tid:3A888] ------------------------------------------------------------
[00:21:44.300799][SYN_API       ][critical][tid:3A888] DFA detected, see separate file for details dfa_log.txt
[00:21:44.300806][SYN_API       ][error][tid:3A888] DFA detected, see separate file for details dfa_log.txt

dfa_log.txt

[00:21:44.300954][SYN_DEV_FAIL][info ][tid:3A888] tid 239752 #DFA begin 1730910104300826757
[00:21:44.301090][SYN_DEV_FAIL][error][tid:3A888] 


======================================== DFA triggered on the following errors =========================================

Timeout detected on compute/pdma streams: compute_completion_queue0
Check 'Engines Status' for more information on active engines
Check 'Oldest work in each stream' for streams status

========================================================================================================================



[00:21:44.301103][SYN_DEV_FAIL][trace][tid:3A888] ===================================================== Failure Info =====================================================
[00:21:44.301106][SYN_DEV_FAIL][trace][tid:3A888] --- Errors detected ---
[00:21:44.301115][SYN_DEV_FAIL][error][tid:3A888] #DFA reason: tdrFailed            (code 0x1)
[00:21:44.301118][SYN_DEV_FAIL][error][tid:3A888] 

[00:21:44.301149][SYN_DEV_FAIL][info ][tid:3A888] + ---------------------------------------------------------------------- +
[00:21:44.301151][SYN_DEV_FAIL][info ][tid:3A888] | Version:            1.17.0                                             |
[00:21:44.301154][SYN_DEV_FAIL][info ][tid:3A888] | Synapse:            db1a431                                            |
[00:21:44.301156][SYN_DEV_FAIL][info ][tid:3A888] | HCL:                1.17.0-a6d0341                                     |
[00:21:44.301158][SYN_DEV_FAIL][info ][tid:3A888] | MME:                f1ec30d                                            |
[00:21:44.301160][SYN_DEV_FAIL][info ][tid:3A888] | SCAL:               b05f1cf                                            |
[00:21:44.301162][SYN_DEV_FAIL][info ][tid:3A888] | Description:        Habana Labs Device failure analysis                |
[00:21:44.301181][SYN_DEV_FAIL][info ][tid:3A888] | Time:               2024-11-07 00:21:44.301163                         |
[00:21:44.301183][SYN_DEV_FAIL][info ][tid:3A888] + ---------------------------------------------------------------------- +
[00:21:44.301193][SYN_DEV_FAIL][info ][tid:3A888] Failure occurred on device:
[00:21:44.301199][SYN_DEV_FAIL][info ][tid:3A888]      #device name:     hl0
[00:21:44.301201][SYN_DEV_FAIL][info ][tid:3A888]      Moudle index:       6
[00:21:44.301207][SYN_DEV_FAIL][info ][tid:3A888]      fd compute/control: 20/19
[00:21:44.301209][SYN_DEV_FAIL][info ][tid:3A888]      #global rank Id:    ---
[00:21:44.301217][SYN_DEV_FAIL][info ][tid:3A888] #device type GAUDI2
[00:21:44.301219][SYN_DEV_FAIL][info ][tid:3A888] #is simulator  No
[00:21:44.301222][SYN_DEV_FAIL][info ][tid:3A888] #acquire time: 2024-11-07 00:20:58.568116
[00:21:44.301225][SYN_DEV_FAIL][info ][tid:3A888] pci addr:      0000:00:06.0
[00:21:44.301227][SYN_DEV_FAIL][info ][tid:3A888] Server name:   vmInstance0hgiwjnz, ModuleId 6
[00:21:44.301229][SYN_DEV_FAIL][trace][tid:3A888] ==================================================== Engines Status ====================================================
[00:21:44.301711][SYN_DEV_FAIL][trace][tid:3A888] actualSize of engine dump 4035

CORE  EDMA  is_idle  QM_GLBL_STS0  DMA_CORE_STS0  DMA_CORE_STS1
----  ----  -------  ------------  -------------  -------------
 0     0     N        0x1eff        0xa41          0x0
 0     1     N        0x1eff        0xa41          0x0
 1     0     N        0x1eff        0xa41          0x0
 1     1     N        0x1eff        0xa41          0x0
 2     0     N        0x1eff        0xa41          0x0
 2     1     Y        0x3fff        0x6            0x0
 3     0     Y        0x3fff        0x6            0x0
 3     1     Y        0x3fff        0x0            0x0
 
PDMA  is_idle  QM_GLBL_STS0  DMA_CORE_STS0  DMA_CORE_STS1
----  -------  ------------  -------------  -------------
 0     Y        0x3fff        0xa            0x0
 1     Y        0x3fff        0x59c0648      0x0
 
NIC  is_idle  QM_GLBL_STS0  QM_CGM_STS
---  -------  ------------  ----------
 0    Y        0x3fff        0xf11       
 1    Y        0x3fff        0xf11       
 2    Y        0x3fff        0xf11       
 3    Y        0x3fff        0xf11       
 4    Y        0x3fff        0xf11       
 5    Y        0x3fff        0xf11       
 6    Y        0x3fff        0xf11       
 7    Y        0x3fff        0xf11       
 9    Y        0x3fff        0xf11       
 10   Y        0x3fff        0xf11       
 11   Y        0x3fff        0xf11       
 12   Y        0x3fff        0xf11       
 13   Y        0x3fff        0xf11       
 14   Y        0x3fff        0xf11       
 15   Y        0x3fff        0xf11       
 16   Y        0x3fff        0xf11       
 17   Y        0x3fff        0xf11       
 18   Y        0x3fff        0xf11       
 19   Y        0x3fff        0xf11       
 20   Y        0x3fff        0xf11       
 21   Y        0x3fff        0xf11       
 
MME  Stub  is_idle  QM_GLBL_STS0  MME_ARCH_STATUS
---  ----  -------  ------------  ---------------
 0    N     N        0x1eff        0xc27ffe00
 1    N     Y        0x3fff        0xc27ffe00
 2    N     N        0x1eff        0xc27ffe00
 3    N     Y        0x3fff        0xc27ffe00
 
CORE  TPC  is_idle  QM_GLBL_STS0  QM_CGM_STS  STATUS
----  ---  -------  ------------  ----------  ------
 0     0    N        0x1eff        0xa00       0x80
 0     1    N        0x1eff        0xa00       0x80
 0     2    N        0x1eff        0xa00       0x80
 0     3    N        0x1eff        0xa00       0x80
 0     4    N        0x1eff        0xb00       0xee
 0     5    N        0x1eff        0xb00       0xee
 1     0    N        0x1eff        0xb00       0xee
 1     1    N        0x1eff        0xb00       0xee
 1     2    N        0x1eff        0xb00       0xee
 1     3    N        0x1eff        0xb00       0xee
 1     4    N        0x1eff        0xb00       0xee
 1     5    N        0x1eff        0xb00       0xee
 2     0    N        0x1eff        0xb00       0xee
 2     1    N        0x1eff        0xb00       0xee
 2     2    N        0x1eff        0xb00       0xee
 2     3    N        0x1eff        0xb00       0xee
 2     4    N        0x1eff        0xb00       0xee
 2     5    N        0x1eff        0xb00       0xee
 3     0    N        0x1eff        0xb00       0xee
 3     1    N        0x1eff        0xb00       0xee
 3     2    N        0x1eff        0xa00       0x80
 3     3    N        0x1eff        0xa00       0x80
 3     4    N        0x1eff        0xa00       0x80
 3     5    N        0x1eff        0xa00       0x80
 
CORE  DEC  is_idle  VSI_CMD_SWREG15
----  ---  -------  ---------------
 0     0    Y        0x400000
 0     1    Y        0x400000
 1     0    Y        0x400000
 1     1    Y        0x400000
 2     0    Y        0x400000
 2     1    Y        0x400000
 3     0    Y        0x400000
 3     1    Y        0x400000
 
PCIe DEC  is_idle  VSI_CMD_SWREG15
--------  -------  ---------------
 0         Y        0x400000
 
CORE  ROT  is_idle  QM_GLBL_STS0  QM_GLBL_STS1  QM_CGM_STS
----  ---  -------  ------------  ------------  ----------
 0     0    N        0x1fff        0x0           0xb00
 1     0    N        0x1fff        0x0           0xb00