Hi, I’m using PyTorch for inference with a LLM, and I’m encountering an error when PT_HPU_LAZY_MODE = 1
is enabled. It shows that “Synapse detected a device critical error that requires a restart. Killing process in 5 seconds (hl: 0) 00:21:44 [Compute or dma timeout]”
However, everything works fine when I use eager mode.
Here is the relevant log that I think might help.
synapse_runtime.log
[00:20:57.662273][SYN_API ][info ][tid:3A7E5] + ---------------------------------------------------------------------- +
[00:20:57.662297][SYN_API ][info ][tid:3A7E5] | Version: 1.17.0 |
[00:20:57.662299][SYN_API ][info ][tid:3A7E5] | Synapse: db1a431 |
[00:20:57.662304][SYN_API ][info ][tid:3A7E5] | HCL: 1.17.0-a6d0341 |
[00:20:57.662305][SYN_API ][info ][tid:3A7E5] | MME: f1ec30d |
[00:20:57.662306][SYN_API ][info ][tid:3A7E5] | SCAL: b05f1cf |
[00:20:57.662308][SYN_API ][info ][tid:3A7E5] | Description: HabanaLabs Runtime and GraphCompiler |
[00:20:57.662342][SYN_API ][info ][tid:3A7E5] | Time: 2024-11-07 00:20:57.662308 |
[00:20:57.662345][SYN_API ][info ][tid:3A7E5] + ---------------------------------------------------------------------- +
[00:20:58.541908][SYN_API ][info ][tid:3A7E5] synInitialize, status 0[synSuccess]
[00:21:00.682921][SYN_API ][info ][tid:3A7E5] synDeviceAcquireByDeviceType, status 0[synSuccess]
[00:21:44.300630][SYN_STREAM ][error][tid:3A888] ------------------------------------------------------------
[00:21:44.300748][SYN_STREAM ][error][tid:3A888] | Engines timeout reached
[00:21:44.300753][SYN_STREAM ][error][tid:3A888] | on stream: compute_completion_queue0
[00:21:44.300757][SYN_STREAM ][error][tid:3A888] | completed: 0x8 out of 0xa commands
[00:21:44.300769][SYN_STREAM ][error][tid:3A888] | time waited: 30031292us
[00:21:44.300772][SYN_STREAM ][error][tid:3A888] | timeout: 30000000us, timeout is enabled.
[00:21:44.300774][SYN_STREAM ][error][tid:3A888] ------------------------------------------------------------
[00:21:44.300799][SYN_API ][critical][tid:3A888] DFA detected, see separate file for details dfa_log.txt
[00:21:44.300806][SYN_API ][error][tid:3A888] DFA detected, see separate file for details dfa_log.txt
dfa_log.txt
[00:21:44.300954][SYN_DEV_FAIL][info ][tid:3A888] tid 239752 #DFA begin 1730910104300826757
[00:21:44.301090][SYN_DEV_FAIL][error][tid:3A888]
======================================== DFA triggered on the following errors =========================================
Timeout detected on compute/pdma streams: compute_completion_queue0
Check 'Engines Status' for more information on active engines
Check 'Oldest work in each stream' for streams status
========================================================================================================================
[00:21:44.301103][SYN_DEV_FAIL][trace][tid:3A888] ===================================================== Failure Info =====================================================
[00:21:44.301106][SYN_DEV_FAIL][trace][tid:3A888] --- Errors detected ---
[00:21:44.301115][SYN_DEV_FAIL][error][tid:3A888] #DFA reason: tdrFailed (code 0x1)
[00:21:44.301118][SYN_DEV_FAIL][error][tid:3A888]
[00:21:44.301149][SYN_DEV_FAIL][info ][tid:3A888] + ---------------------------------------------------------------------- +
[00:21:44.301151][SYN_DEV_FAIL][info ][tid:3A888] | Version: 1.17.0 |
[00:21:44.301154][SYN_DEV_FAIL][info ][tid:3A888] | Synapse: db1a431 |
[00:21:44.301156][SYN_DEV_FAIL][info ][tid:3A888] | HCL: 1.17.0-a6d0341 |
[00:21:44.301158][SYN_DEV_FAIL][info ][tid:3A888] | MME: f1ec30d |
[00:21:44.301160][SYN_DEV_FAIL][info ][tid:3A888] | SCAL: b05f1cf |
[00:21:44.301162][SYN_DEV_FAIL][info ][tid:3A888] | Description: Habana Labs Device failure analysis |
[00:21:44.301181][SYN_DEV_FAIL][info ][tid:3A888] | Time: 2024-11-07 00:21:44.301163 |
[00:21:44.301183][SYN_DEV_FAIL][info ][tid:3A888] + ---------------------------------------------------------------------- +
[00:21:44.301193][SYN_DEV_FAIL][info ][tid:3A888] Failure occurred on device:
[00:21:44.301199][SYN_DEV_FAIL][info ][tid:3A888] #device name: hl0
[00:21:44.301201][SYN_DEV_FAIL][info ][tid:3A888] Moudle index: 6
[00:21:44.301207][SYN_DEV_FAIL][info ][tid:3A888] fd compute/control: 20/19
[00:21:44.301209][SYN_DEV_FAIL][info ][tid:3A888] #global rank Id: ---
[00:21:44.301217][SYN_DEV_FAIL][info ][tid:3A888] #device type GAUDI2
[00:21:44.301219][SYN_DEV_FAIL][info ][tid:3A888] #is simulator No
[00:21:44.301222][SYN_DEV_FAIL][info ][tid:3A888] #acquire time: 2024-11-07 00:20:58.568116
[00:21:44.301225][SYN_DEV_FAIL][info ][tid:3A888] pci addr: 0000:00:06.0
[00:21:44.301227][SYN_DEV_FAIL][info ][tid:3A888] Server name: vmInstance0hgiwjnz, ModuleId 6
[00:21:44.301229][SYN_DEV_FAIL][trace][tid:3A888] ==================================================== Engines Status ====================================================
[00:21:44.301711][SYN_DEV_FAIL][trace][tid:3A888] actualSize of engine dump 4035
CORE EDMA is_idle QM_GLBL_STS0 DMA_CORE_STS0 DMA_CORE_STS1
---- ---- ------- ------------ ------------- -------------
0 0 N 0x1eff 0xa41 0x0
0 1 N 0x1eff 0xa41 0x0
1 0 N 0x1eff 0xa41 0x0
1 1 N 0x1eff 0xa41 0x0
2 0 N 0x1eff 0xa41 0x0
2 1 Y 0x3fff 0x6 0x0
3 0 Y 0x3fff 0x6 0x0
3 1 Y 0x3fff 0x0 0x0
PDMA is_idle QM_GLBL_STS0 DMA_CORE_STS0 DMA_CORE_STS1
---- ------- ------------ ------------- -------------
0 Y 0x3fff 0xa 0x0
1 Y 0x3fff 0x59c0648 0x0
NIC is_idle QM_GLBL_STS0 QM_CGM_STS
--- ------- ------------ ----------
0 Y 0x3fff 0xf11
1 Y 0x3fff 0xf11
2 Y 0x3fff 0xf11
3 Y 0x3fff 0xf11
4 Y 0x3fff 0xf11
5 Y 0x3fff 0xf11
6 Y 0x3fff 0xf11
7 Y 0x3fff 0xf11
9 Y 0x3fff 0xf11
10 Y 0x3fff 0xf11
11 Y 0x3fff 0xf11
12 Y 0x3fff 0xf11
13 Y 0x3fff 0xf11
14 Y 0x3fff 0xf11
15 Y 0x3fff 0xf11
16 Y 0x3fff 0xf11
17 Y 0x3fff 0xf11
18 Y 0x3fff 0xf11
19 Y 0x3fff 0xf11
20 Y 0x3fff 0xf11
21 Y 0x3fff 0xf11
MME Stub is_idle QM_GLBL_STS0 MME_ARCH_STATUS
--- ---- ------- ------------ ---------------
0 N N 0x1eff 0xc27ffe00
1 N Y 0x3fff 0xc27ffe00
2 N N 0x1eff 0xc27ffe00
3 N Y 0x3fff 0xc27ffe00
CORE TPC is_idle QM_GLBL_STS0 QM_CGM_STS STATUS
---- --- ------- ------------ ---------- ------
0 0 N 0x1eff 0xa00 0x80
0 1 N 0x1eff 0xa00 0x80
0 2 N 0x1eff 0xa00 0x80
0 3 N 0x1eff 0xa00 0x80
0 4 N 0x1eff 0xb00 0xee
0 5 N 0x1eff 0xb00 0xee
1 0 N 0x1eff 0xb00 0xee
1 1 N 0x1eff 0xb00 0xee
1 2 N 0x1eff 0xb00 0xee
1 3 N 0x1eff 0xb00 0xee
1 4 N 0x1eff 0xb00 0xee
1 5 N 0x1eff 0xb00 0xee
2 0 N 0x1eff 0xb00 0xee
2 1 N 0x1eff 0xb00 0xee
2 2 N 0x1eff 0xb00 0xee
2 3 N 0x1eff 0xb00 0xee
2 4 N 0x1eff 0xb00 0xee
2 5 N 0x1eff 0xb00 0xee
3 0 N 0x1eff 0xb00 0xee
3 1 N 0x1eff 0xb00 0xee
3 2 N 0x1eff 0xa00 0x80
3 3 N 0x1eff 0xa00 0x80
3 4 N 0x1eff 0xa00 0x80
3 5 N 0x1eff 0xa00 0x80
CORE DEC is_idle VSI_CMD_SWREG15
---- --- ------- ---------------
0 0 Y 0x400000
0 1 Y 0x400000
1 0 Y 0x400000
1 1 Y 0x400000
2 0 Y 0x400000
2 1 Y 0x400000
3 0 Y 0x400000
3 1 Y 0x400000
PCIe DEC is_idle VSI_CMD_SWREG15
-------- ------- ---------------
0 Y 0x400000
CORE ROT is_idle QM_GLBL_STS0 QM_GLBL_STS1 QM_CGM_STS
---- --- ------- ------------ ------------ ----------
0 0 N 0x1fff 0x0 0xb00
1 0 N 0x1fff 0x0 0xb00