Nishida
December 15, 2023, 8:36pm
1
I tried executing example.py in Model-Reference on container with kubernetes. But I got the below error:
Synapse detected a device critical error that requires a restart. Killing process in 5 seconds (hl: 6) 07:07:15 [please check log files for dfa cause]
The device_fail_analysis_log is below:
[07:07:14.693662][SYN_DEV_FAIL][error][tid:F] hl thunk API failed, errCode memcopyIoctlFailed
[07:07:15.194082][SYN_DEV_FAIL][info ][tid:F] tid 15 #DFA begin 1701760035194077101
[07:07:15.194704][SYN_DEV_FAIL][error][tid:F]
The dmesg is below:
[Tue Dec 5 07:06:55 2023] habanalabs hl6: Failed (9216) to pin host memory with user ptr 0x7ffe0fe00000, size 0x80000000, npages 524288
environment:
hl-smi version:hl-1.11.0-fw-45.1.1.1
Driver version:1.11.0-e6eb0fd
yamlfile:
#### Note - set HOME variable after interactive access
apiVersion: v1
kind: Pod
metadata:
name: resnet-test
spec:
securityContext:
restartPolicy: Never
volumes:
- name: ceph
hostPath:
path: /voyager/ceph/users/username
type: Directory
- name: scratch
emptyDir: {}
containers:
- name: gaudi-container
image: nsdjousys/resnet-test:latest
env:
- name: HOME
value: /scratch
- name: HABANA_VISIBLE_DEVICES
value: all
command: ["/bin/sh", "-ec", "sleep 14400"]
resources:
requests:
cpu: 64
memory: 400Gi
habana.ai/gaudi: 8
hugepages-2Mi: 36Mi
limits:
cpu: 64
memory: 450Gi
habana.ai/gaudi: 8
hugepages-2Mi: 36Mi
volumeMounts:
- name: ceph
mountPath: /voyager/ceph/users/username
- name: scratch
mountPath: /scratch
execution command:
cd path/to/Model-References/TensorFlow/examples/hello_world
python3 example.py
Can you please run with: ENABLE_CONSOLE=true LOG_LEVEL_ALL=4
to collect the logs
Nishida
December 21, 2023, 9:53pm
3
the result of execution with LOG_LEVEL_ALL=4 ENABLE_CONSOLE=true is below:
$ python3 example.py
2023-12-19 07:52:03.667422: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-12-19 07:52:03.708135: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
[2023-12-19 07:52:05.648] [SYN_API] [info] [tid:125] + ---------------------------------------------------------------------- +
[2023-12-19 07:52:05.648] [SYN_API] [info] [tid:125] | Version: 1.11.0 |
[2023-12-19 07:52:05.648] [SYN_API] [info] [tid:125] | Synapse: 2402698 |
[2023-12-19 07:52:05.648] [SYN_API] [info] [tid:125] | HCL: ab04ac9 |
[2023-12-19 07:52:05.648] [SYN_API] [info] [tid:125] | MME: 57aebd3 |
[2023-12-19 07:52:05.648] [SYN_API] [info] [tid:125] | SCAL: 832fb38 |
[2023-12-19 07:52:05.648] [SYN_API] [info] [tid:125] | Description: HabanaLabs Runtime and GraphCompiler |
[2023-12-19 07:52:05.648] [SYN_API] [info] [tid:125] | Time: 2023-12-19 07:52:05.648777 |
[2023-12-19 07:52:05.648] [SYN_API] [info] [tid:125] + ---------------------------------------------------------------------- +
2023-12-19 07:52:06.826336: I /home/jenkins/workspace/cdsoftwarebuilder/create-tensorflow-module---bpt-d/repos/tensorflow-training/habana_device/habana_device_binding_iface.cpp:62] Found TensorFlow library with SHA256: 6313068adb101b8805221c224fdfe66b3a10bc9d044139aa1c7ea305b8b9f0b4
[2023-12-19 07:52:07.530] [HCL] [info] [tid:125] Version: 1.11.0-ab04ac9
[2023-12-19 07:52:08.966] [SYN_API] [critical] [tid:125] DFA detected, see separate file for details device_fail_analysis_log.txt
[2023-12-19 07:52:08.966] [SYN_API] [error] [tid:125] DFA detected, see separate file for details device_fail_analysis_log.txt
[2023-12-19 07:52:08.967] [DFA_API_INFO] [info] [tid:125] tid 293 #DFA begin 1702972328966532898
[2023-12-19 07:52:09.468] [DFA_API_INFO] [debug] [tid:125] =================================== STREAM SYNC HISTORY (chronological order) START ====================================
[2023-12-19 07:52:09.468] [DFA_API_INFO] [debug] [tid:125] type : stream : physical : physical : target (wait) : function/line
[2023-12-19 07:52:09.468] [DFA_API_INFO] [debug] [tid:125] : ID : StreamOffset : QueuesId : :
[2023-12-19 07:52:09.468] [DFA_API_INFO] [debug] [tid:125] ------------------------------------------------------------------------------------------------------------------------
[07:52:05.623739][SYN_API ][debug][tid:125] synInitialize: SYN_API_CALL
[07:52:05.624183][SYN_API ][debug][tid:125] shim layer initialized successfully
[07:52:05.648712][SYN_API ][info ][tid:125] + ---------------------------------------------------------------------- +
[07:52:05.648733][SYN_API ][info ][tid:125] | Version: 1.11.0 |
[07:52:05.648744][SYN_API ][info ][tid:125] | Synapse: 2402698 |
[07:52:05.648752][SYN_API ][info ][tid:125] | HCL: ab04ac9 |
[07:52:05.648759][SYN_API ][info ][tid:125] | MME: 57aebd3 |
[07:52:05.648767][SYN_API ][info ][tid:125] | SCAL: 832fb38 |
[07:52:05.648776][SYN_API ][info ][tid:125] | Description: HabanaLabs Runtime and GraphCompiler |
[07:52:05.648853][SYN_API ][info ][tid:125] | Time: 2023-12-19 07:52:05.648839 |
[07:52:05.648862][SYN_API ][info ][tid:125] + ---------------------------------------------------------------------- +
[07:52:06.712313][SYN_API ][debug][tid:125] synConfigurationSet: SYN_API_CALL
[07:52:06.712326][SYN_API ][debug][tid:125] synConfigurationSet: SYN_API_CALL
[07:52:06.712329][SYN_API ][debug][tid:125] synConfigurationSet: SYN_API_CALL
[07:52:06.712333][SYN_API ][debug][tid:125] synConfigurationSet: SYN_API_CALL
[07:52:06.712344][SYN_API ][debug][tid:125] synDeviceGetCount: SYN_API_CALL
[07:52:06.713017][SYN_API ][debug][tid:125] synDestroy: SYN_API_CALL
[07:52:06.716599][SYN_API ][debug][tid:125] dfa: setting DfaPhase::NONE
[07:52:06.720834][SYN_API ][info ][tid:125] Synapse API's enter/exit counters:
Function: synConfigurationSet , Entry counter: 4, Exit counter: 4
Function: synDeviceGetCount , Entry counter: 1, Exit counter: 1
[07:52:07.062282][SYN_API ][debug][tid:125] synInitialize: SYN_API_CALL
[07:52:07.062335][SYN_API ][debug][tid:125] shim layer initialized successfully
[07:52:07.088469][SYN_API ][debug][tid:125] synConfigurationSet: SYN_API_CALL
[07:52:07.088478][SYN_API ][debug][tid:125] synConfigurationSet: SYN_API_CALL
[07:52:07.088479][SYN_API ][debug][tid:125] synConfigurationSet: SYN_API_CALL
[07:52:07.088483][SYN_API ][debug][tid:125] synConfigurationSet: SYN_API_CALL
[07:52:07.088496][SYN_API ][debug][tid:125] synDeviceGetCountByDeviceType: SYN_API_CALL deviceType 2[synDeviceGaudi]
[07:52:07.089299][SYN_API ][debug][tid:125] synDeviceAcquireByDeviceType: SYN_API_CALL deviceType 2[synDeviceGaudi]
[07:52:07.514636][SYN_PROGRESS][trace][tid:125] 4[INTERNAL_STREAM_TYPE_DMA_DOWN_SYNAPSE] : 4 : 1 : 1 : 2 : memCpyAsync/800
[07:52:07.526315][SYN_PROGRESS][trace][tid:125] 4[INTERNAL_STREAM_TYPE_DMA_DOWN_SYNAPSE] : 4 : 1 : 1 : 7 : memCpyAsync/800
[07:52:07.530246][HCL_API ][info ][tid:125] hcclInitDevice:
[07:52:07.530380][SYN_API ][debug][tid:125] synDeviceGetInfo: SYN_API_CALL deviceId 0
[07:52:07.530547][HCL_API ][info ][tid:125] initDevice - CPU pinning Done
[07:52:07.530566][HCL_API ][info ][tid:125] HclApi::initDevice HclApi::createHcl device type = 2[synDeviceGaudi]
[07:52:07.992682][SYN_API ][debug][tid:125] synDeviceMalloc: SYN_API_CALL deviceId 0 size 0x1e000000 requestedAddress 0x0 flags 0x0
[07:52:08.005359][HCL_API ][info ][tid:125] HCL_Init_Device - Done
[07:52:08.022187][SYN_API ][debug][tid:125] synStreamCreateGeneric: SYN_API_CALL deviceId 0 flags 0x0
[07:52:08.022248][SYN_API ][debug][tid:125] synStreamCreateGeneric: SYN_API_CALL deviceId 0 flags 0x0
[07:52:08.022269][SYN_API ][debug][tid:125] synStreamCreateGeneric: SYN_API_CALL deviceId 0 flags 0x0
[07:52:08.022350][SYN_API ][debug][tid:125] synStreamCreateGeneric: SYN_API_CALL deviceId 0 flags 0x0
[07:52:08.022356][SYN_API ][debug][tid:125] synDeviceGetMemoryInfo: SYN_API_CALL deviceId 0
[07:52:08.022415][SYN_API ][debug][tid:125] synDeviceMalloc: SYN_API_CALL deviceId 0 size 0x782a3e000 requestedAddress 0x0 flags 0x0
[07:52:08.457146][SYN_API ][debug][tid:125] synDeviceGetAttribute: SYN_API_CALL querySize 1 deviceId 0
[07:52:08.457155][SYN_API ][debug][tid:125] synDeviceGetMemoryInfo: SYN_API_CALL deviceId 0
[07:52:08.460408][SYN_API ][debug][tid:125] synHostMap: SYN_API_CALL deviceId 0 size 0x80000000 buffer 0x7ffe0fe00000
[07:52:08.966517][SYN_API ][critical][tid:125] DFA detected, see separate file for details device_fail_analysis_log.txt
[07:52:08.966531][SYN_API ][error][tid:125] DFA detected, see separate file for details device_fail_analysis_log.txt
[07:52:08.967260][SYN_API ][info ][tid:125] dfa: sending DfaPhase::STARTED
[07:52:08.967262][SYN_API ][debug][tid:125] setting dfaPhase to 1
[07:52:08.967283][HCL_API ][info ][tid:125] hcclDfaUpdateState: update Dfa phase to 1[STARTED]
[07:52:08.967285][HCL_API ][info ][tid:125] Dfa state change. Old 0[NONE] new 1[STARTED]
[07:52:09.469232][DFA_API_INFO][debug][tid:125] =============================================== STREAM SYNC HISTORY END ================================================
Synapse detected a device critical error that requires a restart. Killing process in 5 seconds (hl: 6) 07:52:08 [please check log files for dfa cause]
Killed
And details of the device_fail_analysis_log is below:
[07:52:08.466238][SYN_DEV_FAIL][error][tid:125] hl thunk API failed, errCode memcopyIoctlFailed
[07:52:08.966538][SYN_DEV_FAIL][info ][tid:125] tid 293 #DFA begin 1702972328966532898
[07:52:08.967192][SYN_DEV_FAIL][error][tid:125]
======================================== DFA triggered on the following errors =========================================
An hl-thunk call failed
Most likely device was reset. Please check dmesg.
Instead, you can check a copy of the dmesg: dmesg_copy.txt
========================================================================================================================
[07:52:08.967216][SYN_DEV_FAIL][trace][tid:125] ===================================================== Failure Info =====================================================
[07:52:08.967224][SYN_DEV_FAIL][trace][tid:125] --- Errors detected ---
[07:52:08.967248][SYN_DEV_FAIL][error][tid:125] #DFA reason: memcopyIoctlFailed (code 0x200)
[07:52:08.967254][SYN_DEV_FAIL][error][tid:125]
[07:52:08.967294][SYN_DEV_FAIL][info ][tid:125] + ---------------------------------------------------------------------- +
[07:52:08.967302][SYN_DEV_FAIL][info ][tid:125] | Version: 1.11.0 |
[07:52:08.967310][SYN_DEV_FAIL][info ][tid:125] | Synapse: 2402698 |
[07:52:08.967317][SYN_DEV_FAIL][info ][tid:125] | HCL: ab04ac9 |
[07:52:08.967323][SYN_DEV_FAIL][info ][tid:125] | MME: 57aebd3 |
[07:52:08.967329][SYN_DEV_FAIL][info ][tid:125] | SCAL: 832fb38 |
[07:52:08.967336][SYN_DEV_FAIL][info ][tid:125] | Description: Habana Labs Device failure analysis |
[07:52:08.967384][SYN_DEV_FAIL][info ][tid:125] | Time: 2023-12-19 07:52:08.967340 |
[07:52:08.967391][SYN_DEV_FAIL][info ][tid:125] + ---------------------------------------------------------------------- +
[07:52:08.967398][SYN_DEV_FAIL][info ][tid:125] Failure occurred on device:
[07:52:08.967403][SYN_DEV_FAIL][info ][tid:125] #device name: hl6
[07:52:08.967407][SYN_DEV_FAIL][info ][tid:125] Moudle index: 0
[07:52:08.967413][SYN_DEV_FAIL][info ][tid:125] fd compute/control: 16/15
[07:52:08.967418][SYN_DEV_FAIL][info ][tid:125] #global rank Id: ---
[07:52:08.967424][SYN_DEV_FAIL][info ][tid:125] #device type GAUDI
[07:52:08.967429][SYN_DEV_FAIL][info ][tid:125] #is simulator No
[07:52:08.967447][SYN_DEV_FAIL][info ][tid:125] #acquire time: 2023-12-19 07:52:07.110819
[07:52:08.967452][SYN_DEV_FAIL][info ][tid:125] pci addr: 0000:34:00.0
[07:52:08.967461][SYN_DEV_FAIL][trace][tid:125] ==================================================== Engines Status ====================================================
[07:52:08.967719][SYN_DEV_FAIL][trace][tid:125] actualSize of engine dump 1720
This is the capcity of a Gaudi1 Node
hugepages-2Mi: 96000Mi
but currently it is set to 36Mi
You can try setting it to 95000Mi, by changing your yaml line: hugepages-2Mi: 36Mi
Thanks Leon for helping with the resolution
1 Like