I executed MLPerf3.0 Resnet in Habana’s Model-references on gaudi with kubernetes and I tried profiling the application. I could get the result, but it is seemed failed because the gaudi has many blanks. How do I solve it?
Can you please give some details/steps on how you di the profiling?
First, I am using sdsc voyager.
I execute Resnet in this, and the execution was success.
I made the yaml for profiling and executed the below command.
- ./launch_keras_resnet_hvd.sh
- --config
- /voyager/ceph/users/nishida/resnet/batch_256_prof.cfg
- --cpu-pin
- none
- --jpeg-data-dir
- /voyager/ceph/users/nishida/datasets/resnet
- --log_dir
- /scratch
- --habana-profiler
- "1"
launch_keras_resnet_hvd.sh is same as this.
In batch_256_prof.cfg, I changed train epochs to 2 and added the run.sh this line:
hl-prof-config -o /voyager/ceph/users/nishida/resnet -s n1-profile
I found the character device files in /dev/accel and I think these indicate 8 Gaudis. I think the correct character device files are hl0-8 and hl_controlD0-8 in /dev. Having accel0-8 and hl0-8 is correct?
Can you try this?
I can not execute it due to the below error:
Synapse detected a device critical error that requires a restart. Killing process in 5 seconds (hl: 6) 06:48:26 [please check log files for dfa cause]
The error seems similar to the one mentioned here: Synapse detected a device critical error - #3 by Nishida
Can you see if the hugepages fix works for this issue as well?