Failed profiling gaudi

I executed MLPerf3.0 Resnet in Habana’s Model-references on gaudi with kubernetes and I tried profiling the application. I could get the result, but it is seemed failed because the gaudi has many blanks. How do I solve it?

Can you please give some details/steps on how you di the profiling?

First, I am using sdsc voyager.
I execute Resnet in this, and the execution was success.
I made the yaml for profiling and executed the below command.

                      - ./launch_keras_resnet_hvd.sh
                      - --config
                      - /voyager/ceph/users/nishida/resnet/batch_256_prof.cfg
                      - --cpu-pin
                      - none
                      - --jpeg-data-dir
                      - /voyager/ceph/users/nishida/datasets/resnet
                      - --log_dir
                      - /scratch
                      - --habana-profiler
                      - "1"

launch_keras_resnet_hvd.sh is same as this.
In batch_256_prof.cfg, I changed train epochs to 2 and added the run.sh this line:

hl-prof-config -o /voyager/ceph/users/nishida/resnet -s n1-profile

I found the character device files in /dev/accel and I think these indicate 8 Gaudis. I think the correct character device files are hl0-8 and hl_controlD0-8 in /dev. Having accel0-8 and hl0-8 is correct?

Can you try this?

https://docs.habana.ai/en/latest/Profiling/Profiling_with_TensorFlow.html?highlight=profile%20with%20tensorflow

I can not execute it due to the below error:

Synapse detected a device critical error that requires a restart. Killing process in 5 seconds (hl: 6) 06:48:26 [please check log files for dfa cause]

The error seems similar to the one mentioned here: Synapse detected a device critical error - #3 by Nishida

Can you see if the hugepages fix works for this issue as well?