What should peak utilization of a core be?

I am training a variety of models on a dl1 AWS instance. One thing that I noticed was that when using hl-smi the utilization peaks at about mid-40% (using a single core, I use the -l 1 flag so I see utilization as it changes, and it never surpasses 50%). The training speed matched about what I got on a V100. This also happens with simple loops where there is no data transfer, so it doesn’t seem that there is something else holding back performance. This is all with the latest software installed.

Should I be seeing near 100% utilization on the used core on hl-smi and I am only getting half the performance I could? Or is what I’m seeing expected?


Could you mention what model you are trying, and if you are trying 1card or 8 card?

For performance analysis/optimization, I’d suggest looking at the profiling tools such as:
synapse profiling. This profiling is more reliable metric of how the device is performing than the hl-smi number.
You can also look at some common performance tips and tricks here

This is a custom model, basically a bunch of 1D convolutions (either 1x1, or 5-long) with skip connections. This is running on a single core.