Training of PyTorch Efficientnet seems extremely slow

I am trying to train/finetune an efficientnet_b0 model with my own data. It is the official PyTorch 1.11 torchvision model so no third party model. I am using pretrained version and trying to fine-tune it on my own data. I am using Synapse 1.5 as it one of the official AWS Habana AMI With preinstaled PyTorch1.11. I wrote my own training script but basically looked at the MNIST and ResNet training scripts from the Habana GitHub repository and adopted my accordingly…

Currently I am running on a single device just to get my feet wet. I am using FP32 training batch size is 90 and tensor format is [3,448,192].

Running in lazy mode. A single iteration seems to takes over 20 seconds to execute that’s significantly slower than it takes for example on an nVidia V100 .

Any idea why each iteration is running so extremely slowly? I checked it is not the I/O as I removed the image data loading with just torch.randn instead


Is it possible to shared a stripped down version of the code (maybe with random data training, and random initialization)

Also, just to make sure, you can run hl-smi and check HPU usage, to be sure that the model is running on the HPU and not on the CPU

Also if you are ok with TensorFlow instead of PyTorch, here is an enabled efficientnet


Thank you for your answer!

Yes I should be able to make a stripped down version without the need of any data I will try to send one later.

Yes I checked it is running on the HPU (via hl-smi) I also get a .log0 file for the device.

Funnily enough I ran it on the CPU I explicitly used CPU Device just to check as you say if there is maybe an issue but to my surprise on the CPU it ran ~ 20x faster on that AWS instance than using the HPU.

I tried also the hl-profile-conf but I couldn’t get it to record any .JSON or I was too stupid to find where it was saved.

I am afraid our whole pipeline is PyTorch based so unfortunately no a Tensorflow version does not help.

Hello again,

Here is the stripped down training script as promised

Stripped down training script

You simply have to run it with python --usecuda 0 if you want to use it with GAUDI. Hope it helps to narrow down the issue!

Btw seeing the same with torchvision ConvNext models and not to the same extreme with MobilenetV3 but it also has very slow iterations.

However a few other network topologizes with the same training script seem to be running okay.

So I wonder if it is a specific op in those networks maybe an activation function or a specific layout it struggles with.

Thank you for this detailed analysis. We’ll take a look at it and get back to you.

Here are a couple of work arounds you can try:

  1. Perform bernoulli_ of StochasticDepth on CPU
    Around this line here,
if 'hpu' in input.device.type:
    dev = 'cpu'
#noise = torch.empty(size, dtype=input.dtype, device=input.device)
noise = torch.empty(size, dtype=input.dtype, device=dev)
noise = noise.bernoulli_(survival_rate)
if 'hpu' in input.device.type:
    noise =
  1. Disable inplace Dropout
    Around here
    Replace nn.Dropout(p=dropout, inplace=True), with nn.Dropout(p=dropout),

Please let me know if you see speedups with these 2 changes.


Thank you very much. Yes with these changes it has improved it tremendously. This feels more like the performance I would suspect for an accelerator :grinning:

So does this mean inplace dropout is not something the Gaudi architecture likes? As I think that is used in quiet a few architectures. At least it used in torchvision’s version of mobilenetv3 when I remove it there I also see a tremendous performance boost

Thanks for verifying. We’ll take a look at the workarounds (bernoulli and in place dropout) and will update here if we find the reason of the slowness or if it goes away in later releases.