Training of PyTorch Efficientnet seems extremely slow

BautaJD · August 18, 2022, 4:50pm

I am trying to train/finetune an efficientnet_b0 model with my own data. It is the official PyTorch 1.11 torchvision model so no third party model. I am using pretrained version and trying to fine-tune it on my own data. I am using Synapse 1.5 as it one of the official AWS Habana AMI With preinstaled PyTorch1.11. I wrote my own training script but basically looked at the MNIST and ResNet training scripts from the Habana GitHub repository and adopted my accordingly…

Currently I am running on a single device just to get my feet wet. I am using FP32 training batch size is 90 and tensor format is [3,448,192].

Running in lazy mode. A single iteration seems to takes over 20 seconds to execute that’s significantly slower than it takes for example on an nVidia V100 .

Any idea why each iteration is running so extremely slowly? I checked it is not the I/O as I removed the image data loading with just torch.randn instead

Sayantan_S · August 19, 2022, 5:26am

Hi,

Is it possible to shared a stripped down version of the code (maybe with random data training, and random initialization)

Also, just to make sure, you can run hl-smi and check HPU usage, to be sure that the model is running on the HPU and not on the CPU

Also if you are ok with TensorFlow instead of PyTorch, here is an enabled efficientnet

Thanks

BautaJD · August 19, 2022, 4:27pm

Thank you for your answer!

Yes I should be able to make a stripped down version without the need of any data I will try to send one later.

Yes I checked it is running on the HPU (via hl-smi) I also get a .log0 file for the device.

Funnily enough I ran it on the CPU I explicitly used CPU Device just to check as you say if there is maybe an issue but to my surprise on the CPU it ran ~ 20x faster on that AWS instance than using the HPU.

I tried also the hl-profile-conf but I couldn’t get it to record any .JSON or I was too stupid to find where it was saved.

I am afraid our whole pipeline is PyTorch based so unfortunately no a Tensorflow version does not help.

BautaJD · August 19, 2022, 4:27pm

Hello again,

Here is the stripped down training script as promised

Stripped down training script

You simply have to run it with python train_stripped.py --usecuda 0 if you want to use it with GAUDI. Hope it helps to narrow down the issue!

BautaJD · August 22, 2022, 6:56pm

Btw seeing the same with torchvision ConvNext models and not to the same extreme with MobilenetV3 but it also has very slow iterations.

However a few other network topologizes with the same training script seem to be running okay.

So I wonder if it is a specific op in those networks maybe an activation function or a specific layout it struggles with.

Sayantan_S · August 23, 2022, 4:38am

Thank you for this detailed analysis. We’ll take a look at it and get back to you.

Sayantan_S · August 23, 2022, 5:09am

Here are a couple of work arounds you can try:

Perform bernoulli_ of StochasticDepth on CPU
Around this line here,

if 'hpu' in input.device.type:
    dev = 'cpu'
#noise = torch.empty(size, dtype=input.dtype, device=input.device)
noise = torch.empty(size, dtype=input.dtype, device=dev)
noise = noise.bernoulli_(survival_rate)
if 'hpu' in input.device.type:
    noise = noise.to(input.device.type)

Disable inplace Dropout
Around here
Replace nn.Dropout(p=dropout, inplace=True), with nn.Dropout(p=dropout),

Please let me know if you see speedups with these 2 changes.

Thanks
Sayantan

BautaJD · August 23, 2022, 4:28pm

Thank you very much. Yes with these changes it has improved it tremendously. This feels more like the performance I would suspect for an accelerator

So does this mean inplace dropout is not something the Gaudi architecture likes? As I think that is used in quiet a few architectures. At least it used in torchvision’s version of mobilenetv3 when I remove it there I also see a tremendous performance boost

Sayantan_S · August 23, 2022, 4:30pm

Thanks for verifying. We’ll take a look at the workarounds (bernoulli and in place dropout) and will update here if we find the reason of the slowness or if it goes away in later releases.

Topic		Replies	Views
Trainer killed/Segfault PyTorch	6	637	September 1, 2023
PyTorch model works on CPU/CUDA but not on HPU Training pytorch	5	1752	January 19, 2022
Tensors taking time to shift from HPU to CPU Inference pytorch	2	129	July 9, 2024
I'm running Habana's models and I don't see the same level of performance as what is published on the GitHub and Developer site FAQ performance	0	679	June 30, 2021
Habana Gaudi Hpus Training time improvement TensorFlow	2	657	September 30, 2022

Training of PyTorch Efficientnet seems extremely slow

Related topics