Error with convolution layers

ShaoyenT · May 23, 2022, 5:07pm

I am getting this error when running training using convolution layers in PyTorch:

RuntimeError: Function ToCopyBackward0 returned an invalid gradient at index 0 - got [1, 1, 2048, 768] but expected shape compatible with [768, 2048, 1, 1]

I have permuted the model using permute_params.

After debugging in permute_params I see the last conv layer is permuted from conv.weight: torch.Size([768, 2048, 1, 1]) to conv.weight: torch.Size([1, 1, 2048, 768])

I suspect it’s the manipulation of param.data that is messing up the gradient shapes.

Any help is appreciated.

Greg_S · May 23, 2022, 5:13pm

Hi @ShaoyenT; thanks for posting this question. What is the model you are running? Can you please confirm the setup you are running (SynpaseAI SW version) and if you are running on prem or in AWS DL1 instance? If you are running in the DL1 instance, can you please confirm the DLAMI or base AMI + Docker image that you are using?

We can support you much faster if we have a log file. you can post the log file to an external storage and share the link with us.

ShaoyenT · May 23, 2022, 10:55pm

Hi @Greg_S, I am using AWS DL1 instance with AMI Deep Learning AMI Habana PyTorch 1.10.1 SynapseAI 1.3.0 (Ubuntu 20.04) 20220304

The model is a custom ResNet based model but I can replicate this error with this code

import os

from habana_frameworks.torch.utils.library_loader import load_habana_module
import habana_frameworks.torch.core.hccl
import habana_frameworks.torch.core as htcore

import torch
import torch.nn.functional as F
from torch.nn.parallel import DistributedDataParallel as DDP

def permute_params(model, to_filters_last, lazy_mode):
    with torch.no_grad():
        for name, param in model.named_parameters():
            if(param.ndim == 4):
                if to_filters_last:
                    param.data = param.data.permute((2, 3, 1, 0))  # permute KCRS to RSCK
                else:
                    param.data = param.data.permute((3, 2, 0, 1))  # permute RSCK to KCRS
    if lazy_mode:
        htcore.mark_step()

class Net(torch.nn.Module):
    def __init__(self,):
        super().__init__()
        self.conv = torch.nn.Sequential(
            torch.nn.Conv2d(3, 5, 2, stride=2),
            torch.nn.MaxPool2d(8,4),
        )
        self.linear = torch.nn.Linear(1445,3)
        
    def forward(self, inp):
        x = self.conv(inp)

        x = x.view(x.size(0),-1)
        x = self.linear(x)
    
        return x

os.environ['LOCAL_RANK'] = '0'
os.environ['WORLD_SIZE'] = '1'
os.environ['ID'] = '0'
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '34245'
batch_size = 4

load_habana_module()
torch.distributed.init_process_group(backend="hccl", rank=0, world_size=1)

x = torch.rand((batch_size,3,150,150))
target = torch.tensor([1] * batch_size, dtype=torch.long)

model = Net()
model.to('hpu')
model = DDP(model, broadcast_buffers=False, gradient_as_bucket_view=False)
permute_params(model, True, True)
y = model(x.to('hpu'))

loss = F.cross_entropy(y, target.to('hpu'))
loss.backward()
htcore.mark_step()

which gives the output

synapse_logger INFO. pid=5799 at /home/jenkins/workspace/cdsoftwarebuilder/create-pytorch---bpt-d/repos/pytorch-integration/pytorch_helpers/synapse_logger/synapse_logger.cpp:340 Done command: restart
Loading Habana modules from /home/ubuntu/.local/lib/python3.8/site-packages/habana_frameworks/torch/lib
Traceback (most recent call last):
  File "test_conv.py", line 65, in <module>
    htcore.mark_step()
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/autograd/__init__.py", line 154, in backward
    Variable._execution_engine.run_backward(
RuntimeError: Function ConvolutionOverrideableBackward0 returned an invalid gradient at index 1 - got [2, 2, 3, 5] but expected shape compatible with [5, 3, 2, 2]

Sayantan_S · May 26, 2022, 12:22am

Hi @ShaoyenT

Could you please post your running command (you seem to be trying out multi-card?).

I copied the contents of your code snippet, commented out the lines relevant for multicard (os.environ setting, torch.distributed.init_process_group and model = DDP(model,…)) and was able to run it fine on single card. Can you also please try the same experiment and see if you can run it on single card?

commented out lines for single card run:

#os.environ['LOCAL_RANK'] = '0'
#os.environ['WORLD_SIZE'] = '1'
#os.environ['ID'] = '7'
#os.environ['MASTER_ADDR'] = 'localhost'
#os.environ['MASTER_PORT'] = '34245'

#torch.distributed.init_process_group(backend="hccl", rank=0, world_size=1)

#model = DDP(model, broadcast_buffers=False, gradient_as_bucket_view=False)

ShaoyenT · May 26, 2022, 5:53pm

Hi @Sayantan_S

Yes, I can run this code on a single card by commenting out those lines. However I wish to do distributed training using convolution layers. Is this the correct way of using DDP on Gaudi?

I am running this code using python3 -m torch.distributed.run test.py (also with python3 test.py on a single-card)

Sayantan_S · June 2, 2022, 3:48am

Thanks for your reply and experiment. We are looking into it and will update here

Sayantan_S · June 8, 2022, 10:13pm

Hi,

Can you please move the permute_params before the DDP call and try

That is:

model = Net()
model.to('hpu')
permute_params(model, True, True)
model = DDP(model, broadcast_buffers=False, gradient_as_bucket_view=False)

instead of:

model = Net()
model.to('hpu')
model = DDP(model, broadcast_buffers=False, gradient_as_bucket_view=False)
permute_params(model, True, True)

It works for me if I make this change.

ShaoyenT · July 13, 2022, 7:41am

My code is working now with this change.
Thanks a lot!

Topic		Replies	Views
RuntimeError: Input sizes must be equal when doing loss.backward() during the training of a GNN PyTorch pytorch	3	97	March 20, 2025
Trainer killed/Segfault PyTorch	6	630	September 1, 2023
AttributeError : 'HabanaParameterWrapper' object has no attribute 'change_device_placement' Training pytorch	6	129	October 23, 2024
RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_BRIDGE Exception in Lowering thread PyTorch	3	241	March 23, 2025
ValueError: invalid type: 'torch.hpu.FloatTensor' Training	9	730	June 6, 2023

Error with convolution layers

Related topics