Error with convolution layers

I am getting this error when running training using convolution layers in PyTorch:

RuntimeError: Function ToCopyBackward0 returned an invalid gradient at index 0 - got [1, 1, 2048, 768] but expected shape compatible with [768, 2048, 1, 1]

I have permuted the model using permute_params.

After debugging in permute_params I see the last conv layer is permuted from conv.weight: torch.Size([768, 2048, 1, 1]) to conv.weight: torch.Size([1, 1, 2048, 768])

I suspect it’s the manipulation of that is messing up the gradient shapes.

Any help is appreciated.

Hi @ShaoyenT; thanks for posting this question. What is the model you are running? Can you please confirm the setup you are running (SynpaseAI SW version) and if you are running on prem or in AWS DL1 instance? If you are running in the DL1 instance, can you please confirm the DLAMI or base AMI + Docker image that you are using?

We can support you much faster if we have a log file. you can post the log file to an external storage and share the link with us.

Hi @Greg_S, I am using AWS DL1 instance with AMI Deep Learning AMI Habana PyTorch 1.10.1 SynapseAI 1.3.0 (Ubuntu 20.04) 20220304

The model is a custom ResNet based model but I can replicate this error with this code

import os

from habana_frameworks.torch.utils.library_loader import load_habana_module
import habana_frameworks.torch.core.hccl
import habana_frameworks.torch.core as htcore

import torch
import torch.nn.functional as F
from torch.nn.parallel import DistributedDataParallel as DDP

def permute_params(model, to_filters_last, lazy_mode):
    with torch.no_grad():
        for name, param in model.named_parameters():
            if(param.ndim == 4):
                if to_filters_last:
           =, 3, 1, 0))  # permute KCRS to RSCK
           =, 2, 0, 1))  # permute RSCK to KCRS
    if lazy_mode:

class Net(torch.nn.Module):
    def __init__(self,):
        self.conv = torch.nn.Sequential(
            torch.nn.Conv2d(3, 5, 2, stride=2),
        self.linear = torch.nn.Linear(1445,3)
    def forward(self, inp):
        x = self.conv(inp)

        x = x.view(x.size(0),-1)
        x = self.linear(x)
        return x

os.environ['LOCAL_RANK'] = '0'
os.environ['WORLD_SIZE'] = '1'
os.environ['ID'] = '0'
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '34245'
batch_size = 4

torch.distributed.init_process_group(backend="hccl", rank=0, world_size=1)

x = torch.rand((batch_size,3,150,150))
target = torch.tensor([1] * batch_size, dtype=torch.long)

model = Net()'hpu')
model = DDP(model, broadcast_buffers=False, gradient_as_bucket_view=False)
permute_params(model, True, True)
y = model('hpu'))

loss = F.cross_entropy(y,'hpu'))

which gives the output

synapse_logger INFO. pid=5799 at /home/jenkins/workspace/cdsoftwarebuilder/create-pytorch---bpt-d/repos/pytorch-integration/pytorch_helpers/synapse_logger/synapse_logger.cpp:340 Done command: restart
Loading Habana modules from /home/ubuntu/.local/lib/python3.8/site-packages/habana_frameworks/torch/lib
Traceback (most recent call last):
  File "", line 65, in <module>
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/autograd/", line 154, in backward
RuntimeError: Function ConvolutionOverrideableBackward0 returned an invalid gradient at index 1 - got [2, 2, 3, 5] but expected shape compatible with [5, 3, 2, 2]

Hi @ShaoyenT

Could you please post your running command (you seem to be trying out multi-card?).

I copied the contents of your code snippet, commented out the lines relevant for multicard (os.environ setting, torch.distributed.init_process_group and model = DDP(model,…)) and was able to run it fine on single card. Can you also please try the same experiment and see if you can run it on single card?

commented out lines for single card run:

#os.environ['LOCAL_RANK'] = '0'
#os.environ['WORLD_SIZE'] = '1'
#os.environ['ID'] = '7'
#os.environ['MASTER_ADDR'] = 'localhost'
#os.environ['MASTER_PORT'] = '34245'

#torch.distributed.init_process_group(backend="hccl", rank=0, world_size=1)

#model = DDP(model, broadcast_buffers=False, gradient_as_bucket_view=False)

Hi @Sayantan_S

Yes, I can run this code on a single card by commenting out those lines. However I wish to do distributed training using convolution layers. Is this the correct way of using DDP on Gaudi?

I am running this code using python3 -m (also with python3 on a single-card)

Thanks for your reply and experiment. We are looking into it and will update here


Can you please move the permute_params before the DDP call and try

That is:

model = Net()'hpu')
permute_params(model, True, True)
model = DDP(model, broadcast_buffers=False, gradient_as_bucket_view=False)

instead of:

model = Net()'hpu')
model = DDP(model, broadcast_buffers=False, gradient_as_bucket_view=False)
permute_params(model, True, True)

It works for me if I make this change.

1 Like