A question about how to use "wrap_in_hpu_graph"

Hi, I create a simple code to understand the usage of function “wrap_in_hpu_graph”, the code is as below:

import torch
import habana_frameworks.torch as ht
import habana_frameworks.torch.core as htcore

class A(torch.nn.Module):
def forward(self,x):
b=x[:,torch.tensor((1,2,1,2,1,2,2,1,0,0,0,0)),torch.tensor((0,1,2,3,0,1,1,1,1,1,1,1))]
return b

def foo_1():
sa=A()
sa=ht.hpu.wrap_in_hpu_graph(sa)
sa(torch.arange(30).reshape(2,3,5).to(‘hpu’))

def foo_2():
sa=A()
sa=ht.hpu.wrap_in_hpu_graph(sa)
sa(torch.arange(30).reshape(2,3,5))

when run foo_1(), will indicate error messge “RuntimeError: cpu fallback is not supported during hpu graph capturing”
when run foo_2(), looks like no error message.
So, the question is:

  1. If I use ht.hpu.wrap_in_hpu_graph() to optimize the model, but the input tensor of the model does not execute to(‘hpu’), that is, the input tensor is located at device=‘cpu’, is this usage allowed?

Posting the code in triple backticks so that indents are visible

class A(torch.nn.Module):
    def forward(self,x):
        b=x[:,torch.tensor((1,2,1,2,1,2,2,1,0,0,0,0)),torch.tensor((0,1,2,3,0,1,1,1,1,1,1,1))]
        return b

def foo_1():
    sa=A()
    sa=ht.hpu.wrap_in_hpu_graph(sa)
    sa(torch.arange(30).reshape(2,3,5).to('hpu'))

def foo_2():
    sa=A()
    sa=ht.hpu.wrap_in_hpu_graph(sa)
    sa(torch.arange(30).reshape(2,3,5))

foo_1()
foo_2()

@taoshaoyu ,

In general the input tensor should be moved to device. Without moving the input tensor on HPU, the operation would happen on CPU.

I can only reproduce the issue on release 1.8, but I do not see the issue on 1.9.

Are you using 1.8? If you can you please move to 1.9 and check if it works for you?

Thanks

A furthur note:

the op you seem to do in the model itself is an indexing op.

indexing ops might be dynamic. Here is how you detect dynamic shapes. Basically set GRAPH_VISUALIZATION=1 and run multiple steps with same input shape (we dont want to check for recompiles due to different input shapes as wrap_in_hpu_graph can handle input dynamicity) and check if .graph_dumps is growing.

If your model is dynamic you should not use hpu graphs. So I would first suggest checking if there is dynamicity of ops in the model and then consider wrapping in hpu graph. Note that wrap_in_hpu_graph is able to deal with input dynamicity, so if that is the only kind of dynamicity you have (ie no dynamic op or dynamic control flow), you can wrap in hpu graph. You can read about removing dynamic ops here, specifically these examples.

Also note you can use hpu graphs for training as well as detailed in here, here and here. For training the equivalent of wrap_in_hpu_graph is ModuleCacher