Tensors taking time to shift from HPU to CPU

Hi,

I am working on TTS. During inferencing the tensors are being shifted to CPU from HPU.
It is taking much more time than the entire inference time.

Attaching the screenshots.

you can check some TTS examples here:

Also if inp/output is taking too long, I suggest using HPU graphs

For inference:

import habana_frameworks.torch as ht
model = ht.hpu.wrap_in_hpu_graph(model)

For training:

import habana_frameworks.torch.core as htcore
htcore.hpu.ModuleCacher(max_graphs=10)(model=model, inplace=True)

I encountered something similar and uninstalling mpi4py fixed it for me, maybe you can try that.

btw I dont have a minimal reproduce example yet, Ill probably open up an issue for that once I have a proper one.