VITS training got RuntimeError: MKL FFT doesn't support tensors of type: BFloat16

I’m trying to train VITS, a text-to-speech model on HPU. I succeed in pretraining it on CUDA A100. Next I want to try it on HPU. With some static shape fixed I succeeded in doing the inference on HPU. However, when I tried to do the training process, it failed with the following error.

/usr/local/lib/python3.10/dist-packages/torch/functional.py:660: UserWarning: stft with return_complex=False is deprecated. In a future pytorch release, stft will return complex tensors for all inputs, and return_complex=False will raise an error.
Note: you can still call torch.view_as_real on the complex output to recover the old return format. (Triggered internally at /npu-stack/pytorch-fork/aten/src/ATen/native/SpectralOps.cpp:874.)
  return _VF.stft(input, n_fft, hop_length, win_length, window,  # type: ignore[attr-defined]
Traceback (most recent call last):
  File "/vits/train_ms.py", line 306, in <module>
    main()
  File "/vits/train_ms.py", line 130, in main
    train_and_evaluate(rank, epoch, hps, [net_g, net_d], [optim_g, optim_d], [scheduler_g, scheduler_d], scaler, [train_loader, eval_loader], logger, [writer, writer_eval])
  File "/vits/train_ms.py", line 168, in train_and_evaluate
    y_hat_mel = mel_spectrogram_torch(
  File "/vits/mel_processing.py", line 108, in mel_spectrogram_torch
    spec = torch.stft(y, n_fft, hop_length=hop_size, win_length=win_size, window=hann_window[wnsize_dtype_device],
  File "/usr/local/lib/python3.10/dist-packages/torch/functional.py", line 660, in stft
    return _VF.stft(input, n_fft, hop_length, win_length, window,  # type: ignore[attr-defined]
RuntimeError: MKL FFT doesn't support tensors of type: BFloat16

The source code is here vits/mel_processing.py at main · Spycsh/vits · GitHub, triggered by train_ms.py

Before training I just enter the following lines into train_ms.py, put all the relevant inputs on “hpu” device and backend to ‘hccl’. (The are my local change and not commited yet so you may not see them on the same file on GitHub)

import habana_frameworks.torch.core as htcore
import habana_frameworks.torch.gpu_migration
from habana_frameworks.torch.distributed.hccl import initialize_distributed_hpu

PyTorch Support Matrix — Gaudi Documentation 1.17.0 documentation seems FFT is not supported yet, don’t know whether it is related.

Could you give me some hint where I’m wrong, or how I can disable BF16 or some other ways that I can do to use the STFT that supported on HPU?

"MKL FFT " in the error seems to suggest that its falling back to CPU, possibly because its not supported yet on hpu (maybe because this functions deals wtigh complex numbers, which is not supported in hpu).

you can move the FT op to CPU explicitly. This might avoid the crash

# say x is a tensor on HPU
x = x.cpu()
x_stft = torch.stft(x, fft_size, hop_size, win_length, window, ....)
# move back stuff to CPU
x = x.to(device)
x_stft = x_stft.to(torch.float32).to(device)

In case you want to turn of bf16, you can check how autocast is used in the page below, and try turning that off (pass it enabled=False)
https://docs.habana.ai/en/latest/PyTorch/PyTorch_Mixed_Precision/index.html?highlight=autocast

1 Like

Thanks @Sayantan_S , I fallback manually to CPU Float32 and this passed. Now I got another error said following:

  File "/vits/train_ms.py", line 307, in <module>
    main()
  File "/vits/train_ms.py", line 129, in main
    train_and_evaluate(rank, epoch, hps, [net_g, net_d], [optim_g, optim_d], [scheduler_g, scheduler_d], scaler, [train_loader, eval_loader], logger, [writer, writer_eval])
  File "/vits/train_ms.py", line 188, in train_and_evaluate
    scaler.scale(loss_disc_all).backward()
  File "/usr/local/lib/python3.10/dist-packages/torch/_tensor.py", line 532, in backward
    torch.autograd.backward(
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/__init__.py", line 266, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Graph compile failed. synStatus=synStatus 26 [Generic failure].

I thought gpu migration module should do the thing for migration of torch.cuda.amp.GradScaler to HPU, as the doc lists. So I look into the log by
setting the debug level to 4(error only) and check the detailed log. I found something might be useful.

vim graph_compiler.log

[02:21:22.954855][GC                    ][error][tid:11DFB4] Failed to load tpc kernel for node: gradient/module/0/5/reduce_sum_fwd_f32/5521_complex/reduce_sum_fwd_f32_0, GUID: reduce_sum_fwd_f32. Got error: GLUE_INCOMPATIBLE_OUTPUT_SIZE
[02:21:22.954942][PASS_MANAGER          ][error][tid:11DFB4] Graph optimization failed pass: loadTpcKernels
vim pytorch.log

[02:21:22.955134][PT_BRIDGE       ][error][tid:11DFB4] /npu-stack/pytorch-integration/backend/synapse_helpers/graph.cpp: 536Graph compile failed. synStatus=synStatus 26 [Generic failure]. compile
[02:21:22.971260][PT_BRIDGE       ][error][tid:11DFB4] backtrace (up to 30)                                                          [02:21:22.971283][PT_BRIDGE       ][error][tid:11DFB4] /usr/lib/habanalabs/libhl_logger.so(hl_logger::v1_0::logStackTrace(std::shared_ptr<hl_logger::Logger> const&, int)+0x5c) [0x7fd445744d0c]                                                                          [02:21:22.971291][PT_BRIDGE       ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/lib/libhabana_pytorch_plugin.so(void hl_logger::v1_7_inline_fmt_compile::logStacktrace<HlLogger::LoggerType>(HlLogger::LoggerType, int)+0x61) [0x7fd447a90bb1]                                                                                                                          [02:21:22.971299][PT_BRIDGE       ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/lib/libhabana_pytorch_backend.so(synapse_helpers::graph::compile()+0x1927) [0x7fd4468355c7]
[02:21:22.971303][PT_BRIDGE       ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/lib/libhabana_pytorch_backend.so(habana::HabanaLaunchOpPT::CompileSynapseGraph()+0xd9) [0x7fd446ff2599]                                            [02:21:22.971307][PT_BRIDGE       ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/lib/libhabana_pytorch_backend.so(habana::HabanaLaunchOpPT::CompileSynapseGraphAndPatchTable()+0x1fb) [0x7fd446fdeeeb]                              [02:21:22.971321][PT_BRIDGE       ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/lib/libhabana_pytorch_backend.so(habana::HabanaLaunchOpPT::run(std::vector<c10::IValue, std::allocator<c10::IValue> >&, std::optional<std::vector<at::Tensor, std::allocator<at::Tensor> > >, std::optional<std::vector<std::vector<long, std::allocator<long> >, std::allocator<std::vector<long, std::allocator<long> > > > >, bool, habana::HabanaLaunchOpPipeline::PipelineCallBase&)+0x1883) [0x7fd446fb8273]           [02:21:22.971326][PT_BRIDGE       ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/lib/libhabana_pytorch_plugin.so(+0xf16e65) [0x7fd448460e65]                                                                                        [02:21:22.971332][PT_BRIDGE       ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/lib/libhabana_pytorch_plugin.so(habana_lazy::exec::HlExec::Launch(std::vector<c10::IValue, std::allocator<c10::IValue> >&, c10::hpu::HPUStream const&, bool)+0x932) [0x7fd4484636d2]                                                                                                    [02:21:22.971337][PT_BRIDGE       ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/lib/libhabana_pytorch_plugin.so(LaunchSyncTensorsGraph(LaunchTensorsInfo&&, LaunchEagerInfo&&, LaunchStreamInfo&&)+0x607) [0x7fd44843d277]         [02:21:22.971344][PT_BRIDGE       ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/lib/libhabana_pytorch_plugin.so(habana_lazy::HbLazyTensor::SyncTensorsGraphInternal(std::vector<habana_lazy::HbLazyTensor, std::allocator<habana_lazy::HbLazyTensor> >*, std::shared_ptr<habana_lazy::HbLazyFrontEndInfoToBackend>, bool, bool)+0x2019) [0x7fd4484402d9]
[02:21:22.971349][PT_BRIDGE       ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/lib/libhabana_pytorch_plugin.so(habana_lazy::HbLazyTensor::SyncTensorsGraph(std::vector<habana_lazy::HbLazyTensor, std::allocator<habana_lazy::HbLazyTensor> >*, std::shared_ptr<habana_lazy::HbLazyFrontEndInfoToBackend>, bool, bool)+0x43d) [0x7fd44844186d]
[02:21:22.971344][PT_BRIDGE       ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/lib/libhabana_pytorch_plugin.so(habana_lazy::HbLazyTensor::SyncTensorsGraphInternal(std::vector<habana_lazy::HbLazyTensor, std::allocator<habana_lazy::HbLazyTensor> >*, std::shared_ptr<habana_lazy::HbLazyFrontEndInfoToBackend>, bool, bool)+0x2019) [0x7fd4484402d9]
[02:21:22.971349][PT_BRIDGE       ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/lib/libhabana_pytorch_plugin.so(habana_lazy::HbLazyTensor::SyncTensorsGraph(std::vector<habana_lazy::HbLazyTensor, std::allocator<habana_lazy::HbLazyTensor> >*, std::shared_ptr<habana_lazy::HbLazyFrontEndInfoToBackend>, bool, bool)+0x43d) [0x7fd44844186d]                         [02:21:22.971358][PT_BRIDGE       ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/lib/libhabana_pytorch_plugin.so(habana_lazy::HbLazyTensor::SyncLiveTensorsGraph(c10::Device const*, std::shared_ptr<habana_lazy::HbLazyFrontEndInfoToBackend>, std::vector<habana_lazy::HbLazyTensor, std::allocator<habana_lazy::HbLazyTensor> >, bool, bool, std::vector<habana_lazy::HbLazyTensor, std::allocator<habana_lazy::HbLazyTensor> >, std::set<long, std::less<long>, std::allocator<long> >)+0x3a8) [0x7fd448442248]
[02:21:22.971367][PT_BRIDGE       ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/lib/libhabana_pytorch_plugin.so(habana_lazy::HbLazyTensor::StepMarker(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::shared_ptr<habana_lazy::HbLazyFrontEndInfoToBackend>, std::vector<habana_lazy::HbLazyTensor, std::allocator<habana_lazy::HbLazyTensor> >, bool, bool, std::vector<habana_lazy::HbLazyTensor, std::allocator<habana_lazy::HbLazyTensor> >, std::set<long, std::less<long>, std::allocator<long> >)+0x95b) [0x7fd44844326b]
[02:21:22.971379][PT_BRIDGE       ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/lib/libhabana_pytorch_plugin.so(habana_lazy::HbLazyTensorViews::StepMarkerAllReduce(std::vector<at::Tensor, std::allocator<at::Tensor> > const&)+0x69b) [0x7fd4484c7bfb]
[02:21:22.971390][PT_BRIDGE       ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/distributed/_hccl_C.so(c10d::ProcessGroupHCCL::collective(std::vector<at::Tensor, std::allocator<at::Tensor> >&, std::vector<at::Tensor, std::allocator<at::Tensor> >&, std::function<hcclResult_t (at::Tensor&, at::Tensor&, void const*, void*, void*&, InternalStreamHandle*)>, bool)+0x1100) [0x7fd43eb11ec0]
[02:21:22.971395][PT_BRIDGE       ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/distributed/_hccl_C.so(c10d::ProcessGroupHcclBase::allreduce(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&)+0x42a) [0x7fd43eae252a]
[02:21:22.971403][PT_BRIDGE       ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/distributed/_hccl_C.so(c10d::ops::allreduce_hpu_(c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, c10::intrusive_ptr<c10d::ReduceOp, c10::detail::intrusive_target_default_null_type<c10d::ReduceOp> > const&, std::optional<at::Tensor> const&, long)+0x138) [0x7fd43eb289b8]
[02:21:22.971423][PT_BRIDGE       ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/distributed/_h
[02:21:22.971395][PT_BRIDGE       ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/distributed/_hccl_C.so(c10d::ProcessGroupHcclBase::allreduce(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&)+0x42a) [0x7fd43eae252a]
[02:21:22.971403][PT_BRIDGE       ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/distributed/_hccl_C.so(c10d::ops::allreduce_hpu_(c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, c10::intrusive_ptr<c10d::ReduceOp, c10::detail::intrusive_target_default_null_type<c10d::ReduceOp> > const&, std::optional<at::Tensor> const&, long)+0x138) [0x7fd43eb289b8]
[02:21:22.971423][PT_BRIDGE       ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/distributed/_hccl_C.so(c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (*)(c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, c10::intrusive_ptr<c10d::ReduceOp, c10::detail::intrusive_target_default_null_type<c10d::ReduceOp> > const&, std::optional<at::Tensor> const&, long), std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > >, c10::guts::typelist::typelist<c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, c10::intrusive_ptr<c10d::ReduceOp, c10::detail::intrusive_target_default_null_type<c10d::ReduceOp> > const&, std::optional<at::Tensor> const&, long> >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*)+0x19e) [0x7fd43eb37c6e]
[02:21:22.971427][PT_BRIDGE       ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so(+0x4a366a6) [0x7fd4f0da66a6]
[02:21:22.971429][PT_BRIDGE       ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so(+0x53ce525) [0x7fd4f173e525]
[02:21:22.971432][PT_BRIDGE       ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so(+0x53dd3ec) [0x7fd4f174d3ec]
[02:21:22.971434][PT_BRIDGE       ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so(+0x543054c) [0x7fd4f17a054c]
[02:21:22.971438][PT_BRIDGE       ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so(c10d::Reducer::run_allreduce_hook(c10d::GradBucket&)+0x49) [0x7fd4f17b0bf9]
[02:21:22.971444][PT_BRIDGE       ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so(c10d::Reducer::run_comm_hook(c10d::GradBucket&)+0x55) [0x7fd4f17b0cd5]
[02:21:22.971447][PT_BRIDGE       ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so(c10d::Reducer::all_reduce_bucket(c10d::Reducer::Bucket&)+0x4d7) [0x7fd4f17b3fe7]
[02:21:22.971444][PT_BRIDGE       ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so(c10d::Reducer::run_comm_hook(c10d::GradBucket&)+0x55) [0x7fd4f17b0cd5]
[02:21:22.971447][PT_BRIDGE       ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so(c10d::Reducer::all_reduce_bucket(c10d::Reducer::Bucket&)+0x4d7) [0x7fd4f17b3fe7]                                                                 [02:21:22.971451][PT_BRIDGE       ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so(c10d::Reducer::mark_bucket_ready(unsigned long)+0x6b) [0x7fd4f17b427b]
[02:21:22.971454][PT_BRIDGE       ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so(c10d::Reducer::mark_variable_ready(unsigned long)+0x1b1) [0x7fd4f17bb311]
[02:21:22.971457][PT_BRIDGE       ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so(c10d::Reducer::autograd_hook(unsigned long)+0x16c) [0x7fd4f17bb6bc]
[02:21:22.971459][PT_BRIDGE       ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so(+0x544b993) [0x7fd4f17bb993]
[02:21:22.971462][PT_BRIDGE       ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so(+0x54509cf) [0x7fd4f17c09cf]
vim synapse_runtime.log

[02:20:44.079748][SYN_API       ][info ][tid:11D8F6] + ---------------------------------------------------------------------- +
[02:20:44.079768][SYN_API       ][info ][tid:11D8F6] | Version:            1.16.1                                             |
[02:20:44.079774][SYN_API       ][info ][tid:11D8F6] | Synapse:            323df60                                            |
[02:20:44.079778][SYN_API       ][info ][tid:11D8F6] | HCL:                97cbe6a                                            |
[02:20:44.079781][SYN_API       ][info ][tid:11D8F6] | MME:                b7ec966                                            |
[02:20:44.079786][SYN_API       ][info ][tid:11D8F6] | SCAL:               0ecf6e1                                            |
[02:20:44.079790][SYN_API       ][info ][tid:11D8F6] | Description:        HabanaLabs Runtime and GraphCompiler               |
[02:20:44.079842][SYN_API       ][info ][tid:11D8F6] | Time:               2024-08-13 02:20:44.079792                         |
[02:20:44.079856][SYN_API       ][info ][tid:11D8F6] + ---------------------------------------------------------------------- +
[02:21:22.954967][SYN_RECIPE    ][error][tid:11DFB4] compileGraph: Can not compile graph
[02:21:22.954985][SYN_RECIPE    ][error][tid:11DFB4] addRecipeHandleAndCompileGraph: Can not compile
[02:21:22.955006][SYN_API       ][error][tid:11DFB4] compileGraph: Failed to add recipe handle into Recipe-Singleton status 26[synFail]

Could you give me some hints on why the graph is not compiled successful and how can I go further on this?

Can you share the code/command line to repro the issue?

Hi @Sayantan_S , here are the minimal steps to reproduce the original bug.

docker run -itd -p 8091:80  --runtime=habana -e http_proxy=$http_proxy -e https_proxy=$https_proxy -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --ipc=host vault.habana.ai/gaudi-docker/1.16.1/ubuntu22.04/habanalabs/pytorch-installer-2.2.2:latest

# enter into the container, and clone 
docker exec -it <container_id> bash

git clone https://github.com/Spycsh/vits.git

# prepare env
cd vits
pip install -r requirements_hpu.txt
cd monotonic_align
mkdir monotonic_align
python setup.py build_ext --inplace

export RANK=0
export WORLD_SIZE=1
export MASTER_ADDR=127.0.0.1
export MASTER_PORT=8888

python train_ms_hpu.py -c configs/intel_base.json -m intel_base

The dataset is also uploaded so you do not need to prepare it. I used to train the same dataset successfully on A100 by running the command python train_ms.py -c configs/intel_base.json -m intel_base.

Thanks for further suggestions!