Thanks @Sayantan_S , I fallback manually to CPU Float32 and this passed. Now I got another error said following:
File "/vits/train_ms.py", line 307, in <module>
main()
File "/vits/train_ms.py", line 129, in main
train_and_evaluate(rank, epoch, hps, [net_g, net_d], [optim_g, optim_d], [scheduler_g, scheduler_d], scaler, [train_loader, eval_loader], logger, [writer, writer_eval])
File "/vits/train_ms.py", line 188, in train_and_evaluate
scaler.scale(loss_disc_all).backward()
File "/usr/local/lib/python3.10/dist-packages/torch/_tensor.py", line 532, in backward
torch.autograd.backward(
File "/usr/local/lib/python3.10/dist-packages/torch/autograd/__init__.py", line 266, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: Graph compile failed. synStatus=synStatus 26 [Generic failure].
I thought gpu migration module should do the thing for migration of torch.cuda.amp.GradScaler to HPU, as the doc lists. So I look into the log by
setting the debug level to 4(error only) and check the detailed log. I found something might be useful.
vim graph_compiler.log
[02:21:22.954855][GC ][error][tid:11DFB4] Failed to load tpc kernel for node: gradient/module/0/5/reduce_sum_fwd_f32/5521_complex/reduce_sum_fwd_f32_0, GUID: reduce_sum_fwd_f32. Got error: GLUE_INCOMPATIBLE_OUTPUT_SIZE
[02:21:22.954942][PASS_MANAGER ][error][tid:11DFB4] Graph optimization failed pass: loadTpcKernels
vim pytorch.log
[02:21:22.955134][PT_BRIDGE ][error][tid:11DFB4] /npu-stack/pytorch-integration/backend/synapse_helpers/graph.cpp: 536Graph compile failed. synStatus=synStatus 26 [Generic failure]. compile
[02:21:22.971260][PT_BRIDGE ][error][tid:11DFB4] backtrace (up to 30) [02:21:22.971283][PT_BRIDGE ][error][tid:11DFB4] /usr/lib/habanalabs/libhl_logger.so(hl_logger::v1_0::logStackTrace(std::shared_ptr<hl_logger::Logger> const&, int)+0x5c) [0x7fd445744d0c] [02:21:22.971291][PT_BRIDGE ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/lib/libhabana_pytorch_plugin.so(void hl_logger::v1_7_inline_fmt_compile::logStacktrace<HlLogger::LoggerType>(HlLogger::LoggerType, int)+0x61) [0x7fd447a90bb1] [02:21:22.971299][PT_BRIDGE ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/lib/libhabana_pytorch_backend.so(synapse_helpers::graph::compile()+0x1927) [0x7fd4468355c7]
[02:21:22.971303][PT_BRIDGE ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/lib/libhabana_pytorch_backend.so(habana::HabanaLaunchOpPT::CompileSynapseGraph()+0xd9) [0x7fd446ff2599] [02:21:22.971307][PT_BRIDGE ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/lib/libhabana_pytorch_backend.so(habana::HabanaLaunchOpPT::CompileSynapseGraphAndPatchTable()+0x1fb) [0x7fd446fdeeeb] [02:21:22.971321][PT_BRIDGE ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/lib/libhabana_pytorch_backend.so(habana::HabanaLaunchOpPT::run(std::vector<c10::IValue, std::allocator<c10::IValue> >&, std::optional<std::vector<at::Tensor, std::allocator<at::Tensor> > >, std::optional<std::vector<std::vector<long, std::allocator<long> >, std::allocator<std::vector<long, std::allocator<long> > > > >, bool, habana::HabanaLaunchOpPipeline::PipelineCallBase&)+0x1883) [0x7fd446fb8273] [02:21:22.971326][PT_BRIDGE ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/lib/libhabana_pytorch_plugin.so(+0xf16e65) [0x7fd448460e65] [02:21:22.971332][PT_BRIDGE ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/lib/libhabana_pytorch_plugin.so(habana_lazy::exec::HlExec::Launch(std::vector<c10::IValue, std::allocator<c10::IValue> >&, c10::hpu::HPUStream const&, bool)+0x932) [0x7fd4484636d2] [02:21:22.971337][PT_BRIDGE ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/lib/libhabana_pytorch_plugin.so(LaunchSyncTensorsGraph(LaunchTensorsInfo&&, LaunchEagerInfo&&, LaunchStreamInfo&&)+0x607) [0x7fd44843d277] [02:21:22.971344][PT_BRIDGE ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/lib/libhabana_pytorch_plugin.so(habana_lazy::HbLazyTensor::SyncTensorsGraphInternal(std::vector<habana_lazy::HbLazyTensor, std::allocator<habana_lazy::HbLazyTensor> >*, std::shared_ptr<habana_lazy::HbLazyFrontEndInfoToBackend>, bool, bool)+0x2019) [0x7fd4484402d9]
[02:21:22.971349][PT_BRIDGE ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/lib/libhabana_pytorch_plugin.so(habana_lazy::HbLazyTensor::SyncTensorsGraph(std::vector<habana_lazy::HbLazyTensor, std::allocator<habana_lazy::HbLazyTensor> >*, std::shared_ptr<habana_lazy::HbLazyFrontEndInfoToBackend>, bool, bool)+0x43d) [0x7fd44844186d]
[02:21:22.971344][PT_BRIDGE ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/lib/libhabana_pytorch_plugin.so(habana_lazy::HbLazyTensor::SyncTensorsGraphInternal(std::vector<habana_lazy::HbLazyTensor, std::allocator<habana_lazy::HbLazyTensor> >*, std::shared_ptr<habana_lazy::HbLazyFrontEndInfoToBackend>, bool, bool)+0x2019) [0x7fd4484402d9]
[02:21:22.971349][PT_BRIDGE ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/lib/libhabana_pytorch_plugin.so(habana_lazy::HbLazyTensor::SyncTensorsGraph(std::vector<habana_lazy::HbLazyTensor, std::allocator<habana_lazy::HbLazyTensor> >*, std::shared_ptr<habana_lazy::HbLazyFrontEndInfoToBackend>, bool, bool)+0x43d) [0x7fd44844186d] [02:21:22.971358][PT_BRIDGE ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/lib/libhabana_pytorch_plugin.so(habana_lazy::HbLazyTensor::SyncLiveTensorsGraph(c10::Device const*, std::shared_ptr<habana_lazy::HbLazyFrontEndInfoToBackend>, std::vector<habana_lazy::HbLazyTensor, std::allocator<habana_lazy::HbLazyTensor> >, bool, bool, std::vector<habana_lazy::HbLazyTensor, std::allocator<habana_lazy::HbLazyTensor> >, std::set<long, std::less<long>, std::allocator<long> >)+0x3a8) [0x7fd448442248]
[02:21:22.971367][PT_BRIDGE ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/lib/libhabana_pytorch_plugin.so(habana_lazy::HbLazyTensor::StepMarker(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::shared_ptr<habana_lazy::HbLazyFrontEndInfoToBackend>, std::vector<habana_lazy::HbLazyTensor, std::allocator<habana_lazy::HbLazyTensor> >, bool, bool, std::vector<habana_lazy::HbLazyTensor, std::allocator<habana_lazy::HbLazyTensor> >, std::set<long, std::less<long>, std::allocator<long> >)+0x95b) [0x7fd44844326b]
[02:21:22.971379][PT_BRIDGE ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/lib/libhabana_pytorch_plugin.so(habana_lazy::HbLazyTensorViews::StepMarkerAllReduce(std::vector<at::Tensor, std::allocator<at::Tensor> > const&)+0x69b) [0x7fd4484c7bfb]
[02:21:22.971390][PT_BRIDGE ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/distributed/_hccl_C.so(c10d::ProcessGroupHCCL::collective(std::vector<at::Tensor, std::allocator<at::Tensor> >&, std::vector<at::Tensor, std::allocator<at::Tensor> >&, std::function<hcclResult_t (at::Tensor&, at::Tensor&, void const*, void*, void*&, InternalStreamHandle*)>, bool)+0x1100) [0x7fd43eb11ec0]
[02:21:22.971395][PT_BRIDGE ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/distributed/_hccl_C.so(c10d::ProcessGroupHcclBase::allreduce(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&)+0x42a) [0x7fd43eae252a]
[02:21:22.971403][PT_BRIDGE ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/distributed/_hccl_C.so(c10d::ops::allreduce_hpu_(c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, c10::intrusive_ptr<c10d::ReduceOp, c10::detail::intrusive_target_default_null_type<c10d::ReduceOp> > const&, std::optional<at::Tensor> const&, long)+0x138) [0x7fd43eb289b8]
[02:21:22.971423][PT_BRIDGE ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/distributed/_h
[02:21:22.971395][PT_BRIDGE ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/distributed/_hccl_C.so(c10d::ProcessGroupHcclBase::allreduce(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&)+0x42a) [0x7fd43eae252a]
[02:21:22.971403][PT_BRIDGE ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/distributed/_hccl_C.so(c10d::ops::allreduce_hpu_(c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, c10::intrusive_ptr<c10d::ReduceOp, c10::detail::intrusive_target_default_null_type<c10d::ReduceOp> > const&, std::optional<at::Tensor> const&, long)+0x138) [0x7fd43eb289b8]
[02:21:22.971423][PT_BRIDGE ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/distributed/_hccl_C.so(c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (*)(c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, c10::intrusive_ptr<c10d::ReduceOp, c10::detail::intrusive_target_default_null_type<c10d::ReduceOp> > const&, std::optional<at::Tensor> const&, long), std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > >, c10::guts::typelist::typelist<c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, c10::intrusive_ptr<c10d::ReduceOp, c10::detail::intrusive_target_default_null_type<c10d::ReduceOp> > const&, std::optional<at::Tensor> const&, long> >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*)+0x19e) [0x7fd43eb37c6e]
[02:21:22.971427][PT_BRIDGE ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so(+0x4a366a6) [0x7fd4f0da66a6]
[02:21:22.971429][PT_BRIDGE ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so(+0x53ce525) [0x7fd4f173e525]
[02:21:22.971432][PT_BRIDGE ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so(+0x53dd3ec) [0x7fd4f174d3ec]
[02:21:22.971434][PT_BRIDGE ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so(+0x543054c) [0x7fd4f17a054c]
[02:21:22.971438][PT_BRIDGE ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so(c10d::Reducer::run_allreduce_hook(c10d::GradBucket&)+0x49) [0x7fd4f17b0bf9]
[02:21:22.971444][PT_BRIDGE ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so(c10d::Reducer::run_comm_hook(c10d::GradBucket&)+0x55) [0x7fd4f17b0cd5]
[02:21:22.971447][PT_BRIDGE ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so(c10d::Reducer::all_reduce_bucket(c10d::Reducer::Bucket&)+0x4d7) [0x7fd4f17b3fe7]
[02:21:22.971444][PT_BRIDGE ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so(c10d::Reducer::run_comm_hook(c10d::GradBucket&)+0x55) [0x7fd4f17b0cd5]
[02:21:22.971447][PT_BRIDGE ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so(c10d::Reducer::all_reduce_bucket(c10d::Reducer::Bucket&)+0x4d7) [0x7fd4f17b3fe7] [02:21:22.971451][PT_BRIDGE ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so(c10d::Reducer::mark_bucket_ready(unsigned long)+0x6b) [0x7fd4f17b427b]
[02:21:22.971454][PT_BRIDGE ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so(c10d::Reducer::mark_variable_ready(unsigned long)+0x1b1) [0x7fd4f17bb311]
[02:21:22.971457][PT_BRIDGE ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so(c10d::Reducer::autograd_hook(unsigned long)+0x16c) [0x7fd4f17bb6bc]
[02:21:22.971459][PT_BRIDGE ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so(+0x544b993) [0x7fd4f17bb993]
[02:21:22.971462][PT_BRIDGE ][error][tid:11DFB4] /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so(+0x54509cf) [0x7fd4f17c09cf]
vim synapse_runtime.log
[02:20:44.079748][SYN_API ][info ][tid:11D8F6] + ---------------------------------------------------------------------- +
[02:20:44.079768][SYN_API ][info ][tid:11D8F6] | Version: 1.16.1 |
[02:20:44.079774][SYN_API ][info ][tid:11D8F6] | Synapse: 323df60 |
[02:20:44.079778][SYN_API ][info ][tid:11D8F6] | HCL: 97cbe6a |
[02:20:44.079781][SYN_API ][info ][tid:11D8F6] | MME: b7ec966 |
[02:20:44.079786][SYN_API ][info ][tid:11D8F6] | SCAL: 0ecf6e1 |
[02:20:44.079790][SYN_API ][info ][tid:11D8F6] | Description: HabanaLabs Runtime and GraphCompiler |
[02:20:44.079842][SYN_API ][info ][tid:11D8F6] | Time: 2024-08-13 02:20:44.079792 |
[02:20:44.079856][SYN_API ][info ][tid:11D8F6] + ---------------------------------------------------------------------- +
[02:21:22.954967][SYN_RECIPE ][error][tid:11DFB4] compileGraph: Can not compile graph
[02:21:22.954985][SYN_RECIPE ][error][tid:11DFB4] addRecipeHandleAndCompileGraph: Can not compile
[02:21:22.955006][SYN_API ][error][tid:11DFB4] compileGraph: Failed to add recipe handle into Recipe-Singleton status 26[synFail]
Could you give me some hints on why the graph is not compiled successful and how can I go further on this?