Gaudi2 slower compared to A100

I am trying to compare training time for YOLOX.
I am running on 8 gaudi2 gaudi_code and 8 A100 nvidia_code and after running 100 epochs, gaudi2 was approx 160 min slower than gaudi2. but mlperf ResNet submitted by Habana shows better performance than A100. why I see result mismatch?

When posting a technical issue, please describe the issue; be as descriptive as possible, you can include things like:
• What was the expected behavior:
• What is the observed result:
• Is the issue consistently reproducible? how long does it take to reproduce:
• If you are using AWS DL1 instance, please report the AMI name that you are using
What is the minimal script/command to reproduce the issue:
Please include any error message or stack trace observed:
Please run the Snapshot for Debug tool and post to the issue
• git clone GitHub - HabanaAI/Snapshot_For_Debug: Snapshot scripts for gathering information about the model and Habana training session for Habana analysis and debug
• touch OUT_DOCKER.txt
• python src/gather_info_docker.py --lite --cmd=<command_script> -s OUT_DOCKER.txt
• post the generated tar file (gather_info_docker.tar.gz) after checking its contents

Also yolox_gaudi2 does this support multi node training?

There is an open issue in pytorch dataloader. Dataloader with higher number of workers seem to have a race condition that might cause intermittent failure.

The original code base uses 4 workers, while the HPU code base uses 2 workers (to avoid the issue linked above).

For Gaudi2 best performance, 4 worker threads help. Since the training is limited by dataloading time, 2 workers is not enough to achieve full speed. However we set 2 workers for now, till the issue is resolved.

Thanks
Sayantan

Thanks @Sayantan_S . It helped and with 2 workers, gaudi2 is faster than A100.

But I am facing issue with one of my semantic segmentation workload, which is using Tensorflow framework, where I see A100 faster than gaudi2. Both codes are identical using horovod and tensorflow dataset except I have added following lines to run code on gaudi/gaudi2 machine.

from habana_frameworks.tensorflow import load_habana_module

load_habana_module()

Please let me know if there is any recommendation if I am missing.

Also running yolox on gaudi1 seems to be not working and getting following error and hangs.

docker image : vault.habana.ai/gaudi-docker/1.9.0/ubuntu20.04/habanalabs/pytorch-installer-1.13.1

2023-06-06 00:24:57 | INFO     | yolox.core.trainer:237 - Loading dataset into memory...                                                                                                                              
2023-06-06 00:25:05 | INFO     | yolox.core.trainer:239 - Done                                                                                                                                                        
/home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources---bp-dt/repos/hcl/src/infra/hcl_event.cpp::41(waitForList): The condition [ count++ < GCFG_MAX_WAIT_ATTEMPTS.value() ] failed. waitForList: 
waitForHandle timed out. maxWaitAttempts: 100, count: 101                                                                                                                                                             
/home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources---bp-dt/repos/hcl/src/infra/hcl_event.cpp::41(waitForList): The condition [ count++ < GCFG_MAX_WAIT_ATTEMPTS.value() ] failed. waitForList: 
waitForHandle timed out. maxWaitAttempts: 100, count: 101                                                                                                                                                             
/home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources---bp-dt/repos/hcl/src/infra/hcl_event.cpp::41(waitForList): The condition [ count++ < GCFG_MAX_WAIT_ATTEMPTS.value() ] failed. waitForList: 
waitForHandle timed out. maxWaitAttempts: 100, count: 101                                                                                                                                                             
/home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources---bp-dt/repos/hcl/src/infra/hcl_event.cpp::41(waitForList): The condition [ count++ < GCFG_MAX_WAIT_ATTEMPTS.value() ] failed. waitForList: 
waitForHandle timed out. maxWaitAttempts: 100, count: 101                                                                                                                                                             
/home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources---bp-dt/repos/hcl/src/infra/hcl_event.cpp::41(waitForList): The condition [ count++ < GCFG_MAX_WAIT_ATTEMPTS.value() ] failed. waitForList: 
waitForHandle timed out. maxWaitAttempts: 100, count: 101                                                                                                                                                             
/home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources---bp-dt/repos/hcl/src/infra/hcl_event.cpp::41(waitForList): The condition [ count++ < GCFG_MAX_WAIT_ATTEMPTS.value() ] failed. waitForList: 
waitForHandle timed out. maxWaitAttempts: 100, count: 101                                                                                                                                                             
/home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources---bp-dt/repos/hcl/src/infra/hcl_event.cpp::41(waitForList): The condition [ count++ < GCFG_MAX_WAIT_ATTEMPTS.value() ] failed. waitForList: 
waitForHandle timed out. maxWaitAttempts: 100, count: 101                                                                                                                                                             
/home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources---bp-dt/repos/hcl/src/infra/hcl_event.cpp::41(waitForList): The condition [ count++ < GCFG_MAX_WAIT_ATTEMPTS.value() ] failed. waitForList: 
waitForHandle timed out. maxWaitAttempts: 100, count: 101                                                                                                                                                             
dmesg: read kernel buffer failed: Operation not permitted                                                                                                                                                             
dmesg: read kernel buffer failed: Operation not permitted                                                                                                                                                             
dmesg: read kernel buffer failed: Operation not permitted                                                                                                                                                             
src/tcmalloc.cc:332] Attempt to free invalid pointer 0x2020202010676e18                                                                                                                                               
dmesg: read kernel buffer failed: Operation not permitted                                                                                                                                                             
backtrace (up to 30)                                                                                                                                                                                                  
/usr/lib/habanalabs/libhl_logger.so(hl_logger::v1_0::logStackTrace(std::ostream&)+0x50) [0x7f8063e38240]                                                                                                              
/usr/lib/habanalabs/libSynapse.so(+0x12676ec) [0x7f8065bf96ec]                                                                                                                                                        
/usr/lib/habanalabs/libSynapse.so(std::_Function_handler<void (int, char const*, bool), hl_logger::v1_2_fmt_compile::ModuleLoggerData<synapse::LogManager::LogType>::ModuleLoggerData(char const*)::{lambda(int, char 
const*, bool)#1}>::_M_invoke(std::_Any_data const&, int&&, char const*&&, bool&&)+0x1f) [0x7f8065bfa24f]
/usr/lib/habanalabs/libhl_logger.so(+0xfb47) [0x7f8063e37b47]
/usr/lib/habanalabs/libhl_logger.so(signalHandler(int, siginfo_t*, void*)+0x29) [0x7f8063e41a09]
===============================================================================
====================== USER CODE STACK TRACE START POINT ======================
===============================================================================
/lib/x86_64-linux-gnu/libc.so.6(+0x43090) [0x7f81c60bc090]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb) [0x7f81c60bc00b]
/usr/lib/habanalabs/libTPCFuser.so(+0x17e7058) [0x7f80624c4058]
/lib/x86_64-linux-gnu/libc.so.6(+0x43090) [0x7f81c60bc090]
/usr/lib/habanalabs/libSynapse.so(DfaBase::dumpEngStatus()+0x85) [0x7f80657255f5]
/usr/lib/habanalabs/libSynapse.so(DfaBase::checkFailure(DfaStatus)+0x255) [0x7f8065734b75]
/usr/lib/habanalabs/libSynapse.so(DfaBase::notifyHlthunkFailure(DfaErrorCode)+0x75) [0x7f80657242b5]
/usr/lib/habanalabs/libSynapse.so(synSingleton::notifyHlthunkFailure(DfaErrorCode)+0x42) [0x7f806591abb2]                                                                                                   [109/1979]
/usr/lib/habanalabs/libSynapse.so(hclNotifyFailure(DfaErrorCode, unsigned long)+0x44) [0x7f806575de84]
/usr/lib/habanalabs/libhcl.so(waitForList(std::__cxx11::list<HCL_Request, std::allocator<HCL_Request> >&, unsigned short)+0x1140) [0x7f8064431d20]
/usr/lib/habanalabs/libhcl.so(HclDevice::sync(unsigned int, unsigned short)+0x49b) [0x7f806465fa7b]
/usr/lib/habanalabs/libhcl.so(HclDevice::onNewCommEnd(unsigned int, HclConfig&)+0x9a) [0x7f80646699da]
/usr/lib/habanalabs/libhcl.so(hccl::hccl_communicator::initialize(hccl::internal_unique_id_t const*)+0x2569) [0x7f806425eed9]
/usr/lib/habanalabs/libhcl.so(hccl::hccl_context::comm_init_rank(void**, int, hcclUniqueId&, int)+0x121) [0x7f8064277031]
/usr/lib/habanalabs/libhcl.so(hcclCommInitRank_Original(void**, int, hcclUniqueId&, int)+0xcb) [0x7f80641ed98b]
/opt/habanalabs/habana_plugins/libhost_profiler.so(HcclSingletonHostProfiler::hcclCommInitRank(void**, int, hcclUniqueId&, int)+0x59) [0x7f8060c3ca59]
/usr/lib/habanalabs/libhcl.so(hcclCommInitRank_impl(void**, int, hcclUniqueId, int)+0x37) [0x7f80641da447]
/usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/lib/libhabana_pytorch_backend.so(std::_Function_handler<hcclResult_t (void**, int, hcclUniqueId, int), hcclResult_t (*)(void**, int, hcclUniqueId, int)
>::_M_invoke(std::_Any_data const&, void**&&, int&&, hcclUniqueId&&, int&&)+0x2e) [0x7f80692dcc7e]
/usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/lib/libhabana_pytorch_backend.so(hcclCommInitRank+0x6a) [0x7f80692c926a]
/usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/distributed/_hccl_C.so(c10d::ProcessGroupHCCL::getComm(int)+0x4e4) [0x7f8068902e84]
/usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/distributed/_hccl_C.so(c10d::ProcessGroupHCCL::getCommList(std::vector<int, std::allocator<int> > const&)+0x112) [0x7f80689030f2]
/usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/distributed/_hccl_C.so(c10d::ProcessGroupHCCL::allgather(std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<a
t::Tensor, std::allocator<at::Tensor> > > >&, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllgatherOptions const&)+0x74a) [0x7f806891665a]
/usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so(+0x4f91139) [0x7f81bd974139]
/usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so(+0x4f96c32) [0x7f81bd979c32]
/usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so(c10d::ops::allgather(c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector
<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, c10d::AllgatherOpt
ions const&)+0x157) [0x7f81bd977467]
backtrace (up to 30)
/usr/lib/habanalabs/libhl_logger.so(hl_logger::v1_0::logStackTrace(std::ostream&)+0x50) [0x7fd5d40e1240]
/usr/lib/habanalabs/libSynapse.so(+0x12676ec) [0x7fd5d5ea26ec]
/usr/lib/habanalabs/libSynapse.so(std::_Function_handler<void (int, char const*, bool), hl_logger::v1_2_fmt_compile::ModuleLoggerData<synapse::LogManager::LogType>::ModuleLoggerData(char const*)::{lambda(int, char 
const*, bool)#1}>::_M_invoke(std::_Any_data const&, int&&, char const*&&, bool&&)+0x1f) [0x7fd5d5ea324f]
/usr/lib/habanalabs/libhl_logger.so(+0xfb47) [0x7fd5d40e0b47]
/usr/lib/habanalabs/libhl_logger.so(signalHandler(int, siginfo_t*, void*)+0x29) [0x7fd5d40eaa09]
===============================================================================
====================== USER CODE STACK TRACE START POINT ======================
===============================================================================
/lib/x86_64-linux-gnu/libc.so.6(+0x43090) [0x7fd735b24090]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb) [0x7fd735b2400b]
/usr/lib/habanalabs/libTPCFuser.so(+0x17e7058) [0x7fd5d276d058]
/lib/x86_64-linux-gnu/libc.so.6(+0x43090) [0x7fd735b24090]
/lib/x86_64-linux-gnu/libc.so.6(__read+0x4c) [0x7fd735bef00c]
/lib/x86_64-linux-gnu/libc.so.6(_IO_file_underflow+0x17f) [0x7fd735b71b9f]
/lib/x86_64-linux-gnu/libc.so.6(_IO_default_uflow+0x36) [0x7fd735b72f86]
/lib/x86_64-linux-gnu/libc.so.6(_IO_getline_info+0xac) [0x7fd735b6486c]
/lib/x86_64-linux-gnu/libc.so.6(fgets+0x9a) [0x7fd735b636ca]
/usr/lib/habanalabs/libSynapse.so(exec(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)+0x129) [0x7fd5d59ce9e9]
/usr/lib/habanalabs/libSynapse.so(DfaBase::logHlSmi()+0x12a) [0x7fd5d59cee3a]
/usr/lib/habanalabs/libSynapse.so(DfaBase::checkFailure(DfaStatus)+0x370) [0x7fd5d59ddc90]
/usr/lib/habanalabs/libSynapse.so(DfaBase::notifyHlthunkFailure(DfaErrorCode)+0x75) [0x7fd5d59cd2b5]
/usr/lib/habanalabs/libSynapse.so(synSingleton::notifyHlthunkFailure(DfaErrorCode)+0x42) [0x7fd5d5bc3bb2]
/usr/lib/habanalabs/libSynapse.so(hclNotifyFailure(DfaErrorCode, unsigned long)+0x44) [0x7fd5d5a06e84]
/usr/lib/habanalabs/libhcl.so(waitForList(std::__cxx11::list<HCL_Request, std::allocator<HCL_Request> >&, unsigned short)+0x1140) [0x7fd5d46dad20]
/usr/lib/habanalabs/libhcl.so(HclDevice::sync(unsigned int, unsigned short)+0x49b) [0x7fd5d4908a7b]
/usr/lib/habanalabs/libhcl.so(HclDevice::onNewCommEnd(unsigned int, HclConfig&)+0x9a) [0x7fd5d49129da]
/usr/lib/habanalabs/libhcl.so(hccl::hccl_communicator::initialize(hccl::internal_unique_id_t const*)+0x2569) [0x7fd5d4507ed9]
/usr/lib/habanalabs/libhcl.so(hccl::hccl_context::comm_init_rank(void**, int, hcclUniqueId&, int)+0x121) [0x7fd5d4520031]
/usr/lib/habanalabs/libhcl.so(hcclCommInitRank_Original(void**, int, hcclUniqueId&, int)+0xcb) [0x7fd5d449698b]
/opt/habanalabs/habana_plugins/libhost_profiler.so(HcclSingletonHostProfiler::hcclCommInitRank(void**, int, hcclUniqueId&, int)+0x59) [0x7fd5d0ef0a59]
/usr/lib/habanalabs/libhcl.so(hcclCommInitRank_impl(void**, int, hcclUniqueId, int)+0x37) [0x7fd5d4483447]
/usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/lib/libhabana_pytorch_backend.so(std::_Function_handler<hcclResult_t (void**, int, hcclUniqueId, int), hcclResult_t (*)(void**, int, hcclUniqueId, int)
>::_M_invoke(std::_Any_data const&, void**&&, int&&, hcclUniqueId&&, int&&)+0x2e) [0x7fd5d8d44c7e]
/usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/lib/libhabana_pytorch_backend.so(hcclCommInitRank+0x6a) [0x7fd5d8d3126a]
backtrace (up to 30)
/usr/lib/habanalabs/libhl_logger.so(hl_logger::v1_0::logStackTrace(std::ostream&)+0x50) [0x7f8af0aaf240]
/usr/lib/habanalabs/libSynapse.so(+0x12676ec) [0x7f8af28706ec]
/usr/lib/habanalabs/libSynapse.so(std::_Function_handler<void (int, char const*, bool), hl_logger::v1_2_fmt_compile::ModuleLoggerData<synapse::LogManager::LogType>::ModuleLoggerData(char const*)::{lambda(int, char 
const*, bool)#1}>::_M_invoke(std::_Any_data const&, int&&, char const*&&, bool&&)+0x1f) [0x7f8af287124f]
/usr/lib/habanalabs/libhl_logger.so(+0xfb47) [0x7f8af0aaeb47]
/usr/lib/habanalabs/libhl_logger.so(signalHandler(int, siginfo_t*, void*)+0x29) [0x7f8af0ab8a09]
===============================================================================
====================== USER CODE STACK TRACE START POINT ======================
===============================================================================
/lib/x86_64-linux-gnu/libc.so.6(+0x43090) [0x7f8c524f2090]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb) [0x7f8c524f200b]
/usr/lib/habanalabs/libTPCFuser.so(+0x17e7058) [0x7f8aef13b058]
/lib/x86_64-linux-gnu/libc.so.6(+0x43090) [0x7f8c524f2090]
/lib/x86_64-linux-gnu/libc.so.6(__read+0x4c) [0x7f8c525bd00c]
/lib/x86_64-linux-gnu/libc.so.6(_IO_file_underflow+0x17f) [0x7f8c5253fb9f]
/lib/x86_64-linux-gnu/libc.so.6(_IO_default_uflow+0x36) [0x7f8c52540f86]
/lib/x86_64-linux-gnu/libc.so.6(_IO_getline_info+0xac) [0x7f8c5253286c]
/lib/x86_64-linux-gnu/libc.so.6(fgets+0x9a) [0x7f8c525316ca]
/usr/lib/habanalabs/libSynapse.so(exec(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)+0x129) [0x7f8af239c9e9]
/usr/lib/habanalabs/libSynapse.so(DfaBase::logHlSmi()+0x12a) [0x7f8af239ce3a]
/usr/lib/habanalabs/libSynapse.so(DfaBase::checkFailure(DfaStatus)+0x370) [0x7f8af23abc90]
/usr/lib/habanalabs/libSynapse.so(DfaBase::notifyHlthunkFailure(DfaErrorCode)+0x75) [0x7f8af239b2b5]
/usr/lib/habanalabs/libSynapse.so(synSingleton::notifyHlthunkFailure(DfaErrorCode)+0x42) [0x7f8af2591bb2]
/usr/lib/habanalabs/libSynapse.so(hclNotifyFailure(DfaErrorCode, unsigned long)+0x44) [0x7f8af23d4e84]
/usr/lib/habanalabs/libhcl.so(waitForList(std::__cxx11::list<HCL_Request, std::allocator<HCL_Request> >&, unsigned short)+0x1140) [0x7f8af10a8d20]
/usr/lib/habanalabs/libhcl.so(HclDevice::sync(unsigned int, unsigned short)+0x49b) [0x7f8af12d6a7b]
/usr/lib/habanalabs/libhcl.so(HclDevice::onNewCommEnd(unsigned int, HclConfig&)+0x9a) [0x7f8af12e09da]
/usr/lib/habanalabs/libhcl.so(hccl::hccl_communicator::initialize(hccl::internal_unique_id_t const*)+0x2569) [0x7f8af0ed5ed9]
/usr/lib/habanalabs/libhcl.so(hccl::hccl_context::comm_init_rank(void**, int, hcclUniqueId&, int)+0x121) [0x7f8af0eee031]
/usr/lib/habanalabs/libhcl.so(hcclCommInitRank_Original(void**, int, hcclUniqueId&, int)+0xcb) [0x7f8af0e6498b]
/opt/habanalabs/habana_plugins/libhost_profiler.so(HcclSingletonHostProfiler::hcclCommInitRank(void**, int, hcclUniqueId&, int)+0x59) [0x7f8aed8bea59]
/usr/lib/habanalabs/libhcl.so(hcclCommInitRank_impl(void**, int, hcclUniqueId, int)+0x37) [0x7f8af0e51447]
/usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/lib/libhabana_pytorch_backend.so(std::_Function_handler<hcclResult_t (void**, int, hcclUniqueId, int), hcclResult_t (*)(void**, int, hcclUniqueId, int)
>::_M_invoke(std::_Any_data const&, void**&&, int&&, hcclUniqueId&&, int&&)+0x2e) [0x7f8af5712c7e]
/usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/lib/libhabana_pytorch_backend.so(hcclCommInitRank+0x6a) [0x7f8af56ff26a]
backtrace (up to 30)
/usr/lib/habanalabs/libhl_logger.so(hl_logger::v1_0::logStackTrace(std::ostream&)+0x50) [0x7f8d6effe240]
/usr/lib/habanalabs/libSynapse.so(+0x12676ec) [0x7f8d70dbf6ec]
/usr/lib/habanalabs/libSynapse.so(std::_Function_handler<void (int, char const*, bool), hl_logger::v1_2_fmt_compile::ModuleLoggerData<synapse::LogManager::LogType>::ModuleLoggerData(char const*)::{lambda(int, char 
const*, bool)#1}>::_M_invoke(std::_Any_data const&, int&&, char const*&&, bool&&)+0x1f) [0x7f8d70dc024f]
/usr/lib/habanalabs/libhl_logger.so(+0xfb47) [0x7f8d6effdb47]
/usr/lib/habanalabs/libhl_logger.so(signalHandler(int, siginfo_t*, void*)+0x29) [0x7f8d6f007a09]
===============================================================================
====================== USER CODE STACK TRACE START POINT ======================
===============================================================================
/lib/x86_64-linux-gnu/libc.so.6(+0x43090) [0x7f8ed0a41090]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb) [0x7f8ed0a4100b]
/usr/lib/habanalabs/libTPCFuser.so(+0x17e7058) [0x7f8d6d68a058]
/lib/x86_64-linux-gnu/libc.so.6(+0x43090) [0x7f8ed0a41090]
/lib/x86_64-linux-gnu/libc.so.6(__read+0x4c) [0x7f8ed0b0c00c]
/lib/x86_64-linux-gnu/libc.so.6(_IO_file_underflow+0x17f) [0x7f8ed0a8eb9f]
/lib/x86_64-linux-gnu/libc.so.6(_IO_default_uflow+0x36) [0x7f8ed0a8ff86]
/usr/lib/habanalabs/libSynapse.so(exec(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)+0x129) [0x7f8d708eb9e9]
/usr/lib/habanalabs/libSynapse.so(DfaBase::logHlSmi()+0x12a) [0x7f8d708ebe3a]
/usr/lib/habanalabs/libSynapse.so(DfaBase::checkFailure(DfaStatus)+0x370) [0x7f8d708fac90]
/usr/lib/habanalabs/libSynapse.so(DfaBase::notifyHlthunkFailure(DfaErrorCode)+0x75) [0x7f8d708ea2b5]
/usr/lib/habanalabs/libSynapse.so(synSingleton::notifyHlthunkFailure(DfaErrorCode)+0x42) [0x7f8d70ae0bb2]
/usr/lib/habanalabs/libSynapse.so(hclNotifyFailure(DfaErrorCode, unsigned long)+0x44) [0x7f8d70923e84]
/usr/lib/habanalabs/libhcl.so(waitForList(std::__cxx11::list<HCL_Request, std::allocator<HCL_Request> >&, unsigned short)+0x1140) [0x7f8d6f5f7d20]
/usr/lib/habanalabs/libhcl.so(HclDevice::sync(unsigned int, unsigned short)+0x49b) [0x7f8d6f825a7b]
/usr/lib/habanalabs/libhcl.so(HclDevice::onNewCommEnd(unsigned int, HclConfig&)+0x9a) [0x7f8d6f82f9da]
/usr/lib/habanalabs/libhcl.so(hccl::hccl_communicator::initialize(hccl::internal_unique_id_t const*)+0x2569) [0x7f8d6f424ed9]
/usr/lib/habanalabs/libhcl.so(hccl::hccl_context::comm_init_rank(void**, int, hcclUniqueId&, int)+0x121) [0x7f8d6f43d031]
/usr/lib/habanalabs/libhcl.so(hcclCommInitRank_Original(void**, int, hcclUniqueId&, int)+0xcb) [0x7f8d6f3b398b]
/opt/habanalabs/habana_plugins/libhost_profiler.so(HcclSingletonHostProfiler::hcclCommInitRank(void**, int, hcclUniqueId&, int)+0x59) [0x7f8d6be0da59]
/usr/lib/habanalabs/libhcl.so(hcclCommInitRank_impl(void**, int, hcclUniqueId, int)+0x37) [0x7f8d6f3a0447]
/usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/lib/libhabana_pytorch_backend.so(std::_Function_handler<hcclResult_t (void**, int, hcclUniqueId, int), hcclResult_t (*)(void**, int, hcclUniqueId, int)
>::_M_invoke(std::_Any_data const&, void**&&, int&&, hcclUniqueId&&, int&&)+0x2e) [0x7f8d73c61c7e]
/usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/lib/libhabana_pytorch_backend.so(hcclCommInitRank+0x6a) [0x7f8d73c4e26a]
backtrace (up to 30)
/usr/lib/habanalabs/libhl_logger.so(hl_logger::v1_0::logStackTrace(std::ostream&)+0x50) [0x7f1883987240]
/usr/lib/habanalabs/libSynapse.so(+0x12676ec) [0x7f18857486ec]
/usr/lib/habanalabs/libSynapse.so(std::_Function_handler<void (int, char const*, bool), hl_logger::v1_2_fmt_compile::ModuleLoggerData<synapse::LogManager::LogType>::ModuleLoggerData(char const*)::{lambda(int, char 
const*, bool)#1}>::_M_invoke(std::_Any_data const&, int&&, char const*&&, bool&&)+0x1f) [0x7f188574924f]
/usr/lib/habanalabs/libhl_logger.so(+0xfb47) [0x7f1883986b47]
/usr/lib/habanalabs/libhl_logger.so(signalHandler(int, siginfo_t*, void*)+0x29) [0x7f1883990a09]
===============================================================================
====================== USER CODE STACK TRACE START POINT ======================
===============================================================================
/lib/x86_64-linux-gnu/libc.so.6(+0x43090) [0x7f19e53ca090]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb) [0x7f19e53ca00b]
/usr/lib/habanalabs/libTPCFuser.so(+0x17e7058) [0x7f1882013058]
/lib/x86_64-linux-gnu/libc.so.6(+0x43090) [0x7f19e53ca090]
/lib/x86_64-linux-gnu/libc.so.6(__read+0x4c) [0x7f19e549500c]
/lib/x86_64-linux-gnu/libc.so.6(_IO_file_underflow+0x17f) [0x7f19e5417b9f]
/lib/x86_64-linux-gnu/libc.so.6(_IO_default_uflow+0x36) [0x7f19e5418f86]
/lib/x86_64-linux-gnu/libc.so.6(_IO_getline_info+0xac) [0x7f19e540a86c]
/lib/x86_64-linux-gnu/libc.so.6(fgets+0x9a) [0x7f19e54096ca]
/usr/lib/habanalabs/libSynapse.so(exec(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)+0x129) [0x7f18852749e9]
/usr/lib/habanalabs/libSynapse.so(DfaBase::logHlSmi()+0x12a) [0x7f1885274e3a]
/usr/lib/habanalabs/libSynapse.so(DfaBase::checkFailure(DfaStatus)+0x370) [0x7f1885283c90]
/usr/lib/habanalabs/libSynapse.so(DfaBase::notifyHlthunkFailure(DfaErrorCode)+0x75) [0x7f18852732b5]
/usr/lib/habanalabs/libSynapse.so(synSingleton::notifyHlthunkFailure(DfaErrorCode)+0x42) [0x7f1885469bb2]
/usr/lib/habanalabs/libSynapse.so(hclNotifyFailure(DfaErrorCode, unsigned long)+0x44) [0x7f18852ace84]
/usr/lib/habanalabs/libhcl.so(waitForList(std::__cxx11::list<HCL_Request, std::allocator<HCL_Request> >&, unsigned short)+0x1140) [0x7f1883f80d20]
/usr/lib/habanalabs/libhcl.so(HclDevice::sync(unsigned int, unsigned short)+0x49b) [0x7f18841aea7b]
/usr/lib/habanalabs/libhcl.so(HclDevice::onNewCommEnd(unsigned int, HclConfig&)+0x9a) [0x7f18841b89da]
/usr/lib/habanalabs/libhcl.so(hccl::hccl_communicator::initialize(hccl::internal_unique_id_t const*)+0x2569) [0x7f1883daded9]
/usr/lib/habanalabs/libhcl.so(hccl::hccl_context::comm_init_rank(void**, int, hcclUniqueId&, int)+0x121) [0x7f1883dc6031]
/usr/lib/habanalabs/libhcl.so(hcclCommInitRank_Original(void**, int, hcclUniqueId&, int)+0xcb) [0x7f1883d3c98b]
/opt/habanalabs/habana_plugins/libhost_profiler.so(HcclSingletonHostProfiler::hcclCommInitRank(void**, int, hcclUniqueId&, int)+0x59) [0x7f1880796a59]
/usr/lib/habanalabs/libhcl.so(hcclCommInitRank_impl(void**, int, hcclUniqueId, int)+0x37) [0x7f1883d29447]
/usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/lib/libhabana_pytorch_backend.so(std::_Function_handler<hcclResult_t (void**, int, hcclUniqueId, int), hcclResult_t (*)(void**, int, hcclUniqueId, int)
>::_M_invoke(std::_Any_data const&, void**&&, int&&, hcclUniqueId&&, int&&)+0x2e) [0x7f18885eac7e]

for yolox i see the docker image is 1.9. Can you verify by running hl-smi or hl-smi | grep -i "driver version", if your driver matches your docker?

Is this tensorflow segmentation model from model-references or some private workload?

There can be some optimizations that could be done:
for example, if the model has dynamic shapes, (which can be detected like this), it might be slower, so you could try to rewrite parts of the model.

Also you can profile to check the reason for slowness.

Here are some other optimization ideas.

@Sayantan_S regarding yolox on gaudi1, below is the hl-smi output. I tried cloning 1.5.0 synapsis AI code, but it doesn’t have yolox code in it. could you guide me? This is old ec2 instance of gaudi1, do you want me to spin up new gaudi1?

git clone -b 1.5.0 https://github.com/HabanaAI/Model-References

Driver and docker versions are expected to be same for it to work. If you have 1.5 driver use a docker from 1.5. However 1.5 is pretty old (almost a year old), might make sense to upgrade to a newer release (1.10)

@Sayantan_S , I am using 1.6.0 as driver version (no docker) and git clone using 1.6.0 synapsis ai version and running yolox, but getting following error.

2023-06-07 00:07:17 | INFO     | yolox.core.trainer:375 - epoch: 1/1, iter: 170/180, iter_time: 0.507s, data_time: 0.001s, global_avg_time: 3.422s, total_loss: 11.5, iou_loss: 3.4, l1_loss: 1.5, conf_loss: 5.2, cls
_loss: 1.4, lr: 7.136e-04, size: 672, ETA: 0:00:34                                                                                                                                                                    
/home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources---bp-dt/repos/hcl/src/infra/hcl_event.cpp::40(waitForList): The condition [ count++ < GCFG_MAX_WAIT_ATTEMPTS.value() ] failed. waitForList: 
waitForHandle timed out. maxWaitAttempts: 10, count: 11                                                                                                                                                               
/home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources---bp-dt/repos/hcl/src/infra/hcl_event.cpp::40(waitForList): The condition [ count++ < GCFG_MAX_WAIT_ATTEMPTS.value() ] failed. waitForList: 
waitForHandle timed out. maxWaitAttempts: 10, count: 11                                                                                                                                                               
/home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources---bp-dt/repos/hcl/src/infra/hcl_event.cpp::40(waitForList): The condition [ count++ < GCFG_MAX_WAIT_ATTEMPTS.value() ] failed. waitForList: 
waitForHandle timed out. maxWaitAttempts: 10, count: 11                                                                                                                                                               
hcl_communicator.cpphcl_communicator.cpp::hcl_communicator.cpp::79::79 79 destroy destroy(...) destroy(...) HCL_CommDestroy returned with an error status=-1(...) HCL_CommDestroy returned with an error status=-1HCL_
CommDestroy returned with an error status=-1                                                                                                                                                                          
                                                                                                                                                                                                                      
                                                                                                                                                                                                                      
terminate called after throwing an instance of 'c10::Error'                                                                                                                                                           
  what():  FATAL ERROR :: MODULE:SYNHELPER synDeviceRelease failed with. Status: 26                                                                                                                                   
Habana exception raised from ~device_id at device.cpp:1045 (most recent call first):                                                                                                                                  
frame #0: synapse_helpers::device_id::~device_id() + 0x44a (0x7f4b38bd291a in /usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/lib/libpytorch_synapse_helpers.so)                                       
frame #1: synapse_helpers::device::~device() + 0x5a9 (0x7f4b38bd6cf9 in /usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/lib/libpytorch_synapse_helpers.so)                                             
frame #2: std::_Sp_counted_ptr<synapse_helpers::device*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x16 (0x7f4b38bdd556 in /usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/lib/libpytorch_synapse_hel
pers.so)                                                                                                                                                                                                              
frame #3: synapse_helpers::HPURegistrarPerThreadTracker::~HPURegistrarPerThreadTracker() + 0x12d (0x7f4b393743cd in /usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/lib/libhabana_pytorch_plugin.so)   
frame #4: __call_tls_dtors + 0x3f (0x7f4bf87f72bf in /lib/x86_64-linux-gnu/libc.so.6)                                                                                                                                 
frame #5: <unknown function> + 0x46a0d (0x7f4bf87f6a0d in /lib/x86_64-linux-gnu/libc.so.6)                                                                                                                            
frame #6: on_exit + 0 (0x7f4bf87f6a60 in /lib/x86_64-linux-gnu/libc.so.6)                                                                                                                                             
frame #7: /usr/bin/python3() [0x67f9bb]                                                                                                                                                                               
frame #8: /usr/bin/python3() [0x67f9db]                                                                                                                                                                               
frame #9: PyErr_PrintEx + 0x16 (0x67fc06 in /usr/bin/python3)                                                                                                                                                         
frame #10: PyRun_SimpleStringFlags + 0x52 (0x67fc72 in /usr/bin/python3)                                                                                                                                              
frame #11: Py_RunMain + 0x2cc (0x6b7d3c in /usr/bin/python3)                                                                                                                                                          
frame #12: Py_BytesMain + 0x2d (0x6b800d in /usr/bin/python3)                                                                                                                                                         
frame #13: __libc_start_main + 0xf3 (0x7f4bf87d4083 in /lib/x86_64-linux-gnu/libc.so.6)                                                                                                                               
frame #14: _start + 0x2e (0x5fb85e in /usr/bin/python3)                                                                                                                                                               
                                                                                                                                                                                                                      
Internal Error: Received signal - Aborted                                                                                                                                                                             
frame #0: dumpStack(int) + 0xc3 (0x7f4b37470683 in /usr/lib/habanalabs/libSynapse.so)                                                                                                                                 
frame #1: <unknown function> + 0x43090 (0x7f4bf87f3090 in /lib/x86_64-linux-gnu/libc.so.6)                                                                                                                            
frame #2: gsignal + 0xcb (0x7f4bf87f300b in /lib/x86_64-linux-gnu/libc.so.6)                                                                                                                                          
frame #3: abort + 0x12b (0x7f4bf87d2859 in /lib/x86_64-linux-gnu/libc.so.6)                                                                                                                                           
frame #4: <unknown function> + 0x9e911 (0x7f4bf8486911 in /lib/x86_64-linux-gnu/libstdc++.so.6)                                                                                                                       
frame #5: <unknown function> + 0xaa38c (0x7f4bf849238c in /lib/x86_64-linux-gnu/libstdc++.so.6)                                                                                                                       
frame #6: <unknown function> + 0xa9369 (0x7f4bf8491369 in /lib/x86_64-linux-gnu/libstdc++.so.6)                                                                                                                       
frame #7: __gxx_personality_v0 + 0x2a1 (0x7f4bf8491d21 in /lib/x86_64-linux-gnu/libstdc++.so.6)                                                                                                                       
frame #8: __libunwind_Unwind_RaiseException + 0x1db (0x7f4bf85cddfb in /lib/x86_64-linux-gnu/libunwind.so.8)                                                                                                          
frame #9: __cxa_throw + 0x3c (0x7f4bf849269c in /lib/x86_64-linux-gnu/libstdc++.so.6)                                                                                                                                 
frame #10: Logger::habana_assert(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x5cb (0x7f4b391ace6e in /usr/local/lib/python3.8/d
ist-packages/habana_frameworks/torch/lib/libhabana_pytorch_plugin.so)                                                                                                                                                 
frame #11: synapse_helpers::device_id::~device_id() + 0x44a (0x7f4b38bd291a in /usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/lib/libpytorch_synapse_helpers.so)                                      
frame #12: synapse_helpers::device::~device() + 0x5a9 (0x7f4b38bd6cf9 in /usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/lib/libpytorch_synapse_helpers.so)                                            
frame #13: std::_Sp_counted_ptr<synapse_helpers::device*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x16 (0x7f4b38bdd556 in /usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/lib/libpytorch_synapse_he
lpers.so)                                                                                                                                                                                                             
frame #14: synapse_helpers::HPURegistrarPerThreadTracker::~HPURegistrarPerThreadTracker() + 0x12d (0x7f4b393743cd in /usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/lib/libhabana_pytorch_plugin.so)  
frame #15: __call_tls_dtors + 0x3f (0x7f4bf87f72bf in /lib/x86_64-linux-gnu/libc.so.6)                                                                                                                                
frame #16: <unknown function> + 0x46a0d (0x7f4bf87f6a0d in /lib/x86_64-linux-gnu/libc.so.6)                                                                                                                           
frame #17: on_exit + 0 (0x7f4bf87f6a60 in /lib/x86_64-linux-gnu/libc.so.6)                                                                                                                                            
frame #18: /usr/bin/python3() [0x67f9bb] 
frame #19: /usr/bin/python3() [0x67f9db]                                                                                                                                                                              
frame #20: PyErr_PrintEx + 0x16 (0x67fc06 in /usr/bin/python3)                                                                                                                                                        
frame #21: PyRun_SimpleStringFlags + 0x52 (0x67fc72 in /usr/bin/python3)                                                                                                                                              
frame #22: Py_RunMain + 0x2cc (0x6b7d3c in /usr/bin/python3)
frame #23: Py_BytesMain + 0x2d (0x6b800d in /usr/bin/python3)
frame #24: __libc_start_main + 0xf3 (0x7f4bf87d4083 in /lib/x86_64-linux-gnu/libc.so.6)
frame #25: _start + 0x2e (0x5fb85e in /usr/bin/python3)

terminate called after throwing an instance of 'c10::Error'
  what():  FATAL ERROR :: MODULE:SYNHELPER synDeviceRelease failed with. Status: 26
Habana exception raised from ~device_id at device.cpp:1045 (most recent call first):
frame #0: synapse_helpers::device_id::~device_id() + 0x44a (0x7f69bafd191a in /usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/lib/libpytorch_synapse_helpers.so)
frame #1: synapse_helpers::device::~device() + 0x5a9 (0x7f69bafd5cf9 in /usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/lib/libpytorch_synapse_helpers.so)
frame #2: std::_Sp_counted_ptr<synapse_helpers::device*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x16 (0x7f69bafdc556 in /usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/lib/libpytorch_synapse_hel
pers.so)
frame #3: synapse_helpers::HPURegistrarPerThreadTracker::~HPURegistrarPerThreadTracker() + 0x12d (0x7f69bb7733cd in /usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/lib/libhabana_pytorch_plugin.so)
frame #4: __call_tls_dtors + 0x3f (0x7f6a7abf62bf in /lib/x86_64-linux-gnu/libc.so.6)
frame #5: <unknown function> + 0x46a0d (0x7f6a7abf5a0d in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: on_exit + 0 (0x7f6a7abf5a60 in /lib/x86_64-linux-gnu/libc.so.6)
frame #7: /usr/bin/python3() [0x67f9bb]
frame #8: /usr/bin/python3() [0x67f9db]
frame #9: PyErr_PrintEx + 0x16 (0x67fc06 in /usr/bin/python3)
frame #10: PyRun_SimpleStringFlags + 0x52 (0x67fc72 in /usr/bin/python3)
frame #11: Py_RunMain + 0x2cc (0x6b7d3c in /usr/bin/python3)
frame #12: Py_BytesMain + 0x2d (0x6b800d in /usr/bin/python3)
frame #13: __libc_start_main + 0xf3 (0x7f6a7abd3083 in /lib/x86_64-linux-gnu/libc.so.6)
frame #14: _start + 0x2e (0x5fb85e in /usr/bin/python3)

terminate called after throwing an instance of 'c10::Error'
  what():  FATAL ERROR :: MODULE:SYNHELPER synDeviceRelease failed with. Status: 26
Habana exception raised from ~device_id at device.cpp:1045 (most recent call first):
frame #0: synapse_helpers::device_id::~device_id() + 0x44a (0x7f1da14db91a in /usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/lib/libpytorch_synapse_helpers.so)
frame #1: synapse_helpers::device::~device() + 0x5a9 (0x7f1da14dfcf9 in /usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/lib/libpytorch_synapse_helpers.so)
frame #2: std::_Sp_counted_ptr<synapse_helpers::device*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x16 (0x7f1da14e6556 in /usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/lib/libpytorch_synapse_hel
pers.so)
frame #3: synapse_helpers::HPURegistrarPerThreadTracker::~HPURegistrarPerThreadTracker() + 0x12d (0x7f1da1c7d3cd in /usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/lib/libhabana_pytorch_plugin.so)
frame #4: __call_tls_dtors + 0x3f (0x7f1e611002bf in /lib/x86_64-linux-gnu/libc.so.6)
frame #5: <unknown function> + 0x46a0d (0x7f1e610ffa0d in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: on_exit + 0 (0x7f1e610ffa60 in /lib/x86_64-linux-gnu/libc.so.6)
frame #7: /usr/bin/python3() [0x67f9bb]
frame #8: /usr/bin/python3() [0x67f9db]
frame #9: PyErr_PrintEx + 0x16 (0x67fc06 in /usr/bin/python3)
frame #10: PyRun_SimpleStringFlags + 0x52 (0x67fc72 in /usr/bin/python3)
frame #11: Py_RunMain + 0x2cc (0x6b7d3c in /usr/bin/python3)
frame #12: Py_BytesMain + 0x2d (0x6b800d in /usr/bin/python3)
frame #13: __libc_start_main + 0xf3 (0x7f1e610dd083 in /lib/x86_64-linux-gnu/libc.so.6)
frame #14: _start + 0x2e (0x5fb85e in /usr/bin/python3)

Internal Error: Received signal - Aborted
frame #0: dumpStack(int) + 0xc3 (0x7f69b986f683 in /usr/lib/habanalabs/libSynapse.so)
frame #1: <unknown function> + 0x43090 (0x7f6a7abf2090 in /lib/x86_64-linux-gnu/libc.so.6)
frame #2: gsignal + 0xcb (0x7f6a7abf200b in /lib/x86_64-linux-gnu/libc.so.6)
frame #3: abort + 0x12b (0x7f6a7abd1859 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0x9e911 (0x7f6a7a885911 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0xaa38c (0x7f6a7a89138c in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #6: <unknown function> + 0xa9369 (0x7f6a7a890369 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #7: __gxx_personality_v0 + 0x2a1 (0x7f6a7a890d21 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: __libunwind_Unwind_RaiseException + 0x1db (0x7f6a7a9ccdfb in /lib/x86_64-linux-gnu/libunwind.so.8)                                                                                                [283/1808]
frame #9: __cxa_throw + 0x3c (0x7f6a7a89169c in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #10: Logger::habana_assert(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x5cb (0x7f69bb5abe6e in /usr/local/lib/python3.8/d
ist-packages/habana_frameworks/torch/lib/libhabana_pytorch_plugin.so)
frame #11: synapse_helpers::device_id::~device_id() + 0x44a (0x7f69bafd191a in /usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/lib/libpytorch_synapse_helpers.so)
frame #12: synapse_helpers::device::~device() + 0x5a9 (0x7f69bafd5cf9 in /usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/lib/libpytorch_synapse_helpers.so)
frame #13: std::_Sp_counted_ptr<synapse_helpers::device*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x16 (0x7f69bafdc556 in /usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/lib/libpytorch_synapse_he
lpers.so)
frame #14: synapse_helpers::HPURegistrarPerThreadTracker::~HPURegistrarPerThreadTracker() + 0x12d (0x7f69bb7733cd in /usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/lib/libhabana_pytorch_plugin.so)
frame #15: __call_tls_dtors + 0x3f (0x7f6a7abf62bf in /lib/x86_64-linux-gnu/libc.so.6)
frame #16: <unknown function> + 0x46a0d (0x7f6a7abf5a0d in /lib/x86_64-linux-gnu/libc.so.6)
frame #17: on_exit + 0 (0x7f6a7abf5a60 in /lib/x86_64-linux-gnu/libc.so.6)
frame #18: /usr/bin/python3() [0x67f9bb]
frame #19: /usr/bin/python3() [0x67f9db]
frame #20: PyErr_PrintEx + 0x16 (0x67fc06 in /usr/bin/python3)
frame #21: PyRun_SimpleStringFlags + 0x52 (0x67fc72 in /usr/bin/python3)
frame #22: Py_RunMain + 0x2cc (0x6b7d3c in /usr/bin/python3)
frame #23: Py_BytesMain + 0x2d (0x6b800d in /usr/bin/python3)
frame #24: __libc_start_main + 0xf3 (0x7f6a7abd3083 in /lib/x86_64-linux-gnu/libc.so.6)
frame #25: _start + 0x2e (0x5fb85e in /usr/bin/python3)

Internal Error: Received signal - Aborted
frame #0: dumpStack(int) + 0xc3 (0x7f1d9fd79683 in /usr/lib/habanalabs/libSynapse.so)
frame #1: <unknown function> + 0x43090 (0x7f1e610fc090 in /lib/x86_64-linux-gnu/libc.so.6)
frame #2: gsignal + 0xcb (0x7f1e610fc00b in /lib/x86_64-linux-gnu/libc.so.6)
frame #3: abort + 0x12b (0x7f1e610db859 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0x9e911 (0x7f1e60d8f911 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0xaa38c (0x7f1e60d9b38c in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #6: <unknown function> + 0xa9369 (0x7f1e60d9a369 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #7: __gxx_personality_v0 + 0x2a1 (0x7f1e60d9ad21 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: __libunwind_Unwind_RaiseException + 0x1db (0x7f1e60ed6dfb in /lib/x86_64-linux-gnu/libunwind.so.8)
frame #9: __cxa_throw + 0x3c (0x7f1e60d9b69c in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #10: Logger::habana_assert(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x5cb (0x7f1da1ab5e6e in /usr/local/lib/python3.8/d
ist-packages/habana_frameworks/torch/lib/libhabana_pytorch_plugin.so)
frame #11: synapse_helpers::device_id::~device_id() + 0x44a (0x7f1da14db91a in /usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/lib/libpytorch_synapse_helpers.so)
frame #12: synapse_helpers::device::~device() + 0x5a9 (0x7f1da14dfcf9 in /usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/lib/libpytorch_synapse_helpers.so)
frame #13: std::_Sp_counted_ptr<synapse_helpers::device*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x16 (0x7f1da14e6556 in /usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/lib/libpytorch_synapse_he
lpers.so)
frame #14: synapse_helpers::HPURegistrarPerThreadTracker::~HPURegistrarPerThreadTracker() + 0x12d (0x7f1da1c7d3cd in /usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/lib/libhabana_pytorch_plugin.so)
frame #15: __call_tls_dtors + 0x3f (0x7f1e611002bf in /lib/x86_64-linux-gnu/libc.so.6)
frame #16: <unknown function> + 0x46a0d (0x7f1e610ffa0d in /lib/x86_64-linux-gnu/libc.so.6)
frame #17: on_exit + 0 (0x7f1e610ffa60 in /lib/x86_64-linux-gnu/libc.so.6)
frame #18: /usr/bin/python3() [0x67f9bb]
frame #19: /usr/bin/python3() [0x67f9db]
frame #20: PyErr_PrintEx + 0x16 (0x67fc06 in /usr/bin/python3)
frame #21: PyRun_SimpleStringFlags + 0x52 (0x67fc72 in /usr/bin/python3)
frame #22: Py_RunMain + 0x2cc (0x6b7d3c in /usr/bin/python3)
frame #23: Py_BytesMain + 0x2d (0x6b800d in /usr/bin/python3)
frame #24: __libc_start_main + 0xf3 (0x7f1e610dd083 in /lib/x86_64-linux-gnu/libc.so.6)
frame #25: _start + 0x2e (0x5fb85e in /usr/bin/python3)

/home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources---bp-dt/repos/hcl/src/platform/gaudi/infra/hcl_command_submission.cpp::152(onWatcher): The condition [ rc == 0 ] failed. hlthunk_wait_for_cs
() failed with rc(-110) status(2) csid(6054) csHandle(16666) 
/home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources---bp-dt/repos/hcl/src/platform/gaudi/infra/hcl_command_submission.cpp::152(onWatcher): The condition [ rc == 0 ] failed. hlthunk_wait_for_cs
() failed with rc(-16) status(16667) csid(6055) csHandle(16667) 
/home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources---bp-dt/repos/hcl/src/platform/gaudi/infra/hcl_command_submission.cpp::152(onWatcher): The condition [ rc == 0 ] failed. hlthunk_wait_for_cs
() failed with rc(-16) status(16668) csid(6056) csHandle(16668) 
/home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources---bp-dt/repos/hcl/src/platform/gaudi/infra/hcl_command_submission.cpp::152(onWatcher): The condition [ rc == 0 ] failed. hlthunk_w[228/1808]
() failed with rc(-16) status(16668) csid(6056) csHandle(16668) 
/home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources---bp-dt/repos/hcl/src/platform/gaudi/infra/hcl_command_submission.cpp::152(onWatcher): The condition [ rc == 0 ] failed. hlthunk_wait_for_cs
() failed with rc(-16) status(16669) csid(6057) csHandle(16669) 
/home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources---bp-dt/repos/hcl/src/platform/gaudi/infra/hcl_command_submission.cpp::152(onWatcher): The condition [ rc == 0 ] failed. hlthunk_wait_for_cs
() failed with rc(-16) status(16670) csid(6058) csHandle(16670) 
/home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources---bp-dt/repos/hcl/src/platform/gaudi/infra/hcl_command_submission.cpp::152(onWatcher): The condition [ rc == 0 ] failed. hlthunk_wait_for_cs
() failed with rc(-16) status(16671) csid(6059) csHandle(16671) 
/home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources---bp-dt/repos/hcl/src/platform/gaudi/infra/hcl_command_submission.cpp::152(onWatcher): The condition [ rc == 0 ] failed. hlthunk_wait_for_cs
() failed with rc(-16) status(16672) csid(6060) csHandle(16672) 
/home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources---bp-dt/repos/hcl/src/platform/gaudi/infra/hcl_command_submission.cpp::152(onWatcher): The condition [ rc == 0 ] failed. hlthunk_wait_for_cs
() failed with rc(-16) status(16673) csid(6061) csHandle(16673) 
/home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources---bp-dt/repos/hcl/src/platform/gaudi/infra/hcl_command_submission.cpp::152(onWatcher): The condition [ rc == 0 ] failed. hlthunk_wait_for_cs
() failed with rc(-16) status(16674) csid(6062) csHandle(16674) 
/home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources---bp-dt/repos/hcl/src/platform/gaudi/infra/hcl_command_submission.cpp::152(onWatcher): The condition [ rc == 0 ] failed. hlthunk_wait_for_cs
() failed with rc(-16) status(16111) csid(5496) csHandle(16111) /home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources---bp-dt/repos/hcl/src/platform/gaudi/infra/hcl_command_submission.cpp::152(o
nWatcher): The condition [ rc == 0 ] failed. hlthunk_wait_for_cs() failed with rc(-16) status(14796) csid(4180) csHandle(14796) 

/home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources---bp-dt/repos/hcl/src/platform/gaudi/infra/hcl_command_submission.cpp::152(onWatcher): The condition [ rc == 0 ] failed. hlthunk_wait_for_cs
() failed with rc(-16) status(16675) csid(6063) csHandle(16675) 
/home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources---bp-dt/repos/hcl/src/platform/gaudi/infra/hcl_command_submission.cpp::152(onWatcher): The condition [ rc == 0 ] failed. hlthunk_wait_for_cs
() failed with rc(-16) status(16676) csid(6064) csHandle(16676) 
/home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources---bp-dt/repos/hcl/src/platform/gaudi/infra/hcl_command_submission.cpp::152(onWatcher): The condition [ rc == 0 ] failed. hlthunk_wait_for_cs
() failed with rc(-16) status(16677) csid(6065) csHandle(16677) 
/home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources---bp-dt/repos/hcl/src/platform/gaudi/infra/hcl_command_submission.cpp::152(onWatcher): The condition [ rc == 0 ] failed. hlthunk_wait_for_cs
() failed with rc(-16) status(16678) csid(6066) csHandle(16678) 
/home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources---bp-dt/repos/hcl/src/platform/gaudi/infra/hcl_command_submission.cpp::152(onWatcher): The condition [ rc == 0 ] failed. hlthunk_wait_for_cs
() failed with rc(-16) status(16112) csid(5497) csHandle(16112) /home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources---bp-dt/repos/hcl/src/platform/gaudi/infra/hcl_command_submission.cpp::152(o
nWatcher): The condition [ rc == 0 ] failed. hlthunk_wait_for_cs() failed with rc(-16) status(14797) csid(4181) csHandle(14797) 

/home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources---bp-dt/repos/hcl/src/platform/gaudi/infra/hcl_command_submission.cpp::152(onWatcher): The condition [ rc == 0 ] failed. hlthunk_wait_for_cs
() failed with rc(-16) status(16113) csid(5498) csHandle(16113) 
/home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources---bp-dt/repos/hcl/src/platform/gaudi/infra/hcl_command_submission.cpp::152(onWatcher): The condition [ rc == 0 ] failed. hlthunk_wait_for_cs
() failed with rc(-16) status(14798) csid(4182) csHandle(14798) 
/home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources---bp-dt/repos/hcl/src/platform/gaudi/infra/hcl_command_submission.cpp::152(onWatcher): The condition [ rc == 0 ] failed. hlthunk_wait_for_cs
() failed with rc(-16) status(16679) csid(6067) csHandle(16679) 
/home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources---bp-dt/repos/hcl/src/platform/gaudi/infra/hcl_command_submission.cpp::152(onWatcher): The condition [ rc == 0 ] failed. hlthunk_wait_for_cs
() failed with rc(-16) status(16114) csid(5499) csHandle(16114) 
/home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources---bp-dt/repos/hcl/src/platform/gaudi/infra/hcl_command_submission.cpp::152(onWatcher): The condition [ rc == 0 ] failed. hlthunk_wait_for_cs
() failed with rc(-16) status(14799) csid(4183) csHandle(14799) 
/home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources---bp-dt/repos/hcl/src/platform/gaudi/infra/hcl_command_submission.cpp::152(onWatcher): The condition [ rc == 0 ] failed. hlthunk_wait_for_cs
() failed with rc(-16) status(16680) csid(6068) csHandle(16680) 
/home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources---bp-dt/repos/hcl/src/platform/gaudi/infra/hcl_command_submission.cpp::152(onWatcher): The condition [ rc == 0 ] failed. hlthunk_wait_for_cs
() failed with rc(-16) status(16115) csid(5500) csHandle(16115) 
/home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources---bp-dt/repos/hcl/src/platform/gaudi/infra/hcl_command_submission.cpp::152(onWatcher): The condition [ rc == 0 ] failed. hlthunk_wait_for_cs
() failed with rc(-16) status(14800) csid(4184) csHandle(14800) 
/home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources---bp-dt/repos/hcl/src/platform/gaudi/infra/hcl_command_submission.cpp::152(onWatcher): The condition [ rc == 0 ] failed. hlthunk_wait_for_cs
() failed with rc(-16) status(16681) csid(6069) csHandle(16681) 
/home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources---bp-dt/repos/hcl/src/platform/gaudi/infra/hcl_command_submission.cpp::152(onWatcher): The condition [ rc == 0 ] failed. hlthunk_wait_for_cs
() failed with rc(-16) status(16116) csid(5501) csHandle(16116) 
/home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources---bp-dt/repos/hcl/src/platform/gaudi/infra/hcl_command_submission.cpp::152(onWatcher): The condition [ rc == 0 ] failed. hlthunk_wait_for_cs
() failed with rc(-16) status(14801) csid(4185) csHandle(14801) 
/home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources---bp-dt/repos/hcl/src/platform/gaudi/infra/hcl_command_submission.cpp::152(onWatcher): The condition [ rc == 0 ] failed. hlthunk_wait_for_cs
Traceback (most recent call last):
  File "tools/train.py", line 168, in <module>
    launch(
  File "/home/ubuntu/data/Model-References/PyTorch/computer_vision/detection/yolox/yolox/core/launch.py", line 90, in launch
    mp.start_processes(
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 149, in join
    raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 7 terminated with exit code 255
ubuntu@ip-10-0-30-59:~/data/Model-References/PyTorch/computer_vision/detection/yolox$ /usr/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 108 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Can you please try out the mnist model (by checking out the 1.6 version of model-references if you are using 1.6 SW stack).

This will help establish if the SW stack you have is installed correctly.