I'm running Habana's models and I don't see the same level of performance as what is published on the GitHub and Developer site

Please refer to Model Optimization Guide for a list of optimizations. Common methods include tuning batchsize, using mixed precision instead of float32, and implementing data pipelines with prefetching to device. Also please make sure ops are landing on HPU instead of running on the Host CPU. Users can review this section of the Optimization guide for more detials.