Our company has been both serving some models and fine-tuning some LLM models (Mistral, Llama 2, WizardCoder, etc.) on the 8x Gaudi2 machines in Intel Developer Cloud (IDC). Up to this point, we have been using a customer FastAPI (and rather simple) REST API server to server LLM text completions. This works fine until there is significant load on the inference servers. Even if we load balance multiple replicas, we are hitting issues with concurrent requests and what should really be batched requests.
I know that TGI, Triton, vLLM, etc. support best practice batching of incoming requests to LLMs that are hosted on NVIDIA GPUs. In particular, it seems that continuous batching (aka “iteration-level scheduling”) allows for drastically increased throughput without making modifications to the underlying LLM.
What is the best current inference server implementation that supports Gaudi2 and do any of these support continuous batching or other throughput enhancing schemes? I’ve looked at the Optimum Habana implementation of TGI (https://github.com/huggingface/optimum-habana/tree/main/text-generation-inference), and I can get that to work fine. However, it seems to only currently support batches of size 1, which limits it’s utility.
Before I jump into some implementation of continuous batching myself, any suggestions that would save me that sadness?