Current best inference server implementation for Gaudi2

dwhitena · November 16, 2023, 6:35pm

Our company has been both serving some models and fine-tuning some LLM models (Mistral, Llama 2, WizardCoder, etc.) on the 8x Gaudi2 machines in Intel Developer Cloud (IDC). Up to this point, we have been using a customer FastAPI (and rather simple) REST API server to server LLM text completions. This works fine until there is significant load on the inference servers. Even if we load balance multiple replicas, we are hitting issues with concurrent requests and what should really be batched requests.

I know that TGI, Triton, vLLM, etc. support best practice batching of incoming requests to LLMs that are hosted on NVIDIA GPUs. In particular, it seems that continuous batching (aka “iteration-level scheduling”) allows for drastically increased throughput without making modifications to the underlying LLM.

What is the best current inference server implementation that supports Gaudi2 and do any of these support continuous batching or other throughput enhancing schemes? I’ve looked at the Optimum Habana implementation of TGI (https://github.com/huggingface/optimum-habana/tree/main/text-generation-inference), and I can get that to work fine. However, it seems to only currently support batches of size 1, which limits it’s utility.

Before I jump into some implementation of continuous batching myself, any suggestions that would save me that sadness?

Sayantan_S · November 28, 2023, 5:27pm

Continuous batching is not supported yet.

LilRay · January 2, 2025, 3:28am

Hi, does any method currently support batch inference?

Sayantan_S · January 2, 2025, 3:32am

vllm and tgi is supported, both of which have continuous batching (and other serving optimizations that come with these frameworks).
Vllm:

https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/vLLM_Inference.html
https://docs.vllm.ai/en/latest/getting_started/gaudi-installation.html

TGI: GitHub - huggingface/tgi-gaudi: Large Language Model Text Generation Inference on Habana Gaudi

What do u mean by “batch inference”? Did you mean “continuous batching”

Topic		Replies	Views
SynapseAI 1.15.0 Release Announcements	0	354	April 17, 2024
Missing Results for LLaMA2 on Gaudi2 Inference	0	402	August 16, 2023
Llama inference result with infinite eot_id tokens Inference	4	198	December 3, 2024
Does Gaudi2 lib support Mixtral-8x7b? Inference models	1	156	March 29, 2024
What is --enforce-eager Inference	3	1116	July 30, 2024

Current best inference server implementation for Gaudi2

Related topics