What is --enforce-eager

I am running vllm inference on gaudi 2 -model meta-llama/Meta-Llama-3-8B-Instruct --dtype float16 --max-num-seqs 2048 --block-size 128 . the model isnt loading without --enforce-eager tag . what does --enforce-eager do .

“the model isnt loading” … can you describe it in more details, like is it crashing, or hanging, or not producing good results etc?

Though --enforce-eager as a name seems to suggest it controls if its lazy or eager mode, it actually controls in HPU graph is used or not. This interpretation is in line with the original usage of the flag to use CUDA graph or not as mentioned here

You can see it in use here

Please check the second point of this section. HPU graphs might take more memory, so I suspect your model runs out of memory when you have HPU graphs, but are able to run with --enforce-eager disabling HPU graphs.

The meaning of the flag might get updated. for example there was a recent change here, whose description shows a more detailed table of the usage of enforce_eager in more detail: