Llama inference result with infinite eot_id tokens

Hi, I’ve met some inference issues while using the Llama 3.1 model from the HuggingFace.
When I generate the answers from the Meta-Llama-3.1-8B-Instruct, they repeat the <|eot_id|> token at the end of the answer until it reaches the max length.
How can we solve this problem?

def generate(self, input_data: str, temperature):
        inputs = self.tokenizer(input_data, return_tensors="pt").to(self.device)
        with torch.no_grad():
            outputs = self.model.generate(
                **inputs,
                max_new_tokens= 4096, #self.max_new_tokens, #1024
                return_dict_in_generate=True,
                output_scores=True,
                output_hidden_states=True,
                temperature=temperature,
                do_sample=True,
                stopping_criteria=None,
                pad_token_id=self.tokenizer.eos_token_id,
            )
        
        full_answer = self.tokenizer.decode(
            outputs.sequences[0], skip_special_tokens=False)

To run generation, i’d suggest looking here: optimum-habana/examples/text-generation at main · huggingface/optimum-habana · GitHub

specifically we pass in a generation_config that makes the run optimal for HPU:

You can put a breakpount/pdb before this line and checkout generation_config when running a command like:

PT_ENABLE_INT64_SUPPORT=1 python run_generation.py --model_name_or_path meta-llama/Llama-3.1-8B-Instruct --trim_logits --use_kv_cache --attn_softmax_bf16 --bf16 --bucket_internal  --bucket_size=128  --use_flash_attention --flash_attention_recompute --batch_size 16 --max_input_tokens 2048 --max_new_tokens 2048

Also you may need warmup to be more performant (but that shouldnt change accuracy)

Still same issue occurs even I put the generation config.

what settings of generation config did you use?

Hi, can you try one of the below?

  1. Set skip_special_tokens=True at self.tokenizer.decode that will remove all eot_id tokens
  2. add ignore_eos=True, and lazy_mode=True at self.model.generate, this would remove all eot_id tokens and performance is better.