Llama inference result with infinite eot_id tokens

dmlab-llm · November 4, 2024, 5:06am

Hi, I’ve met some inference issues while using the Llama 3.1 model from the HuggingFace.
When I generate the answers from the Meta-Llama-3.1-8B-Instruct, they repeat the <|eot_id|> token at the end of the answer until it reaches the max length.
How can we solve this problem?

def generate(self, input_data: str, temperature):
        inputs = self.tokenizer(input_data, return_tensors="pt").to(self.device)
        with torch.no_grad():
            outputs = self.model.generate(
                **inputs,
                max_new_tokens= 4096, #self.max_new_tokens, #1024
                return_dict_in_generate=True,
                output_scores=True,
                output_hidden_states=True,
                temperature=temperature,
                do_sample=True,
                stopping_criteria=None,
                pad_token_id=self.tokenizer.eos_token_id,
            )
        
        full_answer = self.tokenizer.decode(
            outputs.sequences[0], skip_special_tokens=False)

Sayantan_S · November 4, 2024, 5:13am

To run generation, i’d suggest looking here: optimum-habana/examples/text-generation at main · huggingface/optimum-habana · GitHub

specifically we pass in a generation_config that makes the run optimal for HPU:

github.com

huggingface/optimum-habana/blob/main/examples/text-generation/run_generation.py#L494


      
              if inputs_embeds is not None:
                  input_data.update(inputs_embeds)
                  input_data.update(input_tokens)
              else:
                  args.input_embeds = False
                  input_data.update(input_tokens)
          else:
              input_data.update(input_tokens)
          
          iteration_times = []
          outputs = model.generate(
              **input_data,
              generation_config=generation_config,
              assistant_model=assistant_model,
              lazy_mode=use_lazy_mode,
              hpu_graphs=args.use_hpu_graphs,
              profiling_steps=args.profiling_steps,
              profiling_warmup_steps=args.profiling_warmup_steps,
              ignore_eos=args.ignore_eos,
              iteration_times=iteration_times,
              profiling_record_shapes=args.profiling_record_shapes,

You can put a breakpount/pdb before this line and checkout generation_config when running a command like:

PT_ENABLE_INT64_SUPPORT=1 python run_generation.py --model_name_or_path meta-llama/Llama-3.1-8B-Instruct --trim_logits --use_kv_cache --attn_softmax_bf16 --bf16 --bucket_internal  --bucket_size=128  --use_flash_attention --flash_attention_recompute --batch_size 16 --max_input_tokens 2048 --max_new_tokens 2048

Also you may need warmup to be more performant (but that shouldnt change accuracy)

dmlab-llm · November 12, 2024, 6:14pm

Still same issue occurs even I put the generation config.

Sayantan_S · November 12, 2024, 6:19pm

what settings of generation config did you use?

sunson · December 3, 2024, 9:27am

Hi, can you try one of the below?

Set skip_special_tokens=True at self.tokenizer.decode that will remove all eot_id tokens
add ignore_eos=True, and lazy_mode=True at self.model.generate, this would remove all eot_id tokens and performance is better.

Topic		Replies	Views
Missing Results for LLaMA2 on Gaudi2 Inference	0	402	August 16, 2023
LangChain: Optimum Habana Examples Text-Generation Inference	3	262	June 4, 2024
Synapse detected a device critical error that requires a restart. [Compute or dma timeout] PyTorch pytorch	0	103	November 12, 2024
What is --enforce-eager Inference	3	1117	July 30, 2024
Problem with training llama-3-70b with deepspeed Training pytorch	1	218	July 18, 2024

Llama inference result with infinite eot_id tokens

Related topics