Hi, I’ve met some inference issues while using the Llama 3.1 model from the HuggingFace.
When I generate the answers from the Meta-Llama-3.1-8B-Instruct, they repeat the <|eot_id|> token at the end of the answer until it reaches the max length.
How can we solve this problem?
def generate(self, input_data: str, temperature):
inputs = self.tokenizer(input_data, return_tensors="pt").to(self.device)
with torch.no_grad():
outputs = self.model.generate(
**inputs,
max_new_tokens= 4096, #self.max_new_tokens, #1024
return_dict_in_generate=True,
output_scores=True,
output_hidden_states=True,
temperature=temperature,
do_sample=True,
stopping_criteria=None,
pad_token_id=self.tokenizer.eos_token_id,
)
full_answer = self.tokenizer.decode(
outputs.sequences[0], skip_special_tokens=False)
To run generation, i’d suggest looking here: optimum-habana/examples/text-generation at main · huggingface/optimum-habana · GitHub
specifically we pass in a generation_config that makes the run optimal for HPU:
if inputs_embeds is not None:
input_data.update(inputs_embeds)
input_data.update(input_tokens)
else:
args.input_embeds = False
input_data.update(input_tokens)
else:
input_data.update(input_tokens)
iteration_times = []
outputs = model.generate(
**input_data,
generation_config=generation_config,
assistant_model=assistant_model,
lazy_mode=use_lazy_mode,
hpu_graphs=args.use_hpu_graphs,
profiling_steps=args.profiling_steps,
profiling_warmup_steps=args.profiling_warmup_steps,
ignore_eos=args.ignore_eos,
iteration_times=iteration_times,
profiling_record_shapes=args.profiling_record_shapes,
You can put a breakpount/pdb before this line and checkout generation_config
when running a command like:
PT_ENABLE_INT64_SUPPORT=1 python run_generation.py --model_name_or_path meta-llama/Llama-3.1-8B-Instruct --trim_logits --use_kv_cache --attn_softmax_bf16 --bf16 --bucket_internal --bucket_size=128 --use_flash_attention --flash_attention_recompute --batch_size 16 --max_input_tokens 2048 --max_new_tokens 2048
Also you may need warmup to be more performant (but that shouldnt change accuracy)
Still same issue occurs even I put the generation config.
what settings of generation config did you use?
sunson
December 3, 2024, 9:27am
5
Hi, can you try one of the below?
Set skip_special_tokens=True
at self.tokenizer.decode
that will remove all eot_id tokens
add ignore_eos=True,
and lazy_mode=True
at self.model.generate
, this would remove all eot_id tokens and performance is better.