Problem to train the local dataset using Llama2 Fine-Tuning with Low-Rank Adaptations (LoRA) on Intel® Gaudi®2 AI Accelerator

followed procedure as mentioned link: Llama2 Fine-Tuning with Low-Rank Adaptations (LoRA) on Intel® Gaudi®2 AI Accelerator - Intel Gaudi Developers (habana.ai)

how to train a model with local dataset, tried method as blow mentioned python3 …/gaudi_spawn.py --use_deepspeed --world_size 8 run_lora_clm.py
–model_name_or_path meta-llama/CodeLlama-7b-Instruct-hf
–deepspeed llama2_ds_zero2_config.json
–train_file /optimum-habana/examples/language-modeling/train-data.jsonl
–bf16 True
–output_dir ./lora_out
–num_train_epochs 2
–max_seq_len 2048
–per_device_train_batch_size 10
–per_device_eval_batch_size 10
–gradient_checkpointing
–evaluation_strategy epoch
–eval_delay 2
–save_strategy no
–learning_rate 0.0018
–warmup_ratio 0.03
–lr_scheduler_type “cosine”
–logging_steps 1
–dataset_concatenation
–attn_softmax_bf16 True
–do_train
–do_eval
–use_habana
–use_lazy_mode
–pipelining_fwd_bwd
–throughput_warmup_steps 3
–lora_rank 4
–lora_target_modules “q_proj” “v_proj” “k_proj” “o_proj”
–validation_split_percentage 4

After Execution above got error like : raceback (most recent call last):

File “/optimum-habana/examples/language-modeling/run_lora_clm.py”, line 754, in
main()
File “/optimum-habana/examples/language-modeling/run_lora_clm.py”, line 529, in main
raise ValueError(“Unsupported dataset”)
ValueError: Unsupported dataset

Please suggest the best way to train a model using local dataset with intel Gaudi 2 AI accelerator

The cmd line you posted says:

–train_file /optimum-habana/examples/language-modeling/train-data.jsonl

Is the extension “jsonl”, or is it a typo and is it supposed to be “json”?

I’ll check and get back what is the expected format of the dataset

Yes this is jsonl extension, I have tried with different extension as well like .csv, json and parquet but getting same error. I request please help me to fix the issue, if required we will have a working session to get the things clear.
Thanks, in advance,

The readme has an example:

python3 run_lora_clm.py \
    --model_name_or_path huggyllama/llama-7b \
    --dataset_name tatsu-lab/alpaca \
   ...

tatsu-lab/alpaca is a sample dataset

Please check the preprocessing section on run_lora_clm.py to check what transformations happen on the data.

First loaded here

sample from the dataset at this point:

 raw_datasets["train"]
Dataset({
    features: ['instruction', 'input', 'output', 'text'],
    num_rows: 52002
})
(Pdb) raw_datasets["train"][0]
{'instruction': 'Give three tips for staying healthy.', 'input': '', 'output': '1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.', 'text': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nGive three tips for staying healthy.\n\n### Response:\n1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.'}

At this point:

 raw_datasets["train"]
Dataset({
    features: ['instruction', 'input', 'output', 'text'],
    num_rows: 49922
})
(Pdb) raw_datasets["train"][0]
{'instruction': 'Compare and contrast the use of web technologies between web 1.0 and web 2.0', 'input': '', 'output': 'The technology used in the original web (web 1.0) was limited to static webpages, and the primary focus was on the visual aspects of web design. Web 1.0 technologies lacked most of the interactivity and personalization that is commonplace today.\n\nThe use of web technologies in web 2.0 are much more complex, with the focus being on the development of dynamic webpages that can offer user interaction, personalization, and social interaction. Content is the main emphasis, and user content and interaction is encouraged. This encourages the development of communities and apps using extensive backend programming.', 'text': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nCompare and contrast the use of web technologies between web 1.0 and web 2.0\n\n### Response:\nThe technology used in the original web (web 1.0) was limited to static webpages, and the primary focus was on the visual aspects of web design. Web 1.0 technologies lacked most of the interactivity and personalization that is commonplace today.\n\nThe use of web technologies in web 2.0 are much more complex, with the focus being on the development of dynamic webpages that can offer user interaction, personalization, and social interaction. Content is the main emphasis, and user content and interaction is encouraged. This encourages the development of communities and apps using extensive backend programming.'}

Finally this gets called, creating prompts:

(Pdb) type(prompts)
<class 'dict'>
(Pdb) prompts.keys()
dict_keys(['source', 'target'])
(Pdb) prompts['source'][0]
'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nCompare and contrast the use of web technologies between web 1.0 and web 2.0\n\n### Response:'
(Pdb) prompts['target'][0]
'The technology used in the original web (web 1.0) was limited to static webpages, and the primary focus was on the visual aspects of web design. Web 1.0 technologies lacked most of the interactivity and personalization that is commonplace today.\n\nThe use of web technologies in web 2.0 are much more complex, with the focus being on the development of dynamic webpages that can offer user interaction, personalization, and social interaction. Content is the main emphasis, and user content and interaction is encouraged. This encourages the development of communities and apps using extensive backend programming.'

Finally after all this, we get:

raw_datasets['train']
Dataset({
    features: ['prompt_sources', 'prompt_targets'],
    num_rows: 49922
})
(Pdb) raw_datasets['train'][0]
{'prompt_sources': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nCompare and contrast the use of web technologies between web 1.0 and web 2.0\n\n### Response:', 'prompt_targets': 'The technology used in the original web (web 1.0) was limited to static webpages, and the primary focus was on the visual aspects of web design. Web 1.0 technologies lacked most of the interactivity and personalization that is commonplace today.\n\nThe use of web technologies in web 2.0 are much more complex, with the focus being on the development of dynamic webpages that can offer user interaction, personalization, and social interaction. Content is the main emphasis, and user content and interaction is encouraged. This encourages the development of communities and apps using extensive backend programming.'}

So basically, you need a dataset with 2 keys, ‘prompt_sources’, ‘prompt_targets’