The readme has an example:
python3 run_lora_clm.py \
--model_name_or_path huggyllama/llama-7b \
--dataset_name tatsu-lab/alpaca \
...
tatsu-lab/alpaca is a sample dataset
Please check the preprocessing section on run_lora_clm.py to check what transformations happen on the data.
First loaded here
sample from the dataset at this point:
raw_datasets["train"]
Dataset({
features: ['instruction', 'input', 'output', 'text'],
num_rows: 52002
})
(Pdb) raw_datasets["train"][0]
{'instruction': 'Give three tips for staying healthy.', 'input': '', 'output': '1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.', 'text': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nGive three tips for staying healthy.\n\n### Response:\n1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.'}
At this point:
raw_datasets["train"]
Dataset({
features: ['instruction', 'input', 'output', 'text'],
num_rows: 49922
})
(Pdb) raw_datasets["train"][0]
{'instruction': 'Compare and contrast the use of web technologies between web 1.0 and web 2.0', 'input': '', 'output': 'The technology used in the original web (web 1.0) was limited to static webpages, and the primary focus was on the visual aspects of web design. Web 1.0 technologies lacked most of the interactivity and personalization that is commonplace today.\n\nThe use of web technologies in web 2.0 are much more complex, with the focus being on the development of dynamic webpages that can offer user interaction, personalization, and social interaction. Content is the main emphasis, and user content and interaction is encouraged. This encourages the development of communities and apps using extensive backend programming.', 'text': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nCompare and contrast the use of web technologies between web 1.0 and web 2.0\n\n### Response:\nThe technology used in the original web (web 1.0) was limited to static webpages, and the primary focus was on the visual aspects of web design. Web 1.0 technologies lacked most of the interactivity and personalization that is commonplace today.\n\nThe use of web technologies in web 2.0 are much more complex, with the focus being on the development of dynamic webpages that can offer user interaction, personalization, and social interaction. Content is the main emphasis, and user content and interaction is encouraged. This encourages the development of communities and apps using extensive backend programming.'}
Finally this gets called, creating prompts:
(Pdb) type(prompts)
<class 'dict'>
(Pdb) prompts.keys()
dict_keys(['source', 'target'])
(Pdb) prompts['source'][0]
'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nCompare and contrast the use of web technologies between web 1.0 and web 2.0\n\n### Response:'
(Pdb) prompts['target'][0]
'The technology used in the original web (web 1.0) was limited to static webpages, and the primary focus was on the visual aspects of web design. Web 1.0 technologies lacked most of the interactivity and personalization that is commonplace today.\n\nThe use of web technologies in web 2.0 are much more complex, with the focus being on the development of dynamic webpages that can offer user interaction, personalization, and social interaction. Content is the main emphasis, and user content and interaction is encouraged. This encourages the development of communities and apps using extensive backend programming.'
Finally after all this, we get:
raw_datasets['train']
Dataset({
features: ['prompt_sources', 'prompt_targets'],
num_rows: 49922
})
(Pdb) raw_datasets['train'][0]
{'prompt_sources': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nCompare and contrast the use of web technologies between web 1.0 and web 2.0\n\n### Response:', 'prompt_targets': 'The technology used in the original web (web 1.0) was limited to static webpages, and the primary focus was on the visual aspects of web design. Web 1.0 technologies lacked most of the interactivity and personalization that is commonplace today.\n\nThe use of web technologies in web 2.0 are much more complex, with the focus being on the development of dynamic webpages that can offer user interaction, personalization, and social interaction. Content is the main emphasis, and user content and interaction is encouraged. This encourages the development of communities and apps using extensive backend programming.'}
So basically, you need a dataset with 2 keys, ‘prompt_sources’, ‘prompt_targets’