Questions on creating instruction data

Thanks for the great work!

I have a few questions regarding data creation of xP3 after following the guide [here](https://github.com/bigscience-workshop/xmtf#create-xp3) to create instruction data on the `code` language subset. 

1. I noticed the total samples of the public processed data (from [here](https://huggingface.co/datasets/bigscience/xP3all)) on the `code` split is **2707724**. However, my resulting data following the above github guide is much more than that (approximately >3M samples). I wonder if there were any additional post-processing to get the final instruction data for tuning?

2. Following the above github guide, I noticed there was no prompt for this particular dataset State Changes. I got this warning when running the [creation code](https://github.com/bigscience-workshop/bigscience/blob/master/data/xp3/prepare_xp3_train.py): 
``Tried instantiating `DatasetTemplates` for Fraser/python-state-changes, but no prompts found. Please ignore this warning if you are creating new prompts for this dataset.``

Is this dataset not assigned with any prompt (similar to how HumanEval was treated). Or is the below version of PromptSource I used is not correct: 
``git clone -b tr13 https://github.com/Muennighoff/promptsource.git & install cd promptsource; pip install -e .``

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Questions on creating instruction data #13

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Questions on creating instruction data #13

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions