Create dataset
Create dataset is the first step in Finetune Wizard. Here, you can upload your train, validation, and test data files to finetune LLM model. You also have the option to use a percentage of the train data file for validation and test if separate data files are not provided.
If you are new to AI/ML development, Please read our blog post on Data Splitting for more information on how to split your data into train, validation, and test data files.
Uploading data files
To upload your data files, follow these steps:
- Click on the "Choose File" button next to the data file type you want to upload - train, validation, or test.
- A dialog box will appear where you can select one of the previously uploaded files by clicking on it and then clicking the "Select" button to confirm.
- If you want to upload a new file, click on the "Upload File" button. This will open a dialog box where you can upload a new file from your local computer.
- Enable the respective checkbox provided in the dialog box for below option
- Check for Null values - checks the current file for null values
- Binary Classification - for binary classification (will make sure that datset contains only two unique labels)
- Classification - for multi-class classification (will make sure that labels are containing just a single token - as recommended by OpenAI)
- Click the "Upload" button to finish the process.
Creating Validation and Test data files
If you do not have separate validation and test data files, you can create them using a percentage of the train data file. To do this, follow these steps:
- Upload your train data file as described above.
- In the validation and test sections, you will see an extra input box to enter a percentage value between 0.1 and 50.0.
- You can also adjust the slider to set the percentage value.
- Enter the percentage value you want to use for validation and/or test data files.
- Click the "Upload" button to finish the process.
Prompt and Completion End Tokens
You can also enter the prompt and completion end tokens in their respective text input boxes.
-
prompt end token is a fixed separator that informs the model when the prompt ends and the completion begins. A simple separator that generally works well is \n\n###\n\n. This separator should not appear elsewhere in any prompt in your data file.
-
completion end token is a fixed stop sequence that informs the model when the completion ends. A stop sequence could be \n, ###, or any other token that does not appear in any completion in your data file.
Dataset Creation Process
Once you click the Create Dataset button after completing the above datasets. Dataset creation process will be initiated in the back end and it will create the datasets with required format for finetuning the LLM model.