Create dataset

Create dataset is the first step in Finetune Wizard. Here, you can upload your train, validation, and test data files to finetune LLMs. You also have the option to specify percentages for validation and test data files. If percentages are provided instead of data files, EasyLLM will split the train data file into validation and test data files based on the specified percentages.

If you are new to AI/ML development, Please read our blog post on Data Splitting for more information on how to split your data into train, validation, and test data files.

Uploading data files

To upload your data files, follow these steps:

Click on the "Choose File" button next to the data file type you want to upload - train, validation, or test.
A dialog box will appear where you can select one of the previously uploaded files by clicking on it and then clicking the "Select" button to confirm.
If you want to upload a new file, click on the "Upload File" button. This will open a dialog box where you can upload a new file from your local computer.
Enable the respective checkbox provided in the dialog box for below option
- Check for Null values - checks the current file for null values
- Binary Classification - for binary classification (will make sure that datset contains only two unique labels)
- Classification - for multi-class classification (will make sure that labels are containing just a single token - as recommended by OpenAI)
- Chat - for chat finetuning (validates the dataset for Chat format)
- Tools - for tool calling finetuning (validates the dataset for Tool Calling format)
- Multimodal - for multimodal finetuning (validates the dataset for format and multimodal data)
Click the "Upload" button to finish the process.

Creating Validation and Test data files

If you do not have separate validation and test data files, you can create them using a percentage of the train data file. To do this, follow these steps:

Upload your train data file as described above.
In the validation and test sections, you will see an extra input box to enter a percentage value between 0.1 and 50.0.
You can also adjust the slider to set the percentage value.
Enter the percentage value you want to use for validation and/or test data files.
Click the "Upload" button to finish the process.

Prompt and Completion End Tokens

You can also enter the prompt and completion end tokens in their respective text input boxes.

prompt end token is a fixed separator that informs the model when the prompt ends and the completion begins. A simple separator that generally works well is \n\n###\n\n. This separator should not appear elsewhere in any prompt in your data file.
completion end token is a fixed stop sequence that informs the model when the completion ends. A stop sequence could be \n, ###, or any other token that does not appear in any completion in your data file.

Note

These tokens are better used for base LLMs.If you are not sure about the prompt and completion end tokens, you can leave them untouched.

Dataset Creation Process

Once you click the Create Dataset button after completing the above datasets. Dataset creation process will be initiated in the back end and it will create the datasets with required format for finetuning the LLMs.