How to prepare data

Preparing data is the first and very important task for finetuning an LLM. In this section, we will explain more about preparing data and guide you with some important tricks and tips on this task.

Tip

If you are new to finetuning, we recommend you to read Prompt Design, Few-shot Learning, Finetuning and Prompt Engineering vs Finetuning guides before continuing with this guide

Preparing Data

To finetune an LLM, you'll need a set of training examples consisting of a single input ("prompt") and its associated output ("completion").

For example, let's consider a few-shot learning prompt method for creating a review sentiment analysis solution:

Review: Food is very bad
Sentiment: Negative

Review: I liked the taste of every dish
Sentiment: Positive

Review: The service was slow and the staff was rude
Sentiment: Negative

Review: The ambiance was cozy and the music was soothing
Sentiment: Positive

Review: The portions were small and the prices were high
Sentiment: Negative

Review: The food was average, nothing special but not bad either
Sentiment: Neutral

Review: The restaurant was clean and well-maintained
Sentiment: Positive

Review: The menu had limited options, but the food was decent
Sentiment: Neutral

Review: The wait time for our food was too long, but the taste made up for it
Sentiment:

But the above few shot learning based prompt has limitions which are explained detaily in our Prompt Engineering vs Finetuning guide. If you have decided to finetune an LLM rather than using a base LLM with prompt enginnering, you need to prepare the data in a different way.

When you need to prepare the data for finetuning, organize it in a table format using a spreadsheet application like Excel or Google Sheets. The table should have "prompt" and "completion" columns. Each training example from the previous few-shot learning will become a separate row in this table.

The resulting table will look like this:

prompt	completion
Food is very bad	Negative
I liked the taste of every dish	Positive
The service was slow and the staff was rude	Negative
The ambiance was cozy and the music was soothing	Positive
The portions were small and the prices were high	Negative
The food was average, nothing special but not bad either	Neutral
The restaurant was clean and well-maintained	Positive
The menu had limited options, but the food was decent	Neutral

In this format, you can add as many data examples as you want. Providing more data examples improves the performance of the finetuned LLM.

General Best Practices

Here are some general best practices for finetuning:

To improve an LLM through finetuning, gather many high-quality examples created by experts. Increasing the number of examples leads to better performance, with each doubling showing a linear improvement.
If finetuning on an existing dataset, manually review the data for offensive or inaccurate content if possible. Alternatively, randomly sample and review as much data as you can if the dataset is large.
Ensure the dataset used for finetuning has a similar structure and task type as what the LLM will be used for.
Separators used for indicating the start of completion, like 'Sentiment:', do not affect finetuning. You can leave them out, resulting in a prompt like: "Review: The wait time for our food was too long, but the taste made up for it".

If the prompt contains only one input element in the original prompt, you can omit the starting separator. For example,

Review: The wait time for our food was too long, but the taste made up for it

can become

The wait time for our food was too long, but the taste made up for it

However, if the prompt has multiple input elements in the original prompt, you should keep the separators to differentiate between different inputs. For example, prompt

Company name: XYZ Corporation
Product: ABC Widget
Target audience: Young professionals
Goal: Increase brand awareness

for creating a social media post like below

📣 Exciting news! XYZ Corporation is thrilled to introduce the revolutionary ABC Widget!
🚀🌟 Designed to meet the needs of young professionals, this cutting-edge gadget will transform your everyday life.
💼💻 Don't miss out on the opportunity to enhance productivity and stay ahead of the game.
Get your hands on the ABC Widget today and experience a new level of innovation! 

#XYZCorporation #ABCWidget #Innovation #ProductivityBoost

Here, we have used separators like 'Company name:' and 'Product:' to differentiate the inputs since there are multiple inputs. However, we have omitted the 'Social media post:' separator at the end to indicate the start of completion.

Remember to follow these guidelines to properly format your data for finetuning.