How to prepare data
Preparing data is the first and very important task for finetuning an LLM. In this section, we will explain more about preparing data and guide you with some important tricks and tips on this task.
Tip
If you are new to finetuning, we recommend you to read Prompt Design, Few-shot Learning, Finetuning and Prompt Engineering vs Finetuning guides before continuing with this guide
Preparing Data
To finetune an LLM, you'll need a set of training examples consisting of a single input ("prompt") and its associated output ("completion").
For example, let's consider a few-shot learning prompt method for creating a review sentiment analysis solution:
Review: Food is very bad
Sentiment: Negative
Review: I liked the taste of every dish
Sentiment: Positive
Review: The service was slow and the staff was rude
Sentiment: Negative
Review: The ambiance was cozy and the music was soothing
Sentiment: Positive
Review: The portions were small and the prices were high
Sentiment: Negative
Review: The food was average, nothing special but not bad either
Sentiment: Neutral
Review: The restaurant was clean and well-maintained
Sentiment: Positive
Review: The menu had limited options, but the food was decent
Sentiment: Neutral
Review: The wait time for our food was too long, but the taste made up for it
Sentiment:
But the above few shot learning based prompt has limitions which are explained detaily in our Prompt Engineering vs Finetuning guide. If you have decided to finetune an LLM rather than using a base LLM with prompt enginnering, you need to prepare the data in a different way.
When you need to prepare the data for finetuning, organize it in a table format using a spreadsheet application like Excel or Google Sheets. The table should have "prompt" and "completion" columns. Each training example from the previous few-shot learning will become a separate row in this table.
The resulting table will look like this:
prompt | completion |
---|---|
Food is very bad | Negative |
I liked the taste of every dish | Positive |
The service was slow and the staff was rude | Negative |
The ambiance was cozy and the music was soothing | Positive |
The portions were small and the prices were high | Negative |
The food was average, nothing special but not bad either | Neutral |
The restaurant was clean and well-maintained | Positive |
The menu had limited options, but the food was decent | Neutral |
In this format, you can add as many data examples as you want. Providing more data examples improves the performance of the finetuned LLM.
General Best Practices
Here are some general best practices for finetuning:
-
To improve an LLM through finetuning, gather many high-quality examples created by experts. Increasing the number of examples leads to better performance, with each doubling showing a linear improvement.
-
If finetuning on an existing dataset, manually review the data for offensive or inaccurate content if possible. Alternatively, randomly sample and review as much data as you can if the dataset is large.
-
Ensure the dataset used for finetuning has a similar structure and task type as what the LLM will be used for.
-
Separators used for indicating the start of completion, like 'Sentiment:', do not affect finetuning. You can leave them out, resulting in a prompt like: "Review: The wait time for our food was too long, but the taste made up for it".
-
If the prompt contains only one input element in the original prompt, you can omit the starting separator. For example,
can becomeReview: The wait time for our food was too long, but the taste made up for it
The wait time for our food was too long, but the taste made up for it
-
However, if the prompt has multiple input elements in the original prompt, you should keep the separators to differentiate between different inputs. For example, prompt
for creating a social media post like belowCompany name: XYZ Corporation Product: ABC Widget Target audience: Young professionals Goal: Increase brand awareness
Here, we have used separators like 'Company name:' and 'Product:' to differentiate the inputs since there are multiple inputs. However, we have omitted the 'Social media post:' separator at the end to indicate the start of completion.π£ Exciting news! XYZ Corporation is thrilled to introduce the revolutionary ABC Widget! ππ Designed to meet the needs of young professionals, this cutting-edge gadget will transform your everyday life. πΌπ» Don't miss out on the opportunity to enhance productivity and stay ahead of the game. Get your hands on the ABC Widget today and experience a new level of innovation! #XYZCorporation #ABCWidget #Innovation #ProductivityBoost
Remember to follow these guidelines to properly format your data for finetuning.