How to prepare data
Preparing data is the first and very important task for finetuning a LLM model. In this section, we will explain more about preparing data and guide you with some important tricks and tips on this task.
See Prompt Enginnering and Prompt Engineering vs Finetuning guides before continuing with this guide
Preparing Data
To fine-tune a model, you'll need a set of training examples consisting of a single input ("prompt") and its associated output ("completion"). This differs from using GPT3 base models like text-davinci-003 or any other LLM, where multiple examples can be provided in a single prompt.
For example, let's consider a few-shot learning prompt method for creating a review sentiment analysis solution in GPT3 base models like text-davinci-003:
Review: Food is very bad
Sentiment: Negative
Review: I liked the taste of every dish
Sentiment: Positive
Review: The service was slow and the staff was rude
Sentiment: Negative
Review: The ambiance was cozy and the music was soothing
Sentiment: Positive
Review: The portions were small and the prices were high
Sentiment: Negative
Review: The food was average, nothing special but not bad either
Sentiment: Neutral
Review: The restaurant was clean and well-maintained
Sentiment: Positive
Review: The menu had limited options, but the food was decent
Sentiment: Neutral
Review: The wait time for our food was too long, but the taste made up for it
Sentiment:
But the above few shot learning based prompt has limitions which are explained detaily in our Prompt Engineering vs Finetuning guide. If you have decided to finetune a LLM rather than using a base LLM model with prompt enginnering, you need to prepare the data in a particular way.
When you need to prepare the data for finetuning, organize it in a table format using a spreadsheet application like Excel or Google Sheets. The table should have "prompt" and "completion" columns. Each training example from the previous few-shot learning will become a separate row in this table.
The resulting table will look like this:
prompt | completion |
---|---|
Food is very bad | Negative |
I liked the taste of every dish | Positive |
The service was slow and the staff was rude | Negative |
The ambiance was cozy and the music was soothing | Positive |
The portions were small and the prices were high | Negative |
The food was average, nothing special but not bad either | Neutral |
The restaurant was clean and well-maintained | Positive |
The menu had limited options, but the food was decent | Neutral |
In this format, you can add as many data examples as you want. Providing more data examples improves the performance of the fine-tuned model.
General Best Practices
Here are some general best practices for finetuning models:
-
To improve a model through finetuning, gather many high-quality examples created by experts. Increasing the number of examples leads to better performance, with each doubling showing a linear improvement.
-
Starting with classifiers is recommended as they are easier to work with. For classification tasks, ada is a good choice since it performs slightly worse than more advanced models after finetuning, but is faster and more cost-effective.
-
If finetuning on an existing dataset, manually review the data for offensive or inaccurate content if possible. Alternatively, randomly sample and review as much data as you can if the dataset is large.
-
Ensure the dataset used for finetuning has a similar structure and task type as what the model will be used for.
-
Remember to keep the prompt and completion within the allowed tokens limit (2048 for GPT-3 models), including the separator.
-
Separators used for indicating the start of completion, like 'Sentiment:', do not affect finetuning. You can leave them out, resulting in a prompt like: "Review: The wait time for our food was too long, but the taste made up for it".
-
If the prompt contains only one input element in the original prompt, you can omit the starting separator. For example,
can becomeReview: The wait time for our food was too long, but the taste made up for it
The wait time for our food was too long, but the taste made up for it
-
However, if the prompt has multiple input elements in the original prompt, you should keep the separators to differentiate between different inputs. For example, prompt
for creating a social media post like belowCompany name: XYZ Corporation Product: ABC Widget Target audience: Young professionals Goal: Increase brand awareness
Here, we have used separators like 'Company name:' and 'Product:' to differentiate the inputs since there are multiple inputs. However, we have omitted the 'Social media post:' separator at the end to indicate the start of completion.π£ Exciting news! XYZ Corporation is thrilled to introduce the revolutionary ABC Widget! ππ Designed to meet the needs of young professionals, this cutting-edge gadget will transform your everyday life. πΌπ» Don't miss out on the opportunity to enhance productivity and stay ahead of the game. Get your hands on the ABC Widget today and experience a new level of innovation! #XYZCorporation #ABCWidget #Innovation #ProductivityBoost
Remember to follow these guidelines to properly format your data for finetuning.