Skip to content

How to prepare data

Preparing data is the first and very important task for finetuning a LLM model. In this section, we will explain more about preparing data and guide you with some important tricks and tips on this task.

See Prompt Enginnering and Prompt Engineering vs Finetuning guides before continuing with this guide

Preparing Data

To fine-tune a model, you'll need a set of training examples consisting of a single input ("prompt") and its associated output ("completion"). This differs from using GPT3 base models like text-davinci-003 or any other LLM, where multiple examples can be provided in a single prompt.

For example, let's consider a few-shot learning prompt method for creating a review sentiment analysis solution in GPT3 base models like text-davinci-003:

Review: Food is very bad
Sentiment: Negative

Review: I liked the taste of every dish
Sentiment: Positive

Review: The service was slow and the staff was rude
Sentiment: Negative

Review: The ambiance was cozy and the music was soothing
Sentiment: Positive

Review: The portions were small and the prices were high
Sentiment: Negative

Review: The food was average, nothing special but not bad either
Sentiment: Neutral

Review: The restaurant was clean and well-maintained
Sentiment: Positive

Review: The menu had limited options, but the food was decent
Sentiment: Neutral

Review: The wait time for our food was too long, but the taste made up for it
Sentiment:

But the above few shot learning based prompt has limitions which are explained detaily in our Prompt Engineering vs Finetuning guide. If you have decided to finetune a LLM rather than using a base LLM model with prompt enginnering, you need to prepare the data in a particular way.

When you need to prepare the data for finetuning, organize it in a table format using a spreadsheet application like Excel or Google Sheets. The table should have "prompt" and "completion" columns. Each training example from the previous few-shot learning will become a separate row in this table.

The resulting table will look like this:

prompt completion
Food is very bad Negative
I liked the taste of every dish Positive
The service was slow and the staff was rude Negative
The ambiance was cozy and the music was soothing Positive
The portions were small and the prices were high Negative
The food was average, nothing special but not bad either Neutral
The restaurant was clean and well-maintained Positive
The menu had limited options, but the food was decent Neutral

In this format, you can add as many data examples as you want. Providing more data examples improves the performance of the fine-tuned model.

General Best Practices

Here are some general best practices for finetuning models:

  • To improve a model through finetuning, gather many high-quality examples created by experts. Increasing the number of examples leads to better performance, with each doubling showing a linear improvement.

  • Starting with classifiers is recommended as they are easier to work with. For classification tasks, ada is a good choice since it performs slightly worse than more advanced models after finetuning, but is faster and more cost-effective.

  • If finetuning on an existing dataset, manually review the data for offensive or inaccurate content if possible. Alternatively, randomly sample and review as much data as you can if the dataset is large.

  • Ensure the dataset used for finetuning has a similar structure and task type as what the model will be used for.

  • Remember to keep the prompt and completion within the allowed tokens limit (2048 for GPT-3 models), including the separator.

  • Separators used for indicating the start of completion, like 'Sentiment:', do not affect finetuning. You can leave them out, resulting in a prompt like: "Review: The wait time for our food was too long, but the taste made up for it".

  • If the prompt contains only one input element in the original prompt, you can omit the starting separator. For example,

    Review: The wait time for our food was too long, but the taste made up for it
    
    can become
    The wait time for our food was too long, but the taste made up for it
    

  • However, if the prompt has multiple input elements in the original prompt, you should keep the separators to differentiate between different inputs. For example, prompt

    Company name: XYZ Corporation
    Product: ABC Widget
    Target audience: Young professionals
    Goal: Increase brand awareness
    
    for creating a social media post like below
    πŸ“£ Exciting news! XYZ Corporation is thrilled to introduce the revolutionary ABC Widget!
    πŸš€πŸŒŸ Designed to meet the needs of young professionals, this cutting-edge gadget will transform your everyday life.
    πŸ’ΌπŸ’» Don't miss out on the opportunity to enhance productivity and stay ahead of the game.
    Get your hands on the ABC Widget today and experience a new level of innovation! 
    
    #XYZCorporation #ABCWidget #Innovation #ProductivityBoost
    
    Here, we have used separators like 'Company name:' and 'Product:' to differentiate the inputs since there are multiple inputs. However, we have omitted the 'Social media post:' separator at the end to indicate the start of completion.

Remember to follow these guidelines to properly format your data for finetuning.