Evaluation Configuration

This document provides detailed information on how to configure the Evaluation options for analyzing the performance of different finetunes created for a specific version. The available configurations will vary based on the user's subscription plan.

Evaluation Types

We provide a list of evaluation types that can be available for you to choose from based on the finetune task. Based on the evaluation type selected, the configuration options will be displayed.

Metric Name: Name of the metric. This metric will be used for hyperparameter tuning
Metric Goal: Whether the metric needs to be minimized or maximized (Not configurable for standard metrics. But can be configured for custom metrics)

Some evaluation types have multiple metrics available. You can select the appropriate metric based on the task.

Tip

Evaluations are dependent on the quality of the evaluation datasets. Make sure to prepare the best quality evaluation datasets to get the best evaluation results.

Binary Classification

For binary classification, the following metrics are available: f1_score, accuracy, precision, recall. You can select any of these metrics to be used for hyperparameter tuning.

Classification Positive Class: Name of the positive class. For example, if you are creating a finetune for sentiment prediction with two sentiments positive and negative, you need to set this value as positive. If you are using 1 for positive and 0 for negative, you need to set this value as 1.
Classification Betas: Enter a list of F betas, for example: [1, 2, 3, 4]. With a beta of 1 (i.e., the F-1 score), precision and recall are given the same weight. A larger beta score puts more weight on recall and less on precision. A smaller beta score puts more weight on precision and less on recall.

You can choose between adding a list of values or providing a range using the radio button.
- If you choose the list: Enter the values in the text box and click the + button to add the values to the list. Added values will have an X button to remove that value.
- If you choose the range: Enter the starting and ending values of the range in the two text boxes.

MultiClass Classification

MultiClass Classification is used to evaluate the performance of a finetuned LLM for a multi-class classification task like sentiment analysis, topic classification, etc. Available metrics are: accuracy, precision_weighted, recall_weighted, f1_weighted, precision_macro, recall_macro, f1_macro, precision_micro, recall_micro, f1_micro.

Classification Number of Classes: Represents the number of classes (labels) in the given classification dataset. For example, if you are creating a finetune for sentiment prediction with sentiments positive, neutral, and negative, you need to set this value as 3. Enter the total number of classes in the input box.
Classification Betas: Same as in the Binary Classification section.

Text Similarity

Text similarity is used to evaluate the similarity between two texts based on text embeddings of the generated text and the ground truth text. Useful for tasks like Generative AI, Chatbots, etc. where the generated text by finetuned LLMs should be similar to the ground truth text.

Exact Match

Exact match is used calculate the percentage of the generated text that matches the ground truth text exactly. Useful for tasks like Question Answering, PII detection, etc.

LLM as a Judge

LLM as a Judge is a method where we ask a very good LLM to evaluate the quality of the generated text. We provide the LLM with the ground truth text and the generated text and ask the LLM to evaluate the quality of the generated text.

Task: You need to provide a task description for the LLM to evaluate the quality of the generated text. This description will be used to evaluate the quality of the generated text.
Judge LLM: You need to provide the name of the LLM to be used as a judge. This LLM will be used to evaluate the quality of the generated text.

LLM as a Judge is useful for hard tasks which require complex reasoning. For tasks like code generation, financial analysis and Agents (Tool Calling), LLM as a Judge is a good method to evaluate the quality of the generated text.

Summary Quality

Summary quality is used to evaluate the quality of the summary generated by finetuned LLMs. A custom prompted LLM Judge is used to evaluate the quality of the summary.

Custom

For cases where the built-in metrics are not sufficient for your task, you can create custom metrics for your finetune. You can set up a API endpoint which will accept the generated text and ground truth text and return a score.

Enter the config as a JSON object to call the user-created API to get the custom metrics values for test and validation datasets. This JSON object should contain parameters for Python requests library POST request.

Example

{
    "url" : "https://api.domain.com/Metrics"
}

Generation Settings

You can set the generation settings in the below JSON object. These settings will be used to perform inference on test and validation datasets when evaluating the finetuned LLM. This JSON object should contain OpenAI standard generation settings.

Example

{
    "temperature" : 0,
    "top_p" : 1,
    "frequency_penalty" : 0,
    "presence_penalty" : 0,
    "stop" : "\n"
}

Dynamic Max Tokens Settings

For some tasks, it will be useful to set a dynamic max_tokens value based on the prompt text. For example, in PII detection, setting token length as same as the prompt text length will be useful.

You can configure this JSON object to set max_tokens a dynamic value for each prompt. If dynamic_max_tokens is true, dynamic_max_tokens_config should contain parameters for Python requests library POST request. default_max_tokens will be used if there is a error in the API call to get the max_tokens value.

Example

{
    "dynamic_max_tokens": true,
    "dynamic_max_tokens_config": { 
        "url": "https://api.domain.com/DynamicMaxTokens" 
    },
    "default_max_tokens": 22 
}