Evaluation Configuration
This document provides detailed information on how to configure the Evaluation options for analyzing the performance of different finetunes created for a specific version. The available configurations will vary based on the user's subscription plan.
Evaluation Types
We provide a list of evaluation types that can be available for you to choose from based on the finetune task. Based on the evaluation type selected, the configuration options will be displayed.
- Metric Name: Name of the metric. This metric will be used for hyperparameter tuning
- Metric Goal: Whether the metric needs to be minimized or maximized (Not configurable for standard metrics. But can be configured for custom metrics)
Some evaluation types have multiple metrics available. You can select the appropriate metric based on the task.
Tip
Evaluations are dependent on the quality of the evaluation datasets. Make sure to prepare the best quality evaluation datasets to get the best evaluation results.
Binary Classification
For binary classification, the following metrics are available: f1_score
, accuracy
, precision
, recall
. You can select any of these metrics to be used for hyperparameter tuning.
-
Classification Positive Class: Name of the positive class. For example, if you are creating a finetune for sentiment prediction with two sentiments
positive
andnegative
, you need to set this value aspositive
. If you are using1
for positive and0
for negative, you need to set this value as1
. -
Classification Betas: Enter a list of F betas, for example: [1, 2, 3, 4]. With a beta of 1 (i.e., the F-1 score), precision and recall are given the same weight. A larger beta score puts more weight on recall and less on precision. A smaller beta score puts more weight on precision and less on recall.
You can choose between adding a list of values or providing a range using the radio button.
- If you choose the list: Enter the values in the text box and click the + button to add the values to the list. Added values will have an X button to remove that value.
- If you choose the range: Enter the starting and ending values of the range in the two text boxes.
MultiClass Classification
MultiClass Classification is used to evaluate the performance of a finetuned LLM for a multi-class classification task like sentiment analysis, topic classification, etc. Available metrics are: accuracy
, precision_weighted
, recall_weighted
, f1_weighted
, precision_macro
, recall_macro
, f1_macro
, precision_micro
, recall_micro
, f1_micro
.
-
Classification Number of Classes: Represents the number of classes (labels) in the given classification dataset. For example, if you are creating a finetune for sentiment prediction with sentiments positive, neutral, and negative, you need to set this value as 3. Enter the total number of classes in the input box.
-
Classification Betas: Same as in the Binary Classification section.
Text Similarity
Text similarity is used to evaluate the similarity between two texts based on text embeddings of the generated text and the ground truth text. Useful for tasks like Generative AI, Chatbots, etc. where the generated text by finetuned LLMs should be similar to the ground truth text.
Exact Match
Exact match is used calculate the percentage of the generated text that matches the ground truth text exactly. Useful for tasks like Question Answering, PII detection, etc.
LLM as a Judge
LLM as a Judge is a method where we ask a very good LLM to evaluate the quality of the generated text. We provide the LLM with the ground truth text and the generated text and ask the LLM to evaluate the quality of the generated text.
- Task: You need to provide a task description for the LLM to evaluate the quality of the generated text. This description will be used to evaluate the quality of the generated text.
- Judge LLM: You need to provide the name of the LLM to be used as a judge. This LLM will be used to evaluate the quality of the generated text.
LLM as a Judge is useful for hard tasks which require complex reasoning. For tasks like code generation, financial analysis and Agents (Tool Calling), LLM as a Judge is a good method to evaluate the quality of the generated text.
Summary Quality
Summary quality is used to evaluate the quality of the summary generated by finetuned LLMs. A custom prompted LLM Judge is used to evaluate the quality of the summary.
Custom
For cases where the built-in metrics are not sufficient for your task, you can create custom metrics for your finetune. You can set up a API endpoint which will accept the generated text and ground truth text and return a score.
Enter the config as a JSON object to call the user-created API to get the custom metrics values for test and validation datasets. This JSON object should contain parameters for Python requests
library POST request.
Example
{
"url" : "https://api.domain.com/Metrics"
}
Generation Settings
You can set the generation settings in the below JSON object. These settings will be used to perform inference on test and validation datasets when evaluating the finetuned LLM. This JSON object should contain OpenAI standard generation settings.
Example
{
"temperature" : 0,
"top_p" : 1,
"frequency_penalty" : 0,
"presence_penalty" : 0,
"stop" : "\n"
}
Dynamic Max Tokens Settings
For some tasks, it will be useful to set a dynamic max_tokens
value based on the prompt text. For example, in PII detection, setting token length as same as the prompt text length will be useful.
You can configure this JSON object to set max_tokens
a dynamic value for each prompt. If dynamic_max_tokens
is true, dynamic_max_tokens_config
should contain parameters for Python requests
library POST request. default_max_tokens
will be used if there is a error in the API call to get the max_tokens
value.
Example
{
"dynamic_max_tokens": true,
"dynamic_max_tokens_config": {
"url": "https://api.domain.com/DynamicMaxTokens"
},
"default_max_tokens": 22
}