Multimodal
Many current LLMs are multimodal. Apart from text, they can accept other data types like images, audio, and documents. EasyLLM supports multimodal finetuning. You can finetune multimodal LLMs with better data to improve performance on the specific multimodal task.
Tip
The main intuition behind multimodal finetuning is the inputs to the LLM can be of different data types. But the output is always text. Currently, the supported multimodal LLMs do not generate any other data types except text.
Data Preparation
Single-Turn
You can create a dataset in simple Excel or CSV format as you did in Data Preparation for Classification guide. But for multimodal finetuning, you need to add an additional "files"column for the multimodal data. You can create a zip file with the multimodal data files and upload it along with the dataset. Then, you can just mention the filenames which are available in zip file in the "files" column. You can also use web links and base64 encoded data.
Example:
prompt | completion | files |
---|---|---|
Describe the given image | drawing in black and white of Egypt pyramids with the caption "PYRAMID" | 0.png |
Describe the given images | drawing of a man with a telescope looking at stars with the title "GALILEO" and picture of a black and white owl with yellow eyes and title "MODERN BIOLOGY" | 1.png,2.png |
Describe the given image | image of blue background with the title "MODERN PHYSICS" | https://www.assetscdn.com/3.png |
Describe the given image | image of blue chemical elements with the title "MODERN CHEMISTRY" | data:image/jpeg;base64,{base64_image} |
Multi-Turn
You can follow the chat format as you did in Data Preparation for Chat guide. But for multimodal finetuning, you will be including the multimodal data in the chat history.
Example:
[
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image?",
},
{
"type": "image_url",
"image_url": {"url": "data:image/jpeg;base64,{base64_image}"},
},
],
},
{
"role": "assistant",
"content": [
{
"type": "text",
"text": "The image is a image of blue chemical elements with the title MODERN CHEMISTRY",
},
],
}
]
You can include multiple multimodal files in a single user message. For example, you can provide two images and ask the LLM to describe the difference between them.
[
{
"role": "user",
"content": [
{
"type": "text",
"text": "Explain the given images one by one",
},
{
"type": "image_url",
"image_url": {"url": "data:image/jpeg;base64,{base64_image}"},
},
{
"type": "image_url",
"image_url": {"url": "https://www.assetscdn.com/3.png"},
},
{
"type": "image_url",
"image_url": {"url": "https://www.assetscdn.com/1.png"},
},
],
},
{
"role": "assistant",
"content": [
{
"type": "text",
"text": "First one is a image of blue chemical elements with the title 'MODERN CHEMISTRY'. Second one is a image of blue background with the title 'MODERN PHYSICS' and the third one is a drawing of a man with a telescope looking at stars with the title 'GALILEO'",
},
],
}
]
Note
Above multi-turn examples are just a single example. You need to create the dataset with lot of examples in .jsonl format.
Supported File Types
EasyLLM supports images (.png, .jpeg, .jpg, .webp, .gif), audio (.mp3), and documents (.pdf, .txt). For base64 encoding, use image/{file_type}
for images, audio/mpeg
or audio/mp3
for audio, application/pdf
for PDFs, and text/plain
for text files.
Example:
[
{
"role": "user",
"content": [
{
"type": "text",
"text": "Extract the table from the given pdf",
},
{
"type": "image_url",
"image_url": {"url": "data:application/pdf;base64,{base64_pdf}"},
},
],
},
{
"role": "assistant",
"content": [
{
"type": "text",
"text": "The table is extracted from the given pdf and the table is as follows:\n {table}",
},
],
}
]
Tip
You can do video understaning with Images. Refere this guide Video Understanding for more details.
Supported LLMs
- OpenAI: Images
- GCP Verext AI - Gemini: Images, Audio, Documents
- AWS Bedrock - Nova Lite and Pro: Images