Skip to content

Multimodal

Many current LLMs are multimodal. Apart from text, they can accept other data types like images, audio, and documents. EasyLLM supports multimodal finetuning. You can finetune multimodal LLMs with better data to improve performance on the specific multimodal task.

Tip

The main intuition behind multimodal finetuning is the inputs to the LLM can be of different data types. But the output is always text. Currently, the supported multimodal LLMs do not generate any other data types except text.

Data Preparation

Single-Turn

You can create a dataset in simple Excel or CSV format as you did in Data Preparation for Classification guide. But for multimodal finetuning, you need to add an additional "files"column for the multimodal data. You can create a zip file with the multimodal data files and upload it along with the dataset. Then, you can just mention the filenames which are available in zip file in the "files" column. You can also use web links and base64 encoded data.

Example:

prompt completion files
Describe the given image drawing in black and white of Egypt pyramids with the caption "PYRAMID" 0.png
Describe the given images drawing of a man with a telescope looking at stars with the title "GALILEO" and picture of a black and white owl with yellow eyes and title "MODERN BIOLOGY" 1.png,2.png
Describe the given image image of blue background with the title "MODERN PHYSICS" https://www.assetscdn.com/3.png
Describe the given image image of blue chemical elements with the title "MODERN CHEMISTRY" data:image/jpeg;base64,{base64_image}

Multi-Turn

You can follow the chat format as you did in Data Preparation for Chat guide. But for multimodal finetuning, you will be including the multimodal data in the chat history.

Example:

[
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "What is in this image?",
            },
            {
                "type": "image_url",
                "image_url": {"url": "data:image/jpeg;base64,{base64_image}"},
            },
        ],
    },
    {
        "role": "assistant",
        "content": [
            {
                "type": "text",
                "text": "The image is a image of blue chemical elements with the title MODERN CHEMISTRY",
            },
        ],
    }
]

You can include multiple multimodal files in a single user message. For example, you can provide two images and ask the LLM to describe the difference between them.

[
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "Explain the given images one by one",
            },
            {
                "type": "image_url",
                "image_url": {"url": "data:image/jpeg;base64,{base64_image}"},
            },
            {
                "type": "image_url",
                "image_url": {"url": "https://www.assetscdn.com/3.png"},
            },
            {
                "type": "image_url",
                "image_url": {"url": "https://www.assetscdn.com/1.png"},
            },
        ],
    },
    {
        "role": "assistant",
        "content": [
            {
                "type": "text",
                "text": "First one is a image of blue chemical elements with the title 'MODERN CHEMISTRY'. Second one is a image of blue background with the title 'MODERN PHYSICS' and the third one is a drawing of a man with a telescope looking at stars with the title 'GALILEO'",
            },
        ],
    }
]
Note

Above multi-turn examples are just a single example. You need to create the dataset with lot of examples in .jsonl format.

Supported File Types

EasyLLM supports images (.png, .jpeg, .jpg, .webp, .gif), audio (.mp3), and documents (.pdf, .txt). For base64 encoding, use image/{file_type} for images, audio/mpeg or audio/mp3 for audio, application/pdf for PDFs, and text/plain for text files.

Example:

[
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "Extract the table from the given pdf",
            },
            {
                "type": "image_url",
                "image_url": {"url": "data:application/pdf;base64,{base64_pdf}"},
            },
        ],
    },
    {
        "role": "assistant",
        "content": [
            {
                "type": "text",
                "text": "The table is extracted from the given pdf and the table is as follows:\n {table}",
            },
        ],
    }
]
Tip

You can do video understaning with Images. Refere this guide Video Understanding for more details.

Supported LLMs

  • OpenAI: Images
  • GCP Verext AI - Gemini: Images, Audio, Documents
  • AWS Bedrock - Nova Lite and Pro: Images