[ML Story] Fine-tune Vision Language Model on custom dataset

Nitin Tiwari
Google Developer Experts
9 min readApr 27, 2024

--

Introduction

We’re living in the era of LLMs, and almost every week, you’ll hear a new language model making its way out. From Google’s Gemini and Gemma models to Meta’s latest Llama 3 and Microsoft’s tiny Phi-3 models, a fierce online competition is underway among these industry giants to grab the top spot.

Amidst this flurry of activity, what stands out most is the willingness of these tech giants to open up some of these language models to the developer community.

What are the benefits of making models open?
Opening up models to the developer community brings several advantages, including developers fine-tuning these models for specific use cases to solve interesting problems.

And if you’re an LLM nerd by now, I’m sure you have attempted to fine-tune at least one of these open models to explore their capabilities. While discussing all these popular models, it’s uncommon to find ones that are both open and multimodal simultaneously.

One such hidden gem I recently explored was the Idefics2-8B Vision Language Model built by Hugging Face 🤗. It’s open, and supports multi-modality, accepting sequences of image and text inputs. The Idefics2 model can answer questions about images, describe visual content, creating stories from multiple images, and a lot more.

I had been searching for articles and tutorials that detail the steps for fine-tuning Vision Language Models on custom datasets. While most covered the fine-tuning process (on already available datasets), they often overlooked the crucial step of data preparation.

This blog just brings that to you: A comprehensive guide to fine-tuning the Idefics2 model where you’ll not only learn how to fine-tune a Vision Language Model, but also preparing your own custom dataset from scratch.

But before we start, let’s quickly understand the high-level architecture of Vision Language Models, and how do they differ from standard LLMs.

Vision Language Models

Vision Language Models

Vision Language Models are multimodal models that learn from images and text, generating text outputs from image and text inputs. They excel in zero-shot capabilities, generalization, and various tasks like image recognition, question answering, and document understanding.

These models can also capture spatial properties and output bounding boxes or segmentation masks for specific subjects. Learn more about VLMs and their architecture here.

Now that you’ve a basic understanding of VLMs, it’s show time. Let’s dive into the code.

Step 1: Data Preparation

Unlike LLMs, the dataset format for VLMs is slightly different due to the introduction of images and videos in addition to the standard textual data.

Today, we’ll fine-tune the Idefics2 model on images of documents for visual question and answering. Our training data is derived from a sub-sampled version of the DocVQA dataset, with slight modifications to recreate the entire dataset from scratch.

Clone the following repository.

!git clone https://github.com/NSTiwari/Fine-tune-IDEFICS-Vision-Language-Model

Within the repository, you’ll find a datasets folder that has an images sub-folder. This sub-folder contains both the training and testing sets of images. Precisely, there are 1000 images for training and 200 images for testing.

Here’s a gist of how the data looks like:

Sub-sampled DocVQA image dataset

The images are black and white scanned copies of various documents like news articles, emails, invoices, reports, advertisements, and more.

In addition to the images, there’s a qa_text.csv file in the repository that contains details about all the images. Since we’re working on a question and answering task, the CSV file contains the questions or queries and their corresponding answers for each image.

Sub-sampled DocVQA CSV dataset

Now that we have both the images and their associated content stored separately, let’s merge them to create a dataset formatted for training the model.

Install the following libraries:

!pip install -q git+https://github.com/huggingface/transformers.gi 
!pip install -q accelerate datasets peft bitsandbytes
from datasets import Dataset, DatasetDict, Image
import pandas as pd
import os

# Define train and test size.
TRAIN_SAMPLES = 1000
TEST_SAMPLES = 200
TEST_SIZE = 0.166 #

# Define the directory containing the images.
train_images_directory = '/content/Fine-tune-IDEFICS-Vision-Language-Model/dataset/images/train/'
test_images_directory = '/content/Fine-tune-IDEFICS-Vision-Language-Model/dataset/images/test/'

# Read the CSV Q&A text.
qa_text = pd.read_csv('/content/Fine-tune-IDEFICS-Vision-Language-Model/dataset/qa_text.csv')

# Create a list of image paths.
train_image_paths = [os.path.join(train_images_directory, f'train_{i}.jpg') for i in range(TRAIN_SAMPLES)]
test_image_paths = [os.path.join(test_images_directory, f'test_{i}.jpg') for i in range(TEST_SAMPLES)]
image_paths = train_image_paths + test_image_paths

# Create a list of other columns such as id, query, and answer.
ids = ids = qa_text['id'].tolist()
queries = qa_text['query'].tolist()
answers = qa_text['answers'].tolist()

# Create the dataset dictionary.
dataset_dict = {
'id': ids,
'image': image_paths,
'query': queries,
'answers': answers
}

# Create the dataset.
dataset = Dataset.from_dict(dataset_dict)

# Cast the 'image' column to Image type.
dataset = dataset.cast_column("image", Image())

# Split the dataset into train and test.
split_dataset = dataset.train_test_split(test_size=TEST_SIZE, shuffle=False)

# Push the dataset on Hugging Face Hub.
split_dataset.push_to_hub("NSTiwari/DocumentIDEFICS_QA")

The above script combines the images with the textual queries and answers to create a unified dataset, which is subsequently uploaded to Hugging Face 🤗. You can access the dataset through this link.

DocumentIDEFICS_QA dataset on Hugging Face

Well done on creating your custom dataset. It was indeed straightforward, wasn’t it? If you wish to create a dataset for another use-case, simply follow the same steps as before.

Step 2: Load Dataset

Now that we have the dataset prepared, let’s proceed to load it.

from datasets import load_dataset

train_dataset = load_dataset("NSTiwari/DocumentIDEFICS_QA", split="train")
eval_dataset = load_dataset("NSTiwari/DocumentIDEFICS_QA", split="test")

Inspect the train data.

print(train_dataset[0])
train_dataset[0]['image']

This is how a record in the training dataset appears. The image is embedded as a JpegImageFile within the dataset.

{
'id': 'train_0',
'image': <PIL.JpegImagePlugin.JpegImageFile image mode=L size=1695x2025>,
'query': 'what is the date mentioned in this letter?',
'answers': "['1/8/93']"
}
train_0.jpg

Step 3: Configure LoRA adapters

Training or fine-tuning a language model is a herculean task in itself and is practically impossible unless you’re an owner of high computing GPUs.

With the introduction of Parameter-Efficient Fine-Tuning (PEFT), there’s no need to stress about training the entire language model anymore.

Thanks to techniques like LoRA and QLoRA, you can efficiently fine-tune these large models by significantly reducing the number of trainable parameters. This not only accelerates the fine-tuning process but also conserves memory usage.

import torch
from peft import LoraConfig
from transformers import AutoProcessor, BitsAndBytesConfig, Idefics2ForConditionalGeneration

DEVICE = "cuda:0"
USE_LORA = False
USE_QLORA = True

processor = AutoProcessor.from_pretrained(
"HuggingFaceM4/idefics2-8b",
do_image_splitting=False
)
if USE_QLORA or USE_LORA:
lora_config = LoraConfig(
r=8,
lora_alpha=8,
lora_dropout=0.1,
target_modules='.*(text_model|modality_projection|perceiver_resampler).*(down_proj|gate_proj|up_proj|k_proj|q_proj|v_proj|o_proj).*$',
use_dora=False if USE_QLORA else True,
init_lora_weights="gaussian"
)
if USE_QLORA:
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16
)
model = Idefics2ForConditionalGeneration.from_pretrained(
"HuggingFaceM4/idefics2-8b",
torch_dtype=torch.float16,
quantization_config=bnb_config if USE_QLORA else None,
)
model.add_adapter(lora_config)
model.enable_adapters()
else:
model = Idefics2ForConditionalGeneration.from_pretrained(
"HuggingFaceM4/idefics2-8b",
torch_dtype=torch.float16,
_attn_implementation="flash_attention_2", # Need GPUs like A100 or H100.
).to(DEVICE)

The above chunk of code configures the LoRA adapters for the Idefics2–8b model.

QLoRA, short for Quantized LoRA, is an enhanced iteration of LoRA. As the name suggests, QLoRA quantizes the precision of weight parameters, compressing a 32-bit parameter into a 4-bit format.

This sharp reduction in memory demands for the LLM enables easy fine-tuning, particularly beneficial when hardware resources are limited.

Step 4: Create Data Collator

Data collators are objects that will form a batch by using a list of dataset elements as input. These elements are of the same type as the elements of train_dataset or eval_dataset.

import random

class MyDataCollator:
def __init__(self, processor):
self.processor = processor
self.image_token_id = processor.tokenizer.additional_special_tokens_ids[
processor.tokenizer.additional_special_tokens.index("<image>")
]

def __call__(self, examples):
texts = []
images = []
for example in examples:
image = example["image"]
question = example["query"]['en']
answer = random.choice(example["answers"])
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": "Answer briefly."},
{"type": "image"},
{"type": "text", "text": question}
]
},
{
"role": "assistant",
"content": [
{"type": "text", "text": answer}
]
}
]
text = processor.apply_chat_template(messages, add_generation_prompt=False)
texts.append(text.strip())
images.append([image])

batch = processor(text=texts, images=images, return_tensors="pt", padding=True)

labels = batch["input_ids"].clone()
labels[labels == processor.tokenizer.pad_token_id] = self.image_token_id
batch["labels"] = labels

return batch

data_collator = MyDataCollator(processor)

Step 5: Setup training parameters

Configure the hyperparameters to train the model.

from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
output_dir = "IDEFICS_DocVQA",
learning_rate = 2e-4,
fp16 = True,
per_device_train_batch_size = 2,
per_device_eval_batch_size = 2,
gradient_accumulation_steps = 8,
dataloader_pin_memory = False,
save_total_limit = 3,
evaluation_strategy ="steps",
save_strategy = "steps",
eval_steps = 10,
save_steps = 25,
max_steps = 25,
logging_steps = 5,
remove_unused_columns = False,
push_to_hub=False,
label_names = ["labels"],
load_best_model_at_end = False,
report_to = "none",
optim = "paged_adamw_8bit",
)
trainer = Trainer(
model = model,
args = training_args,
data_collator = data_collator,
train_dataset = train_dataset,
eval_dataset = eval_dataset
)

Step 6: Start Training

trainer.train()

It took me about an hour to fine-tune the model for 25 steps using a T4 GPU on Google Colab. The more training steps and examples you use, the better the results tend to be. Given the limitations of the T4, that’s the best I could train the model on.

Training Results

Step 7: Evaluate the model

Now, it’s time to evaluate the model on a test example.

test_example = eval_dataset[0]
test_example["image"]
test_0.jpg
model.eval()

image = test_example["image"]
query = test_example["query"]


messages = [
{
"role": "user",
"content": [
{"type": "text", "text": "Answer briefly."},
{"type": "image"},
{"type": "text", "text": query}
]
}
]


text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=[text.strip()], images=[image], return_tensors="pt", padding=True)
generated_ids = model.generate(**inputs, max_new_tokens=64)
generated_texts = processor.batch_decode(generated_ids[:, inputs["input_ids"].size(1):], skip_special_tokens=True)
print(generated_texts)Question: What the location address of NSDA?
Answer: [‘1128 SIXTEENTH ST., N. W., WASHINGTON, D. C. 20036’, ‘1128 sixteenth st., N. W., washington, D. C. 20036’]

Question: What the location address of NSDA?
Answer: [‘1128 SIXTEENTH ST., N. W., WASHINGTON, D. C. 20036’, ‘1128 sixteenth st., N. W., washington, D. C. 20036’]

Address of NSDA

That’s fantastic — the model has provided an accurate answer to the question posed from the test image.

Step 8: Push the fine-tuned model on Hugging Face 🤗

It’s a good idea to push the model on HF if you want to access it in the future without needing it to retrain.

# Login to your HF account.
from huggingface_hub import notebook_login
notebook_login()

from huggingface_hub import whoami
from pathlib import Path

# Output directory.
output_dir = "IDEFICS_DocVQA"
repo_name = "IDEFICS2-DocVQA-fine-tuned"
username = whoami(token=Path("/root/.cache/huggingface/"))["name"]
repo_id = f"{username}/{repo_name}"
from huggingface_hub import upload_folder, create_repo

repo_id = create_repo(repo_id, exist_ok=True).repo_id

upload_folder(
repo_id=repo_id,
folder_path=output_dir,
commit_message="Pushed the IDEFICS2 fine-tuned model.",
ignore_patterns=["step_*", "epoch_*"],
)

That’s all folks. Congratulations on successfully fine-tuning the Idefics2–8B model on a custom dataset. I hope you learned valuable insights into creating an image-text dataset from scratch and fine-tuning a Vision Language Model.

You can find the complete Colab notebook and the raw dataset on this GitHub repository. If you found the blog helpful, please consider giving the repository a ⭐.

I can’t wait to see what you’ll create by fine-tuning Idefics2–8B. If you have any questions or want to share something incredible that you’ve built, feel free to connect with me on LinkedIn.

Thank you for reading the blog. Stay tuned for more fascinating projects and articles headed your way.

References & Resources

--

--