Fine-Tune Mistral v0.3 with ORPO and Unsloth

The field of artificial intelligence and machine learning is marked by constant innovation, with new tools and methodologies emerging to expand the horizons of what these technologies can achieve. Recently, significant upgrades were introduced in popular AI model series, enhancing their capabilities and setting new benchmarks in AI development.

However, to fully leverage the potential of these advanced models, it’s essential to employ sophisticated fine-tuning techniques like ORPO (Odds Ratio Preference Optimization) and Unsloth. ORPO simplifies the alignment process by integrating preference optimization directly into the training phase, eliminating the need for a separate alignment step. Unsloth, on the other hand, offers groundbreaking advancements in training efficiency, significantly speeding up the process while reducing memory consumption without compromising accuracy.

In this article, we will explore how to fine-tune Mistral v0.3 using ORPO and Unsloth, demonstrating how these techniques can enhance model performance and efficiency. By understanding and applying these methods, you can unlock new levels of capability and efficiency in your AI projects. The code for this process can be found on Google Colab and in the LLM Tutorial on GitHub.

ORPO

Instruction tuning and preference alignment are crucial for customizing Large Language Models (LLMs) for specific tasks. This typically involves a multi-step process: first, Supervised Fine-Tuning (SFT) on instructions to tailor the model to the desired domain, and second, applying preference alignment techniques such as Reinforcement Learning with Human Feedback (RLHF) or Direct Preference Optimization (DPO) to enhance the probability of producing preferred responses over less desirable ones. Researchers have found that although SFT adjusts the model to the target domain, it also raises the chances of producing both unwanted and desired answers. Therefore, the preference alignment stage is essential to enlarge the disparity between the probabilities of accepted and rejected outputs.

Hong and Lee (2024) introduced ORPO (Odds Ratio Preference Optimization), a groundbreaking method that aligns the language model without a reference model in a single-step manner by assigning a weak penalty to the rejected responses and a strong adaptation signal to the chosen responses with a simple log odds ratio term appended to the negative log-likelihood loss.

This approach enhances the traditional language modeling objective by integrating the negative log-likelihood (NLL) loss with an odds ratio (OR) component. The OR loss imposes a slight penalty on disfavored responses while significantly rewarding favored ones, enabling the model to concurrently master the target task and align with human preferences. The objective function for ORPO is defined as follows:

\mathscr{L}{ORPO} = \mathbb{E}{(x, y_{w}, y_l)}[\mathscr{L}{SFT} + \lambda \cdot \mathscr{L}{OR}]

In this formula, SFT represents the conventional supervised fine-tuning loss, OR denotes the odds ratio loss, and Lambda is a weighting factor that balances these two components. This integration ensures that the model adapts effectively to the desired domain while minimizing the generation of undesired outputs.

Unsloth

Unsloth is a fine-tuning framework designed to accelerate the training of large language models (LLMs) like Llama and Mistral, while drastically reducing memory usage. It achieves this through several optimizations:

Manual Derivation and Handwritten GPU Kernels: Unsloth optimizes computational steps by manually deriving and handwriting GPU kernels, bypassing inefficiencies in general-purpose libraries.
Quantization Techniques: Utilizing 4-bit and 16-bit quantization (QLoRA) reduces memory requirements without compromising model accuracy.
Optimized Attention Mechanisms: Integrating Flash Attention v2 for faster attention calculations and reduced memory usage.
Enhanced Memory Management: Efficient memory allocation and data transfer processes optimize VRAM usage.

Unsloth can make training up to 2 times faster on single GPUs and reduces memory usage by up to 60% without degrading accuracy, supporting diverse fine-tuning use cases, including instructional fine-tuning and direct preference optimization (DPO)

Fine-Tuning Mistral v0.3 with ORPO and Unsloth

In this example we will QLoRA fine-tune the Mistral v0.3 7B model using ORPO and the Unsloth framework. ORPO necessitates a preference dataset that includes a prompt, a selected answer, and a discarded answer. To achieve this, we will utilize llmat/dpo-orpo-mix-38k-balanced, a dataset that merges high-quality DPO datasets and has been further balanced using a clustering-based approach.

Let’s start by installing the required libraries:

!pip install python-dotenv
# Installs Unsloth, Xformers (Flash Attention) and all other packages!
!pip install "unsloth[cu121-ampere-torch230] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytes

Now let’s login to our W&B workspace

import wandb
import os
import dotenv

dotenv.load_dotenv()
%env WANDB_NOTEBOOK_NAME = $Fine_tune_Mistral_with_ORPO
wandb.login(key=os.environ["WANDB_API_KEY"])

Load the Model and Tokenizer for LoRA

In the following, we will load the Mistral 7B v0.3 model in 4-bit precision.

cache_dir = './model'
model_id = 'mistralai/Mistral-7B-v0.3'

from unsloth import FastLanguageModel
import torch

max_seq_length = 2048 
dtype = None 
load_in_4bit = True 

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_id,
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

Loading Checks

After loading the model, it’s crucial to ensure that all parameters are correctly placed on the GPU and that none are overflowing onto the CPU. This can be particularly important for large models where memory management is critical.

To verify the placement of the model’s parameters, you can iterate through the model’s named parameters and check their device type. If any parameter is on the CPU (indicated by the device type ‘meta’), it will be printed out. This ensures that your model is fully utilizing the GPU resources and avoiding any potential performance bottlenecks.

Here is the code to perform this check:

# Check there are no parameters overflowing onto cpu (meta).
for n, p in model.named_parameters():
    if p.device.type=='meta':
        print(f"{n} is on meta!")

Setting Up LoRA Fine-Tuning

To prepare your model for LoRA (Low-Rank Adaptation) fine-tuning, you need to configure it properly. This involves setting up the LoRA configuration. Here’s a brief overview of the parameter settings:

r: This parameter controls the rank of the low-rank adaptation matrices. It’s suggested to choose a value greater than 0, with common choices being 8, 16, 32, 64, or 128. The best setting depends on the specific use case and computational resources, but a good starting point is 8 or 16.
lora_alpha: This parameter scales the magnitude of the LoRA update. A higher value can lead to more significant changes in the model’s behavior. In our example we are setting lora_alpha to 32.
target_modules: This list specifies which modules in the model should be fine-tuned. The settings include key modules like "q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", and "down_proj". If the task involves chat fine-tuning, it’s also beneficial to set "lm_head" (language model head) as trainable.
use_gradient_checkpointing: This parameter activates gradient checkpointing to conserve memory. It is managed by Unsloth, which offloads input and output embeddings to disk, thereby saving VRAM.
random_state: This parameter sets the seed for random number generation, ensuring reproducibility. The best setting is any integer value; in the code, it’s set to 3407.
use_rslora: This parameter activates RSLoRA, which adjusts the scaling factor of LoRA adapters to be proportional to 1/√r instead of 1/r. This adjustment enhances the stability of learning, particularly for higher adapter ranks, and improves fine-tuning performance as the rank increases.

These settings provide a good starting point for fine-tuning a language model using PEFT. However, the optimal settings may vary depending on the specific task and dataset, so some experimentation may be necessary.

model = FastLanguageModel.get_peft_model(
    model,
    r = 8, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    lora_alpha = 32,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
        "lm_head", # Language model head - best to set this trainable if chat fine-tuning
        
    ],
    
    lora_dropout = 0, 
    bias = "none",    
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = True,
    
)

Set up Tokenizer and Padding

Before starting the fine-tuning process, it’s essential to configure the tokenizer and set up padding correctly. This ensures that the model can handle input sequences efficiently and that special tokens are properly managed.

Inspect the Tokenizer

Print out the tokenizer details, including the vocabulary size, beginning-of-sequence (BOS) token, end-of-sequence (EOS) token, and chat template.

print(tokenizer)
print(tokenizer.vocab_size)

print(tokenizer.bos_token)
print(tokenizer.eos_token)

print(tokenizer.chat_template)

Customize Chat Template

When working with Llama/Mistral models, it’s sometimes necessary to customize the chat template to ensure the conversation is formatted correctly. This customization is particularly useful when handling cases where the initial message in the conversation might not be from the assistant. By ensuring the beginning-of-sequence token (bos_token) is correctly placed, we can maintain the proper structure and flow of the conversation.

The following code snippet demonstrates how to set the chat template manually for such scenarios. This template checks if the first message is from the assistant. If not, it adds the bos_token at the beginning. This step is crucial because we format the chosen and rejected responses separately, and we want to avoid adding an extra bos_token before the response when there’s no initial user message.

The template is defined using a Jinja-like syntax, which iterates through the messages and formats them based on their roles (user or assistant). For user messages, it wraps the content with [INST] and [/INST] tags, while for assistant messages, it appends an end-of-sequence token (eos_token).

tokenizer.chat_template = """{% if messages[0]['role'] != 'assistant' %}{{ bos_token }}{% endif %}{% for message in messages %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token }}{% endif %}{% endfor %}
"""

# Test chat template
messages = [
    {'role': 'user', 'content': 'write a quick sorf algorithm in python.'},
    {'role': 'assistant', 'content': 'here you are.'},
    {'role': 'user', 'content': 'great.'},
]

inputs = tokenizer.apply_chat_template(messages, tokenize=False)
print(inputs)

Set the Pad Token

When working with tokenizers, it’s essential to designate a token for padding sequences to ensure they all have the same length. This padding token helps maintain the consistency of input shapes when batching data for training models. The following code snippet demonstrates how to set the padding token (pad_token) in your tokenizer by checking for the presence of specific tokens in its vocabulary.

## set the pad token to <pad>, if not <|pad|>, if not <unk> if <unk>
if '<pad>' in tokenizer.get_vocab():
    print('<pad> token is is in the tokenizer. Usinh <pad> for pad')
    #Set the pad token
    tokenizer.pad_token = '<pad>'
elif '<|pad|>' in tokenizer.get_vocab():
    print('<|pad|> token is in the tokenizer. Using for <|pad|> for pad')
    # Set the pad token
    tokenizer.pad_token = '<|pad|>'
elif '<unk>' in tokenizer.get_vocab():
    print('<unk> token is in the tokenizer. Using for <unk> for pad')
    # Set the pad token
    tokenizer.pad_token = '<unk>'
else:
    print(f'Using EOS token, {tokenizer.eos_token}, for padding. Warning, this ')
    tokenizer.pad_token = tokenizer.eos_token

Update the Model Configuration

The following code snippet demonstrates how to update the pad token ID in both the model and its configuration to match the tokenizer’s pad token ID. Additionally, it includes checks and print statements to verify the consistency of these IDs and provides information about the tokenizer’s special tokens.

# Update pad token id in model and its config
model.pad_token_id = tokenizer.pad_token_id
model.config.pad_token_id = tokenizer.pad_token_id

# Check if they are equal
assert model.pad_token_id == tokenizer.pad_token_id, "The model's pat token ID are not equal"

# Print the pad token ids
print('Tokenizer pad token ID:', tokenizer.pad_token_id)
print('Model pad token ID:', model.pad_token_id)
print('Model config pad token ID:', model.config.pad_token_id)
print('Number of tokens now in tokenizer:', tokenizer.vocab_size)

print('Special tokens map:', tokenizer.special_tokens_map)
print('All special tokens:', tokenizer.all_special_tokens)

print(tokenizer)

Set embed and norm layers to trainable

(recommended for chat fine-tuning if the chat template has been changed)

When fine-tuning a model for chat applications, it’s often beneficial to set specific layers to be trainable, especially if you are changing the chat template. This ensures that the model can adapt to the new input format more effectively.

# List to hold the names of the trainable parameters
trainable_params_names = ['embed_tokens', 'input_layernorm', 'post_attention_layernorm', 'norm']

# Set modules to be trainable
for n, p in model.named_parameters():
    if any(k in n for k in trainable_params_names):
        p.requires_grad_(True)
    else:
        p.requires_grad_(False) # Optional: Set the rest to be trainable

# Make a dictionary of trainable parameters
trainable_params = {n: p for n, p in model.named_parameters() if p.requires_grad}

# Convert trainable_params to state_dict format
trainable_params_state_dict = {n: p.data for n, p in trainable_params.items()}

Prepare for LoRA fine-tuning

Before starting the LoRA (Low-Rank Adaptation) fine-tuning process, it’s essential to understand which parameters in your model are trainable and which are not. This helps in ensuring that only the desired parameters are updated during training, which is crucial for efficient and effective fine-tuning.

To achieve this, you can use the following function to print the number of trainable parameters in the model and list which parameters are trainable and which are not.

def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model and lists which parameters
    """
    trainable_params = 0
    non_trainable_params = 0
    all_params = 0

    print("Trainable Parameters")
    for name, param in model.named_parameters():
        all_params += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
            print(f" {name}")
        else:
            non_trainable_params += param.numel()

    print("\nNon-Trainable Parameters:")
    for name, param in model.named_parameters():
        if not param.requires_grad:
            print(f" {name}")

    print(
        f"\nSummary:\n Trainable params: {trainable_params}\n Non-Trainable params: {non_trainable_params}\n All Parameters: {all_params}")

Print the trainable parameters to verify the setup.

print_trainable_parameters(model)

Loading and Preparing the Dataset for Fine-Tuning

When working with large datasets, it’s essential to streamline the process of loading, splitting, and formatting the data to ensure efficient model training and testing. The following Python code demonstrates how to achieve this using the Hugging Face datasets library, along with a tokenizer for text processing.

# Prepared with the help of code from: https://github.com/xfactlab/orpo/.
import json

# Load the dataset
dataset_name = 'llmat/dpo-orpo-mix-38k-balanced' # Ensure this is defined

max_num_samples = None # Set to None to use the full dataset
#max_num_samples = 10000 # set to None to use the full dataset

from datasets import load_dataset

def build_dataset(tokenizer, data_name, cache_dir=None, max_num_samples=10000, test_size_ratio=0.1):
    # Determin the split specification based on max_num samples
    split_spec = 'train' if max_num_samples is None else f'train[:{max_num_samples}]'

    # Load the dataset
    full_data = load_dataset(data_name, split=split_spec, cache_dir=cache_dir)

    # Shuffle the dataset
    if max_num_samples is not None:
        full_data = full_data.shuffle(seed=42)
    else:
        full_data = full_data

    # Determine the number of test samples
    num_total_samples = len(full_data)
    test_size = int(test_size_ratio * num_total_samples)

    # Randomly split the data into training and test sets
    dataset = full_data.train_test_split(test_size=test_size)

    column_names = list(dataset['train'].features)

    def apply_dpo_template(example):
        # function adapted from https://kaitchup.substrack.com/p/fine-tune-a-better-go
        if all(k in example.keys() for k in ('chosen', 'rejected')):
            # For DPO, the inputs are triples of (prompt, chosen, rejected), where 'chosen'
            # We therefore need to extract the N-1 turns to form the prompt
            prompt_messages = example['chosen'][:-1]

            # Now we extract the final turn to define chosen/rejected responses
            chosen_messages = example['chosen'][-1:]
            rejected_messages = example['rejected'][-1:]
            example['text_chosen'] = tokenizer.apply_chat_template(chosen_messages, tokenize=False)
            example['text_rejected'] = tokenizer.apply_chat_template(rejected_messages, tokenize=False)
            example['text_prompt'] = tokenizer.apply_chat_template(prompt_messages, tokenize=False)
        return example

    dataset = dataset.map(apply_dpo_template, remove_columns=column_names,
                desc='Formatting comparisons with prompt template',)

    for split in ['train', 'test']:
        dataset[split] = dataset[split].rename_columns(
            {'text_prompt': 'prompt', 'text_chosen': 'chosen', 'text_rejected': 'rejected'}
        )

    return dataset['train'], dataset['test']

# Assuming 'tokenizer' and 'dataset_name' are already defined
train, test = build_dataset(tokenizer, dataset_name, cache_dir='./dataset', max_num_samples=max_num_samples)

After preparing and formatting your dataset for fine-tuning, let’s inspect the data to ensure that it has been correctly processed. This step helps you verify that the prompt, chosen, rejected, and messages fields are properly formatted and contain the expected information.

print('Prompt:', train['prompt'][0])
print('\n\nChosen:', train['chosen'][0])
print('\n\nRejected:', train['rejected'][0])
print('\n\nMessages (incl. prompt):', train['messages'][0])

Setting Up and Running Training

In this tutorial, we will go through the process of setting up and running the training for your model. This includes configuring training parameters, creating a custom logging callback, and initiating the training process.

Set Training Parameters

Define the training parameters such as the model name, number of epochs, gradient accumulation steps, batch size, and the directory to save the results.

model_name = model_id.split('/')[-1]

epochs=1
grad_accum=4
batch_size=8
fine_tune_tag='ORPO'
save_dir = f'./results/{model_name}_{dataset_name}_{epochs}_epochs_{fine_tune_tag}'
print(save_dir)

Create a Custom Logging Callback

Implement a custom callback to log training metrics to a file. This callback will write the training and evaluation loss to a log file and save the trainable parameters at checkpoint steps. Create an instance of the custom logging callback with the specified log file path.

import transformers
import os
import torch

# Custom callback to log metrics
class LoggingCallback(transformers.TrainerCallback):
    def __init__(self, log_file_path):
        self.log_file_path = log_file_path

    def on_log(self, args, state, control, model=None, logs=None, **kwargs):
        with open(self.log_file_path, 'a') as f:
            if 'loss' in logs:
                f.write(f'Step: {state.global_step}, Training Loss: {logs["loss"]}\n')
                if 'eval_loss' in logs:
                    f.write(f'Step: {state.global_step}, Eval Loss: {logs["eval_loss"]}\n')
                f.flush()  # Force flush the buffered data to file

        # Check if the current step is a checkpoint step
        if state.global_step % int(args.save_steps) == 0:
            # Check if the last checkpoint path exists
            if state.best_model_checkpoint:
                checkpoint_dir = state.best_model_checkpoint
            else:
                # If not, construct the checkpoint directory path
                checkpoint_dir = os.path.join(args.output_dir, f'checkpoint-{state.global_step}')

            # Ensure the checkpoint directory exists
            os.makedirs(checkpoint_dir, exist_ok=True)

            # Save trainable params in the checkpoint directory
            current_trainable_params = {n: p for n, p in model.named_parameters() if p.requires_grad}
            current_trainable_params_state_dict = {n: p.data for n, p in current_trainable_params.items()}
            file_path = os.path.join(checkpoint_dir, 'trainable_params.pt')
            torch.save(current_trainable_params_state_dict, file_path)

# Log file path
cache_dir = './dataset'  # Assuming cache_dir is defined elsewhere in your code
log_file_path = os.path.join(cache_dir, 'training_logs.txt')

# Create an instance of the custom callback
logging_callback = LoggingCallback(log_file_path)

Setting Up ORPO Training

In this section, we’ll walk through setting up and training a model using the ORPOTrainer from the trl library.

I trained the model on the entire dataset (38k samples) using an RTX 4090 GPU (24 GB of VRAM). The training took 7 hours and 35 minutes. You can use smaller GPUs with less VRAM and a smaller batch size. In this case, I recommend only loading a subset of the dataset to speed up training. You can do it by modifying the previous code block, like ‘max_num_samples = 10000’ to only load 10k samples.

Configure ORPO

We define the configuration for the ORPO training. This configuration includes various hyperparameters and settings for training. An important parameter to set is beta. beta is the constant λ of the loss function in the paper. It controls how much weight we give to the preference part vs. the cross-entropy part. In our example, we set the value to 0.2.

from trl import ORPOTrainer, ORPOConfig
from unsloth import is_bfloat16_supported

orpo_config = ORPOConfig(
    beta=0.2,
    save_steps=500, 
    logging_steps=1,
    num_train_epochs=epochs,
    output_dir=save_dir,
    evaluation_strategy='steps', 
    do_eval=True,
    eval_steps=0.2,
    per_device_eval_batch_size=batch_size,
    per_device_train_batch_size=batch_size,
    gradient_accumulation_steps=grad_accum,
    log_level='debug',
    optim='paged_adamw_8bit',
    fp16 = not is_bfloat16_supported(),
    bf16 = is_bfloat16_supported(),
    max_grad_norm=0.3,
    lr_scheduler_type='linear',
    warmup_ratio=0.03,
    learning_rate=1e-4, 

    max_prompt_length=512,
    max_length=1024,

    max_completion_length=1024,
    remove_unused_columns=True,
    
)

Initialize ORPOTrainer

Create an instance of ORPOTrainer with the model, datasets, tokenizer, and the configuration defined earlier.

orpo_trainer = ORPOTrainer(
    model,
    args=orpo_config,
    train_dataset=train,
    eval_dataset=test,
    tokenizer=tokenizer,

    callbacks=[logging_callback], # Add custom callback here
)

Train the Model

Set the model configuration to avoid cache warnings and start the training process.

model.config.use_cache = False # silence the warnings
orpo_trainer.train()

Plotting Training and Evaluation Losses

After training your model, it’s important to visualize the training and evaluation losses to understand how well your model is performing and to identify any potential issues. Visualizing the losses can help you diagnose problems such as overfitting or underfitting and make informed decisions about further training or model adjustments.

import matplotlib.pyplot as plt

# Initialize lists to hold training and evaluation losses and steps
train_losses = []
eval_losses = []
train_steps = []
eval_steps = []

# Populate the lists from the log history
for entry in orpo_trainer.state.log_history:
    if 'loss' in entry:
        train_losses.append(entry['loss'])
        train_steps.append(entry['step'])
    if 'eval_loss' in entry:
        eval_losses.append(entry['eval_loss'])
        eval_steps.append(entry['step'])

# Plot the losses
plt.plot(train_steps, train_losses, label='Train Loss')
plt.plot(eval_steps, eval_losses, label='Eval Loss')
plt.xlabel('Steps')
plt.ylabel('Loss')
plt.legend()
plt.show()

Let’s now check the W&B plots. While the loss goes down, we also can see that the difference between the chosen and rejects answers becomes clearer.

Merging Adapters and Saving the Model to Hugging Face Hub

As a last step, we merge the adapters with the original model using 16-bit precision to enhance quality. Initially, we save it locally in the “model” directory before uploading it to the Hugging Face Hub. The trained model is available at llmat/Mistral-v0.3-7B-ORPO.

model.save_pretrained_merged("model", tokenizer, save_method="merged_16bit")
model.push_to_hub_merged("llmat/Mistral-v0.3-7B-ORPO", tokenizer, save_method="merged_16bit")

Conclusion

This article presented a thorough overview of ORPO fine-tuning and its practical application to a Mistral v0.3 7B model. Utilizing QLoRA’s efficient memory management, we successfully fine-tuned a 7B LLM on a high-quality dataset with minimal GPU resources.

I hope you found this guide helpful. If you liked this article, follow me on Hugging Face @llmat. Best of luck with your model fine-tuning!