!pip install python-dotenv
# Installs Unsloth, Xformers (Flash Attention) and all other packages!
!pip install "unsloth[cu121-ampere-torch230] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytes
The field of artificial intelligence and machine learning is marked by constant innovation, with new tools and methodologies emerging to expand the horizons of what these technologies can achieve. Recently, significant upgrades were introduced in popular AI model series, enhancing their capabilities and setting new benchmarks in AI development.
However, to fully leverage the potential of these advanced models, it’s essential to employ sophisticated fine-tuning techniques like ORPO (Odds Ratio Preference Optimization) and Unsloth. ORPO simplifies the alignment process by integrating preference optimization directly into the training phase, eliminating the need for a separate alignment step. Unsloth, on the other hand, offers groundbreaking advancements in training efficiency, significantly speeding up the process while reducing memory consumption without compromising accuracy.
In this article, we will explore how to fine-tune Mistral v0.3 using ORPO and Unsloth, demonstrating how these techniques can enhance model performance and efficiency. By understanding and applying these methods, you can unlock new levels of capability and efficiency in your AI projects. The code for this process can be found on Google Colab and in the LLM Tutorial on GitHub.
ORPO
Instruction tuning and preference alignment are crucial for customizing Large Language Models (LLMs) for specific tasks. This typically involves a multi-step process: first, Supervised Fine-Tuning (SFT) on instructions to tailor the model to the desired domain, and second, applying preference alignment techniques such as Reinforcement Learning with Human Feedback (RLHF) or Direct Preference Optimization (DPO) to enhance the probability of producing preferred responses over less desirable ones. Researchers have found that although SFT adjusts the model to the target domain, it also raises the chances of producing both unwanted and desired answers. Therefore, the preference alignment stage is essential to enlarge the disparity between the probabilities of accepted and rejected outputs.
Hong and Lee (2024) introduced ORPO (Odds Ratio Preference Optimization), a groundbreaking method that aligns the language model without a reference model in a single-step manner by assigning a weak penalty to the rejected responses and a strong adaptation signal to the chosen responses with a simple log odds ratio term appended to the negative log-likelihood loss.
This approach enhances the traditional language modeling objective by integrating the negative log-likelihood (NLL) loss with an odds ratio (OR) component. The OR loss imposes a slight penalty on disfavored responses while significantly rewarding favored ones, enabling the model to concurrently master the target task and align with human preferences. The objective function for ORPO is defined as follows:
\mathscr{L}{ORPO} = \mathbb{E}{(x, y_{w}, y_l)}[\mathscr{L}{SFT} + \lambda \cdot \mathscr{L}{OR}]
In this formula, SFT represents the conventional supervised fine-tuning loss, OR denotes the odds ratio loss, and Lambda is a weighting factor that balances these two components. This integration ensures that the model adapts effectively to the desired domain while minimizing the generation of undesired outputs.
Unsloth
Unsloth is a fine-tuning framework designed to accelerate the training of large language models (LLMs) like Llama and Mistral, while drastically reducing memory usage. It achieves this through several optimizations:
Manual Derivation and Handwritten GPU Kernels: Unsloth optimizes computational steps by manually deriving and handwriting GPU kernels, bypassing inefficiencies in general-purpose libraries.
Quantization Techniques: Utilizing 4-bit and 16-bit quantization (QLoRA) reduces memory requirements without compromising model accuracy.
Optimized Attention Mechanisms: Integrating Flash Attention v2 for faster attention calculations and reduced memory usage.
Enhanced Memory Management: Efficient memory allocation and data transfer processes optimize VRAM usage.
Unsloth can make training up to 2 times faster on single GPUs and reduces memory usage by up to 60% without degrading accuracy, supporting diverse fine-tuning use cases, including instructional fine-tuning and direct preference optimization (DPO)
Fine-Tuning Mistral v0.3 with ORPO and Unsloth
In this example we will QLoRA fine-tune the Mistral v0.3 7B model using ORPO and the Unsloth framework. ORPO necessitates a preference dataset that includes a prompt, a selected answer, and a discarded answer. To achieve this, we will utilize llmat/dpo-orpo-mix-38k-balanced, a dataset that merges high-quality DPO datasets and has been further balanced using a clustering-based approach.
Let’s start by installing the required libraries:
Now let’s login to our W&B workspace
import wandb
import os
import dotenv
dotenv.load_dotenv()%env WANDB_NOTEBOOK_NAME = $Fine_tune_Mistral_with_ORPO
=os.environ["WANDB_API_KEY"]) wandb.login(key
Load the Model and Tokenizer for LoRA
In the following, we will load the Mistral 7B v0.3 model in 4-bit precision.
= './model'
cache_dir = 'mistralai/Mistral-7B-v0.3' model_id
from unsloth import FastLanguageModel
import torch
= 2048
max_seq_length = None
dtype = True
load_in_4bit
= FastLanguageModel.from_pretrained(
model, tokenizer = model_id,
model_name = max_seq_length,
max_seq_length = dtype,
dtype = load_in_4bit,
load_in_4bit )
Loading Checks
After loading the model, it’s crucial to ensure that all parameters are correctly placed on the GPU and that none are overflowing onto the CPU. This can be particularly important for large models where memory management is critical.
To verify the placement of the model’s parameters, you can iterate through the model’s named parameters and check their device type. If any parameter is on the CPU (indicated by the device type ‘meta’), it will be printed out. This ensures that your model is fully utilizing the GPU resources and avoiding any potential performance bottlenecks.
Here is the code to perform this check:
# Check there are no parameters overflowing onto cpu (meta).
for n, p in model.named_parameters():
if p.device.type=='meta':
print(f"{n} is on meta!")
Setting Up LoRA Fine-Tuning
To prepare your model for LoRA (Low-Rank Adaptation) fine-tuning, you need to configure it properly. This involves setting up the LoRA configuration. Here’s a brief overview of the parameter settings:
r
: This parameter controls the rank of the low-rank adaptation matrices. It’s suggested to choose a value greater than 0, with common choices being 8, 16, 32, 64, or 128. The best setting depends on the specific use case and computational resources, but a good starting point is 8 or 16.lora_alpha
: This parameter scales the magnitude of the LoRA update. A higher value can lead to more significant changes in the model’s behavior. In our example we are setting lora_alpha to 32.target_modules
: This list specifies which modules in the model should be fine-tuned. The settings include key modules like"q_proj"
,"k_proj"
,"v_proj"
,"o_proj"
,"gate_proj"
,"up_proj"
, and"down_proj"
. If the task involves chat fine-tuning, it’s also beneficial to set"lm_head"
(language model head) as trainable.use_gradient_checkpointing
: This parameter activates gradient checkpointing to conserve memory. It is managed by Unsloth, which offloads input and output embeddings to disk, thereby saving VRAM.random_state
: This parameter sets the seed for random number generation, ensuring reproducibility. The best setting is any integer value; in the code, it’s set to 3407.use_rslora
: This parameter activates RSLoRA, which adjusts the scaling factor of LoRA adapters to be proportional to 1/√r instead of 1/r. This adjustment enhances the stability of learning, particularly for higher adapter ranks, and improves fine-tuning performance as the rank increases.
These settings provide a good starting point for fine-tuning a language model using PEFT. However, the optimal settings may vary depending on the specific task and dataset, so some experimentation may be necessary.
= FastLanguageModel.get_peft_model(
model
model,= 8, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
r = 32,
lora_alpha =[
target_modules"q_proj",
"k_proj",
"v_proj",
"o_proj",
"gate_proj",
"up_proj",
"down_proj",
"lm_head", # Language model head - best to set this trainable if chat fine-tuning
],
= 0,
lora_dropout = "none",
bias = "unsloth",
use_gradient_checkpointing = 3407,
random_state = True,
use_rslora
)
Set up Tokenizer and Padding
Before starting the fine-tuning process, it’s essential to configure the tokenizer and set up padding correctly. This ensures that the model can handle input sequences efficiently and that special tokens are properly managed.
Inspect the Tokenizer
Print out the tokenizer details, including the vocabulary size, beginning-of-sequence (BOS) token, end-of-sequence (EOS) token, and chat template.
print(tokenizer)
print(tokenizer.vocab_size)
print(tokenizer.bos_token)
print(tokenizer.eos_token)
print(tokenizer.chat_template)
Customize Chat Template
When working with Llama/Mistral models, it’s sometimes necessary to customize the chat template to ensure the conversation is formatted correctly. This customization is particularly useful when handling cases where the initial message in the conversation might not be from the assistant. By ensuring the beginning-of-sequence token (bos_token) is correctly placed, we can maintain the proper structure and flow of the conversation.
The following code snippet demonstrates how to set the chat template manually for such scenarios. This template checks if the first message is from the assistant. If not, it adds the bos_token at the beginning. This step is crucial because we format the chosen and rejected responses separately, and we want to avoid adding an extra bos_token before the response when there’s no initial user message.
The template is defined using a Jinja-like syntax, which iterates through the messages and formats them based on their roles (user
or assistant
). For user messages, it wraps the content with [INST]
and [/INST]
tags, while for assistant messages, it appends an end-of-sequence token (eos_token
).
= """{% if messages[0]['role'] != 'assistant' %}{{ bos_token }}{% endif %}{% for message in messages %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token }}{% endif %}{% endfor %}
tokenizer.chat_template """
# Test chat template
= [
messages 'role': 'user', 'content': 'write a quick sorf algorithm in python.'},
{'role': 'assistant', 'content': 'here you are.'},
{'role': 'user', 'content': 'great.'},
{
]
= tokenizer.apply_chat_template(messages, tokenize=False)
inputs print(inputs)
Set the Pad Token
When working with tokenizers, it’s essential to designate a token for padding sequences to ensure they all have the same length. This padding token helps maintain the consistency of input shapes when batching data for training models. The following code snippet demonstrates how to set the padding token (pad_token) in your tokenizer by checking for the presence of specific tokens in its vocabulary.
## set the pad token to <pad>, if not <|pad|>, if not <unk> if <unk>
if '<pad>' in tokenizer.get_vocab():
print('<pad> token is is in the tokenizer. Usinh <pad> for pad')
#Set the pad token
= '<pad>'
tokenizer.pad_token elif '<|pad|>' in tokenizer.get_vocab():
print('<|pad|> token is in the tokenizer. Using for <|pad|> for pad')
# Set the pad token
= '<|pad|>'
tokenizer.pad_token elif '<unk>' in tokenizer.get_vocab():
print('<unk> token is in the tokenizer. Using for <unk> for pad')
# Set the pad token
= '<unk>'
tokenizer.pad_token else:
print(f'Using EOS token, {tokenizer.eos_token}, for padding. Warning, this ')
= tokenizer.eos_token tokenizer.pad_token
Update the Model Configuration
The following code snippet demonstrates how to update the pad token ID in both the model and its configuration to match the tokenizer’s pad token ID. Additionally, it includes checks and print statements to verify the consistency of these IDs and provides information about the tokenizer’s special tokens.
# Update pad token id in model and its config
= tokenizer.pad_token_id
model.pad_token_id = tokenizer.pad_token_id
model.config.pad_token_id
# Check if they are equal
assert model.pad_token_id == tokenizer.pad_token_id, "The model's pat token ID are not equal"
# Print the pad token ids
print('Tokenizer pad token ID:', tokenizer.pad_token_id)
print('Model pad token ID:', model.pad_token_id)
print('Model config pad token ID:', model.config.pad_token_id)
print('Number of tokens now in tokenizer:', tokenizer.vocab_size)
print('Special tokens map:', tokenizer.special_tokens_map)
print('All special tokens:', tokenizer.all_special_tokens)
print(tokenizer)
Set embed and norm layers to trainable
(recommended for chat fine-tuning if the chat template has been changed)
When fine-tuning a model for chat applications, it’s often beneficial to set specific layers to be trainable, especially if you are changing the chat template. This ensures that the model can adapt to the new input format more effectively.
# List to hold the names of the trainable parameters
= ['embed_tokens', 'input_layernorm', 'post_attention_layernorm', 'norm']
trainable_params_names
# Set modules to be trainable
for n, p in model.named_parameters():
if any(k in n for k in trainable_params_names):
True)
p.requires_grad_(else:
False) # Optional: Set the rest to be trainable
p.requires_grad_(
# Make a dictionary of trainable parameters
= {n: p for n, p in model.named_parameters() if p.requires_grad}
trainable_params
# Convert trainable_params to state_dict format
= {n: p.data for n, p in trainable_params.items()} trainable_params_state_dict
Prepare for LoRA fine-tuning
Before starting the LoRA (Low-Rank Adaptation) fine-tuning process, it’s essential to understand which parameters in your model are trainable and which are not. This helps in ensuring that only the desired parameters are updated during training, which is crucial for efficient and effective fine-tuning.
To achieve this, you can use the following function to print the number of trainable parameters in the model and list which parameters are trainable and which are not.
def print_trainable_parameters(model):
"""
Prints the number of trainable parameters in the model and lists which parameters
"""
= 0
trainable_params = 0
non_trainable_params = 0
all_params
print("Trainable Parameters")
for name, param in model.named_parameters():
+= param.numel()
all_params if param.requires_grad:
+= param.numel()
trainable_params print(f" {name}")
else:
+= param.numel()
non_trainable_params
print("\nNon-Trainable Parameters:")
for name, param in model.named_parameters():
if not param.requires_grad:
print(f" {name}")
print(
f"\nSummary:\n Trainable params: {trainable_params}\n Non-Trainable params: {non_trainable_params}\n All Parameters: {all_params}")
Print the trainable parameters to verify the setup.
print_trainable_parameters(model)
Loading and Preparing the Dataset for Fine-Tuning
When working with large datasets, it’s essential to streamline the process of loading, splitting, and formatting the data to ensure efficient model training and testing. The following Python code demonstrates how to achieve this using the Hugging Face datasets library, along with a tokenizer for text processing.
# Prepared with the help of code from: https://github.com/xfactlab/orpo/.
import json
# Load the dataset
= 'llmat/dpo-orpo-mix-38k-balanced' # Ensure this is defined
dataset_name
= None # Set to None to use the full dataset
max_num_samples #max_num_samples = 10000 # set to None to use the full dataset
from datasets import load_dataset
def build_dataset(tokenizer, data_name, cache_dir=None, max_num_samples=10000, test_size_ratio=0.1):
# Determin the split specification based on max_num samples
= 'train' if max_num_samples is None else f'train[:{max_num_samples}]'
split_spec
# Load the dataset
= load_dataset(data_name, split=split_spec, cache_dir=cache_dir)
full_data
# Shuffle the dataset
if max_num_samples is not None:
= full_data.shuffle(seed=42)
full_data else:
= full_data
full_data
# Determine the number of test samples
= len(full_data)
num_total_samples = int(test_size_ratio * num_total_samples)
test_size
# Randomly split the data into training and test sets
= full_data.train_test_split(test_size=test_size)
dataset
= list(dataset['train'].features)
column_names
def apply_dpo_template(example):
# function adapted from https://kaitchup.substrack.com/p/fine-tune-a-better-go
if all(k in example.keys() for k in ('chosen', 'rejected')):
# For DPO, the inputs are triples of (prompt, chosen, rejected), where 'chosen'
# We therefore need to extract the N-1 turns to form the prompt
= example['chosen'][:-1]
prompt_messages
# Now we extract the final turn to define chosen/rejected responses
= example['chosen'][-1:]
chosen_messages = example['rejected'][-1:]
rejected_messages 'text_chosen'] = tokenizer.apply_chat_template(chosen_messages, tokenize=False)
example['text_rejected'] = tokenizer.apply_chat_template(rejected_messages, tokenize=False)
example['text_prompt'] = tokenizer.apply_chat_template(prompt_messages, tokenize=False)
example[return example
= dataset.map(apply_dpo_template, remove_columns=column_names,
dataset ='Formatting comparisons with prompt template',)
desc
for split in ['train', 'test']:
= dataset[split].rename_columns(
dataset[split] 'text_prompt': 'prompt', 'text_chosen': 'chosen', 'text_rejected': 'rejected'}
{
)
return dataset['train'], dataset['test']
# Assuming 'tokenizer' and 'dataset_name' are already defined
= build_dataset(tokenizer, dataset_name, cache_dir='./dataset', max_num_samples=max_num_samples) train, test
After preparing and formatting your dataset for fine-tuning, let’s inspect the data to ensure that it has been correctly processed. This step helps you verify that the prompt, chosen, rejected, and messages fields are properly formatted and contain the expected information.
print('Prompt:', train['prompt'][0])
print('\n\nChosen:', train['chosen'][0])
print('\n\nRejected:', train['rejected'][0])
print('\n\nMessages (incl. prompt):', train['messages'][0])
Setting Up and Running Training
In this tutorial, we will go through the process of setting up and running the training for your model. This includes configuring training parameters, creating a custom logging callback, and initiating the training process.
Set Training Parameters
Define the training parameters such as the model name, number of epochs, gradient accumulation steps, batch size, and the directory to save the results.
= model_id.split('/')[-1]
model_name
=1
epochs=4
grad_accum=8
batch_size='ORPO'
fine_tune_tag= f'./results/{model_name}_{dataset_name}_{epochs}_epochs_{fine_tune_tag}'
save_dir print(save_dir)
Create a Custom Logging Callback
Implement a custom callback to log training metrics to a file. This callback will write the training and evaluation loss to a log file and save the trainable parameters at checkpoint steps. Create an instance of the custom logging callback with the specified log file path.
import transformers
import os
import torch
# Custom callback to log metrics
class LoggingCallback(transformers.TrainerCallback):
def __init__(self, log_file_path):
self.log_file_path = log_file_path
def on_log(self, args, state, control, model=None, logs=None, **kwargs):
with open(self.log_file_path, 'a') as f:
if 'loss' in logs:
f'Step: {state.global_step}, Training Loss: {logs["loss"]}\n')
f.write(if 'eval_loss' in logs:
f'Step: {state.global_step}, Eval Loss: {logs["eval_loss"]}\n')
f.write(# Force flush the buffered data to file
f.flush()
# Check if the current step is a checkpoint step
if state.global_step % int(args.save_steps) == 0:
# Check if the last checkpoint path exists
if state.best_model_checkpoint:
= state.best_model_checkpoint
checkpoint_dir else:
# If not, construct the checkpoint directory path
= os.path.join(args.output_dir, f'checkpoint-{state.global_step}')
checkpoint_dir
# Ensure the checkpoint directory exists
=True)
os.makedirs(checkpoint_dir, exist_ok
# Save trainable params in the checkpoint directory
= {n: p for n, p in model.named_parameters() if p.requires_grad}
current_trainable_params = {n: p.data for n, p in current_trainable_params.items()}
current_trainable_params_state_dict = os.path.join(checkpoint_dir, 'trainable_params.pt')
file_path
torch.save(current_trainable_params_state_dict, file_path)
# Log file path
= './dataset' # Assuming cache_dir is defined elsewhere in your code
cache_dir = os.path.join(cache_dir, 'training_logs.txt')
log_file_path
# Create an instance of the custom callback
= LoggingCallback(log_file_path) logging_callback
Setting Up ORPO Training
In this section, we’ll walk through setting up and training a model using the ORPOTrainer from the trl library.
I trained the model on the entire dataset (38k samples) using an RTX 4090 GPU (24 GB of VRAM). The training took 7 hours and 35 minutes. You can use smaller GPUs with less VRAM and a smaller batch size. In this case, I recommend only loading a subset of the dataset to speed up training. You can do it by modifying the previous code block, like ‘max_num_samples = 10000’ to only load 10k samples.
Configure ORPO
We define the configuration for the ORPO training. This configuration includes various hyperparameters and settings for training. An important parameter to set is beta. beta is the constant λ of the loss function in the paper. It controls how much weight we give to the preference part vs. the cross-entropy part. In our example, we set the value to 0.2.
from trl import ORPOTrainer, ORPOConfig
from unsloth import is_bfloat16_supported
= ORPOConfig(
orpo_config =0.2,
beta=500,
save_steps=1,
logging_steps=epochs,
num_train_epochs=save_dir,
output_dir='steps',
evaluation_strategy=True,
do_eval=0.2,
eval_steps=batch_size,
per_device_eval_batch_size=batch_size,
per_device_train_batch_size=grad_accum,
gradient_accumulation_steps='debug',
log_level='paged_adamw_8bit',
optim= not is_bfloat16_supported(),
fp16 = is_bfloat16_supported(),
bf16 =0.3,
max_grad_norm='linear',
lr_scheduler_type=0.03,
warmup_ratio=1e-4,
learning_rate
=512,
max_prompt_length=1024,
max_length
=1024,
max_completion_length=True,
remove_unused_columns
)
Initialize ORPOTrainer
Create an instance of ORPOTrainer with the model, datasets, tokenizer, and the configuration defined earlier.
= ORPOTrainer(
orpo_trainer
model,=orpo_config,
args=train,
train_dataset=test,
eval_dataset=tokenizer,
tokenizer
=[logging_callback], # Add custom callback here
callbacks )
Train the Model
Set the model configuration to avoid cache warnings and start the training process.
= False # silence the warnings
model.config.use_cache orpo_trainer.train()
Plotting Training and Evaluation Losses
After training your model, it’s important to visualize the training and evaluation losses to understand how well your model is performing and to identify any potential issues. Visualizing the losses can help you diagnose problems such as overfitting or underfitting and make informed decisions about further training or model adjustments.
import matplotlib.pyplot as plt
# Initialize lists to hold training and evaluation losses and steps
= []
train_losses = []
eval_losses = []
train_steps = []
eval_steps
# Populate the lists from the log history
for entry in orpo_trainer.state.log_history:
if 'loss' in entry:
'loss'])
train_losses.append(entry['step'])
train_steps.append(entry[if 'eval_loss' in entry:
'eval_loss'])
eval_losses.append(entry['step'])
eval_steps.append(entry[
# Plot the losses
='Train Loss')
plt.plot(train_steps, train_losses, label='Eval Loss')
plt.plot(eval_steps, eval_losses, label'Steps')
plt.xlabel('Loss')
plt.ylabel(
plt.legend() plt.show()
Let’s now check the W&B plots. While the loss goes down, we also can see that the difference between the chosen and rejects answers becomes clearer.
Merging Adapters and Saving the Model to Hugging Face Hub
As a last step, we merge the adapters with the original model using 16-bit precision to enhance quality. Initially, we save it locally in the “model” directory before uploading it to the Hugging Face Hub. The trained model is available at llmat/Mistral-v0.3-7B-ORPO.
"model", tokenizer, save_method="merged_16bit")
model.save_pretrained_merged("llmat/Mistral-v0.3-7B-ORPO", tokenizer, save_method="merged_16bit") model.push_to_hub_merged(
Conclusion
This article presented a thorough overview of ORPO fine-tuning and its practical application to a Mistral v0.3 7B model. Utilizing QLoRA’s efficient memory management, we successfully fine-tuned a 7B LLM on a high-quality dataset with minimal GPU resources.
I hope you found this guide helpful. If you liked this article, follow me on Hugging Face @llmat. Best of luck with your model fine-tuning!