Build a High-Quality DPO Dataset

Direct Preference Optimization (DPO) is a technique in fine-tuning language models using human preference data. In this article, we’ll explore how to build a high-quality DPO dataset by consolidating multiple existing datasets, focusing on extracting only the best answers. We’ll walk through the code implementation step-by-step to demonstrate this process.

What is Direct Preference Optimization (DPO)?

As language models advance, aligning them with human preferences has become increasingly important. Direct Preference Optimization (DPO) is a method that directly integrates human preferences into the training process, eliminating the need for separate reward models or reinforcement learning.

Key Concepts of DPO:

Direct Optimization: DPO adjusts the language model to favor responses preferred by humans, streamlining the alignment process.
Simplified Training: By removing the need for auxiliary models and reinforcement learning, DPO reduces computational overhead and complexity.
Enhanced Alignment: This approach produces outputs more in line with human expectations and values, improving user satisfaction.

DPO Fine-tuning

At the core of the DPO fine-tuning process lies the concept of creating an exact duplicate of the language model (LM) being trained, with its parameters set to remain unchanged. This “frozen” model serves as a reference point during training.

For each data point, both the trained (policy) and frozen (reference) language models evaluate the chosen and rejected responses. The evaluation score is calculated as the product of the probabilities assigned to each token in the target response at every step. Thanks to the causal decoder architecture of these generative language models, we can compute this score in a single forward pass.

To score a chosen or rejected response for a given prompt, the LM calculates the probability of generating each response token sequentially. These probabilities are multiplied together to obtain the final score for that response.

With scores for both the chosen and rejected responses from both models, we compute the ratio between the scores from the trained language model (𝑅_policy) and those from the frozen language model (𝑅_reference). These ratios are then used to calculate the final loss, guiding the model’s weight adjustments during the gradient descent update:

where β is a hyperparameter and σ is the sigmoid function.

Building a High-Quality DPO Dataset

To effectively implement Direct Preference Optimization (DPO), it’s essential to curate a dataset of high-quality preference pairs. Several notable datasets can serve as valuable resources:

argilla/distilabel-math-preference-dpo: Developed by Argilla using the Distilabel framework, this dataset comprises approximately 2,418 entries. Each entry includes a math-related instruction, two model-generated responses, and corresponding quality ratings, facilitating the enhancement of mathematical reasoning in language models.
argilla/distilabel-intel-orca-dpo-pairs: This dataset is a “distilabeled” version of the widely used Intel/orca_dpo_pairs. It has been improved using the Distilabel framework to enhance the quality of preference pairs, making it suitable for fine-tuning models with diverse preference data.
argilla/ultrafeedback-binarized-preferences-cleaned: This dataset offers cleaned and binarized preference pairs, providing a refined resource for training models to understand and prioritize user preferences effectively.
M4-ai/prm_dpo_pairs_cleaned: Containing cleaned DPO pairs, this dataset aids in fine-tuning models to align with preferred responses, enhancing their decision-making capabilities.
jondurbin/truthy-dpo-v0.1: Focused on truthfulness, this dataset provides preference pairs that help models discern and prioritize truthful information, crucial for maintaining accuracy and reliability.
unalignment/toxic-dpo-v0.2: This dataset addresses toxicity by offering preference pairs that guide models to avoid generating harmful or offensive content, promoting safer AI interactions.
argilla/Capybara-Preferences: A collection of preference pairs tailored to specific tasks, this dataset assists in fine-tuning models for specialized applications, enhancing their adaptability and performance.

By selecting the highest-rated responses from these datasets, we can curate a collection of superior preference pairs, thereby enhancing the effectiveness of DPO fine-tuning.

Code Implementation

Let’s dive into the code to see how we can achieve this. We’ll use the datasets library from Hugging Face to handle dataset loading and manipulation.

1. Import Necessary Libraries

from datasets import load_dataset, Dataset, concatenate_datasets as hf_concatenate_datasets, DatasetDict, Features, Value

Explanation:

•   load_dataset: Loads datasets from the Hugging Face Hub.
•   Dataset: A class representing a dataset.
•   concatenate_datasets: Function to concatenate multiple datasets.
•   DatasetDict: A dictionary-like class for datasets with multiple splits.
•   Features, Value: Used to define a consistent schema for our datasets.

2. Load Datasets

# Load datasets
datasets = {
    "math_preference": load_dataset("argilla/distilabel-math-preference-dpo"),
    "intel_orca": load_dataset("argilla/distilabel-intel-orca-dpo-pairs"),
    "ultrafeedback_binarized": load_dataset("argilla/ultrafeedback-binarized-preferences-cleaned"),
    "prm_dpo": load_dataset("M4-ai/prm_dpo_pairs_cleaned"),
    "truthy_dpo": load_dataset("jondurbin/truthy-dpo-v0.1"),
    "toxic_dpo": load_dataset("unalignment/toxic-dpo-v0.2"),
    "capybara": load_dataset("argilla/Capybara-Preferences"),
}

Explanation:

We load several DPO datasets from the Hugging Face Hub and store them in a dictionary for easy access. Each key corresponds to a dataset name, and the value is the loaded dataset.

3. Define a Consistent Schema

# Define the consistent schema
consistent_features = Features({
    "origin": Value("string"),
    "chosen": [{"content": Value("string"), "role": Value("string")}],
    "rejected": [{"content": Value("string"), "role": Value("string")}],
    "prompt": Value("string"),
})

Explanation:

We define a consistent schema (consistent_features) to standardize the datasets. This schema includes:

•   origin: The source of the data.
•   chosen: The preferred answer, along with its role.
•   rejected: The less preferred answer.
•   prompt: The input prompt to which the answers respond.

4. Transform Examples Function

# Function to transform the 'chosen' and 'rejected' features into lists of dictionaries
def transform_example(example):
    if 'prompt' in example and 'chosen' in example:
        example['chosen'] = [
            {"content": example['prompt'], "role": "user"},
            {"content": example['chosen'], "role": "assistant"}
        ]
    if 'prompt' in example and 'rejected' in example:
        example['rejected'] = [
            {"content": example['prompt'], "role": "user"},
            {"content": example['rejected'], "role": "assistant"}
        ]
    return example

Explanation:

This function transforms each example to match our consistent schema. It structures the ‘chosen’ and ‘rejected’ responses as lists of dictionaries, pairing the prompt with the assistant’s response.

5. Align Dataset Features

# Align dataset features
def align_features(dataset, source_name):
    aligned_data = {
        feature: dataset[feature] if feature in dataset.column_names else [None] * len(dataset)
        for feature in consistent_features
    }
    aligned_data["origin"] = [source_name] * len(dataset)
    return Dataset.from_dict(aligned_data, features=consistent_features)

Explanation:

The align_features function ensures that each dataset conforms to the consistent schema. It fills in missing features with None and adds the origin field.

6. Preprocess Datasets

We preprocess each dataset individually to filter and transform the data according to our requirements.

6.1 Capybara Dataset

# Capybara dataset
datasets['capybara']['train'] = datasets['capybara']['train']\
    .filter(lambda x: x['chosen_rating'] is float(x['chosen_rating']) >= 5)\
    .map(lambda x: {'prompt': x['chosen'][0]['content'] if x['chosen'] else "", **x})

Explanation:

•   Filter: Keeps entries where the chosen_rating is None or greater than or equal to 5.
•   Map: Extracts the prompt from the chosen response.

6.2 PRM DPO Dataset

# PRM DPO dataset
datasets['prm_dpo']['train'] = datasets['prm_dpo']['train']\
    .filter(lambda x: x['is_chosen_correct'])\
    .map(transform_example)

Explanation:

•   Filter: Keeps entries where the chosen answer is marked as correct.
•   Map: Applies the transform_example function to standardize the data.

6.3 Ultrafeedback Binarized Dataset

# Ultrafeedback binarized dataset
datasets['ultrafeedback_binarized']['train'] = datasets['ultrafeedback_binarized']['train']\
    .filter(lambda x: x['chosen-rating'] is x['chosen-rating'] >= 5)

Explanation:

Filters out entries where the chosen answer has a rating less than 5.

6.4 Intel ORCA Dataset

# Intel ORCA dataset
datasets['intel_orca']['train'] = datasets['intel_orca']['train']\
    .rename_column('input', 'prompt')\
    .filter(lambda x: x['rating'] is not None and x['rating'][0] >= 10 and x['rating'][1] >= 10)\
    .filter(lambda x: not x.get('in_gsm8k_train', False))\
    .map(transform_example)

Explanation:

•   Rename Column: Renames ‘input’ to ‘prompt’ for consistency.
•   Filter: Keeps entries with high ratings (>=10) for both choices and excludes those in the GSM8K train set.
•   Map: Transforms examples to match the schema.

6.5 Math Preference Dataset

# Math preference dataset
datasets['math_preference']['train'] = datasets['math_preference']['train']\
    .rename_column('instruction', 'prompt')\
    .rename_column('chosen_response', 'chosen')\
    .rename_column('rejected_response', 'rejected')\
    .filter(lambda x: x['chosen_rating'] is x['chosen_rating'] >= 9)\
    .map(transform_example)

Explanation:

•   Rename Columns: Adjusts column names to match our schema.
•   Filter: Keeps entries where the chosen rating is None or greater than or equal to 9.
•   Map: Applies the transformation function.

6.6 Truthy DPO and Toxic DPO Datasets

# Truthy DPO and Toxic DPO datasets
datasets['truthy_dpo'] = datasets['truthy_dpo'].map(transform_example)
datasets['toxic_dpo'] = datasets['toxic_dpo'].map(transform_example)

Explanation:

Simply applies the transform_example function to standardize these datasets.

7. Align and Collect All Datasets

# Align and collect all datasets
all_datasets = []
for name, dataset_dict in datasets.items():
    for split, dataset in dataset_dict.items():
        aligned_dataset = align_features(dataset, name)
        all_datasets.append(aligned_dataset)

Explanation:

We iterate over each dataset and its splits, align their features, and collect them into a list.

8. Concatenate All Datasets

# Concatenate all datasets
combined_dataset = hf_concatenate_datasets(all_datasets)

Explanation:

We concatenate all the aligned datasets into a single dataset using concatenate_datasets.

9. Create the Final Dataset

# Create the final dataset
final_dataset = DatasetDict({'train': combined_dataset})

Explanation:

We wrap the combined dataset into a DatasetDict with a ‘train’ split.

Verify the Dataset

# Print the combined dataset schema and a few rows to verify
print(final_dataset)
print(final_dataset['train'][:1])

DatasetDict({
    train: Dataset({
        features: ['origin', 'chosen', 'rejected', 'prompt'],
        num_rows: 45011
    })
})
{'origin': ['math_preference'], 'chosen': [[{'content': 'How can I simplify the algebraic expression `(3x^2 - 4y^3) / (2x)`?.\n Take a deep breath, think step by step, and give an accurate response', 'role': 'user'}, {'content': 'To simplify the algebraic expression `(3x^2 - 4y^3) / (2x)`, we can follow a few steps:\n\nStep 1: Distribute the division symbol by multiplying the expression by the reciprocal of the denominator. The reciprocal of `2x` is `1/(2x)`, so the expression becomes `(3x^2 - 4y^3) * (1/(2x))`.\n\nStep 2: Simplify within the parentheses by dividing each term separately. \n\n- For the first term, `3x^2`, divide `3x^2` by `2x`. This gives us `(3x^2) / (2x) = (3/2) * (x^2 / x) = (3/2) * x`.\n\n- For the second term, `-4y^3`, divide `-4y^3` by `2x`. This gives us `(-4y^3) / (2x) = (-2) * (y^3 / x)`.\n\nStep 3: Combine the simplified terms from Step 2. The expression now becomes `(3/2) * x - 2 * (y^3 / x)`.\n\nSo, the simplified form of the algebraic expression `(3x^2 - 4y^3) / (2x)` is `(3/2) * x - 2 * (y^3 / x)`.', 'role': 'assistant'}]], 'rejected': [[{'content': 'How can I simplify the algebraic expression `(3x^2 - 4y^3) / (2x)`?.\n Take a deep breath, think step by step, and give an accurate response', 'role': 'user'}, {'content': "To simplify the algebraic expression `(3x^2 - 4y^3) / (2x)`, you can divide each term in the numerator by the denominator.\n\nFirst, let's divide `3x^2` by `2x`. Since both terms have a common factor of `x`, we can simplify this expression to `3x`.\n\nNext, we divide `-4y^3` by `2x`. We can simplify this expression by dividing each term separately. \n\nDividing `-4` by `2` gives `-2`. Then, dividing `y^3` by `x` gives `y^3/x`.\n\nSo, the simplified form of `(3x^2 - 4y^3) / (2x)` is `3x - 2y^3/x`.", 'role': 'assistant'}]], 'prompt': ['How can I simplify the algebraic expression `(3x^2 - 4y^3) / (2x)`?.\n Take a deep breath, think step by step, and give an accurate response']}

Explanation:

We print the structure of the final dataset and the first few entries to ensure everything is correctly formatted.

Conclusion

In this article, we’ve journeyed through the process of building a high-quality DPO dataset by carefully aggregating and standardizing multiple existing datasets. By focusing on extracting only the best answers and ensuring a consistent schema, we’ve created a robust resource that can significantly enhance the performance of language models fine-tuned using Direct Preference Optimization.

This approach not only streamlines the training process but also ensures that our models are more closely aligned with human preferences, leading to outputs that are more accurate, reliable, and satisfying for users. As language models continue to evolve, techniques like DPO and the careful curation of training data become ever more important.

By investing time in building superior datasets, we lay the groundwork for more advanced, aligned, and human-centric AI models in the future. I encourage practitioners to apply these methods, explore further optimizations, and contribute to the ongoing effort to align AI models with human values and expectations.

I hope you found this guide helpful. If you liked this article, follow me on Hugging Face @llmat. Best of luck with your model fine-tuning!

References

Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., & Finn, C. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. arXiv preprint arXiv:2305.18290.
Hugging Face. (n.d.). Datasets. Retrieved November 1, 2024, from https://huggingface.co/docs/datasets/index