From Fine-Tuning to Deployment – Matthias De Paolis

Imagine unlocking the full potential of large language models (LLMs) right on your local machine, without relying on costly cloud services. This is where Ollama shines by allowing users to harness the power of LLMs on their machines. While Ollama offers a range of ready-to-use models, there are times when a custom model is necessary, whether it’s fine-tuned on specific data or designed for a particular task. Efficiently deploying these custom models on local hardware often requires optimization techniques like quantization. In this article, we will explore the concept of quantization and demonstrate how to apply it to a fine-tuned model from Huggingface. We’ll then cover how to install Ollama, create a corresponding Modelfile for a custom model, and integrate this custom model into Ollama, proving how easy it is to bring AI capabilities in-house. All the code used in this article is available on Google Colab and in the LLM Tutorial.

Ollama

Ollama is an open-source platform that empowers users to run large language models (LLMs) locally, bypassing the need for cloud-based services. Designed with accessibility in mind, Ollama simplifies the installation and management of a wide range of pre-trained LLMs and embedding models, enabling easy deployment without requiring extensive technical expertise. The platform provides a local API for seamless application integration and supports frameworks like LangChain. Recently tool-calling functionality has been introduced. This feature allows models to interact with external tools - like APIs, web browsers, and code interpreters - enabling them to perform complex tasks and interact with the outside world more effectively. Thanks to a large open-source community, Ollama continues evolving its capabilities, making it a robust, cost-effective solution for local AI deployment.

Quantization in Large Language Models

Quantization is a crucial technique in machine learning that involves reducing the precision of a model’s weights and activations, without significantly impacting the model’s performance. Traditionally, these models operate using 32-bit floating point (FP32) formats, but quantization allows for the conversion of these weights to lower precision formats such as 16-bit (FP16), 8-bit (INT8), 4-bit, or even 2-bit. The primary goals of quantization are to reduce the model’s memory footprint and computational demands, thereby making it possible to deploy the model on resource-constrained hardware. There are two types of quantization techniques: post-training quantization and quantization-aware training.

Types of Quantization

• Post-Training Quantization (PTQ): PTQ is a straightforward technique in which the model is quantized after it has been fully trained. This method is quick to implement and does not require retraining the model, making it ideal for scenarios where time or resources are limited. However, it may result in a slight decrease in accuracy since the model was not trained with quantization in mind. • Quantization-Aware Training (QAT): QAT integrates quantization into the training process, allowing the model to learn to compensate for the reduced precision. This approach generally results in better performance compared to PTQ, as the model adapts to the quantized environment during training. However, QAT requires more computational resources during training and is more complex to implement.

Quantizing a Custom Model

In our example, we will use the GGUF (GPT-Generated Unified Format) quantization format, released by Georgi Gerganov and the llama.cpp team. GGUF employs the post-training quantization technique and supports a range of quantization methods, allowing developers to balance model accuracy and efficiency based on their specific needs. This format is particularly favored by the community for its ability to run efficiently on both CPU and Apple devices, making it an excellent choice for local testing and deployment.

Installing the llama.cpp library

To start quantizing our model, we need to install the llama.cpp library. The library includes utilities to convert models into GGUF format and tools to quantize these models into various bit-widths depending on the hardware constraints.

!git clone https://github.com/ggerganov/llama.cpp
!pip install -r llama.cpp/requirements.txt -q
!cd llama.cpp && make -j 8

Downloading and Preparing the Model

Once we have the necessary tools, the next step is to download the model we want to quantize from Huggingface. In this example, we are using the Mistral-v0.3–7B-ORPO model that we fine-tuned in the last article. We download the model, rename it locally, and move it to the ./model folder.

!git lfs install
!git clone https://huggingface.co/llmat/Mistral-v0.3-7B-ORPO Mistral-v0.3-7B-ORPO
!mv Mistral-v0.3-7B-ORPO model/

Once we have our model we need to convert it to the GGUF F16 format.

!python ./llama.cpp/convert_hf_to_gguf.py ./model --outfile ./model/Mistral-v0.3-7B-ORPO-f16.gguf --outtype f16

Now, we can choose the method by which we want our model to be quantized. In the context of llama.cpp, quantization methods are typically named following a specific convention: **Q#_K_M Let’s break down what each component means:

• Q:** Stands for “Quantization,” indicating that the model has undergone a process to reduce its numerical precision.

• #: Refers to the number of bits used in the quantization process. For example, 4 in Q4_K_M indicates that the model has been quantized using 4-bit integers.

• K: Denotes the use of k-means clustering in the quantization process. K-means clustering is a technique used to group similar weights, reducing the variation between them and allowing for more efficient quantization with minimal loss of accuracy.

• M: Indicates the size category of the model after quantization, where:

• S = Small

• M = Medium

• L = Large

Quantization Methods Explained

Here’s a closer look at the different quantization methods supported by llama.cpp and Ollama, following the Q#_K_M naming convention:

Q2_K: This method uses 2-bit quantization, offering the most significant size reduction but with a considerable loss in accuracy. It’s mainly used in highly constrained environments where memory and processing power are extremely limited.

Q3_K_S: A 3-bit quantization method using k-means clustering, optimized for small models. This method provides significant memory savings and is used when accuracy can be somewhat compromised.

Q3_K_M: Similar to Q3_K_S but optimized for medium-sized models. This method offers a balanced trade-off between memory usage and accuracy.

Q3_K_L: This method is tailored for larger models, using 3-bit quantization with k-means clustering to reduce size while maintaining as much accuracy as possible.

Q4_0: A standard 4-bit quantization method that does not use k-means clustering. This is the default method, offering a good balance between size reduction and maintaining model accuracy. It’s suitable for general use cases where memory is limited but accuracy is still important.

Q4_1: Similar to Q4_0 but with slight variations in how quantization is applied, potentially offering slightly better accuracy at the cost of a small increase in resource usage.

Q4_K_S: A variation of 4-bit quantization optimized for smaller models. It reduces the model size significantly while preserving reasonable accuracy.

Q4_K_M: This method applies 4-bit quantization with k-means clustering to medium-sized models, offering an excellent balance between size and accuracy. It’s one of the most recommended methods for general use.

Q5_0: Uses 5-bit quantization, which offers higher precision than 4-bit methods, resulting in better accuracy. This method is a good choice when you have slightly more memory available and need to maintain higher accuracy.

Q5_1: A refinement of Q5_0, providing even greater accuracy by applying more sophisticated quantization techniques, though at the cost of increased resource requirements.

Q5_K_S: This method uses 5-bit quantization with k-means clustering, optimized for smaller models, providing higher accuracy than 4-bit methods with only a slight increase in resource use.

Q5_K_M: An advanced 5-bit quantization technique optimized for medium-sized models, providing high accuracy with reasonable memory efficiency. This method is often recommended for scenarios where accuracy is critical but resources are still somewhat limited.

Q6_K: This method uses 6-bit quantization, providing a middle ground between 4-bit and 8-bit methods. It’s suitable when you need more accuracy than what 4-bit offers but can’t afford the higher resource demands of 8-bit quantization.

Q8_0: Uses 8-bit quantization, which is nearly as accurate as the original float16 model. This method is best for scenarios where you need to preserve as much accuracy as possible while still reducing the model size.

In our example we choose to quantize our model to 4-bit, using the Q4_K_M method.

!mkdir Mistral-v0.3-7B-ORPO_Q4_K_M
!./llama.cpp/llama-quantize ./model/Mistral-v0.3-7B-ORPO-f16.gguf ./Mistral-v0.3-7B-ORPO_Q4_K_M/Mistral-v0.3-7B-ORPO_Q4_K_M.gguf Q4_K_M

Push model to hub (Optional)

# imports from huggingface
from huggingface_hub import create_repo, HfApi
import dotenv
import os

dotenv.load_dotenv()
username = "llmat"
HUGGINGFACE_TOKEN = os.getenv('HUGGINGFACE_TOKEN')
MODEL_NAME = "Mistral-v0.3-7B-ORPO_Q4_K_M"


# Defined in the .env
api = HfApi(token=HUGGINGFACE_TOKEN)

# Create empty repo
create_repo(
    repo_id = f"{username}/{MODEL_NAME}-GGUF",
    repo_type="model",
    exist_ok=True,
)

# Upload gguf files
api.upload_folder(
    folder_path=MODEL_NAME,
    repo_id=f"{username}/{MODEL_NAME}-GGUF",
    allow_patterns=f"*.gguf",
)

Install Ollama

With our model now quantized, the next step is to install and start Ollama. To begin, install Ollama by using this link:

Download link: https://ollama.com/download

For Windows Installation: After downloading the executable file, simply run it, and Ollama will be installed automatically.

For MacOS Installation: After the download completes on MacOS, unzip the downloaded file. Then, simply drag the Ollama.app folder into your Applications folder.

Linux installation: Just run below command in your terminal. Ollama will be installed.

!curl -fsSL https://ollama.com/install.sh | sh

Once the installation is complete, start the Ollama server. If using Google Colab then execute the following commands to start Ollama:

!pip install colab-xterm #https://pypi.org/project/colab-xterm/
%load_ext colabxterm
%xterm

## A Terminal window pops up
## Add command 'ollama serve'

If running on a local environment use this command to start the Ollama serve:

!ollama serve

Before we can add our quantized model to the Ollama server, Ollama requires us to create a Modelfile.

Modelfile for Ollama

Ollama’s Modelfile is a developing syntax designed to act as a blueprint, defining key components and parameters to customize model behavior within the Ollama ecosystem. The Modelfile includes several key instructions:

• FROM (Required): Specifies the base model or file to build from.

• PARAMETER: Sets various operational parameters like temperature, context window size, and stopping conditions, influencing model output and behavior.

• TEMPLATE: Defines the prompt structure, including system messages and user prompts, which guides the model’s responses.

• SYSTEM: Sets the system message to dictate the model’s behavior.

• ADAPTER: Applies LoRA adapters to the model for further customization.

• LICENSE: Specifies the legal license under which the model is shared.

• MESSAGE: Provides a message history to influence how the model generates responses.

Create Custom Modelfile

In our example, we only need to set the path, define the template, and set parameters for the stopping conditions.

Path to the Quantized Model: The Modelfile needs to specify the path to the quantized model stored on our system. This ensures the correct model is loaded for processing.

Template for Message Processing: The template within the Modelfile is based on the chat template used in the base model. It is responsible for processing and formatting messages according to their roles, such as “user” or “assistant.” This structure guarantees that the model’s output adheres to the dialogue format it was fine-tuned for.

Stop Parameters: The stop parameters identify the boundaries of the instructions provided to the model and the responses generated by it. The markers “[INST]” and “[/INST]” signal the start and end of the user’s input, respectively. These delimiters ensure the model recognizes where the user’s message begins and ends.

Below is how we define the path to our quantized model, construct the template content and stopping parameters within the Modelfile for our example:

# Creating the content for the Modelfile
template_content = """TEMPLATE """
template_content += '''"""
{{- if .Messages }}
    {{- range $index, $_ := .Messages }}
        {{- if eq .Role "user" }}
            [INST] 
            {{ .Content }}[/INST]
        {{- else if eq .Role "assistant" }}
            {{- if .Content }} {{ .Content }}
            {{- end }}</s>
        {{- end }}
    {{- end }}
{{- else }}
    [INST] 
    {{ .Prompt }}[/INST]
{{- end }} 
{{ .Response }}
{{- if .Response }}</s>
{{- end }}
"""'''

# Write the rest of the parameters to the file
with open('./Mistral-v0.3-7B-ORPO_Q4_K_M/modelfile', 'w') as file:
    file.write('FROM ./Mistral-v0.3-7B-ORPO_Q4_K_M.gguf\n\n')
    file.write('PARAMETER stop "[INST]"\n')
    file.write('PARAMETER stop "[/INST]"\n\n')
    file.write(template_content)

Let’s break down the template content:

• Processing Messages: The template processes the list of messages (.Messages) by identifying the role of each sender (.Role), effectively structuring the conversation.

• Formatting User Messages: Messages from the “user” are enclosed within [INST] tags. If the message is the user’s only input and a system message exists, it is included at the beginning.

• Formatting Assistant Messages: Messages from the “assistant” are output directly without additional tags, with a tag appended to signify the end of the response.

• Handling Edge Cases: If no messages are present, the template provides a fallback instruction within [INST] tags to ensure that the model still generates meaningful content.

• Final Response Handling: The final response is appended and closed with a tag, ensuring the conversation is properly terminated. After creating the Modelfile, you can display the file content to verify:

# Display the file content
with open('./Mistral-v0.3-7B-ORPO_Q4_K_M/modelfile', 'r') as file:
    content = file.read()

With the Modelfile ready, we can now create and add our quantized model to Ollama. This command registers the quantized model with Ollama using the configurations specified in the Modelfile:

!ollama create mistral-v0.3-7B-orpo_Q4_K_M -f ./Mistral-v0.3-7B-ORPO_Q4_K_M/modelfile

We can now check if our quantized model is now listed and ready to use.

!ollama list

Next, we install the necessary library to test the model using the LangChain framework:

!pip install langchain-community langchain-core

Now we run and test the model on Ollama:

from langchain.llms import Ollama

ollama = Ollama(base_url="http://localhost:11434", model="mistral-v0.3-7B-orpo_Q4_K_M")

TEXT_PROMPT = "What is one plus one?"

print(ollama(TEXT_PROMPT))

One plus one equals two.

The model should return a correct response.

Conclusion

This article has walked you through the process of quantizing a custom model, integrating it with Ollama, and testing it locally. By leveraging the llama.cpp framework we quantized our custom model in the Q4_K_M format and pushed it to Hugging Face Hub. We then discussed how to create the corresponding Modelfile and how to integrate our model into the Ollama framework.

Quantization, offers significant benefits, including reduced memory footprint, faster inference times, and lower power consumption. These advantages make it feasible to deploy sophisticated AI models across a variety of hardware configurations, from high-performance servers to low-power edge devices, broadening the scope of where and how AI can be applied. I hope you enjoyed reading this article and learned something new. You can find the quantized model from this example on Huggingface.

References

Brev.dev. (2024). Convert a fine-tuned model to GGUF format and run on Ollama. https://brev.dev/blog/convert-to-llamacpp

IBM. (2024). GGUF versus GGML. IBM. https://www.ibm.com/think/topics/gguf-versus-ggml

Ollama. (2024). Retrieved from https://ollama.com/blog

PatrickPT’s Blog. (2024). LLM Quantization in a nutshell. https://patrickpt.github.io/posts/quantllm/