LLAMA.CPP Guide for Creating GGUFs

Turn Hugging Face Transformer models into GGUFS format using Llama.cpp

credits: https://github.com/ggml-org/llama.cpp/tree/masterIntroduction

Introduction

GGUF

GGUF (General GGML Universal Format) is a specialized file format used to store optimized, quantized versions of large language models (LLMs). GGUF files are designed for efficiency and portability, enabling large transformer models to run smoothly even on CPUs, laptops, and edge devices with limited resources without significantly sacrificing performance.
This is achieved through quantization, a process that compresses model weights (e.g., from 16-bit or 32-bit floats down to 2–8 bits). This compression results in faster inference times, lower memory usage, and broader hardware compatibility.
GGUFs are especially useful for:
Deploying LLMs locally without GPUs
Running AI on low-cost cloud instances (e.g., RunPod, Colab)
Sharing lightweight, efficient models on platforms like Hugging Face
Optimizing production applications where speed and resource efficiency are critical

LLAMA.CPP

Llama.cpp is one of the most powerful and crucial frameworks in the AI and LLM space. One of its main use cases is quantization and creating GGUF (General GGML Universal Format) files-optimized formats for running large models efficiently, even on CPUs or low-resource environments.
Unfortunately, due to a lack of structured documentation, beginners often struggle with using this framework effectively. This guide is designed to walk you through the entire process - from setup to uploading your GGUF models - so you can confidently create and share quantized models.

Setup

Environment

For all tasks related to GGUFs, I personally use RunPod - it's user-friendly, cost-effective, and provides the necessary compute power easily.

Installations

Begin by installing the required Python libraries:

pip install huggingface_hub numpy transformers

Building llama.cpp

In this guide, we'll focus on building for CPU (you can find other build types here):
Note: If you are using runpod then first run these commands for installation of cmake and libcurl4-openssl-dev

apt-get update
apt-get install -y cmake
apt-get install -y libcurl4-openssl-dev

and then run the commands below

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config Release
pip install -r requirements.txt
cd ..

Running these commands will compile all the necessary executables.

Downloading Your Model

from huggingface_hub import login
login()
from huggingface_hub import snapshot_download
MODEL_ID= "meta-llama/Llama-3.2-3B-Instruct"
model_path = snapshot_download(
    repo_id=MODEL_ID,
    local_dir=MODEL_ID
)

GGUF Quantization Options

Here are the available GGUF formats you can create:

Q2_K: 2-bit quantization (~2.6 bits/weight)
Q3_K_M: 3-bit medium (~3.4 bits/weight)
Q4_K_M: 4-bit medium (~4.5 bits/weight)
Q5_K_M: 5-bit medium (~5.5 bits/weight)
Q6_K: 6-bit quantization (~6.6 bits/weight)
Q8_0: 8-bit legacy format (8 bits/weight)

Execution

Step 1:

Convert Model to FP16

Before creating quantized GGUFs, you need to convert the Hugging Face model into FP16 format using the provided script.

mkdir fp16
python llama.cpp/convert_hf_to_gguf.py [MODEL_PATH] --outtype f16 --outfile [OUTPUT_FILE]

In my case it will be like below where MODEL_PATH is meta-llama/Llama-3.2-3B-Instruct and OUTPUT_FILE is fp16/Llama-3.2-3B-Instruct-fp16.gguf:

mkdir fp16
python llama.cpp/convert_hf_to_gguf.py meta-llama/Llama-3.2-3B-Instruct --outtype f16 --outfile fp16/Llama-3.2-3B-Instruct-fp16.gguf

This command will generate:
fp16/Llama-3.2-3B-Instruct-fp16.gguf

Step 2:

Create Quantized GGUF (e.g., Q4_K_M)

Next, use the llama-quantize binary to generate a quantized GGUF file. Here, we'll create a Q4_K_M quantized version:

mkdir q4
./llama.cpp/build/bin/llama-quantize [INPUT_FILE] [OUTPUT_FILE] [QUANTIZATION_METHOD]

In my case INPUT_FILE will be fp16/Llama-3.2-3B-Instruct-fp16.gguf (the fp16 file we generated) and OUTPUT_FILE will be q4/Llama-3.2-3B-Instruct-q4_k_m.gguf and QUANTIZATION_METHOD will be Q4_K_M

mkdir q4
./llama.cpp/build/bin/llama-quantize fp16/Llama-3.2-3B-Instruct-fp16.gguf q4/Llama-3.2-3B-Instruct-q4_k_m.gguf Q4_K_M

This will output your 4-bit GGUF at:
q4/Llama-3.2-3B-Instruct-q4_k_m.gguf

Step 3:

Shard Large Models (Optional)

If your model is very large (e.g., 70B+), uploading it in a single file may be impractical. In such cases, sharding splits the file into smaller parts:

mkdir splits
./llama.cpp/build/bin/llama-gguf-split --split --split-max-tensors [MAX_TENSORS] [MODEL_PATH] [OUTPUT_PATH]

in my case MAX_TENSORS will be 256, MODEL_PATH will be q4/Llama-3.2-3B-Instruct-q4_k_m.gguf and OUTPUT_PATH will be splits/Llama-3.2–3B-Instruct-q4_k_m

mkdir splits
./llama.cpp/build/bin/llama-gguf-split --split --split-max-tensors 256 q4/Llama-3.2-3B-Instruct-q4_k_m.gguf splits/Llama-3.2-3B-Instruct-q4_k_m

Step 4:

Upload to Hugging Face

To upload the quantized model to Hugging Face, use:

cd q4
huggingface-cli upload [HUGGINGFACE_MODEL_ID] .

In my case HUGGINGFACE_MODEL_ID is wasifis/Llama-3.2–3B-Instruct-q4_k_m and the . is for uploading the content of current directory you are in.

cd q4
huggingface-cli upload wasifis/Llama-3.2-3B-Instruct-q4_k_m .

💬 Questions or want to connect?

💬 Drop a comment below
🔗 Connect with me on LinkedIn
📧 Email me at wasifismcontact@gmail.com

Always happy to support and connect with fellow AI builders!