credits: https://github.com/ggml-org/llama.cpp/tree/masterIntroduction
GGUF (General GGML Universal Format) is a specialized file format used to store optimized, quantized versions of large language models (LLMs). GGUF files are designed for efficiency and portability, enabling large transformer models to run smoothly even on CPUs, laptops, and edge devices with limited resources without significantly sacrificing performance.
This is achieved through quantization, a process that compresses model weights (e.g., from 16-bit or 32-bit floats down to 2–8 bits). This compression results in faster inference times, lower memory usage, and broader hardware compatibility.
GGUFs are especially useful for:
Deploying LLMs locally without GPUs
Running AI on low-cost cloud instances (e.g., RunPod, Colab)
Sharing lightweight, efficient models on platforms like Hugging Face
Optimizing production applications where speed and resource efficiency are critical
Llama.cpp is one of the most powerful and crucial frameworks in the AI and LLM space. One of its main use cases is quantization and creating GGUF (General GGML Universal Format) files-optimized formats for running large models efficiently, even on CPUs or low-resource environments.
Unfortunately, due to a lack of structured documentation, beginners often struggle with using this framework effectively. This guide is designed to walk you through the entire process - from setup to uploading your GGUF models - so you can confidently create and share quantized models.
For all tasks related to GGUFs, I personally use RunPod - it's user-friendly, cost-effective, and provides the necessary compute power easily.
Begin by installing the required Python libraries:
pip install huggingface_hub numpy transformers
In this guide, we'll focus on building for CPU (you can find other build types here):
Note: If you are using runpod then first run these commands for installation of cmake and libcurl4-openssl-dev
apt-get update apt-get install -y cmake apt-get install -y libcurl4-openssl-dev
and then run the commands below
git clone https://github.com/ggml-org/llama.cpp cd llama.cpp cmake -B build cmake --build build --config Release pip install -r requirements.txt cd ..
Running these commands will compile all the necessary executables.
Log in to the Hugging Face Hub and download your desired model:
from huggingface_hub import login login() from huggingface_hub import snapshot_download MODEL_ID= "meta-llama/Llama-3.2-3B-Instruct" model_path = snapshot_download( repo_id=MODEL_ID, local_dir=MODEL_ID )
Here are the available GGUF formats you can create:
Before creating quantized GGUFs, you need to convert the Hugging Face model into FP16 format using the provided script.
mkdir fp16 python llama.cpp/convert_hf_to_gguf.py [MODEL_PATH] --outtype f16 --outfile [OUTPUT_FILE]
In my case it will be like below where MODEL_PATH is meta-llama/Llama-3.2-3B-Instruct and OUTPUT_FILE is fp16/Llama-3.2-3B-Instruct-fp16.gguf:
mkdir fp16 python llama.cpp/convert_hf_to_gguf.py meta-llama/Llama-3.2-3B-Instruct --outtype f16 --outfile fp16/Llama-3.2-3B-Instruct-fp16.gguf
This command will generate:
fp16/Llama-3.2-3B-Instruct-fp16.gguf
Next, use the llama-quantize binary to generate a quantized GGUF file. Here, we'll create a Q4_K_M quantized version:
mkdir q4 ./llama.cpp/build/bin/llama-quantize [INPUT_FILE] [OUTPUT_FILE] [QUANTIZATION_METHOD]
In my case INPUT_FILE will be fp16/Llama-3.2-3B-Instruct-fp16.gguf (the fp16 file we generated) and OUTPUT_FILE will be q4/Llama-3.2-3B-Instruct-q4_k_m.gguf and QUANTIZATION_METHOD will be Q4_K_M
mkdir q4 ./llama.cpp/build/bin/llama-quantize fp16/Llama-3.2-3B-Instruct-fp16.gguf q4/Llama-3.2-3B-Instruct-q4_k_m.gguf Q4_K_M
This will output your 4-bit GGUF at:
q4/Llama-3.2-3B-Instruct-q4_k_m.gguf
If your model is very large (e.g., 70B+), uploading it in a single file may be impractical. In such cases, sharding splits the file into smaller parts:
mkdir splits ./llama.cpp/build/bin/llama-gguf-split --split --split-max-tensors [MAX_TENSORS] [MODEL_PATH] [OUTPUT_PATH]
in my case MAX_TENSORS will be 256, MODEL_PATH will be q4/Llama-3.2-3B-Instruct-q4_k_m.gguf and OUTPUT_PATH will be splits/Llama-3.2–3B-Instruct-q4_k_m
mkdir splits ./llama.cpp/build/bin/llama-gguf-split --split --split-max-tensors 256 q4/Llama-3.2-3B-Instruct-q4_k_m.gguf splits/Llama-3.2-3B-Instruct-q4_k_m
To upload the quantized model to Hugging Face, use:
cd q4 huggingface-cli upload [HUGGINGFACE_MODEL_ID] .
In my case HUGGINGFACE_MODEL_ID is wasifis/Llama-3.2–3B-Instruct-q4_k_m and the . is for uploading the content of current directory you are in.
cd q4 huggingface-cli upload wasifis/Llama-3.2-3B-Instruct-q4_k_m .
💬 Drop a comment below
🔗 Connect with me on LinkedIn
📧 Email me at wasifismcontact@gmail.com
Always happy to support and connect with fellow AI builders!