ClearCut Pruning: Efficient Weight Refinement for LLM Compression

Abstract

Post-training pruning is an essential technique for reducing the complexity of neural networks after training, enhancing their deploy-ability on resource-constrained devices. This method optimizes computational resource usage while maintaining model performance. In this post, we introduce the Weight Refinement-based ClearCut Pruning Method, a novel post-training pruning technique that applies sparsity factor to weight matrices. This approach efficiently balances the trade-off between model accuracy and computational efficiency. By calculating weight importance scores, ClearCut Pruning selectively removes less significant weights without the need for retraining. Experimental results on state-of-the-art models, including LLaMA-2-7B, LLaMA-3.1-8B, OPT-6.7B, and OPT-2.7B, show significant performance improvements compared to existing pruning methods, achieving up to 50% sparsity with minimal degradation in accuracy.

Introduction

The rise of large language models (LLMs) has dramatically reshaped the field of artificial intelligence, setting new benchmarks for natural language processing (NLP) tasks. These models, known for their vast number of parameters, are capable of remarkable feats. However, their large size and immense computational requirements pose challenges, especially in environments with limited resources.

To overcome these obstacles, model compression has gained considerable attention. Among the most promising techniques are network pruning and model quantization. While quantization reduces the size of a model by adjusting the precision of its weights and activations, pruning focuses on removing redundant weights, resulting in a more sparse network.

One of the most exciting methods for introducing sparsity into neural networks is post-training pruning (PTP). PTP allows us to reduce model size after the model is trained, offering a practical solution for deploying large models in resource-constrained settings. However, despite its potential, PTP has been plagued by performance degradation when compared to dense models. Models such as SparseGPT and Wanda have made strides in unstructured pruning, but they often come with significant accuracy drops when aiming for practical speed improvements.

To address these issues, we introduce a new post-training pruning technique specifically designed for LLMs, offering an effective solution for model compression. Our method incorporates two key innovations:

Weight Refinement: This new pruning metric addresses the limitations of traditional methods that typically eliminate entire weight channels. By considering both input and output channels of weight matrices and activation data, weight refinement preserves essential model features.

ClearCut Strategy: This innovative strategy optimally retains important weights, making it more efficient than existing pruning methods.

Our approach is not only robust but also easy to implement, requiring no retraining, and can be seamlessly integrated into models with linear layers. Through extensive evaluations on open-source LLMs like LLaMA, LLaMA-2, and OPT, we demonstrate that our method outperforms existing post-training pruning techniques, offering superior performance in terms of perplexity and inference efficiency for sparse LLMs.

Related work

Model compression aims to reduce the computational and storage demands of deep neural networks, with techniques like network pruning, quantization, and low-rank factorization. Early magnitude pruning removes weights based on their size but often needs fine-tuning, while structured pruning prunes entire filters or channels, which can cause performance loss.

N: M sparsity, as seen in SparseGPT and NVIDIA’s Ampere-optimized 2:4 sparsity, imposes a regular pattern of non-zero weights to speed up inference on modern hardware. Wanda and RIA are post-training pruning (PTP) methods that guide unstructured pruning using activation statistics, but they experience rapid perplexity degradation as sparsity increases.

Low-rank approximation (e.g., SVD) reduces parameters and floating-point operations, and hybrid methods combine pruning with low-rank techniques for better compression.

Our approach introduces a new weight refinement metric and the ClearCut algorithm, which enforces sparsity based on refined importance scores. Unlike existing methods, it doesn’t require retraining and shows lower perplexity degradation across models like LLaMA-2 and OPT.

Methodology

In many machine learning models, particularly those utilizing large datasets or complex architectures, the model can become excessively large and computationally expensive. This often leads to inefficiencies in terms of both processing time and memory usage. The primary challenge we address in this study is how to reduce the size of these models without significantly compromising their accuracy or predictive performance. Recently, various model compression techniques have been proposed, one of which is pruning. A specific variant of pruning, known as post-training pruning has garnered attention due to its ability to reduce model size without the need for retraining, providing an efficient approach to model compression.

Large language models (LLMs) typically have a large number of parameters, making them computationally expensive during inference. Our goal with post-training pruning is to reduce the model size by pruning less important weights, without the need for retraining, while maintaining similar performance in terms of accuracy or task-specific metrics. Specifically, we aim to create a pruned model with far fewer parameters than the original model , such that , while ensuring that the performance of the pruned model remains close to that of the original model.

In this equation, represents the weight between the and neurons, and is its absolute magnitude. The first term,

measures the relative importance of each weight in relation to the sum of all incoming weights to the neuron, capturing how much influence this weight has compared to the others connected to the same neuron. The second term,

evaluates the importance of the weight relative to all outgoing weights from the neuron, indicating its significance in propagating information through the network. The third term,

introduces a quadratic scaling factor to amplify the impact of larger weights, ensuring that more significant weights are prioritized during pruning. The scaling factor alpha allows us to control the influence of this quadratic term.

This weight refinement approach improves the pruning process by addressing the limitations of traditional magnitude pruning. By incorporating both the local context (the importance of incoming and outgoing weights) and the global context (the overall network structure), our approach ensures that weights with smaller magnitudes but critical roles in the network are not mistakenly pruned. The quadratic scaling further helps to retain weights that are influential in determining the model's output, even if their magnitude is not the largest.

We propose a new method called ClearCut. This algorithm efficiently applies sparsity ratio to the weight matrix , ensuring that the most important weights are retained while reducing the number of unimportant connections. The ClearCut Pruning Algorithm operates in several key steps.

First, for each row , we calculate the number of weights to retain based on the sparsity ratio. The number of retained weights is computed as:

Where is the number of columns in the weight matrix , and represents the sparsity ratio. This determines how many weights we aim to keep in each row.

Next, for each row, we find the threshold value above which weights are retained. The threshold is determined by finding the largest value in the importance score vector , where:

The threshold value , is calculated using the torch.kthvalue function, which finds the largest value (since torch.kthvalue is 1-indexed). This value acts as the cutoff for pruning:

Following this, a binary mask is created for each row. The mask is set to 1 for weights whose importance scores are greater than or equal to the threshold, and 0 for those that should be pruned. The binary mask for each row is defined as:

Where represents the index of each weight in row .

Finally, the pruned weight matrix is obtained by multiplying the original weight matrix by the binary mask:

This operation selectively retains the important weights based on the calculated importance scores, effectively pruning the less significant weights and reducing the model size.

Experiments

In this study, we test the proposed method on four widely-used large language models (LLMs): LLaMA2-7B, LLaMA3.1-8B, OPT-6.7B, and OPT-2.7B. The models are loaded from the public checkpoints available through the HuggingFace Transformers library. For consistency, we apply uniform pruning across all linear layers of the models, excluding the embedding layers and the output head. Specifically, for the LLaMA models, each self-attention module contains four linear layers, while the MLP modules have three linear layers. In the OPT models, the MLP modules consist of two linear layers. To maintain a fair comparison, all evaluations are carried out using the same codebase across all models.

We utilize the WikiText2 dataset for evaluation, following the standard practices in evaluating language models. The perplexity metric is used to quantify model performance, with lower perplexity indicating better performance. For all experiments, we implement our pruning methods using PyTorch and perform evaluations on NVIDIA T4 GPUs to ensure consistent computational resources.

Results

Perplexity results on WikiText2 were obtained by applying 50% unstructured sparsity to the model.

Model	SparseGPT	Wanda	Magnitude	Dense	RIA	ClearCut (ours)
LLaMA2-7B	7.24	7.26	17.28	5.68	6.81	6.78
LLaMA3.1-8B	-	-	-	-	9.43	9.39
OPT-2.7B	13.48	14.28	265	12.47	14.20	14.19
OPT-6.7B	11.55	11.94	969	10.86	11.84	11.44

The ClearCut (ours) approach has proven to be a major success across multiple large language models (LLMs) as illustrated in the table above. In the LLaMA-7B model, we reduced perplexity to 6.78, outperforming RIA and surpassing both SparseGPT and Wanda. For LLaMA-3.1-8B, we achieved a perplexity of 9.39, ahead of RIA and demonstrating strong performance. The OPT-2.7B and OPT-6.7B models also saw impressive reductions in perplexity, with ClearCut (ours) outperforming SparseGPT, Wanda, and RIA. These results highlight the effectiveness of ClearCut (ours) in reducing perplexity, marking a significant achievement in model pruning.

Optimizing Model Compression to Minimize FLOPS and Improve Efficiency

Unstructured pruning, a technique commonly used for model compression, encounters a significant challenge when approximately 50% of the weights are pruned, leaving the model with an equal distribution of non-zero and zero values. While this pruning approach eliminates a portion of the weights, it does not lead to a reduction in the model size, as the total number of parameters remains the same. The presence of zero values does not mitigate the computational burden since the multiplication of parameters still takes place during inference, resulting in unnecessary computations and resource wastage. Consequently, the computational cost and Floating Point Operations (FLOPS) remain unaffected, preventing any meaningful reduction in computational workload.

Low Rank Decomposition applied after Pruning the model

To overcome this limitation, we propose the application of Low-Rank Decomposition to compress the weight matrix, thereby enhancing computational efficiency as shown in the figure above. In this method, the original weight matrix is approximated as the product of three smaller matrices:

Where represents the full set of learned parameters in the model. The matrix , known as the left singular matrix, encodes the relationships between the rows of , representing how input features contribute to the output. The matrix is a diagonal matrix containing singular values that reflect the strength of these relationships. Lastly, , the right singular matrix, captures the relationships between the columns of , encoding how each output feature is influenced by the input features. By approximating with smaller matrices, this decomposition reduces the model's size and improves computational efficiency, ensuring that fewer parameters are processed during inference and resulting in a more resource-efficient model.

Results

Perplexity and Parameter Reduction Results for the OPT-1.3B Model Across Different Low Rank Levels

Reduction (%)	Base Pruned Perplexity	Low Rank Perplexity	Original Parameters	New Parameters	Rank
10.5%	18.04	23.97	402,849,792	362,348,544	921
20.01%	18.04	27.26	402,849,792	322,240,512	819
40.02%	18.04	49.96	402,849,792	241,631,232	614
60.03%	18.04	323.95	402,849,792	161,021,952	409
80.04%	18.04	2228.15	402,849,792	80,412,672	204

In our experiment, we used the OPT-1.3B model as a baseline for testing. We applied Low-Rank Decomposition to the self-attention layers, specifically targeting the key (K), query (Q), value (V), and output (O) matrices. Through various tests with different ranks shown in the table above, we achieved parameter reductions ranging from 10.5% to 80.4%.

Our findings indicate that the optimal parameter reduction occurs between 20% to 40%, where the reduction in model size is significant, and the degradation in perplexity is minimal, as shown in the figure below. For larger models, such as the OPT-6.7B, we observed that even a parameter reduction of over 50% resulted in only a small increase in perplexity. Specifically, the perplexity increased from 11.76 to 22.16, as shown in Table. This demonstrates that substantial model compression can be achieved with relatively minor performance degradation in terms of perplexity.

Perplexity and Parameter Reduction Results for the OPT-6.7B Model Across Different Low Rank Levels

Reduction (%)	Base Pruned Perplexity	Low Rank Perplexity	Original Parameters	New Parameters	Rank
40.03%	11.76	17.59	2,148,007,936	1,288,175,616	1228
50.00%	11.76	22.16	2,148,007,936	1,074,266,112	1024
60.00%	11.76	37.23	2,148,007,936	859,308,032	819

Conclusion

This post addresses the challenge of deploying large language models (LLMs) efficiently in resource-constrained environments. We introduced a novel post-training pruning method combining Weight Refinement and the ClearCut pruning algorithm to reduce model size without retraining.

The Weight Refinement metric enhances pruning by assessing weight importance both locally (relative to row/column sums) and globally (activation-based). The ClearCut algorithm uses these refined scores to apply effective sparsity, ensuring critical weights are retained while maximizing parameter reduction.

Experiments on models like LLaMA-2-7B, LLaMA-3.1-8B, and OPT demonstrated that our method outperforms baselines like SparseGPT and Wanda, achieving significantly lower perplexity even at 50% sparsity. Additionally, combining our pruning technique with Low-Rank Decomposition further reduces parameter count and FLOPS.

The main advantage of our approach is its post-training, one-shot nature, eliminating the need for fine-tuning. Our method offers a practical solution for optimizing LLMs, making them more accessible and efficient for deployment.

Appendix

🛠️ Implemented Methods

1. ClearCut Pruning

Our novel method combining local weight importance with global activation statistics:

def compute_clearcut_score(W, row_sum, col_sum, activation_norm, alpha=0.1, eps=1e-8):
    """
    Compute ClearCut importance scores for weight matrix W
    """
    abs_W = torch.abs(W)

    # Local importance components
    col_term = abs_W / (col_sum + eps)
    row_term = abs_W / (row_sum + eps)

    # Global interaction term
    interaction = alpha * (abs_W**2) / (row_sum * col_sum + eps)

    # Activation-aware adjustment
    activation_weight = activation_norm ** args.a

    return (col_term + row_term + interaction) * activation_weight

2. Low-Rank Decomposition

SVD-based compression for attention layers:

def apply_low_rank_decomposition(model, target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], rank_ratio=0.5):
    # Perform SVD on weight matrices
    U, S, Vh = torch.linalg.svd(weight, full_matrices=False)

    # Truncate to target rank
    target_rank = int(min_dim * rank_ratio)
    U_r, S_r, Vh_r = U[:, :target_rank], S[:target_rank], Vh[:target_rank, :]

    # Create factorized matrices
    A = U_r @ torch.diag(torch.sqrt(S_r))
    B = torch.diag(torch.sqrt(S_r)) @ Vh_r

    return LowRankLinear(A, B)

📂 Repository Structure

├── main.py              # Main entry point for pruning experiments
├── prune.py             # Core pruning algorithms (ClearCut, RIA, magnitude, wanda, sparsegpt)
├── data.py              # Dataset loading utilities (C4, WikiText2, PTB)
├── eval.py              # Evaluation scripts for perplexity and zero-shot tasks
├── layerwrapper.py      # Layer wrapping utilities for activation collection
├── sparsegpt.py         # SparseGPT implementation
├── quant.py             # Quantization utilities
└── README.md           # This file

🎯 Evaluation

Perplexity Evaluation

# Evaluate on WikiText2
python main.py --model meta-llama/Llama-2-7b-hf --eval_dataset wikitext2

# Evaluate on multiple datasets
python main.py --model facebook/opt-6.7b --eval_dataset c4

Zero-shot Task Evaluation

# Run zero-shot evaluation on common sense reasoning tasks
python main.py \
    --model meta-llama/Llama-2-7b-hf \
    --prune_method ClearCut \
    --sparsity_type 2:4 \
    --eval_zero_shot

Supported zero-shot tasks: BoolQ, RTE, HellaSwag, ARC-Challenge, MNLI

💾 Installation

git clone https://github.com/cRED-f/ClearCut-Pruning-Efficient-Weight-Refinement-for-LLM-Compression.git

Dependencies:

PyTorch >= 2.0 (with semi-structured sparse support)
Transformers >= 4.21.0
Datasets
SciPy (for Linear Sum Assignment)
NumPy

🚀 Quick Start

Basic Usage

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from main import main
import argparse

# Load model
model_name = "meta-llama/Llama-2-7b-hf"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Configure pruning arguments
args = argparse.Namespace(
    model=model_name,
    prune_method="ClearCut",          # Choose: magnitude, wanda, ria, ClearCut, sparsegpt
    sparsity_type="2:4",         # unstructured, 1:4, 2:4, 3:4, 4:8
    sparsity_ratio=0.5,          # For unstructured pruning
    calib_dataset="c4",          # c4, wikitext2, ptb
    nsamples=128,
    seqlen=2048,
    reallocation=True,           # Enable channel reallocation
    lsa=True,                    # Enable Linear Sum Assignment
    apply_low_rank=True,         # Apply low-rank decomposition
    rank_ratio=0.5,              # Rank ratio for SVD
    target_modules="q,k,v,o"     # Attention modules for low-rank
)

# Run pruning
main(args)

Command Line Usage

# ClearCut Pruning with 2:4 sparsity
python main.py \
    --model meta-llama/Llama-2-7b-hf \
    --prune_method ClearCut \
    --sparsity_type 2:4 \
    --calib_dataset c4 \
    --save


# Combined Pruning + Low-rank Decomposition
python main.py \
    --model meta-llama/Llama-2-7b-hf \
    --prune_method ClearCut \
    --sparsity_type 2:4 \
    --apply_low_rank \
    --rank_ratio 0.3 \
    --target_modules q,k,v,o \
    --eval_zero_shot

Key Arguments

--prune_method: Choose from magnitude, wanda, ria, ClearCut, sparsegpt
--sparsity_type: unstructured, 1:4, 2:4, 3:4, 4:8
--sparsity_ratio: Sparsity level for unstructured pruning (0.0-1.0)
--reallocation: Enable heuristic channel reallocation
--lsa: Enable Linear Sum Assignment optimization
--reconstruction: Use SparseGPT-based weight reconstruction
--semi_sparse_acc: Use PyTorch 2:4 semi-structured acceleration
--apply_low_rank: Apply low-rank decomposition after pruning
--rank_ratio: Fraction of singular values to keep (0.0-1.0)

Abstract

Introduction

ClearCut Strategy: This innovative strategy optimally retains important weights, making it more efficient than existing pruning methods.

Related work

Low-rank approximation (e.g., SVD) reduces parameters and floating-point operations, and hybrid methods combine pruning with low-rank techniques for better compression.

Methodology

In this equation, represents the weight between the and neurons, and is its absolute magnitude. The first term,

evaluates the importance of the weight relative to all outgoing weights from the neuron, indicating its significance in propagating information through the network. The third term,

First, for each row , we calculate the number of weights to retain based on the sparsity ratio. The number of retained weights is computed as:

Where is the number of columns in the weight matrix , and represents the sparsity ratio. This determines how many weights we aim to keep in each row.

Next, for each row, we find the threshold value above which weights are retained. The threshold is determined by finding the largest value in the importance score vector , where:

The threshold value , is calculated using the torch.kthvalue function, which finds the largest value (since torch.kthvalue is 1-indexed). This value acts as the cutoff for pruning:

Where represents the index of each weight in row .

Finally, the pruned weight matrix is obtained by multiplying the original weight matrix by the binary mask:

This operation selectively retains the important weights based on the calculated importance scores, effectively pruning the less significant weights and reducing the model size.

Experiments

Results

Perplexity results on WikiText2 were obtained by applying 50% unstructured sparsity to the model.

Model	SparseGPT	Wanda	Magnitude	Dense	RIA	ClearCut (ours)
LLaMA2-7B	7.24	7.26	17.28	5.68	6.81	6.78
LLaMA3.1-8B	-	-	-	-	9.43	9.39
OPT-2.7B	13.48	14.28	265	12.47	14.20	14.19
OPT-6.7B	11.55	11.94	969	10.86	11.84	11.44

Optimizing Model Compression to Minimize FLOPS and Improve Efficiency

Low Rank Decomposition applied after Pruning the model

Results

Perplexity and Parameter Reduction Results for the OPT-1.3B Model Across Different Low Rank Levels

Reduction (%)	Base Pruned Perplexity	Low Rank Perplexity	Original Parameters	New Parameters	Rank
10.5%	18.04	23.97	402,849,792	362,348,544	921
20.01%	18.04	27.26	402,849,792	322,240,512	819
40.02%	18.04	49.96	402,849,792	241,631,232	614
60.03%	18.04	323.95	402,849,792	161,021,952	409
80.04%	18.04	2228.15	402,849,792	80,412,672	204

Perplexity and Parameter Reduction Results for the OPT-6.7B Model Across Different Low Rank Levels

Reduction (%)	Base Pruned Perplexity	Low Rank Perplexity	Original Parameters	New Parameters	Rank
40.03%	11.76	17.59	2,148,007,936	1,288,175,616	1228
50.00%	11.76	22.16	2,148,007,936	1,074,266,112	1024
60.00%	11.76	37.23	2,148,007,936	859,308,032	819

Conclusion

Appendix

🛠️ Implemented Methods

1. ClearCut Pruning

Our novel method combining local weight importance with global activation statistics:

def compute_clearcut_score(W, row_sum, col_sum, activation_norm, alpha=0.1, eps=1e-8):
    """
    Compute ClearCut importance scores for weight matrix W
    """
    abs_W = torch.abs(W)

    # Local importance components
    col_term = abs_W / (col_sum + eps)
    row_term = abs_W / (row_sum + eps)

    # Global interaction term
    interaction = alpha * (abs_W**2) / (row_sum * col_sum + eps)

    # Activation-aware adjustment
    activation_weight = activation_norm ** args.a

    return (col_term + row_term + interaction) * activation_weight

2. Low-Rank Decomposition

SVD-based compression for attention layers:

def apply_low_rank_decomposition(model, target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], rank_ratio=0.5):
    # Perform SVD on weight matrices
    U, S, Vh = torch.linalg.svd(weight, full_matrices=False)

    # Truncate to target rank
    target_rank = int(min_dim * rank_ratio)
    U_r, S_r, Vh_r = U[:, :target_rank], S[:target_rank], Vh[:target_rank, :]

    # Create factorized matrices
    A = U_r @ torch.diag(torch.sqrt(S_r))
    B = torch.diag(torch.sqrt(S_r)) @ Vh_r

    return LowRankLinear(A, B)

📂 Repository Structure

├── main.py              # Main entry point for pruning experiments
├── prune.py             # Core pruning algorithms (ClearCut, RIA, magnitude, wanda, sparsegpt)
├── data.py              # Dataset loading utilities (C4, WikiText2, PTB)
├── eval.py              # Evaluation scripts for perplexity and zero-shot tasks
├── layerwrapper.py      # Layer wrapping utilities for activation collection
├── sparsegpt.py         # SparseGPT implementation
├── quant.py             # Quantization utilities
└── README.md           # This file

🎯 Evaluation

Perplexity Evaluation

# Evaluate on WikiText2
python main.py --model meta-llama/Llama-2-7b-hf --eval_dataset wikitext2

# Evaluate on multiple datasets
python main.py --model facebook/opt-6.7b --eval_dataset c4

Zero-shot Task Evaluation

# Run zero-shot evaluation on common sense reasoning tasks
python main.py \
    --model meta-llama/Llama-2-7b-hf \
    --prune_method ClearCut \
    --sparsity_type 2:4 \
    --eval_zero_shot

Supported zero-shot tasks: BoolQ, RTE, HellaSwag, ARC-Challenge, MNLI

💾 Installation

git clone https://github.com/cRED-f/ClearCut-Pruning-Efficient-Weight-Refinement-for-LLM-Compression.git

Dependencies:

PyTorch >= 2.0 (with semi-structured sparse support)
Transformers >= 4.21.0
Datasets
SciPy (for Linear Sum Assignment)
NumPy

🚀 Quick Start

Basic Usage

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from main import main
import argparse

# Load model
model_name = "meta-llama/Llama-2-7b-hf"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Configure pruning arguments
args = argparse.Namespace(
    model=model_name,
    prune_method="ClearCut",          # Choose: magnitude, wanda, ria, ClearCut, sparsegpt
    sparsity_type="2:4",         # unstructured, 1:4, 2:4, 3:4, 4:8
    sparsity_ratio=0.5,          # For unstructured pruning
    calib_dataset="c4",          # c4, wikitext2, ptb
    nsamples=128,
    seqlen=2048,
    reallocation=True,           # Enable channel reallocation
    lsa=True,                    # Enable Linear Sum Assignment
    apply_low_rank=True,         # Apply low-rank decomposition
    rank_ratio=0.5,              # Rank ratio for SVD
    target_modules="q,k,v,o"     # Attention modules for low-rank
)

# Run pruning
main(args)

Command Line Usage

# ClearCut Pruning with 2:4 sparsity
python main.py \
    --model meta-llama/Llama-2-7b-hf \
    --prune_method ClearCut \
    --sparsity_type 2:4 \
    --calib_dataset c4 \
    --save


# Combined Pruning + Low-rank Decomposition
python main.py \
    --model meta-llama/Llama-2-7b-hf \
    --prune_method ClearCut \
    --sparsity_type 2:4 \
    --apply_low_rank \
    --rank_ratio 0.3 \
    --target_modules q,k,v,o \
    --eval_zero_shot

Key Arguments

--prune_method: Choose from magnitude, wanda, ria, ClearCut, sparsegpt
--sparsity_type: unstructured, 1:4, 2:4, 3:4, 4:8
--sparsity_ratio: Sparsity level for unstructured pruning (0.0-1.0)
--reallocation: Enable heuristic channel reallocation
--lsa: Enable Linear Sum Assignment optimization
--reconstruction: Use SparseGPT-based weight reconstruction
--semi_sparse_acc: Use PyTorch 2:4 semi-structured acceleration
--apply_low_rank: Apply low-rank decomposition after pruning
--rank_ratio: Fraction of singular values to keep (0.0-1.0)