Google has recently introduced Gemma-2b, a lightweight text-to-text Large Language Model (LLM) with 2 billion parameters. Gemma-2b is part of the Gemma family of LLMs, which are built using similar technology to Google’s Gemini models. Gemma-2b is currently available in English and is well-suited for a variety of text-generation tasks. In this article, we will see how well Gemma-2b performs in summarizing a dialogue between two people. After fine-tuning the model, we will evaluate its response using the ROUGE metrics.
In order to fine-tune the model, I have chosen the DialogSum dataset containing 13,460 dialogues with corresponding manually labeled summaries and topics. By exposing the model to diverse dialogue structures and content, this fine-tuning process sharpens its ability to summarize conversations with precision and efficiency. Utilizing proper prompt engineering, we can generate customized summaries tailored to different contexts or topics. Moreover, we can focus on multiple aspects of the dialogue, such as sentiment analysis, key points extraction, and speaker attribution.
Let’s start by installing the necessary libraries and understanding what they do.
!pip3 install -q -U bitsandbytes==0.42.0 !pip3 install -q -U peft==0.8.2 !pip3 install -q -U trl==0.7.10 !pip3 install -q -U accelerate==0.27.1 !pip3 install -q -U datasets==2.17.0 !pip3 install -q -U transformers==4.38.0 !pip install -q rouge_score
import torch from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, GemmaTokenizer import os from google.colab import userdata from peft import LoraConfig from datasets import load_dataset import pandas as pd import transformers from trl import SFTTrainer from datasets import load_metric import numpy as np from rouge_score import rouge_scorer
Now, we will setup the environment to train the model. The HF_TOKEN is used to authenticate when downloading models or tokenizers from Hugging Face’s Model Hub.
os.environ["HF_TOKEN"] = userdata.get('HF_TOKEN_READ')
model_id = "google/gemma-2b" bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16 ) tokenizer = AutoTokenizer.from_pretrained(model_id, token=os.environ['HF_TOKEN']) text = """user How does the brain work? model""" device = "cuda:0" inputs = tokenizer(text, return_tensors="pt").to(device) model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map="auto") outputs = model.generate(**inputs, max_new_tokens=50) print(tokenizer.decode(outputs[0], skip_special_tokens=True))
The model_id points to the model you’re loading from Hugging Face’s Model Hub (google/gemma-2b in this case). The tokenizer is loaded using the AutoTokenizerclass from Hugging Face, which automatically selects the correct tokenizer for the model. Text input (text) is the prompt provided to the model, in this case: “How does the brain work?”.
The tokenizer converts this human-readable text into token IDs (a format the model can process). The tokenized input is returned as a tensor (PyTorch format) with return_tensors="pt", and it’s sent to the GPU (cuda:0) for computation. The AutoModelForCausalLM(Causal Language Model) is loaded from Hugging Face.
Quantization is applied using the BitsAndBytesConfig defined earlier, which reduces the memory and computational requirements by running the model in 4-bit precision. device_map="auto" automatically distributes the model across available devices (such as multiple GPUs).
The model generates text by continuing from the provided input (text), in this case, the response to “How does the brain work?” The generate method uses the tokenized input and generates up to 50 new tokens as output. The outputs returned by the model are token IDs, so they need to be decoded back into human-readable text. tokenizer.decode()converts the token IDs into readable text, and skip_special_tokens=True ensures that special tokens (such as end-of-sequence markers) are not included in the final output.
LoRA is a technique used to efficiently fine-tune large pre-trained models by adding low-rank matrices to certain parts of the model, which allows for parameter-efficient fine-tuning without the need to update all the model parameters. In large models, fine-tuning all parameters is computationally expensive and memory-intensive. LoRA significantly reduces this overhead by modifying only a small subset of the model’s layers with additional low-rank matrices.
lora_config = LoraConfig( r=8, target_modules=["q_proj", "o_proj", "k_proj", "v_proj", "gate_proj", "up_proj", "down_proj"], task_type="CAUSAL_LM", )
The value r=8 means that the added matrices are of rank 8, which determines the size of the low-rank matrices. A higher rank generally gives more expressive power but increases the computational and memory cost. By specifying the modules, you are instructing the model to only fine-tune certain projection layers, while leaving the rest of the model’s parameters untouched, which is what makes LoRA efficient.
data = load_dataset("knkarthick/dialogsum")
def formatting_func(example): text = f"user\n Write the highlight of this dialogue in one sentence: {example['dialogue'][0]} {example['summary'][0]}" return [text]
The function formatting_func takes an example from the dataset and formats it into a specific text prompt that can be used.
trainer = SFTTrainer( model=model, train_dataset=data["train"], args=transformers.TrainingArguments( per_device_train_batch_size=1, gradient_accumulation_steps=4, warmup_steps=2, max_steps=300, learning_rate=2e-4, fp16=True, logging_steps=1, output_dir="outputs", optim="paged_adamw_8bit" ), peft_config=lora_config, formatting_func=formatting_func, ) trainer.train()
The SFTTrainer is designed to fine-tune the model in a memory-efficient way using LoRA, 8-bit optimizers, and half-precision (fp16) training. Key features include small batch sizes, gradient accumulation, and warm-up steps to ensure smooth and efficient training. The formatting_func is applied to each example to format it as a prompt for the model. This setup allows fine-tuning the model with limited computational resources while still obtaining useful improvements for tasks like summarization or dialogue generation.
Now, let’s see how well the fine-tuned model performs.
text = """user\n Write the highlight of this dialogue in one sentence: #Person1#: Which of the two do you think is better? I mean, what's the difference between them? #Person2#: Well. . . this one costs more, but it has a much better sound. This part of it is made of wood, not plastic. And there's a tone control, too. #Person1#: I only want it for the kitchen. I like to listen to the news at breakfast time. #Person2#: Hmm. . . well, the other one is good for the money. It's much cheaper. We sell clot of them and all our customers are satisfied with them. #Person1#: Hmm. . . I'd like the cheaper one, please. Can I pay by cheque? #Person2#: Certainly. model: Here is the summary of this dialogue:""" device = "cuda:0" inputs = tokenizer(text, return_tensors="pt").to(device) true_summary = "The shop assistant helps #Person1# compare two products. #Person1# decides to buy the cheaper one by cheque." outputs = model.generate(**inputs, max_new_tokens=50) gemma_summary = tokenizer.decode(outputs[0], skip_special_tokens=True) print(gemma_summary) print('-' * 50) delimiter = "Here is the summary of this dialogue:" end_token = "" highlight = gemma_summary.split(delimiter)[1].split(end_token)[0].strip() #To get only the summary print(f'Generated Summary: {highlight}') print('-' * 50)
<start_of_turn>user Write the highlight of this dialogue in one sentence: #Person1#: Which of the two do you think is better? I mean, what's the difference between them? #Person2#: Well. . . this one costs more, but it has a much better sound. This part of it is made of wood, not plastic. And there's a tone control, too. #Person1#: I only want it for the kitchen. I like to listen to the news at breakfast time. #Person2#: Hmm. . . well, the other one is good for the money. It's much cheaper. We sell clot of them and all our customers are satisfied with them. #Person1#: Hmm. . . I'd like the cheaper one, please. Can I pay by cheque? #Person2#: Certainly. <end_of_turn> <start_of_turn>model: Here is the summary of this dialogue: #Person1# asks #Person2# about the difference between the two radio and which one is cheaper. #Person2# says that the one with the better sound is more expensive but it has a tone control and the other one is cheaper but -------------------------------------------------- Generated Summary: #Person1# asks #Person2# about the difference between the two radio and which one is cheaper. #Person2# says that the one with the better sound is more expensive but it has a tone control and the other one is cheaper but --------------------------------------------------
As the summary gets pretty long, we can prompt the model to generate the summary in one sentence. Also, we need to define functions to calculate the ROUGE scores.
ROUGE scores are calculated between the generated summary (highlight) and the reference summary (true_summary). The precision, recall, and F1 scores for various ROUGE metrics (ROUGE-1, ROUGE-2, ROUGE-L, etc.) are printed out to assess the similarity between the generated and original summaries. To learn more about ROUGE metrics, check out this article.
def calculate_rouge_scores(original_summary, generated_summary): rouge = load_metric("rouge") scores = rouge.compute(predictions=[generated_summary], references=[original_summary]) return scores rouge_scores = calculate_rouge_scores(highlight, true_summary) rouge_scorer_ = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL', 'rougeLsum']) rouge_scores = rouge_scorer_.score(highlight, true_summary) for metric, scores in rouge_scores.items(): print(f"{metric}:") print(f"Precision: {scores.precision}") print(f"Recall: {scores.recall}") print(f"F1 Score: {scores.fmeasure}") print()
Let’s test the fine-tuned model on 10 random entries of the ‘test’ dataset.
test_data = pd.read_csv('/content/test.csv') test_data_random = test_data.sample(frac=1, random_state=42) test_data_random = test_data_random.head(10) test_data_random = test_data_random.reset_index(drop=True)
num_iterations = len(test_data_random) avg_scores = {'rouge1': {'precision': 0, 'recall': 0, 'f1': 0}, 'rouge2': {'precision': 0, 'recall': 0, 'f1': 0}, 'rougeL': {'precision': 0, 'recall': 0, 'f1': 0}, 'rougeLsum': {'precision': 0, 'recall': 0, 'f1': 0}} for idx, row in test_data_random.iterrows(): dialogue = row['dialogue'] true_summary = row['summary'] text = f"""user\n Write the highlight of this dialogue in one sentence:{dialogue}\nmodel: Here is the summary of this dialogue:""" device = "cuda:0" inputs = tokenizer(text, return_tensors="pt").to(device) outputs = model.generate(**inputs, max_new_tokens=50) gemma_summary = tokenizer.decode(outputs[0], skip_special_tokens=True) print(f'True Summary: {true_summary}') print('-' * 50) delimiter = "Here is the summary of this dialogue:" end_token = "" highlight = gemma_summary.split(delimiter)[1].split(end_token)[0].strip() print(f'Generated Summary: {highlight}') print('-' * 50) rouge_scores = calculate_rouge_scores(highlight, true_summary) rouge_scorer_ = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL', 'rougeLsum']) rouge_scores = rouge_scorer_.score(highlight, true_summary) for metric, scores in rouge_scores.items(): print(f"{metric}:") print(f"Precision: {scores.precision}") print(f"Recall: {scores.recall}") print(f"F1 Score: {scores.fmeasure}") print() avg_scores[metric]['precision'] += scores.precision avg_scores[metric]['recall'] += scores.recall avg_scores[metric]['f1'] += scores.fmeasure for metric, scores in avg_scores.items(): avg_scores[metric]['precision'] /= num_iterations avg_scores[metric]['recall'] /= num_iterations avg_scores[metric]['f1'] /= num_iterations
True Summary: #Person1# proposes to build maintenance procedures to reduce lost production during downtime. -------------------------------------------------- Generated Summary: #Person1# and #Person2# discuss about the ways to reduce the maintenance downtime. -------------------------------------------------- rouge1: Precision: 0.4166666666666667 Recall: 0.4166666666666667 F1 Score: 0.4166666666666667 rouge2: Precision: 0.09090909090909091 Recall: 0.09090909090909091 F1 Score: 0.09090909090909091 rougeL: Precision: 0.3333333333333333 Recall: 0.3333333333333333 F1 Score: 0.3333333333333333 rougeLsum: Precision: 0.3333333333333333 Recall: 0.3333333333333333 F1 Score: 0.3333333333333333 True Summary: Trina accepts Jared's proposal. Then, Jared is astonished to know that Trina already knew from Melissa who saw him buying the ring that he was planning this. Trina has chosen a date and has made a list of four hundred guests and she tells Jared about her arrangements in an ecstasy. Jared finds it hard to get through. -------------------------------------------------- Generated Summary: #Person2# tells #Person1# about their wedding plan. #Person1# is surprised at all these details and asks what else is there. #Person2# tells #Person1# that their uncle could be their florist and his wife -------------------------------------------------- rouge1: Precision: 0.1016949152542373 Recall: 0.18181818181818182 F1 Score: 0.13043478260869565 rouge2: Precision: 0.0 Recall: 0.0 F1 Score: 0.0 rougeL: Precision: 0.05084745762711865 Recall: 0.09090909090909091 F1 Score: 0.06521739130434782 rougeLsum: Precision: 0.05084745762711865 Recall: 0.09090909090909091 F1 Score: 0.06521739130434782 True Summary: Terry Chen in Room 117 calls the housekeeper for a clean-up of her room. -------------------------------------------------- Generated Summary: #Person1# is asking the house keeper to clean the room and #Person2# says yes to it. -------------------------------------------------- rouge1: Precision: 0.2 Recall: 0.1875 F1 Score: 0.19354838709677422 rouge2: Precision: 0.0 Recall: 0.0 F1 Score: 0.0 rougeL: Precision: 0.2 Recall: 0.1875 F1 Score: 0.19354838709677422 rougeLsum: Precision: 0.2 Recall: 0.1875 F1 Score: 0.19354838709677422 True Summary: #Person1# and #Person2# have been waiting for the bus for a long time. They agree they need to get a car. -------------------------------------------------- Generated Summary: #Person1# and #Person2# don't like public transportation and discuss getting a car. -------------------------------------------------- rouge1: Precision: 0.23809523809523808 Recall: 0.38461538461538464 F1 Score: 0.2941176470588235 rouge2: Precision: 0.15 Recall: 0.25 F1 Score: 0.18749999999999997 rougeL: Precision: 0.23809523809523808 Recall: 0.38461538461538464 F1 Score: 0.2941176470588235 rougeLsum: Precision: 0.23809523809523808 Recall: 0.38461538461538464 F1 Score: 0.2941176470588235 True Summary: #Person1# and #Person2# are discussing what to eat at a popular restaurant, and they decide to order until the waitress comes around. -------------------------------------------------- Generated Summary: #Person1# and #Person2# have a conversation about what to have at a restaurant. #Person1# doesn't have a reservation, but gets the last available table for two. #Person2# wants to have some wine or -------------------------------------------------- rouge1: Precision: 0.45454545454545453 Recall: 0.29411764705882354 F1 Score: 0.35714285714285715 rouge2: Precision: 0.19047619047619047 Recall: 0.12121212121212122 F1 Score: 0.14814814814814814 rougeL: Precision: 0.4090909090909091 Recall: 0.2647058823529412 F1 Score: 0.3214285714285714 rougeLsum: Precision: 0.4090909090909091 Recall: 0.2647058823529412 F1 Score: 0.3214285714285714 True Summary: #Person1# tells #Person2# about #Person1#'s vacation plan to Canada. -------------------------------------------------- Generated Summary: #Person1# describes his trip to Canada in detail. #Person2# thinks it's wonderful. -------------------------------------------------- rouge1: Precision: 0.5 Recall: 0.38461538461538464 F1 Score: 0.4347826086956522 rouge2: Precision: 0.1111111111111111 Recall: 0.08333333333333333 F1 Score: 0.09523809523809525 rougeL: Precision: 0.3 Recall: 0.23076923076923078 F1 Score: 0.2608695652173913 rougeLsum: Precision: 0.3 Recall: 0.23076923076923078 F1 Score: 0.2608695652173913 True Summary: #Person1# dislikes #Person2#'s idea of getting a tie for someone. #Person2# then shows #Person1# the tie and #Person1# starts to think it's cool. -------------------------------------------------- Generated Summary: #Person1# and #Person2# discuss what is the most boring, typical gift in the world and what is the highlight of this dialouge is. -------------------------------------------------- rouge1: Precision: 0.2 Recall: 0.21739130434782608 F1 Score: 0.20833333333333331 rouge2: Precision: 0.0 Recall: 0.0 F1 Score: 0.0 rougeL: Precision: 0.16 Recall: 0.17391304347826086 F1 Score: 0.16666666666666666 rougeLsum: Precision: 0.16 Recall: 0.17391304347826086 F1 Score: 0.16666666666666666 True Summary: #Person2# calls #Person1# to make an appointment for a checkup. -------------------------------------------------- Generated Summary: #Person1#: David Johnson wants to make an appointment. #Person2# describes that he has a bad cavity on the back of his head and hurts. #Person1# asks him whether he wants a checkup or a cleaning. #Person2 -------------------------------------------------- rouge1: Precision: 0.8 Recall: 0.2222222222222222 F1 Score: 0.3478260869565218 rouge2: Precision: 0.4444444444444444 Recall: 0.11428571428571428 F1 Score: 0.1818181818181818 rougeL: Precision: 0.7 Recall: 0.19444444444444445 F1 Score: 0.30434782608695654 rougeLsum: Precision: 0.7 Recall: 0.19444444444444445 F1 Score: 0.30434782608695654 True Summary: #Person1# offers a discount but #Person2# is not satisfied. After negotiation, they agree on a 10% discount. -------------------------------------------------- Generated Summary: #Person1# tries to negotiate the price with Person2#. Person2# finally agrees to reduce the price. -------------------------------------------------- rouge1: Precision: 0.11764705882352941 Recall: 0.13333333333333333 F1 Score: 0.125 rouge2: Precision: 0.0 Recall: 0.0 F1 Score: 0.0 rougeL: Precision: 0.11764705882352941 Recall: 0.13333333333333333 F1 Score: 0.125 rougeLsum: Precision: 0.11764705882352941 Recall: 0.13333333333333333 F1 Score: 0.125 True Summary: After three years of cooperation, #Person1# applies for the sole agency of David's company's product in the local market. #Person1# tells David about #Person1#'s company's advantages and the minimum annual sales they can guarantee and promises to follow the sole agency's principles. -------------------------------------------------- Generated Summary: #Person1# applies for the sole agency of their product in the country and #Person2# tells him the minimum annual sales he can guarantee. -------------------------------------------------- rouge1: Precision: 0.3829787234042553 Recall: 0.782608695652174 F1 Score: 0.5142857142857143 rouge2: Precision: 0.2608695652173913 Recall: 0.5454545454545454 F1 Score: 0.3529411764705882 rougeL: Precision: 0.3617021276595745 Recall: 0.7391304347826086 F1 Score: 0.4857142857142858 rougeLsum: Precision: 0.3617021276595745 Recall: 0.7391304347826086 F1 Score: 0.4857142857142858
As you can see, our fine-tuned model is capable of generating pretty good one-line summary of dialogues. Moreover, For ROUGE-1, a Precision of around 0.34 and Recall of about 0.32 are moderate, with an F1 Score of approximately 0.30 falling within the same range. For ROUGE-2, a Precision of approximately 0.12 and Recall of around 0.12 are relatively low, resulting in an F1 Score of about 0.11, which is on the lower side. For ROUGE-L and ROUGE-Lsum, the Precision, Recall, and F1 Scores are similar to those of ROUGE-1 but slightly lower.
Gemma’s underlying architecture allows it to be trained on new datasets specific to a particular task. Through further fine-tuning, these metrics can be improved.
Gemma 2B can successfully generate meaningful dialogue summaries close to the original summary. In terms of architecture, Gemma uses Multi-head Attention which allows the model to focus on specific parts of the input sequence that are most relevant to the current task. Gemma utilizes rotary positional embeddings in each layer. This helps the model understand the order of words within a sequence. During training, Gemma processes information in chunks of 8192 tokens, which gives it a decent amount of context to understand the intricacies of language. Because of these reasons, Gemma 2B can create meaningful summaries of dialogues.
While Gemma’s accessibility is a significant advantage, it comes with inherent trade-offs. Compared to its larger, cloud-based counterparts, Gemma’s performance is inevitably limited by the processing power of a single GPU.
Gemma models like 2B and 7B are designed to address specific computational limitations. While they offer advantages in terms of efficiency and accessibility, they may not match the power of larger AI models like OpenAI’s ChatGPT-4 or Google’s Gemini Ultra and Pro chatbots.
It may only excel in some use cases, as it is optimized for specific applications rather than being a one-size-fits-all solution.
Gemma utilizes a decoder transformer architecture, limiting its functionality to text-to-text large language models. This means it may not be suitable for tasks involving images or videos that require encoder-decoder interactions like Google Gemini’s multimodal capabilities.
Gemma models, while not perfect at dialogue summarization, offer promising potential. With further fine-tuning and optimization, they can be tailored to excel in this task. Their architecture, featuring Multi-head Attention and rotary positional embeddings, provides a strong foundation for creating meaningful summaries.