Before the advent of multimodal approaches in text generation, most AI models were limited to a single mode—typically text—as the primary input. These early models could process and generate human-like text but could not incorporate contextual understanding from non-textual sources like images, audio, or video. Text generation was thus largely confined to information that could be provided through text alone, without the richness that visual or other sensory cues bring to real-world understanding.
The transition to multimodal systems marked a turning point, as models began to integrate multiple sources of input, such as vision and text, allowing them to perform more sophisticated tasks that involve understanding and combining different types of data.
FusionBot represents a novel advancement in chat-based AI systems, integrating image and text processing into a single, seamless framework to enhance user interaction and provide comprehensive, multimodal responses.
In recent years, conversational AI systems have advanced significantly, becoming essential tools for customer support, personal assistance, education, and beyond. However, most of these systems are limited to interpreting and generating responses based on textual data alone, which restricts their potential for interactive and comprehensive user engagement. As the digital landscape evolves, there is a growing demand for systems that can interpret multiple modes of input—especially text and images—within a single framework to deliver more contextual, intelligent, and accessible responses.
FusionBot is designed to address this need by seamlessly integrating image and text processing capabilities within a chat-based interface, providing users with a unified system capable of multimodal interactions. By enabling users to share images and text interchangeably, FusionBot enhances its understanding of complex queries, allowing for deeper insights, more accurate responses, and an overall enriched user experience. For example, a user could submit a picture alongside a text query, and FusionBot would analyze both inputs collectively, returning a response that considers the combined context.
In this paper, we explore the design and implementation of FusionBot, its core algorithms, and the multimodal training methods that enable its robust performance.
BLIP-2 or Bootstrapping Language-Image Pre-training 2 is a scalable multimodal pre-training method that enables any Large Language Models (LLMs) to ingest and understand images, unlocks the capabilities of zero-shot image-to-text generation and powers the world’s first open-sourced multimodal Chatbot prototype. One of BLIP-2’s primary advancements lies in its Bootstrapping approach, where it uses lightweight pre-training on paired image-text data before fine-tuning on specific tasks.
Fig 1: BLIP-2 architecture
First, the image is passed to the image encoder to extract visual features and the outputs are then passed to the language model to make sense of it. However, there is a challenge; since the frozen language model wasn’t trained on image data it can’t make a good interpretation of the extracted visual representations without further help. To solve this problem, Q-Former uses a set of learnable querying vectors and is pre-trained in two stages: (1) vision-language representation learning with a frozen image encoder and (2) vision-to-language generative learning stage with a frozen text encoder.
###1. Image Captioning
# Required Libraries from urllib.request import urlopen from PIL import Image from transformers import AutoProcessor, Blip2ForConditionalGeneration import torch from IPython.display import HTML, display import ipywidgets as widgets import numpy as np from sklearn.preprocessing import MinMaxScaler # Load image of a PM img_path = "https://upload.wikimedia.org/wikipedia/commons/2/2e/Prime_Minister%2C_Shri_Narendra_Modi%2C_in_New_Delhi_on_August_08%2C_2019_%28cropped%29.jpg" image = Image.open(urlopen(img_path)).convert("RGB") # Preprocess the image inputs = blip_processor(image, return_tensors="pt").to(device, torch.float16) inputs["pixel_values"].shape # Convert to numpy and go from (1, 3, 224, 224) to (224, 224, 3) in shape image_inputs = inputs["pixel_values"][0].detach().cpu().numpy() image_inputs = np.einsum('ijk->kji', image_inputs) image_inputs = np.einsum('ijk->jik', image_inputs) # Scale image inputs to 0-255 to represent RGB values scaler = MinMaxScaler(feature_range=(0, 255)) image_inputs = scaler.fit_transform(image_inputs.reshape(-1, image_inputs.shape[-1])).reshape(image_inputs.shape) image_inputs = np.array(image_inputs, dtype=np.uint8) # Convert numpy array to Image Image.fromarray(image_inputs)
# Load processor and main model blip_processor = AutoProcessor.from_pretrained("Salesforce/blip2-opt-2.7b") model = Blip2ForConditionalGeneration.from_pretrained( "Salesforce/blip2-opt-2.7b", torch_dtype=torch.float16 ) # Send the model to GPU to speed up inference device = "cuda" if torch.cuda.is_available() else "gpu" model.to(device)
# Generate caption inputs = blip_processor(image_inputs, return_tensors="pt").to(device, torch.float16) generated_ids = model.generate(**inputs, max_new_tokens=20) generated_text = blip_processor.batch_decode(generated_ids, skip_special_tokens=True) generated_text = generated_text[0].strip() generated_text
Visual Question Answering (VQA) is a field of artificial intelligence that combines computer vision and natural language processing to enable systems to answer questions based on image content. In a VQA task, a model receives an image along with a natural language question about that image. The system processes both the visual and textual inputs, analyzes the image, and generates an accurate, contextually relevant answer.
# Visual Question Answering prompt = "Question: what is the full name of the person seen in this picture. Answer:" # Process both the image and the prompt inputs = blip_processor(image_inputs, text=prompt, return_tensors="pt").to(device, torch.float16) # Generate text generated_ids = model.generate(**inputs, max_new_tokens=30) generated_text = blip_processor.batch_decode(generated_ids, skip_special_tokens=True) generated_text = generated_text[0].strip() generated_text
Using Multimodal chat-Based Prompting user might submit a question along with an image, and the AI can interpret both inputs simultaneously, producing more nuanced and context-aware responses.
# Chat-like prompting prompt = "Question: what is the full name of the person see in this picture. Answer: narendra modi Question: When did he become prime minister of India? Answer:" # Generate output inputs = blip_processor(image_inputs, text=prompt, return_tensors="pt").to(device, torch.float16) generated_ids = model.generate(**inputs, max_new_tokens=30) generated_text = blip_processor.batch_decode(generated_ids, skip_special_tokens=True) generated_text = generated_text[0].strip() generated_text
Note: Along with 2nd question, the first question and its answer are also passed as prompt.
def text_eventhandler(*args): question = args[0]["new"] if question: args[0]["owner"].value = "" # Create prompt if not memory: prompt = " Question: " + question + " Answer:" print(prompt) else: template = "Question: {} Answer: {}." prompt = " ".join( [ template.format(memory[i][0], memory[i][1]) for i in range(len(memory)) ] ) + " Question: " + question + " Answer:" print(prompt) # Generate text inputs = blip_processor(image_inputs, text=prompt, return_tensors="pt") inputs = inputs.to(device, torch.float16) generated_ids = model.generate(**inputs, max_new_tokens=100) generated_text = blip_processor.batch_decode(generated_ids, skip_special_tokens=True) generated_text = generated_text[0].strip().split("Question")[0] # Update memory memory.append((question, generated_text)) # Assign to output output.append_display_data(HTML("<b>USER:</b> " + question)) output.append_display_data(HTML("<b>BLIP-2:</b> " + generated_text)) output.append_display_data(HTML("<br>")) # Prepare widgets in_text = widgets.Text() in_text.continuous_update = False in_text.observe(text_eventhandler, "value") output = widgets.Output() memory = [] # Display chat box display( widgets.VBox( children=[output, in_text], layout=widgets.Layout(display="inline-flex", flex_flow="column-reverse"), ) )
USER: what is the name of the person in the picture.
BLIP-2: narendra modi
USER: When did he become PM.
BLIP-2: in 2014.
FusionBot represents a significant advancement in the evolution of chat-based AI systems, successfully integrating image and text processing into a cohesive framework that enhances user interaction. By enabling a seamless exchange between visual and linguistic data, FusionBot allows users to engage in more natural and contextually rich conversations, leveraging the strengths of both modalities. The architecture of FusionBot not only improves the accuracy and relevance of responses but also broadens the scope of applications across various domains, including education, e-commerce, and technical support.
There are no models linked
There are no datasets linked
There are no models linked
There are no datasets linked