[https://python.langchain.com/docs/introduction/]
[https://console.groq.com/docs/overview]
5 . Ollama
[https://github.com/ollama/ollama/blob/main/docs/api.md]
[https://python.langchain.com/docs/integrations/vectorstores/faiss/]
The applications of Generative AI are flourishing in today's tech infrastructure and environment . Understanding the real world business use cases and automating the solutions of the tasks through intelligent data driven solution systems is a crucial challenge in today's industries. Through the applications of Generative AI we can lead to unimaginable progress in terms of generative capability of the systems to handle read world use cases and to optimize effectively the best decision based on the power of autonomy of the systems to act and think like a Human and take decisions based on the solutions generated. These Gen AI systems are powered by LLM and LVM models which are massive foundation models trained on huge datasets from where they learn feature representations and are capable of taking informed decisions. In this project , I have developed a Multitasking dashboard involving the application of Generative AI powered by multiple LLM models which extends to the domains of unstructured data in the form of either text which is known as NLP (Natural Language Processing) or in the form of graphical data as images, audio and video known as Computer Vision. I have integrated several frameworks to build the entire dashboard like Langchain , Huggingface , Groq etc. and various API
integrations are also used such as Mistral API, Ollama , Groq API kind of external interfaces to fine tune their models into my own task and build a structured design template . The entire dashboard is served through Streamlit which is a python based interactive dashboard visualization library supporting multiple element and layer customizations for effective visualization of the resultant application. Through this system, I approach the various use cases related to NLP such as text classification, sentiment analysis, language translation, RAG , etc. as well as Computer Vision Problems like Facial Landmark Recognition and Image Inpainting. This system proves to be effective in solving problems in their respective domain and can solve multiple problems on a single platform enabling intra platform interconnection establishment and domain modularity based on the different tasks on which it was designed.
The entire project is divided based on the domain of use cases. We integrated several use cases from the domains of NLP and Computer Vision . The uses cases in NLP include text classification, text summarization , question answering, Chatbots, Language Translation, Text Generation and Document based Search Indexing and Retrieval using RAG. From Computer vision, the use cases taken are : - Facial Detection with Landmark Keypoint Estimation , Automatic Image Captioning with Image Inpainting. The use cases are described below .
Definition: Text summarization is the process of reducing a text document to its most important and relevant information. It can be:
Extractive Summarization: Selects sentences or phrases directly from the source text.
Abstractive Summarization: Generates new sentences, often rephrasing or using synonyms to convey the same meaning concisely.
Definition: Text classification involves categorizing text into predefined classes or labels. For example, classifying emails into spam or non-spam or assigning sentiments to product reviews.
Definition: QA systems provide direct answers to questions posed by users, leveraging a source document or a database. QA tasks are typically divided into:
Closed-domain QA: Focuses on specific topics with limited context.
Open-domain QA: Handles a broad range of topics, often requiring external knowledge sources.
Definition: Text generation involves producing coherent and contextually relevant text, often resembling human writing.
Definition: Chatbots are conversational agents that interact with users in natural language. They can be rule-based or AI-driven.
Definition: Language translation involves converting text or speech from one language to another while preserving the original meaning.
Definition: RAG is a hybrid approach that combines information retrieval and text generation. It retrieves relevant context or facts from a database (or external source) and generates text based on the retrieved information.
Definition: This task involves two components:
Image Captioning: Generating descriptive textual captions for an image, capturing objects, activities, and context.
Image Inpainting: Filling in missing or corrupted parts of an image, restoring it to a complete and visually coherent form.
Definition: This task combines:
Facial Detection: Identifying and localizing faces within an image or video.
Landmark Keypoint Estimation: Detecting specific points on a face (e.g., eyes, nose, mouth, and jawline) to understand its geometry and orientation.
In this project , I have used several LLM's , several frameworks like Langchain , Huggingface , Groq and API's like Mistral, Ollama, Huggingface. Along with that I have used streamlit which is a python based interactive dashboarding package which helped me to develop the application UI . Let's understand what is the function of each and tool, platform , API which are used in the above project .
Each tool which is used in the project along with their description, functionality is described below : -
Meta’s LLaMA3.1 and LLaMA3-70B-8192 - Meta's LLaMA (Large Language Model Meta AI) models represent the latest advancements in large-scale transformer-based language models. LLaMA3.1 improves upon its predecessors by incorporating enhanced optimization techniques, such as sparse attention mechanisms and advanced positional encoding for long sequences, making it highly effective for processing context-rich data. The LLaMA3-70B-8192 variant has a massive 70 billion parameters and supports input sequences of up to 8192 tokens, catering to complex tasks such as long-form document summarization and multi-turn dialogue modeling. Applications include chatbots, knowledge retrieval, and extensive text analysis tasks, with high efficiency and scalability for research and industry.
Facebook/BART-large-CNN - BART-large-CNN is a transformer-based sequence-to-sequence model pre-trained using a denoising autoencoder objective. The model comprises an encoder-decoder architecture with 24 transformer layers, enabling robust understanding and generation of text. Optimized for summarization tasks, the CNN-specific fine-tuning enhances its ability to process news articles and other structured data. Its applications include abstractive summarization, paraphrasing, and data-driven text generation, making it a reliable tool for summarizing complex documents while preserving meaning and coherence.
facebook/mbart-large-50-many-to-many-mmt - The mBART-large-50 is a multilingual sequence-to-sequence model trained for text-to-text tasks across 50 languages. It features a shared transformer-based encoder-decoder structure, leveraging token embeddings for multilingual contexts. The "many-to-many-mmt" fine-tuning specializes the model for translation tasks, handling various language pairs with high fidelity. Applications span multilingual machine translation, cross-lingual summarization, and content localization, with a focus on enabling seamless communication across diverse linguistic groups.
google-bert/bert-large-uncased-whole-word-masking-finetuned-squad - This is a variant of Google’s BERT (Bidirectional Encoder Representations from Transformers) fine-tuned on the Stanford Question Answering Dataset (SQuAD). The model uses a transformer architecture with 24 layers and 340 million parameters, applying bidirectional attention for deep contextual understanding. The whole-word masking pre-training strategy allows it to understand multi-word expressions better. Its fine-tuning for SQuAD makes it adept at answering questions from paragraphs, with applications in customer support, search engines, and knowledge-based systems.
lxyuan/distilbert-base-multilingual-cased-sentiments-student - This lightweight multilingual model is a distilled version of BERT, with reduced parameters for faster inference and lower computational overhead. It has been fine-tuned specifically for sentiment analysis across multiple languages. The architecture maintains key transformer layers to balance performance and efficiency, enabling it to classify sentiments effectively. Applications include social media analysis, multilingual customer feedback evaluation, and monitoring brand sentiment across diverse languages and regions.
facebook/bart-large-mnli - BART-large-MNLI is a fine-tuned variant of Facebook’s BART model optimized for natural language inference (NLI) tasks. It features the same 24-layer transformer encoder-decoder structure as the original BART model but fine-tuned on the Multi-Genre NLI (MNLI) dataset. This allows it to assess relationships between text pairs, such as entailment, contradiction, or neutrality. Applications include fact-checking, contextual similarity analysis, and content moderation by understanding nuanced relationships between textual inputs.
-- Retrieval Chains - Retrieval chains combine a retrieval mechanism (e.g., vector search) with LLM reasoning. They extract relevant documents or context from a database or corpus and pass it to the model for generating accurate, context-aware responses.
Conversation Buffers: Maintain the full history of interactions.
Summary Memory: Condenses past interactions into summaries for better scalability in long conversations.
Knowledge Base Memory: Stores extracted facts for reference.
Structured Parsers - Structured parsers help extract structured data (like JSON or tables) from raw text outputs of LLMs. This is crucial for integrating LLMs into applications requiring structured output for downstream processes like data analytics or automation.
Document Loaders - Document loaders handle the ingestion of external data sources like PDFs, Word documents, HTML, or databases. They provide a standardized way to load data into the LangChain pipeline for tasks like summarization, information retrieval, and analysis.
-- Web-based Loaders - Web loaders fetch content from URLs, APIs, or online repositories for real-time data ingestion. This is useful for building applications that process web content, like summarizing news articles or analyzing forums.
Recursive Character Text Splitter - This utility splits large text documents into manageable chunks based on predefined token or character limits. It ensures logical divisions, like splitting at sentence or paragraph boundaries, optimizing LLM performance in processing large documents.
Chat Prompt Templates - Chat prompt templates allow developers to define reusable and parameterized prompts for conversational tasks. They enable consistent and dynamic input formatting, improving the usability and flexibility of chat-based applications.
LangChain VectorStores - VectorStores provide a way to store, index, and query embeddings (numerical representations of text). They are used in retrieval-augmented generation, where relevant documents are retrieved based on semantic similarity for enhanced LLM responses.
LangChain Embeddings - Embeddings are numerical representations of text that capture semantic meaning. LangChain integrates various embedding models (e.g., OpenAI, Hugging Face) for tasks like similarity searches, clustering, and text classification.
System, Human, and AI Messages - LangChain formalizes interaction structures:
-- System Messages: Define initial context or rules for the LLM’s behavior.
-- Human Messages: Represent user inputs in conversational applications.
-- AI Messages: Capture LLM-generated responses, maintaining interaction consistency.
-- Hugging Face: Provides access to a wide range of pre-trained transformers for tasks like
summarization and translation.
-- Mistral: Integrates with Mistral’s LLMs, known for efficiency and high performance.
-- Ollama: Focuses on fine-tuned LLMs for specific domains or tasks.
-- Groq: Utilizes specialized AI accelerators for high-throughput processing.
-- ChatOllama: Optimized for dialogue and knowledge-based conversations using Ollama’s domain-
specific expertise.
-- ChatHuggingface: Leverages Hugging Face transformers for versatile NLP tasks.
-- ChatMistral: Offers efficient and concise conversational capabilities.
-- ChatGroq: Uses Groq's specialized hardware for real-time and large-scale chat applications.
-- NamedTemporaryFile: Creates a temporary file with a name that can be accessed in the filesystem. The file is automatically deleted when it is closed unless specified otherwise with delete=False.
The pipeline module is highly versatile and supports different frameworks, including PyTorch and TensorFlow, ensuring compatibility with a broad range of environments. For example, a sentiment analysis pipeline could take user reviews as input and return the sentiment ("positive" or "negative") with confidence scores. Similarly, a summarization pipeline can condense lengthy documents into concise summaries, aiding in quick content digestion.
-- SAM model registry - The model registry is a centralized repository of pre-trained SAM models. It allows users to access different SAM model variants optimized for specific tasks or datasets. This component ensures flexibility and scalability, enabling developers to select a model variant that balances segmentation quality and computational efficiency.
-- SAM Predictor - The predictor module performs segmentation based on user-provided prompts, such as points, bounding boxes, or text. It leverages SAM's underlying architecture to refine segmentation boundaries dynamically, ensuring high accuracy. This functionality is particularly useful in interactive settings where users guide the segmentation process iteratively, such as medical imaging or annotation tasks
-- SAM Automatic Mask Generator - This module automatically generates masks for all objects in an image without requiring user prompts. It uses SAM's learned features to identify and segment objects of interest efficiently, making it ideal for batch processing or scenarios where precise user input is unavailable. This component facilitates applications in image editing, object tracking, and autonomous systems where real-time mask generation is critical.
-- Stable Diffusion Inpaint Pipeline - The Stable Diffusion Inpaint Pipeline is a specialized variant of the stable diffusion model, designed for image inpainting tasks. Its architecture builds upon the core stable diffusion framework, incorporating components like a UNet model, variational autoencoder (VAE), and text encoder for guided generation.
The UNet model progressively denoises the latent representation of the image, guided by textual prompts or other conditioning information.
The VAE encodes the input image into a latent space and decodes the generated latent representations back into the image space.
A text encoder (e.g., from CLIP or OpenAI's transformers) allows the model to align its inpainting results with textual descriptions, enabling guided completion of missing or masked regions.
The OpenCV (cv2) module is a more comprehensive computer vision library, offering extensive support for both image and video processing. It focuses on low-level image manipulations and advanced tasks like object detection, facial recognition, and real-time video analysis. OpenCV supports a variety of image formats and provides powerful tools for edge detection, morphological transformations, feature extraction, and contour analysis. It integrates well with hardware acceleration (via CUDA), making it highly efficient for real-time applications. OpenCV also excels in integrating machine learning and deep learning models for image-based tasks like segmentation, classification, and tracking.
-- Tasks module -- This module provides a high-level API for performing end-to-end machine learning tasks, including model inference and result interpretation. It abstracts much of the underlying complexity, enabling developers to focus on task-specific requirements. For example, pre-built tasks like object detection, face detection, and gesture recognition can be readily deployed using this module.
-- Vision Module - The vision module is a cornerstone of MediaPipe, offering pipelines for computer vision tasks. It includes models and utilities for face detection, hand tracking, pose estimation, and object tracking. These pipelines are optimized for real-time processing, even on mobile devices, using frameworks like TensorFlow Lite. For instance, the hand tracking solution detects and tracks hands in video frames, providing real-time landmarks for applications like virtual keyboards or sign language interpretation.
-- Landmark pb2 module - This module is a protocol buffer (protobuf) definition used to represent and serialize the landmarks detected by MediaPipe’s vision models. Landmarks are key points on an object or body (e.g., joints in pose estimation or facial landmarks in face detection). This module is critical for downstream tasks like calculating angles between joints or applying gestures in augmented reality. Developers can parse and manipulate the detected landmarks using this module for fine-grained analysis and integration into applications.
Streamlit allows users to display data using widgets, charts, tables, and media components like images, videos, or audio files. It supports integration with popular data visualization libraries such as Matplotlib, Plotly, and Seaborn, as well as frameworks for handling large datasets like Pandas. Streamlit's reactive framework ensures that whenever a user interacts with a widget (e.g., sliders, dropdowns, or text inputs), the application dynamically re-executes, providing an updated output. This functionality makes Streamlit ideal for building dashboards, data exploration tools, and interactive ML model demonstrations.
import validators import streamlit as st import time import tempfile from segment_anything import sam_model_registry, SamAutomaticMaskGenerator import requests from tempfile import NamedTemporaryFile from io import BytesIO from diffusers.pipelines import StableDiffusionInpaintPipeline from PIL import Image import numpy as np import cv2 import matplotlib.pyplot as plt from transformers import Blip Processor, BlipForConditionalGeneration import mediapipe as mp from mediapipe.tasks import python from mediapipe.tasks.python import vision from mediapipe.framework.formats import landmark_pb2 import mediapipe from transformers import pipeline from transformers import AutoModelForSequenceClassification, AutoTokenizer from transformers import MBartForConditionalGeneration, MBart50TokenizerFast from transformers import AutoModelForSeq2SeqLM from langchain_groq.chat_models import ChatGroq from langchain_community.document_loaders import WebBasedLoader from langchain_community.embeddings import HuggingFaceBgeEmbeddings from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain.chains.combine_documents import create_stuff_documents_chain from langchain_core.prompts import ChatPromptTemplate from langchain.chains.retrieval import create_retrieval_chain from langchain_community.vectorstores import FAISS from langchain_mistralai import ChatMistralAI from langchain_core.prompts import ChatPromptTemplate from langchain.schema import SystemMessage, HumanMessage, AIMessage from langchain_ollama.chat_models import ChatOllama import os from dotenv import load_dotenv
st.markdown( """ <style> /* Page Styling */ body { background-color: black; background-size: cover; background-position: center; background-repeat: no-repeat; } .sidebar .sidebar-content { background-color: black ; } .title { color: red; font-size: 50px; font-weight: bold; text-align: center; } .sub-title { color: cyan; font-size: 30px; text-align: center; margin: 20px 0; } .sub-sub-title { color: white; font-size: 25px; text-align: center; margin: 30px 0; } .header { font-size: 36px; color: cyan; text-align: center; margin: 40px 0; } </style> """, unsafe_allow_html=True )
MISTRAL_API_KEY= os.environ['MISTRAL_API_KEY']= "0tk96FDQ8cjEl4twH8c4CBfDQRTNzjtz" GROQ_API_KEY= os.environ['GROQ_API_KEY']= "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"
models = { "Chatbot": ["ChatOllama", "ChatMistralAI", "ChatGroq"] } main_tasks = st.sidebar.multiselect("Choose domain", ["NLP", "CV"])
for task in main_tasks: if task=="NLP": st.markdown("<div class='title'>GEN AI LLM DEPLOYEMENT DASHBOARD</div>", unsafe_allow_html=True) st.write("\n") st.markdown("<div class='sub-title'>Natural Language Processing</div>", unsafe_allow_html=True) selected_tasks= ["Chatbot", "Question-Answering", "Text-Translation", "Sentiment Analysis", "Summarization", "Text Classification", "RAG" ] task_options= st.sidebar.multiselect("Select Tasks", selected_tasks)
for task in task_options: if task=="Chatbot" : def Generative_AI(): selected_models= {} for task in task_options: if task=="Chatbot" : selected_models[task] = st.sidebar.multiselect(f"Select models for {task}", models[task]) for task, model_list in selected_models.items(): if model_list: if task=="Chatbot": st.markdown("<div class='sub-title'>ChatBots</div>", unsafe_allow_html=True)
for model in model_list: if model=="ChatOllama": st.markdown("<div class='sub-sub-title'>ChatOllama</div>", unsafe_allow_html=True) llm = ChatOllama( model="llama3.1", temperature=0.8, max_tokens=1024) prompt = ChatPromptTemplate.from_template( """ ### Task ### You will act like the system specified {system_message} and will answer the questions asked by the user in the form of a: {human_message} """ ) system_message= st.text_input("Enter message for the system : ") if 'chat' not in st.session_state: st.session_state['chat']= [ SystemMessage(content=system_message), ] def get_response(human_message): if human_message : st.session_state['chat'].append(HumanMessage(content=human_message)) response= llm.invoke(st.session_state['chat']) st.session_state['chat'].append(AIMessage(content=response.content)) return response.content human_message= st.text_input("Enter your message : ") if st.button("Generate Text"): with st.spinner("Generating..."): answer = get_response(human_message) generated_answer = answer st.success("Generated Text:") st.write(generated_answer)
elif model=="ChatMistralAI": st.markdown("<div class='sub-sub-title'>ChatMistralAI</div>", unsafe_allow_html=True) llm= ChatMistralAI(api_key=MISTRAL_API_KEY, temperature=0.8, max_tokens=1024) prompt = ChatPromptTemplate.from_template( """ ### Task You are a helpful assistant designed to act like a system specified by the user as {system_message}. ### You will act like the system specified and will answer the questions asked by the user in the form of a: {human_message} """ ) system_message= st.text_input("Enter message for the system : ") human_message= st.text_input("Enter your message : ") if 'chat' not in st.session_state: st.session_state['chat']= [ SystemMessage(content=system_message), ] def get_response(human_message): st.session_state['chat'].append(HumanMessage(content=human_message)) response= llm.invoke(st.session_state['chat']) st.session_state['chat'].append(AIMessage(content=response.content)) return response.content if st.button("Generate Text"): with st.spinner("Generating..."): answer = get_response(prompt.format(system_message=system_message, human_message=human_message)) generated_text = answer st.success("Generated Text:") st.write(generated_text)
elif model=="ChatGroq": st.markdown("<div class='sub-sub-title'>ChatGroq</div>", unsafe_allow_html=True) llm= ChatGroq(api_key=GROQ_API_KEY, model_name="llama3-70b-8192", temperature=0.8, max_tokens=1024) prompt = ChatPromptTemplate.from_template( """ ### Task You are a helpful assistant designed to act like a system specified by the user as {system_message}. ### You will act like the system specified and will answer the questions asked by the user in the form of a: {human_message} """ ) system_message= st.text_input("Enter message for the system : ") human_message= st.text_input("Enter your message : ") if 'chat' not in st.session_state: st.session_state['chat']= [ SystemMessage(content=system_message), ] def get_response(human_message): st.session_state['chat'].append(HumanMessage(content=human_message)) response= llm.invoke(st.session_state['chat']) st.session_state['chat'].append(AIMessage(content=response.content)) return response.content if st.button("Generate Text"): with st.spinner("Generating..."): answer = get_response(prompt.format(system_message= system_message, human_message= human_message)) generated_text = answer st.success("Generated Text:") st.write(generated_text) else: print("No model selected") Generative_AI()
elif task== "Text-Translation": def text_trans(): for task in task_options: if task=="Text-Translation": st.markdown("<div class='sub-title'>Translate text from One language to another :</div>", unsafe_allow_html=True) tokenizer= MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-many-to-many-mmt") model= MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-many-to-many-mmt") src_lang= st.text_input("Enter source language code : ") tgt_lang= st.text_input("Enter target language code : ") tokenizer.src_lang = src_lang tokenizer.tgt_lang = tgt_lang qa_pipeline = pipeline("translation", model=model, tokenizer=tokenizer, src_lang= tokenizer.src_lang, tgt_lang= tokenizer.tgt_lang) sentence = st.text_area("Enter the sentence here...") translation= st.text_input("Enter language to be translated : ") if sentence: result = qa_pipeline(sentence, translation) if st.button("Translate"): with st.spinner("Translating..."): st.subheader("Answer:") st.success("Translated Text:") st.write(result[0]['translation_text']) text_trans()
elif task=="Question-Answering": def qna(): for task in task_options: if task=="Question-Answering": st.markdown("<div class='sub-title'>Question Answering</div>", unsafe_allow_html=True) context = st.text_area("Enter the context here...") question = st.text_input("Enter your question here...") qa_pipeline = pipeline("question-answering", model="google-bert/bert-large-uncased-whole-word-masking-finetuned-squad") if question and context: result = qa_pipeline({"question": question, "context": context}) if st.button("Answer"): with st.spinner("Answering..."): st.subheader("Answer:") st.success("Provided Answer:") st.write(result['answer']) else: st.write("Please provide both a question and a context.") qna()
elif task=="Summarization": def summarize(): for task in task_options: if task=="Summarization": st.markdown("<div class='sub-title'>Summarize texts </div>", unsafe_allow_html=True) model_name = "facebook/bart-large-cnn" # Replace with the actual Llama model name on Hugging Face tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSeq2SeqLM.from_pretrained(model_name) generator = pipeline("summarization", model=model, tokenizer=tokenizer, trust_remote_code=True) user_input = st.text_area("Enter your article:") if st.button("Summarize"): with st.spinner("Summarizing..."): response = generator(user_input) generated_text = response[0]['summary_text'] st.success("Summarized Text:") st.write(generated_text) summarize()
elif task=="Sentiment Analysis": def sentiment(): for task in task_options: if task=="Sentiment Analysis": st.markdown("<div class='sub-title'>Sentiment Analysis </div>", unsafe_allow_html=True) model_name = "lxyuan/distilbert-base-multilingual-cased-sentiments-student" model = AutoModelForSequenceClassification.from_pretrained(model_name) tokenizer = AutoTokenizer.from_pretrained(model_name) nlp = pipeline('text-classification', model=model, tokenizer=tokenizer) question= st.text_input("Enter you sentence : ") if st.button("Generate Sentiment"): with st.spinner("Generating..."): st.subheader("Answer:") result = nlp(question) st.success("Predicted Sentiment:") st.write({result[0]['label'] : result[0]['score']}) sentiment()
elif task== "Text Classification": def classify(): for task in task_options: if task=="Text Classification": st.markdown("<div class='sub-title'>Classify Text with labels </div>", unsafe_allow_html=True) model_name = "facebook/bart-large-mnli" model = AutoModelForSequenceClassification.from_pretrained(model_name) tokenizer = AutoTokenizer.from_pretrained(model_name) nlp = pipeline('zero-shot-classification', model=model, tokenizer=tokenizer) question= st.text_input("Enter you question : ") labels= st.multiselect("Enter labels to classify ", ['travelling', 'sports', 'food', 'education']) if st.button("Classify"): with st.spinner("Classifying..."): st.subheader("Answer:") result = nlp(question, labels) st.success("Classified Text:") st.write({result['labels'][int(max(result['scores']))] : max(result['scores'])}) classify()
elif task== "RAG": def rag(): for task in task_options: if task=="RAG": st.markdown("<div class='sub-title'>Retrieval Augmented Generation</div>", unsafe_allow_html=True) if "vector" not in st.session_state: website= st.text_input("Enter a URL:", placeholder="https://example.com") if website: if validators.url(website): st.session_state.embeddings= HuggingFaceBgeEmbeddings() st.session_state.loader= WebBaseLoader(website) st.session_state.docs= st.session_state.loader.load() st.session_state.text_doc= RecursiveCharacterTextSplitter(chunk_size= 1000, chunk_overlap = 200) st.session_state.fin_docs= st.session_state.text_doc.split_documents(st.session_state.docs) st.session_state.vectors= FAISS.from_documents(st.session_state.fin_docs, st.session_state.embeddings) llm= ChatGroq(groq_api_key= GROQ_API_KEY, model= "llama3-8b-8192") prompt= ChatPromptTemplate.from_template( """ ### CONTEXT {context} (Description: This section provides all necessary background, relevant information, and framing details that are crucial for understanding the request or generating a relevant response. This may include task instructions, tone or style preferences, background knowledge, and specific objectives. Include any special considerations or additional details here.) ### INPUT {input} (Description: This section contains the user's main question, command, or prompt that requires a response. This is the primary user request, and the answer or generation will be based on both the INPUT and CONTEXT.) ### RESPONSE (Note: Answer or generate text based on the CONTEXT and INPUT provided above. Consider tone, style, and any additional requirements included in CONTEXT. Focus on delivering a precise, contextually relevant, and coherent response to the INPUT.) """ ) doc_chain= create_stuff_documents_chain(llm, prompt) retriever= st.session_state.vectors.as_retriever() ret_chain= create_retrieval_chain(retriever, doc_chain) prompt= st.text_input("Enter your Question : ") if prompt : if st.button("Query Website or Document"): with st.spinner("Querying..."): st.subheader("Retrieving ...") response= ret_chain.invoke({'input' : prompt}) start= time.process_time() print(f"Response Time = {time.process_time() - start}") st.success("Retrieved Data:") st.write(response['answer']) with st.expander("Document Similarity Search"): for i, doc in enumerate(response['context']): st.write (doc.page_content) st.write('---------------------------') rag() else: print("No task selected")
else: selected_tasks= ["Image Inpainting with Automatic Image Captioning", "Facial Keypoint detection with Landmark Recognition" ] task_options= st.sidebar.multiselect("Select Tasks", selected_tasks)
for tasks in task_options: if tasks =="Image Inpainting with Automatic Image Captioning": st.markdown( """ <style> /* Page Styling */ body { background-color: black; background-size: cover; background-position: center; background-repeat: no-repeat; } .title { color: red; font-size: 50px; font-weight: bold; text-align: center; } .sub-title { color: cyan; font-size: 35px; text-align: center; margin: 20px 0; } .header { font-size: 25px; color: cyan; text-align: center; margin: 40px 0; } .subheader{ font-size: 20px; color: yellow; text-align: center; margin: 40px 0; } .custom-text{ font-size: 15px; color: purple; text-align: center; margin: 40px 0; } </style> """, unsafe_allow_html=True ) st.markdown("<div class= title>Image Inpainting with Automatic Image Captioning</div>", unsafe_allow_html=True) st.write("\n") st.markdown("<div class= sub-title>Inpainting and Caption Generation with Stable Diffusion, SAM and BLIP</div>", unsafe_allow_html=True) st.markdown("<div class= subheader>Upload and generate an image and get a caption!</div>", unsafe_allow_html=True) st.markdown("<div class= custom-text>Loading Image</div>", unsafe_allow_html=True) uploaded_file = st.file_uploader("Choose an image...", type=["jpg", "jpeg", "png"], key ="upload") if uploaded_file is not None: with NamedTemporaryFile(delete=False, suffix=".png") as temp_file: temp_file.write(uploaded_file.read()) temp_path = temp_file.name image = cv2.imread(temp_path) image= cv2.resize(image, (512, 512), interpolation=None) img_rgb= cv2.cvtColor(image, cv2.COLOR_BGR2RGB) img_rgb= cv2.resize(img_rgb, (512, 512), interpolation=None) st.image(img_rgb, caption="Uploaded Image", use_column_width=True) st.write(f"Image loaded with shape: {img_rgb.shape}") st.markdown("<div class= sub-header>Generating segmentation masks</div>") model_type = 'vit_h' checkpoint = "C:/Users/DIBYOJIT/Downloads/sam_vit_h_4b8939.pth" sam = sam_model_registry[model_type](checkpoint=checkpoint).to("cpu") # Generate segmentation masks mask_generator = SamAutomaticMaskGenerator(sam) result = mask_generator.generate(img_rgb) if result: sorted_result = [segment['segmentation'] for segment in sorted(result, key= lambda x: x['area'], reverse=True)] largest_mask = sorted_result[0] plt.figure(figsize=(8, 6)) plt.imsave('segmentation_mask.png', largest_mask) plt.imshow(largest_mask) else: raise ValueError("No segmentation masks generated.") st.image('segmentation_mask.png', caption="Segmentation Mask", use_column_width=True) st.markdown("<div class= custom-text>Image and Mask URL creation and prompting</div>", unsafe_allow_html=True) def download_image(url): if not validators.url(url): st.error("Invalid URL. Please enter a valid image URL.") return None try: response = requests.get(url) response.raise_for_status() url_image= Image.open(BytesIO(response.content)).convert("RGB") return url_image except requests.exceptions.RequestException as e: st.error(f"Error fetching the image: {e}") return None except Exception as e: st.error(f"Error processing the image: {e}") return None image_url = st.text_input("Enter the URL of the image : ") mask_url = st.text_input("Enter the URL of the mask : ") filename = st.text_input("Enter the filename (without extension): ").strip() extension = st.text_input("Enter the file extension (e.g., png, jpg): ").strip() file_path= f"{filename}.{extension}" init_image = download_image(image_url) mask_image = download_image(mask_url) prompt= st.text_input("Enter the prompt : ") st.write("Prompt: ", prompt) processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base") model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base") if st.button("Generate Image"): st.markdown("<div class= custom-text>Generating Inpainted Image</div>") with st.spinner("Generating..."): image_to_image = StableDiffusionInpaintPipeline.from_pretrained("runwayml/stable-diffusion-inpainting") result = image_to_image(prompt=prompt, image=init_image, mask_image=mask_image) st.success("Generated Image: ") generated_image = result.images[0] generated_image = cv2.cvtColor(np.array(generated_image), cv2.COLOR_RGB2BGR) plt.imsave(file_path, generated_image) st.image(file_path, caption="Generated Image", use_column_width=True) diffused_url= st.text_input("Enter the URL of the diffused image : ") if validators.url(diffused_url): st.success("Image URL is valid.") st.markdown("<div class= custom-text>Generating Inpainted Image</div>") st.write("Automatic Captioning of the Diffused Image ") diffused_image = Image.open(requests.get(diffused_url, stream=True).raw).convert('RGB') if st.button("Generate Caption"): with st.spinner("Generating..."): inputs = processor(diffused_image, return_tensors="pt") outputs = model.generate(**inputs) caption = processor.decode(outputs[0], skip_special_tokens=True) st.success("Generated Caption: ") st.write(caption)
else: st.markdown("<div class='title'>Sample Landmark Detection</div>", unsafe_allow_html=True) graphics= st.sidebar.multiselect("Choose type of graphics", ['Image', 'Video']) for data in graphics: if data =="Image": st.markdown("<div class='header'>Landmark Detection on Image</div>", unsafe_allow_html=True) uploaded_file = st.file_uploader("Choose an image...", type=["jpg", "jpeg", "png"], key="Image") if uploaded_file is not None: with tempfile.NamedTemporaryFile(delete=False, suffix=".jpg") as temp_file: temp_file.write(uploaded_file.read()) temp_file_path = temp_file.name mp_image = mp.Image.create_from_file(temp_file_path) model_path= "C:/Users/DIBYOJIT/Downloads/face_landmarker.task" def draw_landmarks(img_rgb, detection_result): face_landmarks_list = detection_result.face_landmarks annotated image = np.copy(img_rgb) for idx in range(len(face_landmarks_list)): face_landmarks = face_landmarks_list[idx] face_landmarks_proto = landmark_pb2.NormalizedLandmarkList() face_landmarks_proto.landmark.extend([ landmark_pb2.NormalizedLandmark(x=landmark.x, y=landmark.y, z=landmark.z) for landmark in face_landmarks ]) mp_drawing = mp.solutions.drawing_utils mp_drawing.draw_landmarks( image=annotated_image, landmark_list=face_landmarks_proto, connections=mp.solutions.face_mesh.FACEMESH_TESSELATION, landmark_drawing_spec=None, connection_drawing_spec=mp.solutions.drawing_styles.get_default_face_mesh_tesselation_style() ) mp_drawing.draw_landmarks( image=annotated_image, landmark_list=face_landmarks_proto, connections=mp.solutions.face_mesh.FACEMESH_CONTOURS, landmark_drawing_spec=None, connection_drawing_spec=mp.solutions.drawing_styles.get_default_face_mesh_contours_style() ) mp_drawing.draw_landmarks( image=annotated_image, landmark_list=face_landmarks_proto, connections=mp.solutions.face_mesh.FACEMESH_IRISES, landmark_drawing_spec=None, connection_drawing_spec=mp.solutions.drawing_styles.get_default_face_mesh_iris_connections_style() ) return annotated_image base_options= python.BaseOptions(model_asset_path=model_path) options= vision.FaceLandmarkerOptions(base_options= base_options, output_face_blendshapes=True, output_facial_transformation_matrixes=True, num_faces=1) detector= vision.FaceLandmarker.create_from_options(options) detection_result= detector.detect(mp_image) annotated_image = draw_landmarks(mp_image.numpy_view(), detection_result) st.image(annotated_image, caption="Face Landmarks", use_column_width=True) face_landmark_list = detection_result.face_landmarks for idx in range(len(face_landmark_list)): face_landmarks = face_landmark_list[idx] face_landmarks = [(landmark.x, landmark.y) for landmark in face_landmarks] def get_face_roi_from_landmarks(landmarks, image_shape): x= [] y= [] for land in landmarks: x.append(land[0]) y.append(land[1]) x_min= int(min(x)* image_shape[1]) y_min= int(min(y)* image_shape[0]) x_max= int(max(x)* image_shape[1]) y_max= int(max(y)* image_shape[0]) return x_min, y_min, x_max, y_max def generate_face_mask_roi(processed image, face_landmarks, annotated_image_shape): if face_landmarks: # Get bounding box for the face from the landmarks x_min, y_min, x_max, y_max = get_face_roi_from_landmarks(face_landmarks, annotated_image_shape) bbox= (x_min, y_min, x_max, y_max) cv2.rectangle(processed_image, (x_min, y_min), (x_max, y_max), (0, 0, 255), 2) cv2.putText(processed_image, "Face", (x_min, y_min - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.9, (0, 255, 0), 2) plt.imsave("ROI_of_face.png", processed_image) st.image('ROI_of_face.png', caption="Facial Bounding Box with Landmarks", use_column_width=True) return processed_image, bbox, face_landmarks generate_face_mask_roi(annotated_image, face_landmarks, annotated_image.shape)
else: def process_video_frame(frame, face_landmarker): rgb_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB) mp_image = mediapipe.Image(image_format=mediapipe.ImageFormat.SRGB, data=rgb_frame) results = face_landmarker.detect(mp_image) annotated_frame = frame.copy() if results.face_landmarks: for landmarks in results.face_landmarks: if landmarks: face_landmarks_proto = landmark_pb2.NormalizedLandmarkList() face_landmarks_proto.landmark.extend([ landmark_pb2.NormalizedLandmark(x=landmark.x, y=landmark.y, z=landmark.z) for landmark in landmarks ]) mp_drawing = mediapipe.solutions.drawing_utils mp_drawing.draw_landmarks( image=annotated_frame, landmark_list=face_landmarks_proto, connections=mp.solutions.face_mesh.FACEMESH_TESSELATION, landmark_drawing_spec=None, connection_drawing_spec=mp.solutions.drawing_styles.get_default_face_mesh_tesselation_style() ) mp_drawing.draw_landmarks( image=annotated_frame, landmark_list=face_landmarks_proto, connections=mp.solutions.face_mesh.FACEMESH_CONTOURS, landmark_drawing_spec=None, connection_drawing_spec=mp.solutions.drawing_styles.get_default_face_mesh_contours_style() ) mp_drawing.draw_landmarks( image=annotated_frame, landmark_list=face_landmarks_proto, connections=mp.solutions.face_mesh.FACEMESH_IRISES, landmark_drawing_spec=None, connection_drawing_spec=mp.solutions.drawing_styles.get_default_face_mesh_iris_connections_style() ) for idx in range(len(results.face_landmarks)): face_landmarks = results.face_landmarks[idx] face_landmarks = [(landmark.x, landmark.y) for landmark in face_landmarks] x_min, y_min, x_max, y_max= get_face_roi_from_landmarks(face_landmarks, annotated_frame.shape) cv2.rectangle(annotated_frame, (x_min, y_min), (x_max, y_max), (255, 0, 0), 2) cv2.putText(annotated_frame, "Facial Landmarks", (x_min, y_min - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.9, (0, 255, 255), 2) return annotated_frame def get_face_roi_from_landmarks(landmarks, image_shape): x= [] y= [] for land in landmarks: x.append(land[0]) y.append(land[1]) x_min= int(min(x)* image_shape[1]) y_min= int(min(y)* image_shape[0]) x_max= int(max(x)* image_shape[1]) y_max= int(max(y)* image_shape[0]) return x_min, y_min, x_max, y_max webcam_or_custom_video = st.sidebar.multiselect("Choose Webcam or Custom Video", ["Webcam", "Custom Video"]) for vid in webcam_or_custom_video: if vid =="Webcam": st.markdown("<div class='header'>Landmark Detection on Webcam</div>", unsafe_allow_html=True) def main(): cap = cv2.VideoCapture(0) model_path= "C:/Users/DIBYOJIT/Downloads/face_landmarker.task" if not cap.isOpened(): st.error("Error: Could not open video.") exit() options = vision.FaceLandmarkerOptions( base_options=python.BaseOptions(model_asset_path=model_path), num_faces=1 ) face_landmarker = vision.FaceLandmarker.create_from_options(options) while True: ret, frame = cap.read() if not ret: break annotated_frame = process_video_frame(frame, face_landmarker) cv2.imshow("Facial Landmarks", annotated_frame) if cv2.waitKey(1) & 0xFF == ord('q'): break cap.release() cv2.destroyAllWindows() if __name__ == "__main__": main()
else: def main2(): st.markdown("<div class='header'>Landmark Detection on Custom Video</div>", unsafe_allow_html=True) uploaded_file = st.file_uploader("Choose a Video...", type=["mp4", "mp3", "avi"], key="upload") if uploaded_file is not None: with tempfile.NamedTemporaryFile(delete=False, suffix=".mp4") as temp_file: temp_file.write(uploaded_file.read()) temp_file_path = temp_file.name st.video(temp_file_path) model_path= "C:/Users/DIBYOJIT/Downloads/face_landmarker.task" cap = cv2.VideoCapture(temp_file_path) if not cap.isOpened(): st.error("Error: Could not open video.") exit() options = vision.FaceLandmarkerOptions( base_options=python.BaseOptions(model_asset_path=model_path), num_faces=1 ) face_landmarker = vision.FaceLandmarker.create_from_options(options) while True: ret, frame = cap.read() if not ret: break annotated_frame = process_video_frame(frame, face_landmarker) cv2.imshow("Facial Landmarks", annotated_frame) if cv2.waitKey(1) & 0xFF == ord('q'): break cap.release() cv2.destroyAllWindows() if __name__ == "__main__": main2()
Screenshots of the Examples given in the UI
1.1 For ChatOllama
1.2 For ChatMistral
1.3 For ChatGroq
This project aims to combine the use cases of NLP and Computer Vision using Generative AI by dynamically testing the system using user centered queries and problem statements and it also proves the effectiveness in solving those problems with the kind of responses it has generated.
The development of a "Generative AI Multitasking Dashboard" marks a significant leap forward in seamlessly integrating Natural Language Processing (NLP) and Computer Vision (CV) with generative AI capabilities. By leveraging cutting-edge frameworks like Huggingface, LangChain, and Groq, alongside APIs such as Mistral and Ollama, this project demonstrates the transformative potential of generative AI across diverse tasks and domains.
Through Huggingface, the dashboard harnesses state-of-the-art pre-trained models for text generation, classification, and summarization, ensuring robustness and scalability. LangChain’s modularity further enhances the system by enabling dynamic orchestration of language models, offering powerful workflow customization for real-world applications. Groq's high-performance AI acceleration hardware ensures the seamless handling of complex, multitask operations, achieving unprecedented computational efficiency.
The integration of APIs like Mistral and Ollama expands the dashboard's generative capabilities, allowing it to produce high-quality, context-aware outputs in diverse modalities. This interoperability between frameworks and APIs underscores the platform's versatility, meeting the demands of multitasking workflows in industries such as healthcare, finance, and content creation.
The "Generative AI Multitasking Dashboard" sets a benchmark for the future of AI, proving that the unification of NLP and CV with generative AI not only drives innovation but also enables practical, scalable solutions for complex, multi-dimensional problems. This holistic approach offers a pathway to unlocking the full potential of AI, providing end-users with a seamless and transformative experience in their day-to-day operations.