# Generative AI ![GENAI.jpeg](GENAI.jpeg) --DIVIDER--# Documentation References 1. Langchain [https://python.langchain.com/docs/introduction/] 2. Huggingface [https://huggingface.co/docs] 3. Groq [https://console.groq.com/docs/overview] 4. Mistral [https://docs.mistral.ai/] 5 . Ollama [https://github.com/ollama/ollama/blob/main/docs/api.md] 6. FAISS [https://python.langchain.com/docs/integrations/vectorstores/faiss/]--DIVIDER--# Abstract The applications of Generative AI are flourishing in today's tech infrastructure and environment . Understanding the real world business use cases and automating the solutions of the tasks through intelligent data driven solution systems is a crucial challenge in today's industries. Through the applications of Generative AI we can lead to unimaginable progress in terms of generative capability of the systems to handle read world use cases and to optimize effectively the best decision based on the power of autonomy of the systems to act and think like a Human and take decisions based on the solutions generated. These Gen AI systems are powered by LLM and LVM models which are massive foundation models trained on huge datasets from where they learn feature representations and are capable of taking informed decisions. In this project , I have developed a Multitasking dashboard involving the application of Generative AI powered by multiple LLM models which extends to the domains of unstructured data in the form of either text which is known as NLP (Natural Language Processing) or in the form of graphical data as images, audio and video known as Computer Vision. I have integrated several frameworks to build the entire dashboard like Langchain , Huggingface , Groq etc. and various API integrations are also used such as Mistral API, Ollama , Groq API kind of external interfaces to fine tune their models into my own task and build a structured design template . The entire dashboard is served through Streamlit which is a python based interactive dashboard visualization library supporting multiple element and layer customizations for effective visualization of the resultant application. Through this system, I approach the various use cases related to NLP such as text classification, sentiment analysis, language translation, RAG , etc. as well as Computer Vision Problems like Facial Landmark Recognition and Image Inpainting. This system proves to be effective in solving problems in their respective domain and can solve multiple problems on a single platform enabling intra platform interconnection establishment and domain modularity based on the different tasks on which it was designed.--DIVIDER--# Introduction The entire project is divided based on the domain of use cases. We integrated several use cases from the domains of NLP and Computer Vision . The uses cases in NLP include text classification, text summarization , question answering, Chatbots, Language Translation, Text Generation and Document based Search Indexing and Retrieval using RAG. From Computer vision, the use cases taken are : - Facial Detection with Landmark Keypoint Estimation , Automatic Image Captioning with Image Inpainting. The use cases are described below . ## NLP ### Text Summarization Definition: Text summarization is the process of reducing a text document to its most important and relevant information. It can be: Extractive Summarization: Selects sentences or phrases directly from the source text. Abstractive Summarization: Generates new sentences, often rephrasing or using synonyms to convey the same meaning concisely.--DIVIDER--### Text Classification Definition: Text classification involves categorizing text into predefined classes or labels. For example, classifying emails into spam or non-spam or assigning sentiments to product reviews. ### Question Answering (QA) Definition: QA systems provide direct answers to questions posed by users, leveraging a source document or a database. QA tasks are typically divided into: Closed-domain QA: Focuses on specific topics with limited context. Open-domain QA: Handles a broad range of topics, often requiring external knowledge sources. ### Text Generation Definition: Text generation involves producing coherent and contextually relevant text, often resembling human writing. ### Chatbots Definition: Chatbots are conversational agents that interact with users in natural language. They can be rule-based or AI-driven. ### Language Translation Definition: Language translation involves converting text or speech from one language to another while preserving the original meaning. ### Retrieval-Augmented Generation (RAG) Definition: RAG is a hybrid approach that combines information retrieval and text generation. It retrieves relevant context or facts from a database (or external source) and generates text based on the retrieved information.--DIVIDER--## Computer Vision ### Automatic Image Captioning with Image Inpainting Definition: This task involves two components: Image Captioning: Generating descriptive textual captions for an image, capturing objects, activities, and context. Image Inpainting: Filling in missing or corrupted parts of an image, restoring it to a complete and visually coherent form. ### Facial Detection with Landmark Keypoint Estimation Definition: This task combines: Facial Detection: Identifying and localizing faces within an image or video. Landmark Keypoint Estimation: Detecting specific points on a face (e.g., eyes, nose, mouth, and jawline) to understand its geometry and orientation.--DIVIDER--# Methodology In this project , I have used several LLM's , several frameworks like Langchain , Huggingface , Groq and API's like Mistral, Ollama, Huggingface. Along with that I have used streamlit which is a python based interactive dashboarding package which helped me to develop the application UI . Let's understand what is the function of each and tool, platform , API which are used in the above project . Each tool which is used in the project along with their description, functionality is described below : - ## LLM's 1. LLM - We have used several LLM's available from the Huggingface API integrated the huggingface hub and downloaded several pre-trained models based on the use cases of NLP and Computer vision and fine-tuned them in our own use case applications. The LLM's used in our project are described below : - - Meta’s LLaMA3.1 and LLaMA3-70B-8192 - Meta's LLaMA (Large Language Model Meta AI) models represent the latest advancements in large-scale transformer-based language models. LLaMA3.1 improves upon its predecessors by incorporating enhanced optimization techniques, such as sparse attention mechanisms and advanced positional encoding for long sequences, making it highly effective for processing context-rich data. The LLaMA3-70B-8192 variant has a massive 70 billion parameters and supports input sequences of up to 8192 tokens, catering to complex tasks such as long-form document summarization and multi-turn dialogue modeling. Applications include chatbots, knowledge retrieval, and extensive text analysis tasks, with high efficiency and scalability for research and industry. - Facebook/BART-large-CNN - BART-large-CNN is a transformer-based sequence-to-sequence model pre-trained using a denoising autoencoder objective. The model comprises an encoder-decoder architecture with 24 transformer layers, enabling robust understanding and generation of text. Optimized for summarization tasks, the CNN-specific fine-tuning enhances its ability to process news articles and other structured data. Its applications include abstractive summarization, paraphrasing, and data-driven text generation, making it a reliable tool for summarizing complex documents while preserving meaning and coherence. - facebook/mbart-large-50-many-to-many-mmt - The mBART-large-50 is a multilingual sequence-to-sequence model trained for text-to-text tasks across 50 languages. It features a shared transformer-based encoder-decoder structure, leveraging token embeddings for multilingual contexts. The "many-to-many-mmt" fine-tuning specializes the model for translation tasks, handling various language pairs with high fidelity. Applications span multilingual machine translation, cross-lingual summarization, and content localization, with a focus on enabling seamless communication across diverse linguistic groups. - google-bert/bert-large-uncased-whole-word-masking-finetuned-squad - This is a variant of Google’s BERT (Bidirectional Encoder Representations from Transformers) fine-tuned on the Stanford Question Answering Dataset (SQuAD). The model uses a transformer architecture with 24 layers and 340 million parameters, applying bidirectional attention for deep contextual understanding. The whole-word masking pre-training strategy allows it to understand multi-word expressions better. Its fine-tuning for SQuAD makes it adept at answering questions from paragraphs, with applications in customer support, search engines, and knowledge-based systems. - lxyuan/distilbert-base-multilingual-cased-sentiments-student - This lightweight multilingual model is a distilled version of BERT, with reduced parameters for faster inference and lower computational overhead. It has been fine-tuned specifically for sentiment analysis across multiple languages. The architecture maintains key transformer layers to balance performance and efficiency, enabling it to classify sentiments effectively. Applications include social media analysis, multilingual customer feedback evaluation, and monitoring brand sentiment across diverse languages and regions. - facebook/bart-large-mnli - BART-large-MNLI is a fine-tuned variant of Facebook’s BART model optimized for natural language inference (NLI) tasks. It features the same 24-layer transformer encoder-decoder structure as the original BART model but fine-tuned on the Multi-Genre NLI (MNLI) dataset. This allows it to assess relationships between text pairs, such as entailment, contradiction, or neutrality. Applications include fact-checking, contextual similarity analysis, and content moderation by understanding nuanced relationships between textual inputs. ## Langchain 2. Langchain - LangChain is a powerful framework designed to facilitate the development of applications that interact with Large Language Models (LLMs) for tasks like conversational agents, document analysis, and retrieval-augmented generation (RAG). By integrating a modular structure, LangChain enables developers to build sophisticated workflows by combining components like chains, memory, embeddings, vector stores, and model integrations. Below, each component and its functionality is detailed. - Chains - Chains are sequences of operations where each step can involve querying an LLM, performing computations, or processing data. For example, a chain might take user input, process it with memory or context, query an LLM, and return an output. Complex workflows can be designed using chains for tasks like multi-step reasoning, API calls, and decision-making. -- Retrieval Chains - Retrieval chains combine a retrieval mechanism (e.g., vector search) with LLM reasoning. They extract relevant documents or context from a database or corpus and pass it to the model for generating accurate, context-aware responses. - Memory - Memory components store contextual information during interactions, allowing LLMs to retain and utilize previous conversations or state information. Types of memory include: Conversation Buffers: Maintain the full history of interactions. Summary Memory: Condenses past interactions into summaries for better scalability in long conversations. Knowledge Base Memory: Stores extracted facts for reference. - Structured Parsers - Structured parsers help extract structured data (like JSON or tables) from raw text outputs of LLMs. This is crucial for integrating LLMs into applications requiring structured output for downstream processes like data analytics or automation. - Document Loaders - Document loaders handle the ingestion of external data sources like PDFs, Word documents, HTML, or databases. They provide a standardized way to load data into the LangChain pipeline for tasks like summarization, information retrieval, and analysis. -- Web-based Loaders - Web loaders fetch content from URLs, APIs, or online repositories for real-time data ingestion. This is useful for building applications that process web content, like summarizing news articles or analyzing forums. - Recursive Character Text Splitter - This utility splits large text documents into manageable chunks based on predefined token or character limits. It ensures logical divisions, like splitting at sentence or paragraph boundaries, optimizing LLM performance in processing large documents. - Chat Prompt Templates - Chat prompt templates allow developers to define reusable and parameterized prompts for conversational tasks. They enable consistent and dynamic input formatting, improving the usability and flexibility of chat-based applications. - LangChain VectorStores - VectorStores provide a way to store, index, and query embeddings (numerical representations of text). They are used in retrieval-augmented generation, where relevant documents are retrieved based on semantic similarity for enhanced LLM responses. - LangChain Embeddings - Embeddings are numerical representations of text that capture semantic meaning. LangChain integrates various embedding models (e.g., OpenAI, Hugging Face) for tasks like similarity searches, clustering, and text classification. - System, Human, and AI Messages - LangChain formalizes interaction structures: -- System Messages: Define initial context or rules for the LLM’s behavior. -- Human Messages: Represent user inputs in conversational applications. -- AI Messages: Capture LLM-generated responses, maintaining interaction consistency. - LangChain Model Integrations - LangChain offers seamless integration with various models, including: -- Hugging Face: Provides access to a wide range of pre-trained transformers for tasks like summarization and translation. -- Mistral: Integrates with Mistral’s LLMs, known for efficiency and high performance. -- Ollama: Focuses on fine-tuned LLMs for specific domains or tasks. -- Groq: Utilizes specialized AI accelerators for high-throughput processing. - LangChain ChatModels - LangChain includes dedicated chat-based models for conversational AI: -- ChatOllama: Optimized for dialogue and knowledge-based conversations using Ollama’s domain- specific expertise. -- ChatHuggingface: Leverages Hugging Face transformers for versatile NLP tasks. -- ChatMistral: Offers efficient and concise conversational capabilities. -- ChatGroq: Uses Groq's specialized hardware for real-time and large-scale chat applications. ## Tempfile 3. Tempfile - The Tempfile library in Python provides tools for creating and managing temporary files and directories. It is part of Python's standard library and is widely used for scenarios where temporary data storage is needed, such as during intermediate stages of data processing, testing, or when working with data that does not need permanent storage. -- NamedTemporaryFile: Creates a temporary file with a name that can be accessed in the filesystem. The file is automatically deleted when it is closed unless specified otherwise with delete=False. ## BytesIO 4. BytesIO - The BytesIO module in Python, part of the io library, provides an in-memory binary stream that behaves like a file object. It is used for reading and writing binary data without requiring physical files on disk. This module is particularly useful when working with binary data like images, audio, or serialized objects, enabling efficient processing directly in memory. The BytesIO object can act as an input or output stream; for example, it allows writing binary data to the stream using .write() and reading it back using .read(). Additionally, the BytesIO stream supports typical file-like operations such as .seek() to move the pointer to a specific position and .getvalue() to retrieve the entire stream content as bytes. ## Transformers 5. Transformers - The transformers module, developed by Hugging Face, is a Python library that provides tools and pre-trained models for Natural Language Processing (NLP) and Natural Language Understanding (NLU) tasks. It enables the use of state-of-the-art transformer-based architectures like BERT, GPT, RoBERTa, T5, and many others, which have revolutionized NLP by leveraging attention mechanisms for better context understanding in text. The library simplifies the deployment of these models for a wide range of applications, including text classification, question answering, machine translation, summarization, text generation, and more. ## Pipeline 6. Pipeline - The transformers pipeline module, part of the Hugging Face Transformers library, provides an easy-to-use interface for performing various Natural Language Processing (NLP) tasks using pre-trained models. It abstracts away the complexities of loading models, tokenizing input, and decoding outputs, allowing developers and researchers to focus on the tasks at hand. With just a few lines of code, users can set up pipelines for tasks like text classification, sentiment analysis, named entity recognition (NER), question answering, summarization, text generation, translation, and more. The module automatically selects the appropriate tokenizer and model for the specified task, or users can explicitly define the model and tokenizer to customize behavior. The pipeline module is highly versatile and supports different frameworks, including PyTorch and TensorFlow, ensuring compatibility with a broad range of environments. For example, a sentiment analysis pipeline could take user reviews as input and return the sentiment ("positive" or "negative") with confidence scores. Similarly, a summarization pipeline can condense lengthy documents into concise summaries, aiding in quick content digestion. ## SAM Model 7. SAM model - The Segment Anything Model (SAM), developed by Meta AI, is a versatile, general-purpose segmentation model designed to identify and segment objects in images efficiently. Its functionality lies in its ability to "segment anything," including objects it has never encountered during training. SAM operates based on promptable segmentation, where users can guide the model with prompts like points, bounding boxes, or text to segment specific objects in an image. Its architecture integrates a Vision Transformer (ViT) backbone for high-resolution image processing and a modular design that accommodates different inputs, making it adaptable for diverse use cases. -- SAM model registry - The model registry is a centralized repository of pre-trained SAM models. It allows users to access different SAM model variants optimized for specific tasks or datasets. This component ensures flexibility and scalability, enabling developers to select a model variant that balances segmentation quality and computational efficiency. -- SAM Predictor - The predictor module performs segmentation based on user-provided prompts, such as points, bounding boxes, or text. It leverages SAM's underlying architecture to refine segmentation boundaries dynamically, ensuring high accuracy. This functionality is particularly useful in interactive settings where users guide the segmentation process iteratively, such as medical imaging or annotation tasks -- SAM Automatic Mask Generator - This module automatically generates masks for all objects in an image without requiring user prompts. It uses SAM's learned features to identify and segment objects of interest efficiently, making it ideal for batch processing or scenarios where precise user input is unavailable. This component facilitates applications in image editing, object tracking, and autonomous systems where real-time mask generation is critical. ## Diffusers 8. Diffusers - The Diffusers library, developed by Hugging Face, is a comprehensive framework for implementing state-of-the-art diffusion models, which are generative models capable of producing high-quality images, videos, or audio. Diffusion models work by progressively denoising random noise to synthesize data samples, leveraging a probabilistic process. The library provides pre-trained models, pipelines, and tools to easily apply diffusion techniques to various tasks, such as image generation, inpainting, super-resolution, and text-to-image synthesis. It supports popular architectures like Stable Diffusion and enables seamless integration with PyTorch, TensorFlow, and other frameworks. -- Stable Diffusion Inpaint Pipeline - The Stable Diffusion Inpaint Pipeline is a specialized variant of the stable diffusion model, designed for image inpainting tasks. Its architecture builds upon the core stable diffusion framework, incorporating components like a UNet model, variational autoencoder (VAE), and text encoder for guided generation. The UNet model progressively denoises the latent representation of the image, guided by textual prompts or other conditioning information. The VAE encodes the input image into a latent space and decodes the generated latent representations back into the image space. A text encoder (e.g., from CLIP or OpenAI's transformers) allows the model to align its inpainting results with textual descriptions, enabling guided completion of missing or masked regions. ## Image Processing 9. PIL Image , OpenCV - The PIL Image module, now maintained as Pillow, is primarily used for high-level image processing. It provides a simple and intuitive interface for opening, creating, and manipulating images. With the Image class, you can perform operations like resizing, cropping, rotating, flipping, and converting images between different formats (e.g., JPEG, PNG, BMP). The module also supports enhancements such as adjusting brightness, contrast, and color balance. Additionally, Pillow makes it easy to draw on images, apply filters, or blend multiple images together, making it ideal for tasks like creating graphics, preprocessing images for machine learning, or developing visual content. The OpenCV (cv2) module is a more comprehensive computer vision library, offering extensive support for both image and video processing. It focuses on low-level image manipulations and advanced tasks like object detection, facial recognition, and real-time video analysis. OpenCV supports a variety of image formats and provides powerful tools for edge detection, morphological transformations, feature extraction, and contour analysis. It integrates well with hardware acceleration (via CUDA), making it highly efficient for real-time applications. OpenCV also excels in integrating machine learning and deep learning models for image-based tasks like segmentation, classification, and tracking. ## Mediapipe 10. Mediapipe - MediaPipe is an open-source, cross-platform library developed by Google for building customizable machine learning (ML) pipelines to process perceptual data such as video, audio, and sensor data. It simplifies complex tasks by providing pre-built ML models and pipelines, making it an essential tool for real-time applications in computer vision, augmented reality, and human-computer interaction. MediaPipe’s highly modular architecture allows developers to seamlessly integrate its pipelines into their projects across platforms like Android, iOS, and desktop. -- Tasks module -- This module provides a high-level API for performing end-to-end machine learning tasks, including model inference and result interpretation. It abstracts much of the underlying complexity, enabling developers to focus on task-specific requirements. For example, pre-built tasks like object detection, face detection, and gesture recognition can be readily deployed using this module. -- Vision Module - The vision module is a cornerstone of MediaPipe, offering pipelines for computer vision tasks. It includes models and utilities for face detection, hand tracking, pose estimation, and object tracking. These pipelines are optimized for real-time processing, even on mobile devices, using frameworks like TensorFlow Lite. For instance, the hand tracking solution detects and tracks hands in video frames, providing real-time landmarks for applications like virtual keyboards or sign language interpretation. -- Landmark pb2 module - This module is a protocol buffer (protobuf) definition used to represent and serialize the landmarks detected by MediaPipe’s vision models. Landmarks are key points on an object or body (e.g., joints in pose estimation or facial landmarks in face detection). This module is critical for downstream tasks like calculating angles between joints or applying gestures in augmented reality. Developers can parse and manipulate the detected landmarks using this module for fine-grained analysis and integration into applications. ## Streamlit 11. Streamlit - Streamlit is an open-source Python library designed to simplify the process of building interactive and data-driven web applications for machine learning, data analysis, and visualization. Its primary focus is to empower developers and data scientists to create web apps effortlessly using pure Python, without requiring any web development knowledge. With Streamlit, users can turn Python scripts into interactive applications in minutes by leveraging its intuitive and declarative API. Streamlit allows users to display data using widgets, charts, tables, and media components like images, videos, or audio files. It supports integration with popular data visualization libraries such as Matplotlib, Plotly, and Seaborn, as well as frameworks for handling large datasets like Pandas. Streamlit's reactive framework ensures that whenever a user interacts with a widget (e.g., sliders, dropdowns, or text inputs), the application dynamically re-executes, providing an updated output. This functionality makes Streamlit ideal for building dashboards, data exploration tools, and interactive ML model demonstrations. --DIVIDER--# Experiments ### Importing the Modules and Libraries ```python import validators import streamlit as st import time import tempfile from segment_anything import sam_model_registry, SamAutomaticMaskGenerator import requests from tempfile import NamedTemporaryFile from io import BytesIO from diffusers.pipelines import StableDiffusionInpaintPipeline from PIL import Image import numpy as np import cv2 import matplotlib.pyplot as plt from transformers import Blip Processor, BlipForConditionalGeneration import mediapipe as mp from mediapipe.tasks import python from mediapipe.tasks.python import vision from mediapipe.framework.formats import landmark_pb2 import mediapipe from transformers import pipeline from transformers import AutoModelForSequenceClassification, AutoTokenizer from transformers import MBartForConditionalGeneration, MBart50TokenizerFast from transformers import AutoModelForSeq2SeqLM from langchain_groq.chat_models import ChatGroq from langchain_community.document_loaders import WebBasedLoader from langchain_community.embeddings import HuggingFaceBgeEmbeddings from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain.chains.combine_documents import create_stuff_documents_chain from langchain_core.prompts import ChatPromptTemplate from langchain.chains.retrieval import create_retrieval_chain from langchain_community.vectorstores import FAISS from langchain_mistralai import ChatMistralAI from langchain_core.prompts import ChatPromptTemplate from langchain.schema import SystemMessage, HumanMessage, AIMessage from langchain_ollama.chat_models import ChatOllama import os from dotenv import load_dotenv ``` ## Streamlit UI Web Styling using Markdown CSS Styling ```python st.markdown( """ """, unsafe_allow_html=True ) ``` ## External API Integration ```python MISTRAL_API_KEY= os.environ['MISTRAL_API_KEY']= "0tk96FDQ8cjEl4twH8c4CBfDQRTNzjtz" GROQ_API_KEY= os.environ['GROQ_API_KEY']= "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" ``` ## Domain list creation for NLP ```python models = { "Chatbot": ["ChatOllama", "ChatMistralAI", "ChatGroq"] } main_tasks = st.sidebar.multiselect("Choose domain", ["NLP", "CV"]) ``` ## Sub task list creation ```python for task in main_tasks: if task=="NLP": st.markdown("

GEN AI LLM DEPLOYEMENT DASHBOARD

", unsafe_allow_html=True) st.write("\n") st.markdown("

Natural Language Processing

", unsafe_allow_html=True) selected_tasks= ["Chatbot", "Question-Answering", "Text-Translation", "Sentiment Analysis", "Summarization", "Text Classification", "RAG" ] task_options= st.sidebar.multiselect("Select Tasks", selected_tasks) ``` ## For Chatbots ```python for task in task_options: if task=="Chatbot" : def Generative_AI(): selected_models= {} for task in task_options: if task=="Chatbot" : selected_models[task] = st.sidebar.multiselect(f"Select models for {task}", models[task]) for task, model_list in selected_models.items(): if model_list: if task=="Chatbot": st.markdown("

ChatBots

", unsafe_allow_html=True) ``` ## Model list inside Chatbots ### For ChatOllama ```python for model in model_list: if model=="ChatOllama": st.markdown("

ChatOllama

", unsafe_allow_html=True) llm = ChatOllama( model="llama3.1", temperature=0.8, max_tokens=1024) prompt = ChatPromptTemplate.from_template( """ ### Task ### You will act like the system specified {system_message} and will answer the questions asked by the user in the form of a: {human_message} """ ) system_message= st.text_input("Enter message for the system : ") if 'chat' not in st.session_state: st.session_state['chat']= [ SystemMessage(content=system_message), ] def get_response(human_message): if human_message : st.session_state['chat'].append(HumanMessage(content=human_message)) response= llm.invoke(st.session_state['chat']) st.session_state['chat'].append(AIMessage(content=response.content)) return response.content human_message= st.text_input("Enter your message : ") if st.button("Generate Text"): with st.spinner("Generating..."): answer = get_response(human_message) generated_answer = answer st.success("Generated Text:") st.write(generated_answer) ``` ### For ChatMistral ```python elif model=="ChatMistralAI": st.markdown("

ChatMistralAI

", unsafe_allow_html=True) llm= ChatMistralAI(api_key=MISTRAL_API_KEY, temperature=0.8, max_tokens=1024) prompt = ChatPromptTemplate.from_template( """ ### Task You are a helpful assistant designed to act like a system specified by the user as {system_message}. ### You will act like the system specified and will answer the questions asked by the user in the form of a: {human_message} """ ) system_message= st.text_input("Enter message for the system : ") human_message= st.text_input("Enter your message : ") if 'chat' not in st.session_state: st.session_state['chat']= [ SystemMessage(content=system_message), ] def get_response(human_message): st.session_state['chat'].append(HumanMessage(content=human_message)) response= llm.invoke(st.session_state['chat']) st.session_state['chat'].append(AIMessage(content=response.content)) return response.content if st.button("Generate Text"): with st.spinner("Generating..."): answer = get_response(prompt.format(system_message=system_message, human_message=human_message)) generated_text = answer st.success("Generated Text:") st.write(generated_text) ``` ### For ChatGroq ```python elif model=="ChatGroq": st.markdown("

ChatGroq

", unsafe_allow_html=True) llm= ChatGroq(api_key=GROQ_API_KEY, model_name="llama3-70b-8192", temperature=0.8, max_tokens=1024) prompt = ChatPromptTemplate.from_template( """ ### Task You are a helpful assistant designed to act like a system specified by the user as {system_message}. ### You will act like the system specified and will answer the questions asked by the user in the form of a: {human_message} """ ) system_message= st.text_input("Enter message for the system : ") human_message= st.text_input("Enter your message : ") if 'chat' not in st.session_state: st.session_state['chat']= [ SystemMessage(content=system_message), ] def get_response(human_message): st.session_state['chat'].append(HumanMessage(content=human_message)) response= llm.invoke(st.session_state['chat']) st.session_state['chat'].append(AIMessage(content=response.content)) return response.content if st.button("Generate Text"): with st.spinner("Generating..."): answer = get_response(prompt.format(system_message= system_message, human_message= human_message)) generated_text = answer st.success("Generated Text:") st.write(generated_text) else: print("No model selected") Generative_AI() ```--DIVIDER--## Text Translation ```python elif task== "Text-Translation": def text_trans(): for task in task_options: if task=="Text-Translation": st.markdown("

Translate text from One language to another :

", unsafe_allow_html=True) tokenizer= MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-many-to-many-mmt") model= MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-many-to-many-mmt") src_lang= st.text_input("Enter source language code : ") tgt_lang= st.text_input("Enter target language code : ") tokenizer.src_lang = src_lang tokenizer.tgt_lang = tgt_lang qa_pipeline = pipeline("translation", model=model, tokenizer=tokenizer, src_lang= tokenizer.src_lang, tgt_lang= tokenizer.tgt_lang) sentence = st.text_area("Enter the sentence here...") translation= st.text_input("Enter language to be translated : ") if sentence: result = qa_pipeline(sentence, translation) if st.button("Translate"): with st.spinner("Translating..."): st.subheader("Answer:") st.success("Translated Text:") st.write(result[0]['translation_text']) text_trans() ``` ## Question Answering ```python elif task=="Question-Answering": def qna(): for task in task_options: if task=="Question-Answering": st.markdown("

Question Answering

", unsafe_allow_html=True) context = st.text_area("Enter the context here...") question = st.text_input("Enter your question here...") qa_pipeline = pipeline("question-answering", model="google-bert/bert-large-uncased-whole-word-masking-finetuned-squad") if question and context: result = qa_pipeline({"question": question, "context": context}) if st.button("Answer"): with st.spinner("Answering..."): st.subheader("Answer:") st.success("Provided Answer:") st.write(result['answer']) else: st.write("Please provide both a question and a context.") qna() ``` ## Text Summarization ```python elif task=="Summarization": def summarize(): for task in task_options: if task=="Summarization": st.markdown("

Summarize texts

", unsafe_allow_html=True) model_name = "facebook/bart-large-cnn" # Replace with the actual Llama model name on Hugging Face tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSeq2SeqLM.from_pretrained(model_name) generator = pipeline("summarization", model=model, tokenizer=tokenizer, trust_remote_code=True) user_input = st.text_area("Enter your article:") if st.button("Summarize"): with st.spinner("Summarizing..."): response = generator(user_input) generated_text = response[0]['summary_text'] st.success("Summarized Text:") st.write(generated_text) summarize() ``` ## Sentiment Analysis ```python elif task=="Sentiment Analysis": def sentiment(): for task in task_options: if task=="Sentiment Analysis": st.markdown("

Sentiment Analysis

", unsafe_allow_html=True) model_name = "lxyuan/distilbert-base-multilingual-cased-sentiments-student" model = AutoModelForSequenceClassification.from_pretrained(model_name) tokenizer = AutoTokenizer.from_pretrained(model_name) nlp = pipeline('text-classification', model=model, tokenizer=tokenizer) question= st.text_input("Enter you sentence : ") if st.button("Generate Sentiment"): with st.spinner("Generating..."): st.subheader("Answer:") result = nlp(question) st.success("Predicted Sentiment:") st.write({result[0]['label'] : result[0]['score']}) sentiment() ``` ## Text Classification ```python elif task== "Text Classification": def classify(): for task in task_options: if task=="Text Classification": st.markdown("

Classify Text with labels

", unsafe_allow_html=True) model_name = "facebook/bart-large-mnli" model = AutoModelForSequenceClassification.from_pretrained(model_name) tokenizer = AutoTokenizer.from_pretrained(model_name) nlp = pipeline('zero-shot-classification', model=model, tokenizer=tokenizer) question= st.text_input("Enter you question : ") labels= st.multiselect("Enter labels to classify ", ['travelling', 'sports', 'food', 'education']) if st.button("Classify"): with st.spinner("Classifying..."): st.subheader("Answer:") result = nlp(question, labels) st.success("Classified Text:") st.write({result['labels'][int(max(result['scores']))] : max(result['scores'])}) classify() ``` ## RAG ```python elif task== "RAG": def rag(): for task in task_options: if task=="RAG": st.markdown("

Retrieval Augmented Generation

", unsafe_allow_html=True) if "vector" not in st.session_state: website= st.text_input("Enter a URL:", placeholder="https://example.com") if website: if validators.url(website): st.session_state.embeddings= HuggingFaceBgeEmbeddings() st.session_state.loader= WebBaseLoader(website) st.session_state.docs= st.session_state.loader.load() st.session_state.text_doc= RecursiveCharacterTextSplitter(chunk_size= 1000, chunk_overlap = 200) st.session_state.fin_docs= st.session_state.text_doc.split_documents(st.session_state.docs) st.session_state.vectors= FAISS.from_documents(st.session_state.fin_docs, st.session_state.embeddings) llm= ChatGroq(groq_api_key= GROQ_API_KEY, model= "llama3-8b-8192") prompt= ChatPromptTemplate.from_template( """ ### CONTEXT {context} (Description: This section provides all necessary background, relevant information, and framing details that are crucial for understanding the request or generating a relevant response. This may include task instructions, tone or style preferences, background knowledge, and specific objectives. Include any special considerations or additional details here.) ### INPUT {input} (Description: This section contains the user's main question, command, or prompt that requires a response. This is the primary user request, and the answer or generation will be based on both the INPUT and CONTEXT.) ### RESPONSE (Note: Answer or generate text based on the CONTEXT and INPUT provided above. Consider tone, style, and any additional requirements included in CONTEXT. Focus on delivering a precise, contextually relevant, and coherent response to the INPUT.) """ ) doc_chain= create_stuff_documents_chain(llm, prompt) retriever= st.session_state.vectors.as_retriever() ret_chain= create_retrieval_chain(retriever, doc_chain) prompt= st.text_input("Enter your Question : ") if prompt : if st.button("Query Website or Document"): with st.spinner("Querying..."): st.subheader("Retrieving ...") response= ret_chain.invoke({'input' : prompt}) start= time.process_time() print(f"Response Time = {time.process_time() - start}") st.success("Retrieved Data:") st.write(response['answer']) with st.expander("Document Similarity Search"): for i, doc in enumerate(response['context']): st.write (doc.page_content) st.write('---------------------------') rag() else: print("No task selected") ``` --DIVIDER--# CV ## Sub task list ```python else: selected_tasks= ["Image Inpainting with Automatic Image Captioning", "Facial Keypoint detection with Landmark Recognition" ] task_options= st.sidebar.multiselect("Select Tasks", selected_tasks) ``` ## For Image Inpainting with Automatic Image Captioning ```python for tasks in task_options: if tasks =="Image Inpainting with Automatic Image Captioning": st.markdown( """ """, unsafe_allow_html=True ) st.markdown("

Image Inpainting with Automatic Image Captioning

", unsafe_allow_html=True) st.write("\n") st.markdown("

Inpainting and Caption Generation with Stable Diffusion, SAM and BLIP

", unsafe_allow_html=True) st.markdown("

Upload and generate an image and get a caption!

", unsafe_allow_html=True) st.markdown("

Loading Image

", unsafe_allow_html=True) uploaded_file = st.file_uploader("Choose an image...", type=["jpg", "jpeg", "png"], key ="upload") if uploaded_file is not None: with NamedTemporaryFile(delete=False, suffix=".png") as temp_file: temp_file.write(uploaded_file.read()) temp_path = temp_file.name image = cv2.imread(temp_path) image= cv2.resize(image, (512, 512), interpolation=None) img_rgb= cv2.cvtColor(image, cv2.COLOR_BGR2RGB) img_rgb= cv2.resize(img_rgb, (512, 512), interpolation=None) st.image(img_rgb, caption="Uploaded Image", use_column_width=True) st.write(f"Image loaded with shape: {img_rgb.shape}") st.markdown("

Generating segmentation masks

") model_type = 'vit_h' checkpoint = "C:/Users/DIBYOJIT/Downloads/sam_vit_h_4b8939.pth" sam = sam_model_registry[model_type](checkpoint=checkpoint).to("cpu") # Generate segmentation masks mask_generator = SamAutomaticMaskGenerator(sam) result = mask_generator.generate(img_rgb) if result: sorted_result = [segment['segmentation'] for segment in sorted(result, key= lambda x: x['area'], reverse=True)] largest_mask = sorted_result[0] plt.figure(figsize=(8, 6)) plt.imsave('segmentation_mask.png', largest_mask) plt.imshow(largest_mask) else: raise ValueError("No segmentation masks generated.") st.image('segmentation_mask.png', caption="Segmentation Mask", use_column_width=True) st.markdown("

Image and Mask URL creation and prompting

", unsafe_allow_html=True) def download_image(url): if not validators.url(url): st.error("Invalid URL. Please enter a valid image URL.") return None try: response = requests.get(url) response.raise_for_status() url_image= Image.open(BytesIO(response.content)).convert("RGB") return url_image except requests.exceptions.RequestException as e: st.error(f"Error fetching the image: {e}") return None except Exception as e: st.error(f"Error processing the image: {e}") return None image_url = st.text_input("Enter the URL of the image : ") mask_url = st.text_input("Enter the URL of the mask : ") filename = st.text_input("Enter the filename (without extension): ").strip() extension = st.text_input("Enter the file extension (e.g., png, jpg): ").strip() file_path= f"{filename}.{extension}" init_image = download_image(image_url) mask_image = download_image(mask_url) prompt= st.text_input("Enter the prompt : ") st.write("Prompt: ", prompt) processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base") model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base") if st.button("Generate Image"): st.markdown("

Generating Inpainted Image

") with st.spinner("Generating..."): image_to_image = StableDiffusionInpaintPipeline.from_pretrained("runwayml/stable-diffusion-inpainting") result = image_to_image(prompt=prompt, image=init_image, mask_image=mask_image) st.success("Generated Image: ") generated_image = result.images[0] generated_image = cv2.cvtColor(np.array(generated_image), cv2.COLOR_RGB2BGR) plt.imsave(file_path, generated_image) st.image(file_path, caption="Generated Image", use_column_width=True) diffused_url= st.text_input("Enter the URL of the diffused image : ") if validators.url(diffused_url): st.success("Image URL is valid.") st.markdown("

Generating Inpainted Image

") st.write("Automatic Captioning of the Diffused Image ") diffused_image = Image.open(requests.get(diffused_url, stream=True).raw).convert('RGB') if st.button("Generate Caption"): with st.spinner("Generating..."): inputs = processor(diffused_image, return_tensors="pt") outputs = model.generate(**inputs) caption = processor.decode(outputs[0], skip_special_tokens=True) st.success("Generated Caption: ") st.write(caption) ``` ## For Real time Facial Detection with Real time Facial Landmark Keypoint Estimation ### For Image Face and Landmark Detection ```python else: st.markdown("

Sample Landmark Detection

", unsafe_allow_html=True) graphics= st.sidebar.multiselect("Choose type of graphics", ['Image', 'Video']) for data in graphics: if data =="Image": st.markdown("", unsafe_allow_html=True) uploaded_file = st.file_uploader("Choose an image...", type=["jpg", "jpeg", "png"], key="Image") if uploaded_file is not None: with tempfile.NamedTemporaryFile(delete=False, suffix=".jpg") as temp_file: temp_file.write(uploaded_file.read()) temp_file_path = temp_file.name mp_image = mp.Image.create_from_file(temp_file_path) model_path= "C:/Users/DIBYOJIT/Downloads/face_landmarker.task" def draw_landmarks(img_rgb, detection_result): face_landmarks_list = detection_result.face_landmarks annotated image = np.copy(img_rgb) for idx in range(len(face_landmarks_list)): face_landmarks = face_landmarks_list[idx] face_landmarks_proto = landmark_pb2.NormalizedLandmarkList() face_landmarks_proto.landmark.extend([ landmark_pb2.NormalizedLandmark(x=landmark.x, y=landmark.y, z=landmark.z) for landmark in face_landmarks ]) mp_drawing = mp.solutions.drawing_utils mp_drawing.draw_landmarks( image=annotated_image, landmark_list=face_landmarks_proto, connections=mp.solutions.face_mesh.FACEMESH_TESSELATION, landmark_drawing_spec=None, connection_drawing_spec=mp.solutions.drawing_styles.get_default_face_mesh_tesselation_style() ) mp_drawing.draw_landmarks( image=annotated_image, landmark_list=face_landmarks_proto, connections=mp.solutions.face_mesh.FACEMESH_CONTOURS, landmark_drawing_spec=None, connection_drawing_spec=mp.solutions.drawing_styles.get_default_face_mesh_contours_style() ) mp_drawing.draw_landmarks( image=annotated_image, landmark_list=face_landmarks_proto, connections=mp.solutions.face_mesh.FACEMESH_IRISES, landmark_drawing_spec=None, connection_drawing_spec=mp.solutions.drawing_styles.get_default_face_mesh_iris_connections_style() ) return annotated_image base_options= python.BaseOptions(model_asset_path=model_path) options= vision.FaceLandmarkerOptions(base_options= base_options, output_face_blendshapes=True, output_facial_transformation_matrixes=True, num_faces=1) detector= vision.FaceLandmarker.create_from_options(options) detection_result= detector.detect(mp_image) annotated_image = draw_landmarks(mp_image.numpy_view(), detection_result) st.image(annotated_image, caption="Face Landmarks", use_column_width=True) face_landmark_list = detection_result.face_landmarks for idx in range(len(face_landmark_list)): face_landmarks = face_landmark_list[idx] face_landmarks = [(landmark.x, landmark.y) for landmark in face_landmarks] def get_face_roi_from_landmarks(landmarks, image_shape): x= [] y= [] for land in landmarks: x.append(land[0]) y.append(land[1]) x_min= int(min(x)* image_shape[1]) y_min= int(min(y)* image_shape[0]) x_max= int(max(x)* image_shape[1]) y_max= int(max(y)* image_shape[0]) return x_min, y_min, x_max, y_max def generate_face_mask_roi(processed image, face_landmarks, annotated_image_shape): if face_landmarks: # Get bounding box for the face from the landmarks x_min, y_min, x_max, y_max = get_face_roi_from_landmarks(face_landmarks, annotated_image_shape) bbox= (x_min, y_min, x_max, y_max) cv2.rectangle(processed_image, (x_min, y_min), (x_max, y_max), (0, 0, 255), 2) cv2.putText(processed_image, "Face", (x_min, y_min - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.9, (0, 255, 0), 2) plt.imsave("ROI_of_face.png", processed_image) st.image('ROI_of_face.png', caption="Facial Bounding Box with Landmarks", use_column_width=True) return processed_image, bbox, face_landmarks generate_face_mask_roi(annotated_image, face_landmarks, annotated_image.shape) ``` ### For Real time Video Landmark and Face Detection #### For Webcam Video ```python else: def process_video_frame(frame, face_landmarker): rgb_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB) mp_image = mediapipe.Image(image_format=mediapipe.ImageFormat.SRGB, data=rgb_frame) results = face_landmarker.detect(mp_image) annotated_frame = frame.copy() if results.face_landmarks: for landmarks in results.face_landmarks: if landmarks: face_landmarks_proto = landmark_pb2.NormalizedLandmarkList() face_landmarks_proto.landmark.extend([ landmark_pb2.NormalizedLandmark(x=landmark.x, y=landmark.y, z=landmark.z) for landmark in landmarks ]) mp_drawing = mediapipe.solutions.drawing_utils mp_drawing.draw_landmarks( image=annotated_frame, landmark_list=face_landmarks_proto, connections=mp.solutions.face_mesh.FACEMESH_TESSELATION, landmark_drawing_spec=None, connection_drawing_spec=mp.solutions.drawing_styles.get_default_face_mesh_tesselation_style() ) mp_drawing.draw_landmarks( image=annotated_frame, landmark_list=face_landmarks_proto, connections=mp.solutions.face_mesh.FACEMESH_CONTOURS, landmark_drawing_spec=None, connection_drawing_spec=mp.solutions.drawing_styles.get_default_face_mesh_contours_style() ) mp_drawing.draw_landmarks( image=annotated_frame, landmark_list=face_landmarks_proto, connections=mp.solutions.face_mesh.FACEMESH_IRISES, landmark_drawing_spec=None, connection_drawing_spec=mp.solutions.drawing_styles.get_default_face_mesh_iris_connections_style() ) for idx in range(len(results.face_landmarks)): face_landmarks = results.face_landmarks[idx] face_landmarks = [(landmark.x, landmark.y) for landmark in face_landmarks] x_min, y_min, x_max, y_max= get_face_roi_from_landmarks(face_landmarks, annotated_frame.shape) cv2.rectangle(annotated_frame, (x_min, y_min), (x_max, y_max), (255, 0, 0), 2) cv2.putText(annotated_frame, "Facial Landmarks", (x_min, y_min - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.9, (0, 255, 255), 2) return annotated_frame def get_face_roi_from_landmarks(landmarks, image_shape): x= [] y= [] for land in landmarks: x.append(land[0]) y.append(land[1]) x_min= int(min(x)* image_shape[1]) y_min= int(min(y)* image_shape[0]) x_max= int(max(x)* image_shape[1]) y_max= int(max(y)* image_shape[0]) return x_min, y_min, x_max, y_max webcam_or_custom_video = st.sidebar.multiselect("Choose Webcam or Custom Video", ["Webcam", "Custom Video"]) for vid in webcam_or_custom_video: if vid =="Webcam": st.markdown("", unsafe_allow_html=True) def main(): cap = cv2.VideoCapture(0) model_path= "C:/Users/DIBYOJIT/Downloads/face_landmarker.task" if not cap.isOpened(): st.error("Error: Could not open video.") exit() options = vision.FaceLandmarkerOptions( base_options=python.BaseOptions(model_asset_path=model_path), num_faces=1 ) face_landmarker = vision.FaceLandmarker.create_from_options(options) while True: ret, frame = cap.read() if not ret: break annotated_frame = process_video_frame(frame, face_landmarker) cv2.imshow("Facial Landmarks", annotated_frame) if cv2.waitKey(1) & 0xFF == ord('q'): break cap.release() cv2.destroyAllWindows() if __name__ == "__main__": main() ``` #### For Custom Video Landmark and Face Detection ```python else: def main2(): st.markdown("", unsafe_allow_html=True) uploaded_file = st.file_uploader("Choose a Video...", type=["mp4", "mp3", "avi"], key="upload") if uploaded_file is not None: with tempfile.NamedTemporaryFile(delete=False, suffix=".mp4") as temp_file: temp_file.write(uploaded_file.read()) temp_file_path = temp_file.name st.video(temp_file_path) model_path= "C:/Users/DIBYOJIT/Downloads/face_landmarker.task" cap = cv2.VideoCapture(temp_file_path) if not cap.isOpened(): st.error("Error: Could not open video.") exit() options = vision.FaceLandmarkerOptions( base_options=python.BaseOptions(model_asset_path=model_path), num_faces=1 ) face_landmarker = vision.FaceLandmarker.create_from_options(options) while True: ret, frame = cap.read() if not ret: break annotated_frame = process_video_frame(frame, face_landmarker) cv2.imshow("Facial Landmarks", annotated_frame) if cv2.waitKey(1) & 0xFF == ord('q'): break cap.release() cv2.destroyAllWindows() if __name__ == "__main__": main2() ```--DIVIDER--# Results Screenshots of the Examples given in the UI # For NLP Use Cases 1. For Chatbot 1.1 For ChatOllama ![Screenshot (350).png](Screenshot%20(350).png) ![Screenshot (351).png](Screenshot%20(351).png) 1.2 For ChatMistral ![Screenshot (352).png](Screenshot%20(352).png) ![Screenshot (353).png](Screenshot%20(353).png) 1.3 For ChatGroq ![Screenshot (354).png](Screenshot%20(354).png) 2. For Text Translation ![Screenshot (356).png](Screenshot%20(356).png) 3. For Question Answering ![Screenshot (355).png](Screenshot%20(355).png) 4. For Summarization ![Screenshot (359).png](Screenshot%20(359).png) 5. For Sentiment Analysis ![Screenshot (357).png](Screenshot%20(357).png) ![Screenshot (358).png](Screenshot%20(358).png) 6. For Text Classification ![Screenshot (360).png](Screenshot%20(360).png) ![Screenshot (361).png](Screenshot%20(361).png) ![Screenshot (362).png](Screenshot%20(362).png) ![Screenshot (363).png](Screenshot%20(363).png) 7. RAG ![Screenshot (364).png](Screenshot%20(364).png) # For CV Use Cases 1. For Automatic Image Captioning with Image Inpainting ## Original Image ![deepika.jpeg](deepika.jpeg) ## Segmentation Mask ![segmentation_mask.png](segmentation_mask.png) ## Inpainted Image ![deep.jpeg](deep.jpeg) 2. For Facial Detection with Landmark Keypoint Estimation ## For Image Type ### Original Image ![deep.jpeg](deep.jpeg) ### Facial Landmarks Image ![Screenshot (365).png](Screenshot%20(365).png) ### Facial Object Detection along with Landmarks ![Screenshot (368).png](Screenshot%20(368).png) ## For Video type ### For Webcam ![Screenshot (366).png](Screenshot%20(366).png) ### For Custom Video #### Detected Video ![Screenshot (367).png](Screenshot%20(367).png) --DIVIDER--# Conclusion This project aims to combine the use cases of NLP and Computer Vision using Generative AI by dynamically testing the system using user centered queries and problem statements and it also proves the effectiveness in solving those problems with the kind of responses it has generated. The development of a "Generative AI Multitasking Dashboard" marks a significant leap forward in seamlessly integrating Natural Language Processing (NLP) and Computer Vision (CV) with generative AI capabilities. By leveraging cutting-edge frameworks like Huggingface, LangChain, and Groq, alongside APIs such as Mistral and Ollama, this project demonstrates the transformative potential of generative AI across diverse tasks and domains. Through Huggingface, the dashboard harnesses state-of-the-art pre-trained models for text generation, classification, and summarization, ensuring robustness and scalability. LangChain’s modularity further enhances the system by enabling dynamic orchestration of language models, offering powerful workflow customization for real-world applications. Groq's high-performance AI acceleration hardware ensures the seamless handling of complex, multitask operations, achieving unprecedented computational efficiency. The integration of APIs like Mistral and Ollama expands the dashboard's generative capabilities, allowing it to produce high-quality, context-aware outputs in diverse modalities. This interoperability between frameworks and APIs underscores the platform's versatility, meeting the demands of multitasking workflows in industries such as healthcare, finance, and content creation. The "Generative AI Multitasking Dashboard" sets a benchmark for the future of AI, proving that the unification of NLP and CV with generative AI not only drives innovation but also enables practical, scalable solutions for complex, multi-dimensional problems. This holistic approach offers a pathway to unlocking the full potential of AI, providing end-users with a seamless and transformative experience in their day-to-day operations.