Feb 05, 2025●10 reads●No License

ASKBOT: An Integrated AI based chatbot for Voice, Text, Image Intelligence and analysis

e
@eolewuenyi

ASKBOT

INTRODUCTION

‎

A multi-modal conversational AI platform that integrates advanced language processing capabilities with voice, text, and image analysis functionalities. ASKBOT serves as a sophisticated virtual assistant, enabling seamless communication with users through a range of interfaces, including:

Voice-based interactions
Text-based interactions
Visual analysis:
By leveraging these advanced features, ASKBOT provides a highly intuitive and user-friendly experience, empowering users to interact with the platform in a manner that suits their preferences and needs.

‎

Features:

‎

Text-Based Prompts: The text-based prompt feature of ASKBOT allows users to engage with the platform through simple text-based interactions. The results of user testing indicate that this feature is effective in providing accurate and informative responses to user queries.

Screenshot 2025-02-05 113936.png

Speech-to-Text: The text-to-speech functionality of ASKBOT enables users to interact with the platform using voice commands and responds in text. the command is enabled using microphone icon above the input. The model is able to detect tone and punctuation from users speech

Screenshot 2025-02-05 114346.png

Voice Assistant(Speech to Speech): The Voice Assistant feature of ASKBOT is a dedicated voice assistant integrated seamlessly into the platform. This feature allows users to interact with the platform using voice commands and responds with an AI assistant voice, that the user can pick from

Screenshot 2025-02-05 114902.png

Image Analysis: The power of computer vision with ASKBOT's advanced image analysis capabilities. Upload images, and our AI will analyze and understand the content, providing valuable insights and information.

!
Screenshot 2025-02-05 120001.png

Image Generation: Imagine being able to generate high-quality images that bring your ideas to life. With ASKBOT's advanced image generation capabilities, you can do just that. Our platform leverages the latest advancements in deep learning and computer vision to create stunning images that are tailored to your specific needs.

Screenshot 2025-02-05 120554.png
.

KEY ADVANTAGES:

It delivers fast responses from 350to 1200 tokens(depending on model) per second using Groq API service, which is the fastest API
service to date.

Screenshot 2025-02-05 112516.png

It is effective against hallucination by using llama model with uses both NLP and machine learning algorithms to avoid hallucinations
example TEST 1:
A fast general-purpose way to test for hallucination is to do an automated comparison of your generative AI's output to a task to one or more ground truths or expected values.

under normal circumstances, a large language model uses probability to predict the next word, but there is tendency for hallucinations. However this model, uses LLM and a Knowledge retrieval based system to
analyze facts
Ability to render near realistic Images using Flux the best model for generating images

Screenshot 2025-02-05 113015.png

GOALS OF THIS PROJECT

the scope and aim of this project was to develop a fast AI system for any scale or use case by user, and giving the user different mediums of interaction

Requirements:

to develop a quick and easy chatbot solution that can interact with user using three modes: audio,
visual , text and giving the user a choice of interaction
to develop a interface that will easy for user to understand
Storing all session data into the system memory, so the AI can answer question based on previous responses and correct output
Generate visual images for users, using prompts
Analyze and describe key context in an image, giving the AI visual context.

other requirements:

Response Time: the ability to deliver fast responses to users, faster than other open source models using the groq API service
Fault Tolerance: ability to handle unexpected failures.
Accuracy: The system is delivers accurate and factual responses
Flexibility: Ability to change models and LLMS easy in the codebase, and the ability to change parameters

TOOLS FOR DEVELOPMENT

Front-end Development:

Streamlit(Framework of Python)
CSS

Back-end Development:

Python
Streamlit community cloud service

API:

Groq
Together
Deepgram

Large Language Models:

Meta llama
Black Forest Flux
Whisper
Deepgram

Models

Text Based Model: This Model is for delivering responses to the user using a text prompt
Speech to Text Model: This Model of interaction involves using a microphone input and respond in text, it is able to detect tone, and add proper punctuation where necessary
Speech to Speech Model: This model is in three phases, it involves transcribing the audio to text, sending the transcription to the LLM and then using text to speech LLM(deepgram) to deliver the response in form of audio

Development phases of project for each model

Code and Design

Text Based Model

The text based Model consist of the front-end, the API and session state to store the values of previous responses, with a parameter settings on the front-end

Screenshot 2025-02-06 011557.png

You will have to import your need packages, then store the API key to interact with groq models

#store the api key
client = Groq(
    api_key= os.getenv("GROQ_API_key")
)

use a session state to send prompts and responses to the memory of the AI

#create a session state to store all prompts into the memory
if "messages" not in st.session_state:
    st.session_state["messages"] = []

)

store the values of parameters as setting, in which the value can be changed on the frontend

# Parameters
with st.sidebar.expander("⚙ Developer settings"):
    temp = st.slider("Temperature", 0.0, 2.0, value=1.0, help="randomness of response")
    max_tokens = st.slider("Max Tokens", 0, 1024, value=300, help="unit of word for characters for input and output")
    stream = st.toggle("Stream", value=True, help="ability to deliver generated content")
    
    top_p = st.slider("Top P", 0.0, 1.0, help="It's not recommended to alter both the temperature and the top-p, cummulative probability.")
     stop_seq = st.text_input("Stop Sequence", help="word to stop generation)

# For each message of either role "user" or "assistant" add to message in session state
for message in st.session_state.messages:
    # {"role": "user", "content": "hello world"}
    
    avatar = "./assets/images/user_avatar.png"  if message["role"]=="user" else "./assets/images/ai_avatarcom.png" if message["role"] == "assistant" else "👋"
    with  st.chat_message(message["role"], avatar = avatar):
        st.markdown(message["content"])

Finally deliver the response only when a prompt is given and append prompt to user and response to assistant

if prompt := st.chat_input(key="input",disabled=not input):   
    # add the message/prompt to the messages list
    # show the new user message
    recorded_audio = None
    with st.chat_message("user",  avatar = "./assets/images/user_avatar.png"):
        st.write(prompt)
        
       
    st.session_state.messages.append(
        {"role": "user", "content": prompt}
    )     
       
     
    
        
    # Make API call and show the model response
    with st.chat_message("assistant", avatar= "./assets/images/ai_avatarcom.png"):
        # create empty container for response
        response_text = st.empty()
        
        # Make the API call to Groq
        completion = client.chat.completions.create(
            model= "llama-3.1-8b-instant",
            messages=st.session_state.messages,
            stream=stream,
            temperature=temp,
            max_tokens=max_tokens,
            stop=stop_seq,
            top_p=top_p
        )
        
        # Display the full message
        full_response = ""
        
        if stream:
            for chunk in completion:
                full_response += chunk.choices[0].delta.content or ""
                response_text.write(full_response)
        else:
            with st.spinner("Generating"):
                completion.choices[0].message.content
            
        # Add assistant message to the messages list
        st.session_state.messages.append({"role": "assistant", "content": full_response})

SPEECH TO TEXT MODEL

63eb606a79b4825884092f56_speech recognition process.jpg

model phases:

Transcribe the audio to text using whisper
send the response to the main LLM(meta llama)

Transcribing the audio using whisper:

#transcribe the audio to text the process it with the LLM
def transcribe_audio(client, audio_path):
     #transcription history
    if "voicebxt" not in st.session_state:
       st.session_state["voicebxt"] = []
    
    with open(audio_path,"rb") as audio_file:   
           transcript = client.audio.transcriptions.create(model= "whisper-large-v3-Turbo",file=audio_file) 
           print(transcript.text)  
           prompt = transcript.text

sending the response to the llama model

 with st.chat_message("user",  avatar = "./assets/images/user_avatar.png"):
            st.write(prompt)
        
        
           st.session_state.voicebxt.append(
              {"role": "user", "content": prompt}
           ) 
           st.write("mic response ")       
    # Make API call and show the model response
           with st.chat_message("assistant", avatar= "./assets/images/ai_avatarcom.png"):
        # create empty container for response
             response_text = st.empty()
        
        # Make the API call to Groq
           completion = client.chat.completions.create(
               model= "llama-3.1-8b-instant",
               messages =st.session_state.voicebxt,
               stream=stream,
               temperature=temp,
               max_tokens=max_tokens,
               stop=stop_seq,
              top_p=top_p
            )     
        # Display the full message
           full_response = ""      
           if stream:
              for chunk in completion:
                full_response += chunk.choices[0].delta.content or ""
                response_text.write(full_response)
           else:
              with st.spinner("Generating"):
                completion.choices[0].messages.content

        # Add assistant message to the messages list
           st.session_state.voicebxt.append({"role": "assistant", "content": full_response})
           
#display the transcription and answer after
if recorded_audio:
    audio_file= "audio.mp3"
    with open( audio_file, "wb") as f:
        f.write(recorded_audio)
        transcribed_text = transcribe_audio(client, audio_file)

VOICE ASSISTANT MODEL

model phases:

we have to transcribe the audio using the speech to text model(whisper)
send the response to a conversational LLM(llama)
use a speech to text model to output the result(deepgram)

the model also gives users the ability to change voice of the assistant

#store different voices in a session state
if "voice" not in st.session_state:
    st.session_state["voice"] = ""

#voice option for changing voices
st.session_state.voice = st.sidebar.selectbox("Choose A Voice", [
    "aura-athena-en", "aura-asteria-en", "aura-stella-en","aura-luna-en","aura-orion-en","aura-perseus-en","aura-helios-en"
], index=0)


#header for page
st.header("Use Voice Assitant Aura")

#LLM for transcribing the audio and giving a resonse(Speech to Text, and LLM)
client = Groq(
    api_key=os.getenv("GROQ_API_key")
)



#Package for Deepgram(Speech to Text Model)
from deepgram import(
    DeepgramClient,
    SpeakOptions,
)


#function for converting or transcribing audio to text
def audio_to_text(client, audio_path):
    with open(audio_path,"rb") as audio_file:   
           transcript = client.audio.transcriptions.create(model= "whisper-large-v3-Turbo",file=audio_file) 
           print(transcript.text)  
           prompt = transcript.text
           return prompt
    

#fucntion that will generate a response based on the text
def generate_response(input_text):
    if "voicebot" not in st.session_state:
      st.session_state["voicebot"] = []
      

     
    st.session_state.voicebot.append(
        {"role": "user", "content": input_text}
    )    
    completion = client.chat.completions.create(
            model= "llama3-8b-8192",
            messages= st.session_state.voicebot,
            stream=True,
            temperature=0.7,
            max_tokens=300,
            stop=None,
        
            top_p=1
    )
    
    response = " "
    for chunk in completion:
        response += chunk.choices[0].delta.content or ""
       
    print(response)
   
  
    return response



filename = "output.mp3"


#function  get called to use text to speech model
def text2speech(text_input):
    try:
        SPEAK_OPTIONS = {"text": text_input}

        deepgram = DeepgramClient( api_key= os.getenv("DEEPGRAM_API_KEY"))
    

        options = SpeakOptions(
        #using session state so the user can change the voices
        model= st.session_state.voice ,
        encoding="linear16", container="wav"
       )

        response = deepgram.speak.rest.v("1").save(filename, SPEAK_OPTIONS, options)

        return response.content
    except Exception as e:
        print(f"Exception: {e}")




#autoplay the audio file after the text to speech model has evaluated it
def autoplay_audio(file_path: str):
   with open(file_path, "rb") as f:
        data = f.read()
        b64 = base64.b64encode(data).decode()
        md = f"""
            <audio controls autoplay="true">
            <source src="data:audio/mp3;base64,{b64}" type="audio/mp3">
            </audio>
            """
        st.markdown(
            md,
            unsafe_allow_html=True,
        )


#load css file
def load_css(file_path):
    with open(file_path) as f:
        st.html(f"<style>{f.read()}</style>")

css_path = pathlib.Path("./assets/style.css")
load_css(css_path)

#voice assitant Aura
with st.container(key="aura"):
   st.title("AURA")
  
   st.markdown("""<div class="spinner">
    <div class="spinner1"></div>
</div>""", unsafe_allow_html=True)
   voice = audio_recorder(text="",
          recording_color="#181a9e",
           neutral_color="#c2bfb8",
           icon_size="40px",
           key = "voiceai") 
#play voice when you hit the button
if voice:
    voice_file= "tovoice.mp3"
    with open( voice_file, "wb") as f:
        f.write(voice)
        stt = audio_to_text(client, voice_file)
        response = generate_response(stt)
        file = text2speech(response)
        hide_audio = """
           <style>
                audio { display : none !important; position: absolute !important;}
           </style>
      """
        st.markdown(hide_audio, unsafe_allow_html=True)
        #audio played
        st.button("Stop response", key="stop", on_click=reset_chat, type="primary", help="stoping response clears history")
        st.audio("output.mp3", format="audio/mp3", autoplay=True)
        translation = audio_to_text(client, voice_file)

IMAGE GENERATOR AND ANANLYSIS MODEL

the system allows user to give a simple prompt to generate an image, and allows to pick the dimension and resolution of the output image

if "height" not in st.session_state:
    st.session_state["height"] = ""

#store different height of an image in a session state
st.session_state.height = st.sidebar.selectbox("Select Resolution", [
   #all values must be a multiple of 16
    [1024,768],[7680,4230],[3840,2160],[2560,1440], [1920,1080], [1280,720],[720,480],[640,360],[320,240]
], index=0)




@st.cache_data
def load_image(image_file):
   img = Image.open(image_file)
   return img



#generate image after proving the prompt
if prompt:
  with st.spinner("Generating image"):
    #Model parameters for generating images
    response = client.images.generate(
       prompt = prompt,
        model = "black-forest-labs/FLUX.1-dev",
        width=st.session_state.height[0],
        height=st.session_state.height[1],
        steps = 28,
        n=1,
       response_format ="b64_json"
    )
    #converting the response to base64json 
    image_data = response.data[0].b64_json
    #converting image json values to bytes that can be interepreted
    image_bytes = base64.b64decode(image_data)
    #make a directory called images in the environment
    os.makedirs("images", exist_ok=True)
    #index for all the files generated
    image_index =1 
    #set the path of each file to images folder
    image_path = os.path.join("images",f"generated_image {image_index}.png")
    while os.path.exists(image_path):
       image_index+=1
       image_path = os.path.join("images",f"generated_image {image_index}.png")
    #open the file then
    with open(image_path, "wb") as image_file:
       image_file.write(image_bytes)



   #display the generated image
    st.image(image_path, caption="Generated Image")

Analysis Model: The user can give a simple prompt, "whether it is to describe an image" or "solve this math question" the model will deliver an accurate response

prompt= st.text_input("Analysis field", value="Describe an image")
with st.popover("Import Image"):
  
    uploaded_file = st.file_uploader("Upload an image", type=["png","jpg"])



def encode_image(image_path):
   with open(image_path, "rb") as image_file:
      return  base64.b64encode(image_file.read()).decode('utf-8')

def image_to_text(valuex,base64_image):
   prompt

 
   chat_completion = img_client.chat.completions.create(
      messages =[{
            "role": "user",
            "content": [{"type":"text","text":valuex},{"type":"image_url",
               "image_url":{"url": f"data:image;base64,{base64_image}"}}]
         }  ],
      model = "llama-3.2-11b-vision-preview",
    
       temperature=temp,
       max_tokens=max_tokens,
       stop=stop_seq,
       top_p=top_p
    )
   print(chat_completion.choices[0].message.content)
   return chat_completion.choices[0].message.content
if uploaded_file is not None and prompt:

    # Open the image

    image = Image.open(uploaded_file)



    # Define save directory

    save_folder = "uploads"

    os.makedirs(save_folder, exist_ok=True)  # Ensure the folder exists



    # Save image with the original filename

    save_path = os.path.join(save_folder, uploaded_file.name)

    image.save(save_path)
    base64_image = encode_image(save_path)
    described_img = image_to_text(prompt,base64_image)
    st.write(f"output: {described_img}")
elif uploaded_file is not None:
   st.write("Please Input a prompt, then press enter in the analysis field")

my method of review was a formative feedback with a small focus group

RESULTS

Fast response and high throughput, with a low diminishing value as the number of tokens increased
interactive system judged based on my feedback
clear and near realistic images produced by the generator
Accurate responses and shows the ability to avoid hallucinations
Detailed and accurate description of images

CONCLUSION

ASKBOT is an interactive multi-modal system that allows users to pick which method of interaction they would want to use, it is fast, delivers accurate response, delivers clear visuals to user and describes imagery with near perfect details.

ASKBOT: An Integrated AI based chatbot for Voice, Text, Image Intelligence and analysis

Table of contents

ASKBOT

INTRODUCTION

Features:

KEY ADVANTAGES:

GOALS OF THIS PROJECT

Requirements:

TOOLS FOR DEVELOPMENT

Models

Development phases of project for each model

Code and Design

Text Based Model

SPEECH TO TEXT MODEL

VOICE ASSISTANT MODEL

IMAGE GENERATOR AND ANANLYSIS MODEL

RESULTS

CONCLUSION

Datasets

Models

Datasets

Models