Enables users to query images using text input. Users can select specific objects for their queries, streamlining the process of asking questions and eliminating the need to describe the position of objects within the image using spatial words.
  
CUDA compatible GPU with minimum 8gb VRAMPython>=3.8GPT-4 Vision API Key using a .env file defined as OPENAI_API_KEYfrom dotenv import load_dotenv load_dotenv()
git clone https://github.com/shetumohanto/visual-question-answering.git cd visual-question-answering
segment anything vit_h model.wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth
pip install -r requirements.txt
streamlit run app.py
This project uses the following technologies at its core architecture: