Enables users to query images using text input. Users can select specific objects for their queries, streamlining the process of asking questions and eliminating the need to describe the position of objects within the image using spatial words.
CUDA compatible GPU
with minimum 8gb VRAM
Python>=3.8
GPT-4 Vision API Key
using a .env
file defined as OPENAI_API_KEY
from dotenv import load_dotenv load_dotenv()
git clone https://github.com/shetumohanto/visual-question-answering.git cd visual-question-answering
segment anything
vit_h
model.wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth
pip install -r requirements.txt
streamlit run app.py
This project uses the following technologies at its core architecture:
There are no models linked
There are no datasets linked
There are no datasets linked
There are no models linked