A simple yet vesatile application using Gradio, featuring the integration of various open-source models from Hugging Face. This app supports a range of tasks, including Image Text to Text, Visual Question Answering, and Text to Speech, providing an accessible interface for experimenting with these advanced machine learning models.
Module | Source | Function | |
---|---|---|---|
#M1 | Image-Text-to-Text | microsoft/Florence-2-large |
|
#M2 | Visual Question Answering | OpenGVLab/Mini-InternVL-Chat-2B-V1-5 |
|
#M3 | Text-to-Speech | coqui/XTTS-v2 |
|
Task type | Task details | Usage |
---|---|---|
Image Captioning | Generate a short description | !describe -s |
Generate a detailed description | !describe -m | |
Generate a more detailed description | !describe -l | |
Localize and Describe salient regions | !densecap | |
Object Detection | Detect objects from text inputs | !detect obj1 obj2 ... |
Image Segmentation | Segment objects from text inputs | !segment obj1 obj2 ... |
Optical Character Recognition | Localize and Recognize text | !ocr |
Additional features
- Voice options: You can choose the voice for Speech Synthesizer, there are currently 2 voice options:
- David Attenborough
- Morgan Freeman
- Random bot: With every input image entry, a different random bot avatar would be used.
22.04
3.10.12
555
11.8
8
& CuDNN9
GPU | CPU | |
---|---|---|
#M1 | ✅ | ✅ |
#M2 | ✅ | ❌ |
#M3 | ✅ | ✅ |
- Do you need GPU to run this app?
- No, you can run this app on CPU, but you can only use
Image-Text-to-Text
andText-to-Speech
modules, also processing time would be longer.
- You can set
dtype
andquantization
based on this table so that you can make full use of your GPU.- For example with my 6GB GPU:
- #M1:
gpu - q4 - bfp16
- #M2:
gpu - q8 - bfp16
- #M3:
cpu - fp32
- This is the current
gpu_low
specs config.
This preparation is for local run, you should use a
venv
for local run.
pip install -r requirements.cpu.txt
11.8
& CuDNN8|9
pip install -r requirements.txt
File | Includes | |
---|---|---|
General configs | app_config.yaml |
|
#M1 configs | florence_2_large.yaml |
|
#M2 configs | mini_internvl_chat_2b_v1_5.yaml | |
#M3 configs | xtts_v2.yaml |
There are 3 profiles for specs configs:
cpu | gpu_low | gpu_high | |
---|---|---|---|
#M1 | cpu - fp32 | gpu - q4 - bfp16 | gpu - fp32 |
#M2 | gpu - q8 - bfp16 | gpu - fp32 | |
#M3 | cpu - fp32 | cpu - fp32 | gpu - fp32 |
GPU VRAM needed | 0 | ~6GB | > 16GB |
- With
gpu_high
, #M3 will use longer speaker voice duration for synthesizing.- The current default profile is
gpu_low
. You can set the specs profile in app_config.yaml.- If you want to create a custom profile for this, remember to add the custom profile to all module config files as well.
share
-> True
under lanch_config
in app_config.yaml before running the app.venv
(Optional)python app.py
The app is running on http://127.0.0.1:7860/
You need to install
NVIDIA Container Toolkit
in order to use docker for gpu images.
Remember to change the specs profile in app_config.yaml before building images.
Docker engine build:
docker build -f Dockerfile.cpu -t {image_name}:{tag} .
docker build -t {image_name}:{tag} .
Docker compose build:
image
in docker-compose.cpu.yaml to your likingdocker compose -f docker-compose.cpu.yaml build
image
in docker-compose.yaml to your likingdocker compose build
Docker engine run:
docker run -p 7860:7860 {image_name}:{tag}
docker run --gpus all -p 7860:7860 {image_name}:{tag}
Docker compose run:
docker compose -f docker-compose.cpu.yaml up
docker compose up
The app is running on http://0.0.0.0:7860/
- Docker Hub repository: https://hub.docker.com/r/nguyennpa412/simple-multimodal-ai
- There are 3 tags for 3 specs profiles:
cpu
,gpu-low
,gpu-high
Docker engine run:
docker run --pull=always -p 7860:7860 nguyennpa412/simple-multimodal-ai:cpu
docker run --pull=always --gpus all -p 7860:7860 nguyennpa412/simple-multimodal-ai:gpu-low
docker run --pull=always --gpus all -p 7860:7860 nguyennpa412/simple-multimodal-ai:gpu-high
Docker compose run:
image
in docker-compose.cpu.yaml to nguyennpa412/simple-multimodal-ai:cpu
docker compose -f docker-compose.cpu.yaml up --pull=always
image
in docker-compose.yaml to nguyennpa412/simple-multimodal-ai:gpu-low
docker compose up --pull=always
image
in docker-compose.yaml to nguyennpa412/simple-multimodal-ai:gpu-high
docker compose up --pull=always
The app is running on http://0.0.0.0:7860/
There are no datasets linked
There are no datasets linked