A simple yet vesatile application using Gradio, featuring the integration of various open-source models from Hugging Face. This app supports a range of tasks, including Image Text to Text, Visual Question Answering, and Text to Speech, providing an accessible interface for experimenting with these advanced machine learning models.
| Module | Source | Function | |
|---|---|---|---|
| #M1 | Image-Text-to-Text | microsoft/Florence-2-large |
|
| #M2 | Visual Question Answering | OpenGVLab/Mini-InternVL-Chat-2B-V1-5 |
|
| #M3 | Text-to-Speech | coqui/XTTS-v2 |
|
| Task type | Task details | Usage |
|---|---|---|
| Image Captioning | Generate a short description | !describe -s |
| Generate a detailed description | !describe -m | |
| Generate a more detailed description | !describe -l | |
| Localize and Describe salient regions | !densecap | |
| Object Detection | Detect objects from text inputs | !detect obj1 obj2 ... |
| Image Segmentation | Segment objects from text inputs | !segment obj1 obj2 ... |
| Optical Character Recognition | Localize and Recognize text | !ocr |
Additional features
- Voice options: You can choose the voice for Speech Synthesizer, there are currently 2 voice options:
- David Attenborough
- Morgan Freeman
- Random bot: With every input image entry, a different random bot avatar would be used.
22.043.10.1255511.88 & CuDNN9| GPU | CPU | |
|---|---|---|
| #M1 | ✅ | ✅ |
| #M2 | ✅ | ❌ |
| #M3 | ✅ | ✅ |
- Do you need GPU to run this app?
- No, you can run this app on CPU, but you can only use
Image-Text-to-TextandText-to-Speechmodules, also processing time would be longer.
- You can set
dtypeandquantizationbased on this table so that you can make full use of your GPU.- For example with my 6GB GPU:
- #M1:
gpu - q4 - bfp16- #M2:
gpu - q8 - bfp16- #M3:
cpu - fp32
- This is the current
gpu_lowspecs config.
This preparation is for local run, you should use a
venvfor local run.
pip install -r requirements.cpu.txt11.8 & CuDNN8|9pip install -r requirements.txt| File | Includes | |
|---|---|---|
| General configs | app_config.yaml |
|
| #M1 configs | florence_2_large.yaml |
|
| #M2 configs | mini_internvl_chat_2b_v1_5.yaml | |
| #M3 configs | xtts_v2.yaml |
There are 3 profiles for specs configs:
| cpu | gpu_low | gpu_high | |
|---|---|---|---|
| #M1 | cpu - fp32 | gpu - q4 - bfp16 | gpu - fp32 |
| #M2 | gpu - q8 - bfp16 | gpu - fp32 | |
| #M3 | cpu - fp32 | cpu - fp32 | gpu - fp32 |
| GPU VRAM needed | 0 | ~6GB | > 16GB |
- With
gpu_high, #M3 will use longer speaker voice duration for synthesizing.- The current default profile is
gpu_low. You can set the specs profile in app_config.yaml.- If you want to create a custom profile for this, remember to add the custom profile to all module config files as well.
share -> True under lanch_config in app_config.yaml before running the app.venv (Optional)python app.pyThe app is running on http://127.0.0.1:7860/
You need to install
NVIDIA Container Toolkitin order to use docker for gpu images.
Remember to change the specs profile in app_config.yaml before building images.
Docker engine build:
docker build -f Dockerfile.cpu -t {image_name}:{tag} .docker build -t {image_name}:{tag} .Docker compose build:
image in docker-compose.cpu.yaml to your likingdocker compose -f docker-compose.cpu.yaml buildimage in docker-compose.yaml to your likingdocker compose buildDocker engine run:
docker run -p 7860:7860 {image_name}:{tag}docker run --gpus all -p 7860:7860 {image_name}:{tag}Docker compose run:
docker compose -f docker-compose.cpu.yaml updocker compose upThe app is running on http://0.0.0.0:7860/
- Docker Hub repository: https://hub.docker.com/r/nguyennpa412/simple-multimodal-ai
- There are 3 tags for 3 specs profiles:
cpu,gpu-low,gpu-high
Docker engine run:
docker run --pull=always -p 7860:7860 nguyennpa412/simple-multimodal-ai:cpudocker run --pull=always --gpus all -p 7860:7860 nguyennpa412/simple-multimodal-ai:gpu-lowdocker run --pull=always --gpus all -p 7860:7860 nguyennpa412/simple-multimodal-ai:gpu-highDocker compose run:
image in docker-compose.cpu.yaml to nguyennpa412/simple-multimodal-ai:cpudocker compose -f docker-compose.cpu.yaml up --pull=alwaysimage in docker-compose.yaml to nguyennpa412/simple-multimodal-ai:gpu-lowdocker compose up --pull=alwaysimage in docker-compose.yaml to nguyennpa412/simple-multimodal-ai:gpu-highdocker compose up --pull=alwaysThe app is running on http://0.0.0.0:7860/