Multimodal AI model based on Vision Transformers, implemented in approximately 500 lines of code