Visual question answering with grounding and user selection priority