This paper introduces Multiple Choice Reasoning via. Process of Elimination using Multi-Modal models, also know as Multi-Modal Process of Elimination (MM-PoE), a method to enhance vision language models' performance on multiple-choice visual reasoning tasks by employing a two-step scoring system that first eliminates incorrect options and then predicts from the remaining ones. Our experiments across three question-answering datasets show the method's effectiveness, particularly in visual reasoning tasks. This method addresses one of the key limitations concerning the usage of PoE only in zero-shot settings and only with a language-only framework by extending to tasks involving multi-modalities and also includes experimentation techniques for few-shot settings. The link to our software is given here.
The Multi-Modal Process of Elimination (MM-PoE) introduced in this paper operates on a two-step mechanism designed to enhance the decision-making capabilities of vision language models (VLMs) in multiple-choice visual reasoning tasks. This method employs a novel approach to option elimination followed by a focused prediction phase. The strategy is rooted in the belief that separating the elimination of clearly incorrect options from the choice of the best remaining option will improve overall task performance.
Given a multiple-choice visual reasoning task, we define the problem setting as follows:
The goal is to develop an in-context learning method that accurately selects
In the first step of the MM-PoE method, each option
This elimination strategy intuitively aligns with how humans often discard options that seem clearly incorrect before carefully considering the remaining choices.
The second step involves making the final choice from the non-eliminated options. This step utilizes a binary mask to exclude the eliminated options during the prediction phase. The mask for each option
The masked context
The final predicted answer
Here are some typical examples of using MM-PoE.
CLI: To run the CLI application, execute the following.
$ python -m mm_poe #or $ mm_poe
The application will prompt the user to provide relevant inputs for a multiple-choice question e.g. a question, multiple answer choices for the question, and the path to the image relevant to the question context. Once the inputs are provided, the predicted answer will be displayed based on prompt outputs. Note that this application runs inference for only a single sample at a time.
MM-PoE consistently outperformed or matched the best-performing baselines across all datasets, showing particular strength in logical reasoning. The method's effectiveness in separating elimination and prediction tasks was crucial to its success.
Model | Dataset | LM | AVG | Calibration | Channel | MCP | PoE |
---|---|---|---|---|---|---|---|
microsoft/git-base-vqav2 | ScienceQA | 27.4 | 17.8 | 23.2 | 24.6 | 25.8 | 27.2 |
microsoft/git-base-vqav2 | AI2D | 25.4 | 26.2 | 26.4 | 25.4 | 25.3 | 26.5 |
microsoft/git-base-textvqa | ScienceQA | 21.8 | 20.4 | 25.8 | 23.4 | 23.6 | 28.2 |
microsoft/git-base-textvqa | AI2D | 26.5 | 27.6 | 20.8 | 26.2 | 24.2 | 26.8 |
Table 1: Comparison of Multiple-Choice Prompting (MCP) and Process of Elimination (PoE) accuracy scores on 2 visual question answering datasets for the microsoft/git-base-vqav2
and microsoft/git-base-textvqa
models in the zero-shot settings. Each dataset has different number of answer choices. PoE mostly outperforms MCP on all the visual reasoning tasks for the two multi-modal models mentioned.
Question: Which of these states is farthest north?
Options: West Virginia, Louisiana, Arizona, Oklahoma
Ground Truth Option: West Virginia
Predicted Masks: West Virginia, Louisiana, [MASK], [MASK]
Predicted Option: West Virginia
Question: Are phytoplankton predators or prey in this food chain?
Options: producer, predator, prey, NA
Ground Truth Option: prey
Predicted Masks: [MASK], predator, prey, NA
Predicted Option: prey
MM-PoE demonstrates a significant improvement in handling multiple choice visual reasoning tasks by mimicking a human-like process of elimination approach. Future work will focus on enhancing its generalizability and efficiency, possibly extending to handle better masking strategies.