We use cookies to improve your browsing experience and to analyze our website traffic. By clicking β€œAccept All” you agree to our use of cookies. Privacy policy.
●36 reads●Apache 2.0

MM-PoE: Multiple Choice Reasoning via. Process of Elimination using Multi-Modal Models

Table of contents

Abstract

This paper introduces Multiple Choice Reasoning via. Process of Elimination using Multi-Modal models, also know as Multi-Modal Process of Elimination (MM-PoE), a method to enhance vision language models' performance on multiple-choice visual reasoning tasks by employing a two-step scoring system that first eliminates incorrect options and then predicts from the remaining ones. Our experiments across three question-answering datasets show the method's effectiveness, particularly in visual reasoning tasks. This method addresses one of the key limitations concerning the usage of PoE only in zero-shot settings and only with a language-only framework by extending to tasks involving multi-modalities and also includes experimentation techniques for few-shot settings. The link to our software is given here.

Methodology

The Multi-Modal Process of Elimination (MM-PoE) introduced in this paper operates on a two-step mechanism designed to enhance the decision-making capabilities of vision language models (VLMs) in multiple-choice visual reasoning tasks. This method employs a novel approach to option elimination followed by a focused prediction phase. The strategy is rooted in the belief that separating the elimination of clearly incorrect options from the choice of the best remaining option will improve overall task performance.

Problem Setting

Given a multiple-choice visual reasoning task, we define the problem setting as follows:

  • Let be the question or context provided.
  • Let be the image provided.
  • Let be the set of multiple-choice options available.
  • Let be the correct answer from .

The goal is to develop an in-context learning method that accurately selects from given and .

Two-Step Scoring Method

Step 1: Elimination

In the first step of the MM-PoE method, each option is scored based on a specified metric. The score function, , evaluates each option's plausibility given the question and image . The scores are used to eliminate options that are deemed less likely to be correct. Specifically, options whose scores are below the average score are eliminated. This is calculated as follows:

This elimination strategy intuitively aligns with how humans often discard options that seem clearly incorrect before carefully considering the remaining choices.

Step 2: Prediction

The second step involves making the final choice from the non-eliminated options. This step utilizes a binary mask to exclude the eliminated options during the prediction phase. The mask for each option is defined as follows:

The masked context is then constructed by modifying the original context to include only the options for which . Each option is scored again, but this time within the context that explicitly excludes the eliminated options, possibly by using a template that masks out in the presentation of the options:

The final predicted answer is then the option with the highest score among the remaining options:

Usage

Here are some typical examples of using MM-PoE.
CLI: To run the CLI application, execute the following.

$ python -m mm_poe #or $ mm_poe

The application will prompt the user to provide relevant inputs for a multiple-choice question e.g. a question, multiple answer choices for the question, and the path to the image relevant to the question context. Once the inputs are provided, the predicted answer will be displayed based on prompt outputs. Note that this application runs inference for only a single sample at a time.
cli.png

Results

MM-PoE consistently outperformed or matched the best-performing baselines across all datasets, showing particular strength in logical reasoning. The method's effectiveness in separating elimination and prediction tasks was crucial to its success.

ModelDatasetLMAVGCalibrationChannelMCPPoE
microsoft/git-base-vqav2ScienceQA27.417.823.224.625.827.2
microsoft/git-base-vqav2AI2D25.426.226.425.425.326.5
microsoft/git-base-textvqaScienceQA21.820.425.823.423.628.2
microsoft/git-base-textvqaAI2D26.527.620.826.224.226.8

Table 1: Comparison of Multiple-Choice Prompting (MCP) and Process of Elimination (PoE) accuracy scores on 2 visual question answering datasets for the microsoft/git-base-vqav2 and microsoft/git-base-textvqa models in the zero-shot settings. Each dataset has different number of answer choices. PoE mostly outperforms MCP on all the visual reasoning tasks for the two multi-modal models mentioned.

Examples

ScienceQA Example

image.png

Question: Which of these states is farthest north?

Options: West Virginia, Louisiana, Arizona, Oklahoma

Ground Truth Option: West Virginia

Predicted Masks: West Virginia, Louisiana, [MASK], [MASK]

Predicted Option: West Virginia

AI2D Example

17.png

Question: Are phytoplankton predators or prey in this food chain?

Options: producer, predator, prey, NA

Ground Truth Option: prey

Predicted Masks: [MASK], predator, prey, NA

Predicted Option: prey

Conclusion

MM-PoE demonstrates a significant improvement in handling multiple choice visual reasoning tasks by mimicking a human-like process of elimination approach. Future work will focus on enhancing its generalizability and efficiency, possibly extending to handle better masking strategies.