1/7
Introducing MMMU-Pro: A more robust version of MMMU
https://arxiv.org/pdf/2409.02813v1
After launching MMMU, we received valuable feedback from the community:
Some questions were answerable without even seeing the images.
Models didn’t always "know" the answer but found shortcuts from the options provided.
Performance was heavily tied to LLMs, with minimal impact from the vision module.
To tackle these issues, we implemented the following improvements:
1. Filtering Text-Only Answerable Questions
2. Augmenting Candidate Options up to 10 by Human Experts.
3. Vision-Only Input Setting: where questions are embedded directly in images, requiring the model to rely purely on visual input.
Why We Added Vision-Only Input Setting?
1. From a foundational perspective, this setting forces AI to genuinely "see" and "read" at the same time—challenging a core human cognitive skill: the seamless integration of visual and textual information.
2. From an application standpoint, this approach mirrors how users naturally interact with AI systems—by sharing screenshots or photos, without meticulously separating text from images.
Key Results:
Performance on MMMU-Pro is notably lower compared to MMMU, ranging from 16.8% to 26.9% across various models. The ranking of models is generally similar to the original but we also observe less robust ones— for example, GPT-4o mini proved less robust than GPT-4o and other proprietary models, showing significant drops in performance on the augmented set.
More in-depth analysis can be found in the threads below!
2/7
Augmented Candidate Options Impact: Expanding the number of candidate options from 4 to 10 led to a notable drop in model performance, with GPT-4o (0513) experiencing a 10.7% decrease in accuracy.
Impact of Vision-Only Setting: The vision-only input setting further challenged model performance, especially for open-source models, which proved less robust in this scenario. GPT-4o (0513) saw a 4.3% drop, while LLaVA-OneVision-72B and VILA-1.5-40B experienced more dramatic declines of 14.0% and 21.8%, respectively.
3/7
A natural question to ask: is OCR the bottleneck of the vision-input setting?
Answer: No, for both capable open-source and closed-source models.
We actually found the top-performing models are very good at OCR tasks right now. They could accurately extract the question text from the image. The core reason is the complexity of information processing and handling. We observed that the model is more likely to hallucinate and have wrong reasoning chains when simultaneous processing of visual and textual information.
4/7
For example, in the following case, in the vision-only input scenario, the model accurately extracts text from the photo. However, its response tends to be more basic and lacks in-depth analysis. The integration of both visual and textual information appears to increase the cognitive load on the vision module, which may result in a higher likelihood of errors.
5/7
Finally, as we may know CoT mostly helps the reason scenarios but the extent of improvement varied significantly among models. In some cases, we observed a significant performance drop for some models, such as VILA1.5-40B. This decline might be attributed to challenges in instruction-following abilities. The models barely generated useful reasoning chains.
6/7
We would like to thank the community for the great feedback since MMMU release! We hope the robust version MMMU-Pro will help the community more rigorously evaluate the foundation models. This release is collaborated with @zhengtianyu4 @YuanshengNi @YuboWang726 @DrogoKhal4 @TongPetersb @jamesja69137043 @MingYin_0312 @BotaoYu24 @GeZhang86038849 @hhsun1 @ysu_nlp @WenhuChen @gneubig
7/7
The dataset is available at:
MMMU/MMMU_Pro · Datasets at Hugging Face
The inference code is available at:
MMMU/mmmu-pro at main · MMMU-Benchmark/MMMU
To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196