Logo MMKC-Bench

Benchmarking Multimodal Knowledge Conflict for Large Multimodal Models

1Shandong University 2University of Science and Technology of China 3Shanghai Jiaotong University 4Nanjing University 5Nanjing University of Posts and Telecommunications 6Jiangnan University 7The Hong Kong University of Science and Technology 8Shanghai AI Laboratory

*Core Contributors
📧Corresponding to: yuntaodu@sdu.edu.cn

Background

"The tsunami of knowledge washes over the old shores, and new islands rise on the crest of the waves of debate."

– Knowledge Conflict

Data Overview

Introduction

Large Multimodal Models (LMMs) face notable challenges when encountering multimodal knowledge conflicts, particularly under retrieval-augmented generation (RAG) frameworks, where the contextual information from external sources may contradict the model's internal parametric knowledge, leading to unreliable outputs. However, existing benchmarks fail to reflect such realistic conflict scenarios. Most focus solely on intra-memory conflicts, while context-memory and inter-context conflicts remain largely investigated. Furthermore, commonly used factual knowledge-based evaluations are often overlooked, and existing datasets lack a thorough investigation into conflict detection capabilities.

To bridge this gap, we propose MMKC-Bench, a benchmark designed to evaluate factual knowledge conflicts in both context-memory and inter-context scenarios. MMKC-Bench encompasses three types of multimodal knowledge conflicts and includes 1,573 knowledge instances and 3,381 images across 23 broad types, collected through automated pipelines with human verification. We evaluate three representative series of LMMs on both model behavior analysis and conflict detection tasks. Our findings show that while current LMMs are capable of recognizing knowledge conflicts, they tend to favor internal parametric knowledge over external evidence. We hope MMKC-Bench will foster further research in multimodal knowledge conflict and enhance the development of multimodal RAG systems.

Contribution 1

Contribution 1

We propose MMKC-Bench, a multimodal knowledge conflict benchmark focusing on factual knowledge conflict under both context-memory and inter-context scenarios.

Contribution 2

Contribution 2

We propose a novel pipeline to construct the benchmark that collects original knowledge, generates conflict knowledge and produce evaluation with two question formats.

Contribution 3

Contribution 3

Extensive experiments with various models under both context-memory and inter-context for behavior understanding and conflict detection are conducted, revealing several characteristics of existing LMMs.

Multimodal LLM Knowledge Conflicts Benchmark

MMKC-Bench encompasses three types of multimodal knowledge conflicts and includes 1,573 knowledge instances and 3,381 images across 23 broad types, collected through automated pipelines with human verification.

Benchmark Overview

The overall process of constructing the MMKC benchmark. (a) First, obtain high-quality entity categories and instances. (b) Then collect raw data from Wikipedia and filter popular content. (c) Second, use GPT-4o to generate summaries of the text content of the raw data. (d) Subsequently, download images of the query instances from Google and manually filter the images. (e) Finally, we use GPT-4o to generate counterfactual conflict knowledge and generate corresponding questions.

Benchmark Overview

Experimental Results

1. MLMMs are more receptive to internal knowledge than to external evidence

As shown in Table 2 and Table3, under context-memory conflicts, the average OAR exceeds CAR in all cases (6 out of 6), indicating that LMMs tend to favor internal knowledge. Closed-source GPT-4o mini shows consistent results with open-source models, suggesting that even advanced closed models are insensitive to external evidence. This differs from LLMs, which have shown high receptiveness to external knowledge. One reason for this contrast is the difference in training data formats: LLMs are typically trained on long text contexts involving multiple information sources, while LMMs are mostly trained on isolated image-text pairs. This limits their exposure to multi-source contexts and reduces their ability to integrate external information during inference.This finding is important for designing multimodal RAG systems, as it reveals that LMMs may not naturally leverage retrieved evidence and instead rely on parametric knowledge. Thus, improving LMMs’ ability to incorporate external information is important, which may require innovations in training paradigms and model architecture.

Benchmark Overview

2. MLMMs are more sensitive to knowledge-related conflicts and less sensitive to recognition-based conflicts

We group the three conflict types into recognition-based (entity recognition, visual semantics) and knowledge-related (entity knowledge). LMMs show lower OARs on knowledge-related conflicts than of recognition-based conflicts, indicating greater sensitivity to factual inconsistencies. For example, entity recognition conflicts yield an OAR as low as 0.26 on Qwen2.5-VL-7B. While entity recognition conflicts often show the highest OARs, suggesting LMMs more easily rely on internal knowledge for perception tasks.

3. When provided with more external evidence, LMMs exhibit greater alignment with external information, though the improvement remains limited

Compared to context-memory conflict scenarios, models generally achieve higher CARs under inter-context conflicts, suggesting a slight increase in reliance on external evidence. This is because, given more internal information, the model output would be affected more. However, the overall improvement is limited: the largest increase in CAR is 21% on average, while the smallest average improvement is only about -2%. These results reaffirm that LMMs predominantly rely on their internal parametric knowledge, even when presented with multiple external sources.

4. Larger models exhibit a stronger promoting effect across all conflict types

As illustrated in Fig.4 and Fig.5, the Overall Agreement Rate (OAR) generally increases with model size within the Qwen2.5-VL series. Specifically, the OAR improves progressively as the model scales from 3B to 7B, 13B, and 70B, reflecting gains across entity recognition conflict, entity knowledge conflict, and visual semantic conflict. This trend suggests that larger models are more strongly influenced by their internal knowledge. This enhanced capability may stem from exposure to more extensive training data, enabling larger models to develop stronger mechanisms for resolving conflicts.

Benchmark Overview

Case Study

Our Team