Ask in Any Modality:
A Comprehensive Survey on Multimodal Retrieval-Augmented Generation

1Qatar Computing Research Institute, 2Sharif University of Technology,
3University of Tehran, 4K.N. Toosi University of Technology

Multimodal RAG extends traditional Retrieval-Augmented Generation (RAG) frameworks by incorporating diverse data types, such as text, images, audio, and video, from external knowledge sources. This approach aims to address AI limitations like hallucinations and outdated knowledge through the dynamic integration of this retrieved information, thereby improving the factual grounding, accuracy, and reasoning capabilities of the generated outputs.

BUFFET teaser.

Abstract

Large Language Models (LLMs) struggle with hallucinations and outdated knowledge due to their reliance on static training data. Retrieval-Augmented Generation (RAG) mitigates these issues by integrating external dynamic information enhancing factual and updated grounding. Recent advances in multimodal learning have led to the development of Multimodal RAG, incorporating multiple modalities such as text, images, audio, and video to enhance the generated outputs. However, cross-modal alignment and reasoning introduce unique challenges to Multimodal RAG, distinguishing it from traditional unimodal RAG.

This survey offers a structured and comprehensive analysis of Multimodal RAG systems, covering datasets, metrics, benchmarks, evaluation, methodologies, and innovations in retrieval, fusion, augmentation, and generation. We precisely review training strategies, robustness enhancements, and loss functions, while also exploring the diverse Multimodal RAG scenarios. Furthermore, we discuss open challenges and future research directions to support advancements in this evolving field. This survey lays the foundation for developing more capable and reliable AI systems that effectively leverage multimodal dynamic external knowledge bases.

Taxonomy of Recent Advances and Enhancements

BUFFET teaser.

Taxonomy of Application Domains

BUFFET teaser.

Process and Innovations in Multimodal RAG

The typical Multimodal RAG pipeline involves several key stages, alongside crucial innovations in training and robustness, to effectively integrate diverse data types for enhanced generation:

  1. Multimodal Query and Corpus Encoding: User queries and documents from a multimodal corpus (containing text, images, etc.) are processed by modality-specific encoders. These map the varied inputs into a shared semantic space, enabling cross-modal comparison and alignment.
  2. Retrieval Strategy: A retrieval model identifies relevant documents from the encoded corpus based on similarity to the query. This stage uses efficient search (like MIPS), modality-specific strategies (text-centric, vision-centric, video-centric, document layout-aware), and often re-ranking or selection to prioritize the best candidates.
  3. Fusion Mechanisms: The retrieved information, spanning multiple modalities, is effectively combined. Fusion mechanisms align and integrate these diverse data types using techniques like score fusion, attention-based weighting of cross-modal interactions, or projecting inputs into unified representations.
  4. Augmentation Techniques: Before generation, the retrieved context is refined. This can involve context enrichment (adding related info), adaptive retrieval (adjusting based on query complexity or feedback), or iterative retrieval (refining results over multiple steps).
  5. Generation: Finally, a Multimodal Large Language Model (MLLM) generates a response using the original query and the processed, retrieved multimodal context. Innovations include In-Context Learning with retrieved examples, structured reasoning (like Chain-of-Thought), instruction tuning for specific tasks, and ensuring source attribution.
  6. Training Strategies: Training involves multistage processes, often starting with pretraining on large paired datasets to learn cross-modal relationships (e.g., using contrastive loss like InfoNCE or alignment losses) and followed by fine-tuning on specific tasks using objectives like cross-entropy for generation or specialized losses for alignment and disentanglement.
  7. Robustness and Noise Management: Techniques are employed to handle challenges like noisy retrieval inputs and modality-specific biases. Methods include injecting irrelevant results or noise during training, using knowledge distillation, employing adaptive knowledge selection strategies, or applying regularization techniques (like Query Dropout) to enhance model resilience and focus.

Open Problems and Future Directions

Key challenges and future research opportunities in Multimodal RAG include:

  • Improving Generalization, Explainability, and Robustness: Addressing domain adaptation issues, modality biases (like over-reliance on text), the lack of precise source attribution (especially for non-textual data), and vulnerability to adversarial inputs or low-quality sources.
  • Enhancing Reasoning, Alignment, and Retrieval: Developing better compositional reasoning across modalities, improving entity-aware retrieval, mitigating retrieval biases (e.g., position sensitivity, redundancy), exploring knowledge graphs, and creating truly unified embedding spaces for direct multimodal search.
  • Developing Agent-Based and Self-Guided Systems: Moving towards interactive, agentic systems that use feedback (like reinforcement learning or human alignment) to self-assess retrieval needs, evaluate relevance, dynamically choose modalities, and refine outputs iteratively, achieving robust "any-to-any" modality support.
  • Integrating with Real-World Data and Embodied AI: Incorporating diverse real-world sensor data alongside traditional modalities to enhance situational awareness, aligning with trends towards embodied AI for applications in robotics, navigation, and physics-informed reasoning.
  • Addressing Long-Context, Efficiency, Scalability, and Personalization: Overcoming computational bottlenecks in processing long videos or multi-page documents, optimizing the speed-accuracy trade-off for efficiency and scalability (especially for edge devices), exploring user-specific personalization while ensuring privacy, and creating better datasets for evaluating complex reasoning and robustness.

BibTeX

@misc{abootorabi2025askmodalitycomprehensivesurvey,
                title={Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation}, 
                author={Mohammad Mahdi Abootorabi and Amirhosein Zobeiri and Mahdi Dehghani and Mohammadali Mohammadkhani and Bardia Mohammadi and Omid Ghahroodi and Mahdieh Soleymani Baghshah and Ehsaneddin Asgari},
                year={2025},
                eprint={2502.08826},
                archivePrefix={arXiv},
                primaryClass={cs.CL},
                url={https://arxiv.org/abs/2502.08826}, 
        }