Ask in Any Modality:
A Comprehensive Survey on Multimodal Retrieval-Augmented Generation

1Qatar Computing Research Institute, 2Saarland University, 3Zuse School ELIZA,
4University of Tehran, 5Max Planck Institute for Software Systems,
6K.N. Toosi University of Technology, 7Sharif University of Technology

Multimodal RAG extends traditional Retrieval-Augmented Generation (RAG) frameworks by incorporating diverse data types, such as text, images, audio, and video, from external knowledge sources. This approach aims to address AI limitations like hallucinations and outdated knowledge through the dynamic integration of this retrieved information, thereby improving the factual grounding, accuracy, and reasoning capabilities of the generated outputs.

BUFFET teaser.

Abstract

Large Language Models (LLMs) struggle with hallucinations and outdated knowledge due to their reliance on static training data. Retrieval-Augmented Generation (RAG) mitigates these issues by integrating external dynamic information enhancing factual and updated grounding. Recent advances in multimodal learning have led to the development of Multimodal RAG, incorporating multiple modalities such as text, images, audio, and video to enhance the generated outputs. However, cross-modal alignment and reasoning introduce unique challenges to Multimodal RAG, distinguishing it from traditional unimodal RAG.

This survey offers a structured and comprehensive analysis of Multimodal RAG systems, covering datasets, metrics, benchmarks, evaluation, methodologies, and innovations in retrieval, fusion, augmentation, and generation. We precisely review training strategies, robustness enhancements, and loss functions, while also exploring the diverse Multimodal RAG scenarios. Furthermore, we discuss open challenges and future research directions to support advancements in this evolving field. This survey lays the foundation for developing more capable and reliable AI systems that effectively leverage multimodal dynamic external knowledge bases.

Taxonomy of Recent Advances and Enhancements

BUFFET teaser.

Taxonomy of Application Domains

BUFFET teaser.

Process and Innovations in Multimodal RAG

The typical Multimodal RAG pipeline involves several key stages, alongside crucial innovations in training and robustness, to effectively integrate diverse data types for enhanced generation:

  1. Multimodal Query and Corpus Encoding: User queries and documents from a multimodal corpus (containing text, images, etc.) are processed by modality-specific encoders. These map the varied inputs into a shared semantic space, enabling cross-modal comparison and alignment.
  2. Retrieval Strategy: A retrieval model identifies relevant documents from the encoded corpus based on similarity to the query. This stage uses efficient search (like MIPS), modality-specific strategies (text-centric, vision-centric, video-centric, document layout-aware), and often re-ranking or selection to prioritize the best candidates.
  3. Fusion Mechanisms: The retrieved information, spanning multiple modalities, is effectively combined. Fusion mechanisms align and integrate these diverse data types using techniques like score fusion, attention-based weighting of cross-modal interactions, or projecting inputs into unified representations.
  4. Augmentation Techniques: Before generation, the retrieved context is refined. This can involve context enrichment (adding related info), adaptive retrieval (adjusting based on query complexity or feedback), or iterative retrieval (refining results over multiple steps).
  5. Generation: Finally, a Multimodal Large Language Model (MLLM) generates a response using the original query and the processed, retrieved multimodal context. Innovations include In-Context Learning with retrieved examples, structured reasoning (like Chain-of-Thought), instruction tuning for specific tasks, and ensuring source attribution.
  6. Training Strategies: Training involves multistage processes, often starting with pretraining on large paired datasets to learn cross-modal relationships (e.g., using contrastive loss like InfoNCE or alignment losses) and followed by fine-tuning on specific tasks using objectives like cross-entropy for generation or specialized losses for alignment and disentanglement.
  7. Robustness and Noise Management: Techniques are employed to handle challenges like noisy retrieval inputs and modality-specific biases. Methods include injecting irrelevant results or noise during training, using knowledge distillation, employing adaptive knowledge selection strategies, or applying regularization techniques (like Query Dropout) to enhance model resilience and focus.

Open Problems and Future Directions

Key challenges and future research opportunities in Multimodal RAG include:

  • Improving Generalization, Explainability, and Robustness: Addressing domain adaptation issues, modality biases (like over-reliance on text), the lack of precise source attribution (especially for non-textual data), and vulnerability to adversarial inputs or low-quality sources.
  • Enhancing Reasoning, Alignment, and Retrieval: Developing better compositional reasoning across modalities, improving entity-aware retrieval, mitigating retrieval biases (e.g., position sensitivity, redundancy), exploring knowledge graphs, and creating truly unified embedding spaces for direct multimodal search.
  • Developing Agent-Based and Self-Guided Systems: Moving towards interactive, agentic systems that use feedback (like reinforcement learning or human alignment) to self-assess retrieval needs, evaluate relevance, dynamically choose modalities, and refine outputs iteratively, achieving robust "any-to-any" modality support.
  • Integrating with Real-World Data and Embodied AI: Incorporating diverse real-world sensor data alongside traditional modalities to enhance situational awareness, aligning with trends towards embodied AI for applications in robotics, navigation, and physics-informed reasoning.
  • Addressing Long-Context, Efficiency, Scalability, and Personalization: Overcoming computational bottlenecks in processing long videos or multi-page documents, optimizing the speed-accuracy trade-off for efficiency and scalability (especially for edge devices), exploring user-specific personalization while ensuring privacy, and creating better datasets for evaluating complex reasoning and robustness.

BibTeX

@inproceedings{abootorabi-etal-2025-ask,
                title = "Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation",
                author = "Abootorabi, Mohammad Mahdi  and
                  Zobeiri, Amirhosein  and
                  Dehghani, Mahdi  and
                  Mohammadkhani, Mohammadali  and
                  Mohammadi, Bardia  and
                  Ghahroodi, Omid  and
                  Baghshah, Mahdieh Soleymani  and
                  Asgari, Ehsaneddin",
                editor = "Che, Wanxiang  and
                  Nabende, Joyce  and
                  Shutova, Ekaterina  and
                  Pilehvar, Mohammad Taher",
                booktitle = "Findings of the Association for Computational Linguistics: ACL 2025",
                month = jul,
                year = "2025",
                address = "Vienna, Austria",
                publisher = "Association for Computational Linguistics",
                url = "https://aclanthology.org/2025.findings-acl.861/",
                doi = "10.18653/v1/2025.findings-acl.861",
                pages = "16776--16809",
                ISBN = "979-8-89176-256-5",
                abstract = "Large Language Models (LLMs) suffer from hallucinations and outdated knowledge due to their reliance on static training data. Retrieval-Augmented Generation (RAG) mitigates these issues by integrating external dynamic information for improved factual grounding. With advances in multimodal learning, Multimodal RAG extends this approach by incorporating multiple modalities such as text, images, audio, and video to enhance the generated outputs. However, cross-modal alignment and reasoning introduce unique challenges beyond those in unimodal RAG. This survey offers a structured and comprehensive analysis of Multimodal RAG systems, covering datasets, benchmarks, metrics, evaluation, methodologies, and innovations in retrieval, fusion, augmentation, and generation. We review training strategies, robustness enhancements, loss functions, and agent-based approaches, while also exploring the diverse Multimodal RAG scenarios. In addition, we outline open challenges and future directions to guide research in this evolving field. This survey lays the foundation for developing more capable and reliable AI systems that effectively leverage multimodal dynamic external knowledge bases. All resources are publicly available at https://github.com/llm-lab-org/Multimodal-RAG-Survey."
            }