Augmenting Multi-modal Question Answering Systems with Retrieval Methods
Repository URI
Repository DOI
Change log
Authors
Abstract
The quest to develop artificial intelligence systems capable of handling intricate tasks has propelled the prominence of deep learning, particularly since 2016, when neural network models emerged as the mainstream approach. With applications ranging from recommender systems to speech recognition, these models have revolutionised various domains. However, challenges persist, especially in incorporating extensive domain-specific knowledge and mitigating the generation illusion inherent in large language models.
This thesis explores the integration of retrieval-augmented generation (RAG) into multi-modal question answering (QA) systems as a solution to these challenges. By leveraging external knowledge sources, RAG enhances model accuracy and access to domain-specific information. The research unfolds in the following order:
Firstly, to efficiently and effectively leverage the external knowledge for answering knowledge-intensive, visually-grounded questions, we introduce RA-VQA (Retrieval Augmented Visual Question Answering), a framework tailored for knowledge-based visual question answering (KB-VQA). We demonstrate the efficacy of joint training for retriever and generator models in maximising performance.
Secondly, FVQA (Fact-based Visual Question Answering) 2.0 introduces semi-automatically annotated adversarial samples to address data distribution imbalances and enhance system robustness, showcasing substantial improvements in handling challenging scenarios.
Thirdly, the development of FLMR (Fine-grained Late-interaction Multi-modal Retriever), a state-of-the-art multi-modal retriever, and its scaled-up version, PreFLMR (Pre-trained FLMR), underscore the significance of late-interaction models in achieving superior multi-modal retrieval performance. We show that the proposed models are capable of capturing finer-grained interactions between query and context, offering efficient and accurate retrieval across a wide range of multi-modal retrieval tasks.
Then the focus pivots to retrieval methods in TableQA, introducing ITR (Inner Table Retriever) for closed-domain scenarios and LI-RAGE (Late Interaction Retrieval Augmented Generation with Explicit Signals) for open-domain TableQA tasks. Both frameworks exhibit remarkable performance improvements over existing approaches. We show that incorporating retrieval methods in TableQA substantially pushed the research boundary, offering state-of-the-art question answering performance.
Through meticulous experimentation and innovation, this thesis not only advances the theoretical understanding of multi-modal retrieval augmented systems but also contributes practical frameworks and datasets that address critical challenges in question answering across diverse domains. As the journey towards effective AI systems continues, these contributions serve as a solid foundation for future advancements in information retrieval and question answering in multi-modal contexts.