Social Science Research Council Research AMP Just Tech
Citation

Beyond Text: Leveraging Vision-Language Models for Misinformation Detection

Author:
Grewal, Parminder Kaur; Ernst, Marina; Hopfgartner, Frank
Year:
2025

The swift expansion of social media and digital platforms has fueled the spread of misinformation, including disinformation and propaganda intentionally designed to deceive the public and shape public opinions on critical issues. As any other information nowadays, malicious content is not limited to text only, but is enriched with different types of multimedia, including images, videos, etc. This diversity poses significant challenges when it comes to detection, since standard techniques that treat each data type separately are usually ineffective in addressing multimodal scenarios. Addressing this challenge requires advanced detection methods capable of analyzing multimodal content, such as text and images. This study explores the effectiveness of advanced multimodal frameworks, including Google’s CLIP, ViLT, and FLAVA, in detecting misinformation using the Fakeddit dataset, a widely used benchmark for multimodal research. Leveraging pre-trained models, this research evaluates the performance of these methods. In addition to textual and visual inputs, we incorporate metadata features to enhance models’ performance. The results demonstrate these models’ potential to enhance the robustness and accuracy of misinformation detection, thereby countering the increasing sophistication of multimodal disinformation campaigns.