skip to main content


This content will become publicly available on December 1, 2024

Title: Inconsistent Matters: A Knowledge-Guided Dual-Consistency Network for Multi-Modal Rumor Detection
Rumor spreaders are increasingly utilizing multimedia content to attract the attention and trust of news consumers. Though quite a few rumor detection models have exploited the multi-modal data, they seldom consider the inconsistent semantics between images and texts, and rarely spot the inconsistency among the post contents and background knowledge. In addition, they commonly assume the completeness of multiple modalities and thus are incapable of handling handle missing modalities in real-life scenarios. Motivated by the intuition that rumors in social media are more likely to have inconsistent semantics, a novel Knowledge-guided Dual-consistency Network is proposed to detect rumors with multimedia contents. It uses two consistency detection subnetworks to capture the inconsistency at the cross-modal level and the content-knowledge level simultaneously. It also enables robust multi-modal representation learning under different missing visual modality conditions, using a special token to discriminate between posts with visual modality and posts without visual modality. Extensive experiments on three public real-world multimedia datasets demonstrate that our framework can outperform the state-of-the-art baselines under both complete and incomplete modality conditions.  more » « less
Award ID(s):
2008155
NSF-PAR ID:
10477822
Author(s) / Creator(s):
; ; ; ; ;
Publisher / Repository:
IEEE
Date Published:
Journal Name:
IEEE Transactions on Knowledge and Data Engineering
Volume:
35
Issue:
12
ISSN:
1041-4347
Page Range / eLocation ID:
12736 to 12749
Subject(s) / Keyword(s):
["Feature Extraction","Visualization","Social Networking Online","Semantics","Fake News","Electronic Mail","Data Mining","Multi Modal Learning","Rumor Detection","Social Media Analysis","Social Media","Real World Datasets","Visual Modality","Multimodal Learning","Post Content","Multimedia Content","Multimodal Representation","Special Token","Data Visualization","Types Of Information","Visual Information","Visual Features","Text Data","Generative Adversarial Networks","Content Knowledge","Textual Information","Graph Convolutional Network","Entity Pairs","Largest Distance","Twitter Dataset","Text Modality","Text Representation","Textual Features","Missing Patterns","Confidence Interval CI","Representation Of Entities","Data Instances"]
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Rumor spreaders are increasingly utilizing multimedia content to attract the attention and trust of news consumers. Though quite a few rumor detection models have exploited the multi-modal data, they seldom consider the inconsistent semantics between images and texts, and rarely spot the inconsistency among the post contents and background knowledge. In addition, they commonly assume the completeness of multiple modalities and thus are incapable of handling handle missing modalities in real-life scenarios. Motivated by the intuition that rumors in social media are more likely to have inconsistent semantics, a novel Knowledge-guided Dual-consistency Network is proposed to detect rumors with multimedia contents. It uses two consistency detection subnetworks to capture the inconsistency at the cross-modal level and the content-knowledge level simultaneously. It also enables robust multi-modal representation learning under different missing visual modality conditions, using a special token to discriminate between posts with visual modality and posts without visual modality. Extensive experiments on three public real-world multimedia datasets demonstrate that our framework can outperform the state- of-the-art baselines under both complete and incomplete modality conditions. 
    more » « less
  2. Abstract

    Most of the research in the field of affective computing has focused on detecting and classifying human emotions through electroencephalogram (EEG) or facial expressions. Designing multimedia content to evoke certain emotions has been largely motivated by manual rating provided by users. Here we present insights from the correlation of affective features between three modalities namely, affective multimedia content, EEG, and facial expressions. Interestingly, low-level Audio-visual features such as contrast and homogeneity of the video and tone of the audio in the movie clips are most correlated with changes in facial expressions and EEG. We also detect the regions associated with the human face and the brain (in addition to the EEG frequency bands) that are most representative of affective responses. The computational modeling between the three modalities showed a high correlation between features from these regions and user-reported affective labels. Finally, the correlation between different layers of convolutional neural networks with EEG and Face images as input provides insights into human affection. Together, these findings will assist in (1) designing more effective multimedia contents to engage or influence the viewers, (2) understanding the brain/body bio-markers of affection, and (3) developing newer brain-computer interfaces as well as facial-expression-based algorithms to read emotional responses of the viewers.

     
    more » « less
  3. Over the last decade, research has revealed the high prevalence of cyberbullying among youth and raised serious concerns in society. Information on the social media platforms where cyberbullying is most prevalent (e.g., Instagram, Facebook, Twitter) is inherently multi-modal, yet most existing work on cyberbullying identification has focused solely on building generic classification models that rely exclusively on text analysis of online social media sessions (e.g., posts). Despite their empirical success, these efforts ignore the multi-modal information manifested in social media data (e.g., image, video, user profile, time, and location), and thus fail to offer a comprehensive understanding of cyberbullying. Conventionally, when information from different modalities is presented together, it often reveals complementary insights about the application domain and facilitates better learning performance. In this paper, we study the novel problem of cyberbullying detection within a multi-modal context by exploiting social media data in a collaborative way. This task, however, is challenging due to the complex combination of both cross-modal correlations among various modalities and structural dependencies between different social media sessions, and the diverse attribute information of different modalities. To address these challenges, we propose XBully, a novel cyberbullying detection framework, that first reformulates multi-modal social media data as a heterogeneous network and then aims to learn node embedding representations upon it. Extensive experimental evaluations on real-world multi-modal social media datasets show that the XBully framework is superior to the state-of-the-art cyberbullying detection models. 
    more » « less
  4. Wysocki, Bryant T. ; Holt, James ; Blowers, Misty (Ed.)
    The information era has gained a lot of traction due to the abundant digital media contents through technological broadcasting resources. Among the information providers, the social media platform has remained a popular platform for the widespread reach of digital content. Along with accessibility and reach, social media platforms are also a huge venue for spreading misinformation since the data is not curated by trusted authorities. With many malicious participants involved, artificially generated media or strategically altered content could potentially result in affecting the integrity of targeted organizations. Popular content generation tools like DeepFake have allowed perpetrators to create realistic media content by manipulating the targeted subject with a fake identity or actions. Media metadata like time and location-based information are altered to create a false perception of real events. In this work, we propose a Decentralized Electrical Network Frequency (ENF)-based Media Authentication (DEMA) system to verify the metadata information and the digital multimedia integrity. Leveraging the environmental ENF fingerprint captured by digital media recorders, altered media content is detected by exploiting the ENF consistency based on its time and location of recording along with its spatial consistency throughout the captured frames. A decentralized and hierarchical ENF map is created as a reference database for time and location verification. For digital media uploaded to a broadcasting service, the proposed DEMA system correlates the underlying ENF fingerprint with the stored ENF map to authenticate the media metadata. With the media metadata intact, the embedded ENF in the recording is compared with a reference ENF based on the time of recording, and a correlation-based metric is used to evaluate the media authenticity. In case of missing metadata, the frames are divided spatially to compare the ENF consistency throughout the recording. 
    more » « less
  5. Over the last decade, facial landmark tracking and 3D reconstruction have gained considerable attention due to their numerous applications such as human-computer interactions, facial expression analysis, and emotion recognition, etc. Traditional approaches require users to be confined to a particular location and face a camera under constrained recording conditions (e.g., without occlusions and under good lighting conditions). This highly restricted setting prevents them from being deployed in many application scenarios involving human motions. In this paper, we propose the first single-earpiece lightweight biosensing system, BioFace-3D, that can unobtrusively, continuously, and reliably sense the entire facial movements, track 2D facial landmarks, and further render 3D facial animations. Our single-earpiece biosensing system takes advantage of the cross-modal transfer learning model to transfer the knowledge embodied in a high-grade visual facial landmark detection model to the low-grade biosignal domain. After training, our BioFace-3D can directly perform continuous 3D facial reconstruction from the biosignals, without any visual input. Without requiring a camera positioned in front of the user, this paradigm shift from visual sensing to biosensing would introduce new opportunities in many emerging mobile and IoT applications. Extensive experiments involving 16 participants under various settings demonstrate that BioFace-3D can accurately track 53 major facial landmarks with only 1.85 mm average error and 3.38\% normalized mean error, which is comparable with most state-of-the-art camera-based solutions. The rendered 3D facial animations, which are in consistency with the real human facial movements, also validate the system's capability in continuous 3D facial reconstruction. 
    more » « less