Podcasts have become daily companions for half a billion users. Given the enormous amount of podcast content available, highlights provide a valuable signal that helps viewers get the gist of an episode and decide if they want to invest in listening to it in its entirety. However, identifying highlights automatically is challenging due to the unstructured and long-form nature of the content. We introduce Rhapsody, a dataset of 13K podcast episodes paired with segment-level highlight scores derived from YouTube's 'most replayed' feature. We frame the podcast highlight detection as a segment-level binary classification task. We explore various baseline approaches, including zero-shot prompting of language models and lightweight finetuned language models using segment-level classification heads. Our experimental results indicate that even state-of-the-art language models like GPT-4o and Gemini struggle with this task, while models finetuned with in-domain data significantly outperform their zero-shot performance. The finetuned model benefits from leveraging both speech signal features and transcripts. These findings highlight the challenges for fine-grained information access in long-form spoken media. 
                        more » 
                        « less   
                    
                            
                            iDRAMA-rumble-2024: A Dataset of Podcasts from Rumble Spanning 2020 to
                        
                    
    
            Rumble has emerged as a prominent platform hosting contro- versial figures facing restrictions on YouTube. Despite this, the academic community’s engagement with Rumble has been minimal. To help researchers address this gap, we intro- duce a comprehensive dataset of about 6.7K podcast videos from August 2020 to December 2022, amounting to over 5.6K hours of content. Besides covering metadata of these podcast videos, we provide speech-to-text transcriptions for future analysis. We also provide speaker diarization informa- tion, a collection of 168K unique representative images from podcast videos, and face embeddings of more than 400K ex- tracted faces. With the rise of the influence of podcasts and populist figures, this dataset provides a rich resource to iden- tify challenges in cyber social threats in a relatively underex- plored space. 
        more » 
        « less   
        
    
                            - Award ID(s):
- 2046590
- PAR ID:
- 10586413
- Publisher / Repository:
- International Workshop on Cyber Social Threats 2024
- Date Published:
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            We address the problem of human action classification in drone videos. Due to the high cost of capturing and labeling large-scale drone videos with diverse actions, we present unsupervised and semi-supervised domain adaptation approaches that leverage both the existing fully annotated action recognition datasets and unannotated (or only a few annotated) videos from drones. To study the emerging problem of drone-based action recognition, we create a new dataset, NEC-DRONE, containing 5,250 videos to evaluate the task. We tackle both problem settings with 1) same and 2) different action label sets for the source (e.g., Kinectics dataset) and target domains (drone videos). We present a combination of video and instance-based adaptation methods, paired with either a classifier or an embedding-based framework to transfer the knowledge from source to target. Our results show that the proposed adaptation approach substantially improves the performance on these challenging and practical tasks. We further demonstrate the applicability of our method for learning cross-view action recognition on the Charades-Ego dataset. We provide qualitative analysis to understand the behaviors of our approaches.more » « less
- 
            Understanding the relationship between figures and text is key to scientific document understanding. Medical figures in particular are quite complex, often consisting of several subfigures (75% of figures in our dataset), with detailed text describing their content. Previous work studying figures in scientific papers focused on classifying figure content rather than understanding how images relate to the text. To address challenges in figure retrieval and figure-to-text alignment, we introduce MedICaT, a dataset of medical images in context. MedICaT consists of 217K images from 131K open access biomedical papers, and includes captions, inline references for 74% of figures, and manually annotated subfigures and subcaptions for a subset of figures. Using MedICaT, we introduce the task of subfigure to subcaption alignment in compound figures and demonstrate the utility of inline references in image-text matching. Our data and code can be accessed at https://github.com/allenai/medicat.more » « less
- 
            We present a new multimodal, context-based dataset for continuous authentication. The dataset contains 27 subjects, with an age range of [8, 72], where data has been collected across multiple sessions while the subjects are watching videos meant to elicit an emotional response. Collected data includes accelerometer data, heart rate, electrodermal activity, skin temperature, and face videos. We also propose a baseline approach for fair comparisons when using the proposed dataset. The approach uses a combination of a pretrained backbone network with supervised contrastive loss for face. Time-series features are also extracted, from the physiological signals, which are used for classification. This approach, on the proposed dataset, results in an average accuracy, precision, and recall of 76.59%, 88.90, and 53.25, respectively, on electrical signals, and 90.39%, 98.77, and 75.71, respectively on face videos.more » « less
- 
            In this research, we take an innovative approach to the Video Corpus Visual Answer Localization (VCVAL) task using the MedVidQA dataset. We expand on it by incorporating causal inference for medical videos, a novel approach in this field. By leveraging the state-of-the-art GPT-4 and Gemini Pro 1.5 models, the system aims to localize temporal segments in videos and analyze cause-effect relationships from subtitles to enhance medical decision-making. This paper extends the work from the MedVidQA challenge by introducing causality extraction to enhance the interpretability of localized video content. Subtitles are segmented to identify causal units such as cause, effect, condition, action, and signal. Prompts guide GPT-4 and Gemini Pro 1.5 in detecting and quantifying causal structures while analyzing explicit and implicit relationships, including those spanning multiple subtitle fragments. Our results reveal that both GPT-4 and Gemini Pro 1.5 perform better when handling queries individually but face challenges in batch processing for both temporal localization and causality extraction. Despite these challenges, our innovative approach has the potential to significantly advance the field of Health Informatics. In this research, we address the Video Corpus Visual Answer Localization (VCVAL) task using the MedVidQA dataset and take it a step further by integrating causal inference for medical videos. By leveraging the state-of-the-art GPT-4 and Gemini Pro 1.5 model, our system is designed to localize temporal segments in videos and analyze cause-effect relationships from subtitles to enhance medical decision-making. Our preliminary results indicate that while both models perform well for some videos, they face challenges for most, resulting in varying performance levels. The successful integration of temporal localization with causal inference can provide significant improvement for the scalability and overall performance of medical video analysis. Our work demonstrates how AI systems can uncover valuable insights from medical videos, driving significant progress in medical AI applications and potentially making significant contributions to the field.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
 
                                    