skip to main content


Title: ScenarioNet: An Interpretable Data-Driven Model for Scene Understanding
The ability for computational agents to reason about the high-level content of real world scene images is important for many applications. Existing attempts at complex scene understanding lack representational power, efficiency, and the ability to create robust meta- knowledge about scenes. We introduce scenarios as a new way of representing scenes. The scenario is an interpretable, low-dimensional, data-driven representation consisting of sets of frequently co-occurring objects that is useful for a wide range of scene under- standing tasks. Scenarios are learned from data using a novel matrix factorization method which is integrated into a new neural network architecture, the Scenari-oNet. Using ScenarioNet, we can recover semantic in- formation about real world scene images at three levels of granularity: 1) scene categories, 2) scenarios, and 3) objects. Training a single ScenarioNet model enables us to perform scene classification, scenario recognition, multi-object recognition, content-based scene image retrieval, and content-based image comparison. ScenarioNet is efficient because it requires significantly fewer parameters than other CNNs while achieving similar performance on benchmark tasks, and it is interpretable because it produces evidence in an understandable format for every decision it makes. We validate the utility of scenarios and ScenarioNet on a diverse set of scene understanding tasks on several benchmark datasets.  more » « less
Award ID(s):
1747778
NSF-PAR ID:
10105316
Author(s) / Creator(s):
;
Date Published:
Journal Name:
IJCAI Workshop on Explainable Artificial Intelligence (XAI) 2018
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Segmentation of moving objects in dynamic scenes is a key process in scene understanding for navigation tasks. Classical cameras suffer from motion blur in such scenarios rendering them effete. On the contrary, event cameras, because of their high temporal resolution and lack of motion blur, are tailor-made for this problem. We present an approach for monocular multi-motion segmentation, which combines bottom-up feature tracking and top-down motion compensation into a unified pipeline, which is the first of its kind to our knowledge. Using the events within a time-interval, our method segments the scene into multiple motions by splitting and merging. We further speed up our method by using the concept of motion propagation and cluster keyslices.The approach was successfully evaluated on both challenging real-world and synthetic scenarios from the EV-IMO, EED, and MOD datasets and outperformed the state-of-the-art detection rate by 12%, achieving a new state-of-the-art average detection rate of 81.06%, 94.2% and 82.35% on the aforementioned datasets. To enable further research and systematic evaluation of multi-motion segmentation, we present and open-source a new dataset/benchmark called MOD++, which includes challenging sequences and extensive data stratification in-terms of camera and object motion, velocity magnitudes, direction, and rotational speeds. 
    more » « less
  2. Background: Drivers gather most of the information they need to drive by looking at the world around them and at visual displays within the vehicle. Navigation systems automate the way drivers navigate. In using these systems, drivers offload both tactical (route following) and strategic aspects (route planning) of navigational tasks to the automated SatNav system, freeing up cognitive and attentional resources that can be used in other tasks (Burnett, 2009). Despite the potential benefits and opportunities that navigation systems provide, their use can also be problematic. For example, research suggests that drivers using SatNav do not develop as much environmental spatial knowledge as drivers using paper maps (Waters & Winter, 2011; Parush, Ahuvia, & Erev, 2007). With recent growth and advances of augmented reality (AR) head-up displays (HUDs), there are new opportunities to display navigation information directly within a driver’s forward field of view, allowing them to gather information needed to navigate without looking away from the road. While the technology is promising, the nuances of interface design and its impacts on drivers must be further understood before AR can be widely and safely incorporated into vehicles. Specifically, an impact that warrants investigation is the role of AR HUDS in spatial knowledge acquisition while driving. Acquiring high levels of spatial knowledge is crucial for navigation tasks because individuals who have greater levels of spatial knowledge acquisition are more capable of navigating based on their own internal knowledge (Bolton, Burnett, & Large, 2015). Moreover, the ability to develop an accurate and comprehensive cognitive map acts as a social function in which individuals are able to navigate for others, provide verbal directions and sketch direction maps (Hill, 1987). Given these points, the relationship between spatial knowledge acquisition and novel technologies such as AR HUDs in driving is a relevant topic for investigation. Objectives: This work explored whether providing conformal AR navigational cues improves spatial knowledge acquisition (as compared to traditional HUD visual cues) to assess the plausibility and justification for investment in generating larger FOV AR HUDs with potentially multiple focal planes. Methods: This study employed a 2x2 between-subjects design in which twenty-four participants were counterbalanced by gender. We used a fixed base, medium fidelity driving simulator for where participants drove while navigating with one of two possible HUD interface designs: a world-relative arrow post sign and a screen-relative traditional arrow. During the 10-15 minute drive, participants drove the route and were encouraged to verbally share feedback as they proceeded. After the drive, participants completed a NASA-TLX questionnaire to record their perceived workload. We measured spatial knowledge at two levels: landmark and route knowledge. Landmark knowledge was assessed using an iconic recognition task, while route knowledge was assessed using a scene ordering task. After completion of the study, individuals signed a post-trial consent form and were compensated $10 for their time. Results: NASA-TLX performance subscale ratings revealed that participants felt that they performed better during the world-relative condition but at a higher rate of perceived workload. However, in terms of perceived workload, results suggest there is no significant difference between interface design conditions. Landmark knowledge results suggest that the mean number of remembered scenes among both conditions is statistically similar, indicating participants using both interface designs remembered the same proportion of on-route scenes. Deviance analysis show that only maneuver direction had an influence on landmark knowledge testing performance. Route knowledge results suggest that the proportion of scenes on-route which were correctly sequenced by participants is similar under both conditions. Finally, participants exhibited poorer performance in the route knowledge task as compared to landmark knowledge task (independent of HUD interface design). Conclusions: This study described a driving simulator study which evaluated the head-up provision of two types of AR navigation interface designs. The world-relative condition placed an artificial post sign at the corner of an approaching intersection containing a real landmark. The screen-relative condition displayed turn directions using a screen-fixed traditional arrow located directly ahead of the participant on the right or left side on the HUD. Overall results of this initial study provide evidence that the use of both screen-relative and world-relative AR head-up display interfaces have similar impact on spatial knowledge acquisition and perceived workload while driving. These results contrast a common perspective in the AR community that conformal, world-relative graphics are inherently more effective. This study instead suggests that simple, screen-fixed designs may indeed be effective in certain contexts. 
    more » « less
  3. As humans, we can modify our assumptions about a scene by imagining alternative objects or concepts in our minds. For example, we can easily anticipate the implications of the sun being overcast by rain clouds (e.g., the street will get wet) and accordingly prepare for that. In this paper, we introduce a new task/dataset called Commonsense Reasoning for Counterfactual Scene Imagination (COSIM) which is designed to evaluate the ability of AI systems to reason about scene change imagination. In this task/dataset, models are given an image and an initial question-response pair about the image. Next, a counterfactual imagined scene change (in textual form) is applied, and the model has to predict the new response to the initial question based on this scene change. We collect 3.5K high-quality and challenging data instances, with each instance consisting of an image, a commonsense question with a response, a description of a counterfactual change, a new response to the question, and three distractor responses. Our dataset contains various complex scene change types (such as object addition/removal/state change, event description, environment change, etc.) that require models to imagine many different scenarios and reason about the changed scenes. We present a baseline model based on a vision-language Transformer (i.e., LXMERT) and ablation studies. Through human evaluation, we demonstrate a large human-model performance gap, suggesting room for promising future work on this challenging counterfactual, scene imagination task. 
    more » « less
  4. Single-photon sensitive image sensors have recently gained popularity in passive imaging applications where the goal is to capture photon flux (brightness) values of different scene points in the presence of challenging lighting conditions and scene motion. Recent work has shown that high-speed bursts of single-photon timestamp information captured using a single-photon avalanche diode camera can be used to estimate and correct for scene motion thereby improving signal-to-noise ratio and reducing motion blur artifacts. We perform a comparison of various design choices in the processing pipeline used for noise reduction, motion compensation, and upsampling of single-photon timestamp frames. We consider various pixelwise noise reduction techniques in combination with state-of-the-art deep neural network upscaling algorithms to super-resolve intensity images formed with single-photon timestamp data. We explore the trade space of motion blur and signal noise in various scenes with different motion content. Using real data captured with a hardware prototype, we achieved superresolution reconstruction at frame rates up to 65.8 kHz (native sampling rate of the sensor) and captured videos of fast-moving objects. The best reconstruction is obtained with the motion compensation approach, which achieves a structural similarity (SSIM) of about 0.67 for fast moving rigid objects. We are able to reconstruct subpixel resolution. These results show the relative superiority of our motion compensation compared to other approaches that do not exceed an SSIM of 0.5. 
    more » « less
  5. Mobile Augmented Reality (AR), which overlays digital information with real-world scenes surrounding a user, provides an enhanced mode of interaction with the ambient world. Contextual AR applications rely on image recognition to identify objects in the view of the mobile device. In practice, due to image distortions and device resource constraints, achieving high performance image recognition for AR is challenging. Recent advances in edge computing offer opportunities for designing collaborative image recognition frameworks for AR. In this demonstration, we present CollabAR, an edge-assisted collaborative image recognition framework. CollabAR allows AR devices that are facing the same scene to collaborate on the recognition task. Demo participants develop an intuition for different image distortions and their impact on image recognition accuracy. We showcase how heterogeneous images taken by different users can be aggregated to improve recognition accuracy and provide a better user experience in AR. 
    more » « less