skip to main content


Title: MAttNet: Modular Attention Network for Referring Expression Comprehension
In this paper, we address referring expression comprehension: localizing an image region described by a natural language expression. While most recent work treats expressions as a single unit, we propose to decompose them into three modular components related to subject appearance, location, and relationship to other objects. This allows us to flexibly adapt to expressions containing different types of information in an end-to-end framework. In our model, which we call the Modular Attention Network (MAttNet), two types of attention are utilized: language-based attention that learns the module weights as well as the word/phrase attention that each module should focus on; and visual attention that allows the subject and relationship modules to focus on relevant image components. Module weights combine scores from all three modules dynamically to output an overall score. Experiments show that MAttNet outperforms previous state-of-the-art methods by a large margin on both bounding-box-level and pixel-level comprehension tasks. Demo and code are provided.  more » « less
Award ID(s):
1633295
NSF-PAR ID:
10066897
Author(s) / Creator(s):
; ; ; ; ; ;
Date Published:
Journal Name:
IEEE Conference on Computer Vision and Pattern Recognition
ISSN:
2163-6648
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Referring expressions are natural language construc- tions used to identify particular objects within a scene. In this paper, we propose a unified framework for the tasks of referring expression comprehension and generation. Our model is composed of three modules: speaker, listener, and reinforcer. The speaker generates referring expressions, the listener comprehends referring expressions, and the rein- forcer introduces a reward function to guide sampling of more discriminative expressions. The listener-speaker mod- ules are trained jointly in an end-to-end learning frame- work, allowing the modules to be aware of one another during learning while also benefiting from the discrimina- tive reinforcer’s feedback. We demonstrate that this unified framework and training achieves state-of-the-art results for both comprehension and generation on three referring ex- pression datasets. 
    more » « less
  2. Vision-language (VL) pre-training has recently received considerable attention. However, most existing end-to-end pre-training approaches either only aim to tackle VL tasks such as image-text retrieval, visual question answering (VQA) and image captioning that test high-level understanding of images, or only target region-level understanding for tasks such as phrase grounding and object detection. We present FIBER (Fusion-In-the-Backbone-based transformER), a new VL model architecture that can seamlessly handle both these types of tasks. Instead of having dedicated transformer layers for fusion after the uni-modal backbones, FIBER pushes multimodal fusion deep into the model by inserting cross-attention into the image and text backbones to better capture multimodal interactions. In addition, unlike previous work that is either only pre-trained on image-text data or on fine-grained data with box-level annotations, we present a two-stage pre-training strategy that uses both these kinds of data efficiently: (i) coarse-grained pre-training based on image-text data; followed by (ii) fine-grained pre-training based on image-text-box data. We conduct comprehensive experiments on a wide range of VL tasks, ranging from VQA, image captioning, and retrieval, to phrase grounding, referring expression comprehension, and object detection. Using deep multimodal fusion coupled with the two-stage pre-training, FIBER provides consistent performance improvements over strong baselines across all tasks, often outperforming methods using magnitudes more data. Code is released at https://github.com/microsoft/FIBER. 
    more » « less
  3. Abstract

    Referring expression comprehension aims to localize objects identified by natural language descriptions. This is a challenging task as it requires understanding of both visual and language domains. One nature is that each object can be described by synonymous sentences with paraphrases, and such varieties in languages have critical impact on learning a comprehension model. While prior work usually treats each sentence and attends it to an object separately, we focus on learning a referring expression comprehension model that considers the property in synonymous sentences. To this end, we develop an end-to-end trainable framework to learn contrastive features on the image and object instance levels, where features extracted from synonymous sentences to describe the same object should be closer to each other after mapping to the visual domain. We conduct extensive experiments to evaluate the proposed algorithm on several benchmark datasets, and demonstrate that our method performs favorably against the state-of-the-art approaches. Furthermore, since the varieties in expressions become larger across datasets when they describe objects in different ways, we present the cross-dataset and transfer learning settings to validate the ability of our learned transferable features.

     
    more » « less
  4. Nowadays, cyberattack incidents are happening on a daily basis. As a result, the demand for a larger and more challenging workforce is increasing. To handle this demand, academic institutions offer cybersecurity courses and degree programs into their curricula; however, more efforts are needed to address the high demand of the cybersecurity workforce. This work aims to bridge the gap between workforce shortage and the number of qualified graduates to fill the positions. We approach this by introducing cybersecurity concepts at the early stage of undergraduate curricula of computer science and engineering programs. Secure programming is critical as many cybersecurity incidents happen due to software vulnerabilities. However, most UG-level programming courses pay little attention to secure programming practices. As a result, many students graduate with limited knowledge of security vulnerabilities that might plague the developed software. Our goal in this work is to introduce secure programming at introductory level programming courses so that students should be aware of cybersecurity issues and use this security mindset in advanced level courses and projects in their degree programs. To accomplish this goal, we developed intuitive and interactive modules emphasizing secure programming in C++ and Java courses to help students become secure software developers. These modules will be used alongside the coursework to emphasize certain vulnerabilities within the programming environment of a specific language and allow students to learn cybersecurity topics, enforcing a solid foundation and understanding. We developed cybersecurity educational modules for C++ and Java as they are amongst the popular languages and used in introductory programming courses. While designing these modules, we kept in mind that the topics must be relevant to real-world issues in the software industry. We used a variety of resources and benchmarks to ensure the authenticity of our chosen topics, including Common Weakness Enumeration (CWE) and Common Vulnerability and Exposures (CVE). While choosing module topics to develop, we had some restrictions. For example, the topics must be introductory and easy to understand. These modules are geared towards freshman or sophomore-level UG students who have just started programming. The developed security modules have four components: power-point slides, lab description, code template for the lab, and complete solution. The complete solution for each module will be provided to the instructors to check students’ work if they adopt the modules in their courses. The modules developed for a C++ programming course include labs on input validation, integer overflow, random number generation, function call with incorrect argument type, and dangling pointers. In Java, we developed lab modules for input validation, integer overflow, null object reference, random number generator, and data encapsulation. 
    more » « less
  5. For robots to be useful for real-world applications, they must be safe around humans, be adaptable to their environment, and operate in an untethered manner. Soft robots could potentially meet these requirements; however, existing soft robotic architectures are limited by their ability to scale to human sizes and operate at these scales without a tether to transmit power or pressurized air from an external source. Here, we report an untethered, inflated robotic truss, composed of thin-walled inflatable tubes, capable of shape change by continuously relocating its joints, while its total edge length remains constant. Specifically, a set of identical roller modules each pinch the tube to create an effective joint that separates two edges, and modules can be connected to form complex structures. Driving a roller module along a tube changes the overall shape, lengthening one edge and shortening another, while the total edge length and hence fluid volume remain constant. This isoperimetric behavior allows the robot to operate without compressing air or requiring a tether. Our concept brings together advantages from three distinct types of robots—soft, collective, and truss-based—while overcoming certain limitations of each. Our robots are robust and safe, like soft robots, but not limited by a tether; are modular, like collective robots, but not limited by complex subunits; and are shape-changing, like truss robots, but not limited by rigid linear actuators. We demonstrate two-dimensional (2D) robots capable of shape change and a human-scale 3D robot capable of punctuated rolling locomotion and manipulation, all constructed with the same modular rollers and operating without a tether. 
    more » « less