skip to main content


Title: Neural-Guided Program Synthesis of Information Extraction Rules Using Self-Supervision
We propose a neural-based approach for rule synthesis designed to help bridge the gap between the interpretability, precision and maintainability exhibited by rule-based information extraction systems with the scalability and convenience of statistical information extraction systems. This is achieved by avoiding placing the burden of learning another specialized language on domain experts and instead asking them to provide a small set of examples in the form of highlighted spans of text. We introduce a transformer-based architecture that drives a rule synthesis system that leverages a self-supervised approach for pre-training a large-scale language model complemented by an analysis of different loss functions and aggregation mechanisms for variable length sequences of user-annotated spans of text. The results are encouraging and point to different desirable properties, such as speed and quality, depending on the choice of loss and aggregation method.  more » « less
Award ID(s):
2006583
NSF-PAR ID:
10436178
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Proceedings of the First Workshop on Pattern-based Approaches to NLP in the Age of Deep Learning
Page Range / eLocation ID:
85–93
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Hols, Thorsten Holz ; Ristenpart, Thomas (Ed.)
    Automated attack discovery techniques, such as attacker synthesis or model-based fuzzing, provide powerful ways to ensure network protocols operate correctly and securely. Such techniques, in general, require a formal representation of the protocol, often in the form of a finite state machine (FSM). Unfortunately, many protocols are only described in English prose, and implementing even a simple network protocol as an FSM is time-consuming and prone to subtle logical errors. Automatically extracting protocol FSMs from documentation can significantly contribute to increased use of these techniques and result in more robust and secure protocol implementations.In this work we focus on attacker synthesis as a representative technique for protocol security, and on RFCs as a representative format for protocol prose description. Unlike other works that rely on rule-based approaches or use off-the-shelf NLP tools directly, we suggest a data-driven approach for extracting FSMs from RFC documents. Specifically, we use a hybrid approach consisting of three key steps: (1) large-scale word-representation learning for technical language, (2) focused zero-shot learning for mapping protocol text to a protocol-independent information language, and (3) rule-based mapping from protocol-independent information to a specific protocol FSM. We show the generalizability of our FSM extraction by using the RFCs for six different protocols: BGPv4, DCCP, LTP, PPTP, SCTP and TCP. We demonstrate how automated extraction of an FSM from an RFC can be applied to the synthesis of attacks, with TCP and DCCP as case-studies. Our approach shows that it is possible to automate attacker synthesis against protocols by using textual specifications such as RFCs. 
    more » « less
  2. Modern enterprises rely on Data Leakage Prevention (DLP) systems to enforce privacy policies that prevent unintentional flow of sensitive information to unauthorized entities. However, these systems operate based on rule sets that are limited to syntactic analysis and therefore completely ignore the semantic relationships between participants involved in the information exchanges. For similar reasons, these systems cannot enforce complex privacy policies that require temporal reasoning about events that have previously occurred. To address these limitations, we advocate a new design methodology for DLP systems centered on the notion of Contextual Integrity (CI).We use the CI framework to abstract real-world communication exchanges into formally defined information flows where privacy policies describe sequences of admissible flows. CI allows us to decouple (1) the syntactic extraction of flows from information exchanges, and (2) the enforcement of privacy policies on these flows. We applied this approach to built VACCINE, a DLP auditing system for emails. VACCINE uses state-of-the-art techniques in natural language processing to extract flows from email text. It also provides a declarative language for describing privacy policies. These policies are automatically compiled to operational rules that the system uses for detecting data leakages. We evaluated VACCINE on the Enron email corpus and show that it improves over the state of the art both in terms of the expressivity of the policies that DLP systems can enforce as well as its precision in detecting data leakages. 
    more » « less
  3. Tang, P. ; Grau, D. ; El Asmar, M. (Ed.)
    Existing automated code checking (ACC) systems require the extraction of requirements from regulatory textual documents into computer-processable rule representations. The information extraction processes in those ACC systems are based on either human interpretation, manual annotation, or predefined automated information extraction rules. Despite the high performance they showed, rule-based information extraction approaches, by nature, lack sufficient scalability—the rules typically need some level of adaptation if the characteristics of the text change. Machine learning-based methods, instead of relying on hand-crafted rules, automatically capture the underlying patterns of the existing training text and have a great capability of generalizing to a variety of texts. A more scalable, machine learning-based approach is thus needed to achieve a more robust performance across different types of codes/documents for automatically generating semantically-enriched building-code sentences for the purpose of ACC. To address this need, this paper proposes a machine learning-based approach for generating semantically-enriched building-code sentences, which are annotated syntactically and semantically, for supporting IE. For improved robustness and scalability, the proposed approach uses transfer learning strategies to train deep neural network models on both general-domain and domain-specific data. The proposed approach consists of four steps: (1) data preparation and preprocessing; (2) development of a base deep neural network model for generating semantically-enriched building-code sentences; (3) model training using transfer learning strategies; and (4) model evaluation. The proposed approach was evaluated on a corpus of sentences from the 2009 International Building Code (IBC) and the Champaign 2015 IBC Amendments. The preliminary results show that the proposed approach achieved an optimal precision of 88%, recall of 86%, and F1-measure of 87%, indicating good performance. 
    more » « less
  4. Biomedical Entity Linking (BEL) is the task of mapping of spans of text within biomedical documents to normalized, unique identifiers within an ontology. This is an important task in natural language processing for both translational information extraction applications and providing context for downstream tasks like relationship extraction. In this paper, we will survey the progression of BEL from its inception in the late 80s to present day state of the art systems, provide a comprehensive list of datasets available for training BEL systems, reference shared tasks focused on BEL, discuss the technical components that comprise BEL systems, and discuss possible directions for the future of the field. 
    more » « less
  5. Abstract Background Diabetic retinopathy (DR) is a leading cause of blindness in American adults. If detected, DR can be treated to prevent further damage causing blindness. There is an increasing interest in developing artificial intelligence (AI) technologies to help detect DR using electronic health records. The lesion-related information documented in fundus image reports is a valuable resource that could help diagnoses of DR in clinical decision support systems. However, most studies for AI-based DR diagnoses are mainly based on medical images; there is limited studies to explore the lesion-related information captured in the free text image reports. Methods In this study, we examined two state-of-the-art transformer-based natural language processing (NLP) models, including BERT and RoBERTa, compared them with a recurrent neural network implemented using Long short-term memory (LSTM) to extract DR-related concepts from clinical narratives. We identified four different categories of DR-related clinical concepts including lesions, eye parts, laterality, and severity, developed annotation guidelines, annotated a DR-corpus of 536 image reports, and developed transformer-based NLP models for clinical concept extraction and relation extraction. We also examined the relation extraction under two settings including ‘gold-standard’ setting—where gold-standard concepts were used–and end-to-end setting. Results For concept extraction, the BERT model pretrained with the MIMIC III dataset achieve the best performance (0.9503 and 0.9645 for strict/lenient evaluation). For relation extraction, BERT model pretrained using general English text achieved the best strict/lenient F1-score of 0.9316. The end-to-end system, BERT_general_e2e, achieved the best strict/lenient F1-score of 0.8578 and 0.8881, respectively. Another end-to-end system based on the RoBERTa architecture, RoBERTa_general_e2e, also achieved the same performance as BERT_general_e2e in strict scores. Conclusions This study demonstrated the efficiency of transformer-based NLP models for clinical concept extraction and relation extraction. Our results show that it’s necessary to pretrain transformer models using clinical text to optimize the performance for clinical concept extraction. Whereas, for relation extraction, transformers pretrained using general English text perform better. 
    more » « less