skip to main content

This content will become publicly available on October 16, 2024

Title: Data Extraction via Semantic Regular Expression Synthesis

Many data extraction tasks of practical relevance require not only syntactic pattern matching but also semantic reasoning about the content of the underlying text. While regular expressions are very well suited for tasks that require only syntactic pattern matching, they fall short for data extraction tasks that involve both a syntactic and semantic component. To address this issue, we introduce semantic regexes, a generalization of regular expressions that facilitates combined syntactic and semantic reasoning about textual data. We also propose a novel learning algorithm that can synthesize semantic regexes from a small number of positive and negative examples. Our proposed learning algorithm uses a combination of neural sketch generation and compositional type-directed synthesis for fast and effective generalization from a small number of examples. We have implemented these ideas in a new tool called Smore and evaluated it on representative data extraction tasks involving several textual datasets. Our evaluation shows that semantic regexes can better support complex data extraction tasks than standard regular expressions and that our learning algorithm significantly outperforms existing tools, including state-of-the-art neural networks and program synthesis tools.

more » « less
Award ID(s):
Author(s) / Creator(s):
; ; ; ;
Publisher / Repository:
Date Published:
Journal Name:
Proceedings of the ACM on Programming Languages
Page Range / eLocation ID:
1848 to 1877
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Recent systems for converting natural language descriptions into regular expressions (regexes) have achieved some success, but typically deal with short, formulaic text and can only produce simple regexes. Real-world regexes are complex, hard to describe with brief sentences, and sometimes require examples to fully convey the user’s intent. We present a framework for regex synthesis in this setting where both natural language (NL) and examples are available. First, a semantic parser (either grammar-based or neural) maps the natural language description into an intermediate sketch, which is an incomplete regex containing holes to denote missing components. Then a program synthesizer searches over the regex space defined by the sketch and finds a regex that is consistent with the given string examples. Our semantic parser can be trained purely from weak supervision based on correctness of the synthesized regex, or it can leverage heuristically derived sketches. We evaluate on two prior datasets (Kushman and Barzilay 2013 ; Locascio et al. 2016 ) and a real-world dataset from Stack Overflow. Our system achieves state-of-the-art performance on the prior datasets and solves 57% of the real-world dataset, which existing neural systems completely fail on. 1 
    more » « less
  2. Regular expressions are a popular target for programming by example (PBE) systems, which seek to learn regexes from user-provided examples. Synthesizing from only positive examples remains an unsolved challenge, as the unrestricted search space makes it difficult to avoid over- and under- generalizing. Prior work has approached this in two ways: search-based techniques which require extra input, such as user feedback and/or a natural language description, and neural techniques. The former puts an extra burden on the user, while the latter requires large representative training data sets which are almost nonexistent for this domain. To tackle this challenge we present Regex+, a search-based syn- thesizer that infers regexes from just a few positive examples. Regex+ avoids over/under-generalization by using minimum description length (MDL) learning, adapted to version space algebras in order to efficiently search for an optimal regex according to a compositional MDL ranking function. Our evaluation shows that Regex+ more than triples the accu- racy of existing neural and search-based regex synthesizers on benchmarks with only positive examples 
    more » « less
  3. Although there are tools to help developers understand the matching behaviors between a regular expression and a string, regular-expression related faults are still common. Learning developers’ behavior through the change history of regular expressions can identify common edit patterns, which can inform the creation of mutation and repair operators to assist with testing and fixing regular expressions. In this work, we explore how regular expressions evolve over time, focusing on the characteristics of regular expression edits, the syntactic and semantic difference of the edits, and the feature changes of edits. Our exploration uses two datasets. First, we look at GitHub projects that have a regular expression in their current version and look back through the commit logs to collect the regular expressions’ edit history. Second, we collect regular expressions composed by study participants during problem- solving tasks. Our results show that 1) 95% of the regular expressions from GitHub are not edited, 2) most edited regular expressions have a syntactic distance of 4-6 characters from their predecessors, 3) over 50% of the edits in GitHub tend to expand the scope of regular expression, and 4) the number of features used indicates the regular expression language usage increases over time. This work has implications for supporting regular expression repair and mutation to ensure test suite quality. 
    more » « less
  4. The commonsense natural language inference (CNLI) tasks aim to select the most likely follow-up statement to a contextual description of ordinary, everyday events and facts. Current approaches to transfer learning of CNLI models across tasks require many labeled data from the new task. This paper presents a way to reduce this need for additional annotated training data from the new task by leveraging symbolic knowledge bases, such as ConceptNet. We formulate a teacher-student framework for mixed symbolic-neural reasoning, with the large-scale symbolic knowledge base serving as the teacher and a trained CNLI model as the student. This hybrid distillation process involves two steps. The first step is a symbolic reasoning process. Given a collection of unlabeled data, we use an abductive reasoning framework based on Grenander's pattern theory to create weakly labeled data. Pattern theory is an energy-based graphical probabilistic framework for reasoning among random variables with varying dependency structures. In the second step, the weakly labeled data, along with a fraction of the labeled data, is used to transfer-learn the CNLI model into the new task. The goal is to reduce the fraction of labeled data required. We demonstrate the efficacy of our approach by using three publicly available datasets (OpenBookQA, SWAG, and HellaSWAG) and evaluating three CNLI models (BERT, LSTM, and ESIM) that represent different tasks. We show that, on average, we achieve 63% of the top performance of a fully supervised BERT model with no labeled data. With only 1000 labeled samples, we can improve this performance to 72%. Interestingly, without training, the teacher mechanism itself has significant inference power. The pattern theory framework achieves 32.7% accuracy on OpenBookQA, outperforming transformer-based models such as GPT (26.6%), GPT-2 (30.2%), and BERT (27.1%) by a significant margin. We demonstrate that the framework can be generalized to successfully train neural CNLI models using knowledge distillation under unsupervised and semi-supervised learning settings. Our results show that it outperforms all unsupervised and weakly supervised baselines and some early supervised approaches, while offering competitive performance with fully supervised baselines. Additionally, we show that the abductive learning framework can be adapted for other downstream tasks, such as unsupervised semantic textual similarity, unsupervised sentiment classification, and zero-shot text classification, without significant modification to the framework. Finally, user studies show that the generated interpretations enhance its explainability by providing key insights into its reasoning mechanism. 
    more » « less
  5. Regular expressions (regexes) are ubiquitous in modern software. There is a variety of implementation techniques for regex matching, which can be roughly categorized as (1) relying on backtracking search, or (2) being based on finite-state automata. The implementations that use backtracking are often chosen due to their ability to support advanced pattern-matching constructs. Unfortunately, they are known to suffer from severe performance problems. For some regular expressions, the running time for matching can be exponential in the size of the input text. In order to provide stronger guarantees of matching efficiency, automata-based regex matching is the preferred choice. However, even these regex engines may exhibit severe performance degradation for some patterns. The main reason for this is that regexes used in practice are not exclusively built from the classical regular constructs, i.e., concatenation, nondeterministic choice and Kleene's star. They involve additional constructs that provide succinctness and convenience of expression. The most common such construct is bounded repetition (also called counting), which describes the repetition of the pattern a fixed number of times. In this paper, we propose a new algorithm for the efficient matching of regular expressions that involve bounded repetition. Our algorithms are based on a new model of automata, which we call nondeterministic bit vector automata (NBVA). This model is chosen to be expressively equivalent to nondeterministic counter automata with bounded counters, a very natural model for expressing patterns with bounded repetition. We show that there is a class of regular expressions with bounded repetition that can be matched in time that is independent from the repetition bounds. Our algorithms are general enough to cover the vast majority of challenging bounded repetitions that arise in practice. We provide an implementation of our approach in a regex engine, which we call BVA-Scan. We compare BVA-Scan against state-of-the-art regex engines on several real datasets. 
    more » « less