skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: MLRegTest: A benchmark for the machine learning of regular languages [DATASET]
MLRegTest is a benchmark for machine learning systems on sequence classification, which contains training, development, and test sets from 1,800 regular languages. MLRegTest organizes its languages according to their logical complexity (monadic second order, first order, propositional, or monomial expressions) and the kind of logical literals (string, tier-string, subsequence, or combinations thereof). The logical complexity and choice of literal provides a systematic way to understand different kinds of long-distance dependencies in regular languages, and therefore to understand the capacities of different ML systems to learn such long-distance dependencies.  more » « less
Award ID(s):
2125295
PAR ID:
10615154
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ; ; ;
Publisher / Repository:
Dryad
Date Published:
Subject(s) / Keyword(s):
sequence classification Supervised machine learning regular languages logical complexity Recurrent neural networks FOS: Computer and information sciences FOS: Computer and information sciences
Format(s):
Medium: X Size: 106406182796 bytes
Size(s):
106406182796 bytes
Right(s):
Creative Commons Zero v1.0 Universal
Sponsoring Org:
National Science Foundation
More Like this
  1. Edwards, Peter (Ed.)
    Game theory is used by all behavioral sciences, but its development has long centered around the economic interpretation of equilibrium outcomes in relatively simple games and toy systems. But game theory has another potential use: the high-level design of large game compositions that express complex architectures and represent real-world institutions faithfully. Compositional game theory, grounded in the mathematics underlying programming languages, and introduced here as a general computational framework, increases the parsimony of game representations with abstraction and modularity, accelerates search and design, and helps theorists across disciplines express real-world institutional complexity in well-defined ways. Relative to existing approaches in game theory, compositional game theory is especially promising for solving game systems with long-range dependencies, for comparing large numbers of structurally related games, and for nesting games into the larger logical or strategic flows typical of real world policy or institutional systems. 
    more » « less
  2. Semantic automata were developed to compare the complexity of generalized quantifiers in terms of the string languages that describe their truth conditions. An important point that has gone unnoticed so far is that these string languages are remarkably simple for most quantifiers, in particular those that can be realized by a single lexical item. Whereas complex quantifiers such as "an even number of" correspond to specific regular languages, the lexical quantifiers "every", "no", "some", as well as numerals do not reach this level of complexity. Instead, they all stay close to the bottom of the so-called subregular hierarchy. What more, the class of tier-based strictly local languages provide a remarkably tight characterization of the class of lexical quantifiers. A significant number of recent publications have also argued for the central role of tier-based strict locality in phonology, morphology, and syntax. This suggests that subregularity in general and tier-based strict locality in particular may be a unifying property of natural language across all its submodules. 
    more » « less
  3. This paper investigates bounds on the generative capacity of prosodic processes, by focusing on the complexity of recursive prosody in coordination contexts in English (Wagner, 2010). Although all phonological processes and most prosodic processes are computationally regular string languages, we show that recursive prosody is not. The output string language is instead parallel multiple context-free (Seki et al., 1991). We evaluate the complexity of the pattern over strings, and then move on to a characterization over trees that requires the expressivity of multi bottom-up tree transducers. In doing so, we provide a foundation for future mathematically grounded investigations of the syntax-prosody interface. 
    more » « less
  4. Regular expressions (regexps) are a convenient way for programmers to express complex string searching logic. Sev- eral popular programming languages expose an interface to a regexp matching subsystem, either by language-level primi- tives or through standard libraries. The implementations be- hind these matching systems vary greatly in their capabilities and running-time characteristics. In particular, backtracking matchers may exhibit worst-case running-time that is either linear, polynomial, or exponential in the length of the string being searched. Such super-linear worst-case regexps expose applications to Regular Expression Denial-of-Service (Re- DoS) when inputs can be controlled by an adversarial attacker. In this work, we investigate the impact of ReDoS in back- tracking engines, a popular type of engine used by most programming languages. We evaluate several existing tools against a dataset of broadly collected regexps, and find that despite extensive theoretical work in this field, none are able to achieve both high precision and high recall. To address this gap in existing work, we develop REGULATOR, a novel dy- namic, fuzzer-based analysis system for identifying regexps vulnerable to ReDoS. We implement this system by directly instrumenting a popular backtracking regexp engine, which increases the scope of supported regexp syntax and features over prior work. Finally, we evaluate this system against three common regexp datasets, and demonstrate a seven-fold in- crease in true positives discovered when comparing against existing tools. 
    more » « less
  5. Compositional compiler verification is a difficult problem that focuses on separate compilation of program components with possibly different verified compilers. Logical relations are widely used in proving correctness of program transformations in higher-order languages; however, they do not scale to compositional verification of multi-pass compilers due to their lack of transitivity. The only known technique to apply to compositional verification of multi-pass compilers for higher-order languages is parametric inter-language simulations (PILS), which is however significantly more complicated than traditional proof techniques for compiler correctness. In this paper, we present a novel verification framework forlightweight compositional compiler correctness. We demonstrate that by imposing the additional restriction that program components are compiled by pipelines that go throughthe same sequence of intermediate representations, logical relation proofs can be transitively composed in order to derive an end-to-end compositional specification for multi-pass compiler pipelines. Unlike traditional logical-relation frameworks, our framework supports divergence preservation—even when transformations reduce the number of program steps. We achieve this by parameterizing our logical relations with a pair ofrelational invariants. We apply this technique to verify a multi-pass, optimizing middle-end pipeline for CertiCoq, a compiler from Gallina (Coq’s specification language) to C. The pipeline optimizes and closure-converts an untyped functional intermediate language (ANF or CPS) to a subset of that language without nested functions, which can be easily code-generated to low-level languages. Notably, our pipeline performs more complex closure-allocation optimizations than the state of the art in verified compilation. Using our novel verification framework, we prove an end-to-end theorem for our pipeline that covers both termination and divergence and applies to whole-program and separate compilation, even when different modules are compiled with different optimizations. Our results are mechanized in the Coq proof assistant. 
    more » « less