Static Analysis for Checking the Disambiguation Robustness of Regular Expressions

Mamouras, Konstantinos (ORCID:0000000312097738); Le_Glaunec, Alexis (ORCID:0000000254445924); Li, Wu_Angela (ORCID:0000000245233401); Chattopadhyay, Agnishom (ORCID:0009000704628080)

doi:10.1145/3656461

Citation Details

Static Analysis for Checking the Disambiguation Robustness of Regular Expressions

Regular expressions are commonly used for finding and extracting matches from sequence data. Due to the inherent ambiguity of regular expressions, a disambiguation policy must be considered for the match extraction problem, in order to uniquely determine the desired match out of the possibly many matches. The most common disambiguation policies are the POSIX policy and the greedy (PCRE) policy. The POSIX policy chooses the longest match out of the leftmost ones. The greedy policy chooses a leftmost match and further disambiguates using a greedy interpretation of Kleene iteration to match as many times as possible. The choice of disambiguation policy can affect the output of match extraction, which can be an issue for reusing regular expressions across regex engines. In this paper, we introduce and study the notion of disambiguation robustness for regular expressions. A regular expression is robust if its extraction semantics is indifferent to whether the POSIX or greedy disambiguation policy is chosen. This gives rise to a decision problem for regular expressions, which we prove to be PSPACE-complete. We propose a static analysis algorithm for checking the (non-)robustness of regular expressions and two performance optimizations. We have implemented the proposed algorithms and we have shown experimentally that they are practical for analyzing large datasets of regular expressions derived from various application domains. more »

Award ID(s):: 2313062

PAR ID:: 10612723

Author(s) / Creator(s):: Mamouras, Konstantinos; Le_Glaunec, Alexis; Li, Wu_Angela; Chattopadhyay, Agnishom

Publisher / Repository:: Association for Computing Machinery (ACM)

Date Published:: 2024-06-20

Journal Name:: Proceedings of the ACM on Programming Languages

Volume:: 8

Issue:: PLDI

ISSN:: 2475-1421

Format(s):: Medium: X Size: p. 2073-2097

Size(s):: p. 2073-2097

Sponsoring Org:: National Science Foundation

Journal Article:
https://doi.org/10.1145/3656461

More Like this