An enormous amount of realworld data exists in the form of graphs. Oftentimes, interesting patterns that describe the complex dynamics of these graphs are captured in the form of frequently reoccurring substructures. Recent work at the intersection of formal language theory and graph theory has explored the use of graph grammars for graph modeling and pattern mining. However, existing formulations do not extract meaningful and easily interpretable patterns from the data. The present work addresses this limitation by extracting a special type of vertex replacement grammar, which we call a KT grammar, according to the Minimum Description Length (MDL) heuristic. In experiments on synthetic and realworld datasets, we show that KTgrammars can be efficiently extracted from a graph and that these grammars encode meaningful patterns that represent the dynamics of the realworld system.
more »
« less
Modeling Graphs with Vertex Replacement Grammars
One of the principal goals of graph modeling is to capture the building blocks of network data in order to study various physical and natural phenomena. Recent work at the intersection of formal language theory and graph theory has explored the use of graph grammars for graph modeling. However, existing graph grammar formalisms, like Hyperedge Replacement Grammars, can only operate on small treelike graphs. The present work relaxes this restriction by revising a different graph grammar formalism called Vertex Replacement Grammars (VRGs). We show that a variant of the VRG called Clusteringbased Node Replacement Grammar (CNRG) can be efficiently extracted from many hierarchical clusterings of a graph. We show that CNRGs encode a succinct model of the graph, yet faithfully preserves the structure of the original graph. In experiments on large realworld datasets, we show that graphs generated from the CNRG model exhibit a diverse range of properties that are similar to those found in the original networks.
more »
« less
 Award ID(s):
 1652492
 NSFPAR ID:
 10136394
 Date Published:
 Journal Name:
 IEEE International Conference on Data Mining
 Page Range / eLocation ID:
 558 to 567
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


Discovering the underlying structures present in large real world graphs is a fundamental scientific problem. Recent work at the intersection of formal language theory and graph theory has found that a Probabilistic Hyperedge Replacement Grammar (PHRG) can be extracted from a tree decomposition of any graph. However, because the extracted PHRG is directly dependent on the shape and contents of the tree decomposition, rather than from the dynamics of the graph, it is unlikely that informative graphprocesses are actually being captured with the PHRG extraction algorithm. To address this problem, the current work adapts a related formalism called Probabilistic Synchronous HRG (PSHRG) that learns synchronous graph production rules from temporal graphs. We introduce the PSHRG model and describe a method to extract growth rules from the graph. We find that SHRG rules capture growth patterns found in temporal graphs and can be used to predict the future evolution of a temporal graph. We perform a brief evaluation on small synthetic networks that demonstrate the prediction accuracy of PSHRG versus baseline and state of the art models. Ultimately, we find that PSHRGs seem to be very good at modelling dynamics of a temporal graph; however, our prediction algorithm, which is based on string parsing and generation algorithms, does not scale to practically useful graph sizes.more » « less

Mutzel, Petra and (Ed.)Grammar compression is, next to LempelZiv (LZ77) and runlength BurrowsWheeler transform (RLBWT), one of the most flexible approaches to representing and processing highly compressible strings. The main idea is to represent a text as a contextfree grammar whose language is precisely the input string. This is called a straightline grammar (SLG). An AVL grammar, proposed by Rytter [Theor. Comput. Sci., 2003] is a type of SLG that additionally satisfies the AVL property: the heights of parse trees for children of every nonterminal differ by at most one. In contrast to other SLG constructions, AVL grammars can be constructed from the LZ77 parsing in compressed time: 𝒪(z log n) where z is the size of the LZ77 parsing and n is the length of the input text. Despite these advantages, AVL grammars are thought to be too large to be practical. We present a new technique for rapidly constructing a small AVL grammar from an LZ77 or LZ77like parse. Our algorithm produces grammars that are always at least five times smaller than those produced by the original algorithm, and usually not more than double the size of grammars produced by the practical RePair compressor [Larsson and Moffat, Proc. IEEE, 2000]. Our algorithm also achieves low peak RAM usage. By combining this algorithm with recent advances in approximating the LZ77 parsing, we show that our method has the potential to construct a runlength BWT in about one third of the time and peak RAM required by other approaches. Overall, we show that AVL grammars are surprisingly practical, opening the door to much faster construction of key compressed data structuresmore » « less

null (Ed.)A graph G is called {\em selfordered} (a.k.a asymmetric) if the identity permutation is its only automorphism. Equivalently, there is a unique isomorphism from G to any graph that is isomorphic to G. We say that G=(VE) is {\em robustly selfordered}if the size of the symmetric difference between E and the edgeset of the graph obtained by permuting V using any permutation :VV is proportional to the number of nonfixedpoints of . In this work, we initiate the study of the structure, construction and utility of robustly selfordered graphs. We show that robustly selfordered boundeddegree graphs exist (in abundance), and that they can be constructed efficiently, in a strong sense. Specifically, given the index of a vertex in such a graph, it is possible to find all its neighbors in polynomialtime (i.e., in time that is polylogarithmic in the size of the graph). We provide two very different constructions, in tools and structure. The first, a direct construction, is based on proving a sufficient condition for robust selfordering, which requires that an auxiliary graph, on {\em pairs} of vertices of the original graph, is expanding. In this case the original graph is (not only robustly selfordered but) also expanding. The second construction proceeds in three steps: It boosts the mere existence of robustly selfordered graphs, which provides explicit graphs of sublogarithmic size, to an efficient construction of polynomialsize graphs, and then, repeating it again, to exponentialsize(robustly selfordered) graphs that are locally constructible. This construction can yield robustly selfordered graphs that are either expanders or highly disconnected, having logarithmic size connected components. We also consider graphs of unbounded degree, seeking correspondingly unbounded robustness parameters. We again demonstrate that such graphs (of linear degree)exist (in abundance), and that they can be constructed efficiently, in a strong sense. This turns out to require very different tools. Specifically, we show that the construction of such graphs reduces to the construction of nonmalleable twosource extractors with very weak parameters but with some additional natural features. We actually show two reductions, one simpler than the other but yielding a less efficient construction when combined with the known constructions of extractors. We demonstrate that robustly selfordered boundeddegree graphs are useful towards obtaining lower bounds on the query complexity of testing graph properties both in the boundeddegree and the dense graph models. Indeed, their robustness offers efficient, local and distance preserving reductions from testing problems on ordered structures (like sequences) to the unordered (effectively unlabeled) graphs. One of the results that we obtain, via such a reduction, is a subexponential separation between the query complexities of testing and tolerant testing of graph properties in the boundeddegree graph model. Changes to previous version: We retract the claims made in our initial posting regarding the construction of nonmalleable twosource extractors (which are quasiorthogonal) as well as the claims about the construction of relocationdetecting codes (see Theorems 1.5 and 1.6 in the original version). The source of trouble is a fundamental flaw in the proof of Lemma 9.7 (in the original version), which may as well be wrong. Hence, the original Section 9 was omitted, except that the original Section 9.3 was retained as a new Section 8.3. The original Section 8 appears as Section 8.0 and 8.1, and Section 8.2 is new.more » « less

It takes great effort to manually or semiautomatically convert freetext phenotype narratives (e.g., morphological descriptions in taxonomic works) to a computable format before they can be used in largescale analyses. We argue that neither a manual curation approach nor an information extraction approach based on machine learning is a sustainable solution to produce computable phenotypic data that are FAIR (Findable, Accessible, Interoperable, Reusable) (Wilkinson et al. 2016). This is because these approaches do not scale to all biodiversity, and they do not stop the publication of freetext phenotypes that would need postpublication curation. In addition, both manual and machine learning approaches face great challenges: the problem of intercurator variation (curators interpret/convert a phenotype differently from each other) in manual curation, and keywords to ontology concept translation in automated information extraction, make it difficult for either approach to produce data that are truly FAIR. Our empirical studies show that intercurator variation in translating phenotype characters to EntityQuality statements (Mabee et al. 2007) is as high as 40% even within a single project. With this level of variation, curated data integrated from multiple curation projects may still not be FAIR. The key causes of this variation have been identified as semantic vagueness in original phenotype descriptions and difficulties in using standardized vocabularies (ontologies). We argue that the authors describing characters are the key to the solution. Given the right tools and appropriate attribution, the authors should be in charge of developing a project's semantics and ontology. This will speed up ontology development and improve the semantic clarity of the descriptions from the moment of publication. In this presentation, we will introduce the Platform for AuthorDriven Computable Data and Ontology Production for Taxonomists, which consists of three components: a webbased, ontologyaware software application called 'Character Recorder,' which features a spreadsheet as the data entry platform and provides authors with the flexibility of using their preferred terminology in recording characters for a set of specimens (this application also facilitates semantic clarity and consistency across species descriptions); a set of services that produce RDF graph data, collects terms added by authors, detects potential conflicts between terms, dispatches conflicts to the third component and updates the ontology with resolutions; and an Android mobile application, 'Conflict Resolver,' which displays ontological conflicts and accepts solutions proposed by multiple experts. a webbased, ontologyaware software application called 'Character Recorder,' which features a spreadsheet as the data entry platform and provides authors with the flexibility of using their preferred terminology in recording characters for a set of specimens (this application also facilitates semantic clarity and consistency across species descriptions); a set of services that produce RDF graph data, collects terms added by authors, detects potential conflicts between terms, dispatches conflicts to the third component and updates the ontology with resolutions; and an Android mobile application, 'Conflict Resolver,' which displays ontological conflicts and accepts solutions proposed by multiple experts. Fig. 1 shows the system diagram of the platform. The presentation will consist of: a report on the findings from a recent survey of 90+ participants on the need for a tool like Character Recorder; a methods section that describes how we provide semantics to an existing vocabulary of quantitative characters through a set of properties that explain where and how a measurement (e.g., length of perigynium beak) is taken. We also report on how a custom color palette of RGB values obtained from real specimens or highquality specimen images, can be used to help authors choose standardized color descriptions for plant specimens; and a software demonstration, where we show how Character Recorder and Conflict Resolver can work together to construct both humanreadable descriptions and RDF graphs using morphological data derived from species in the plant genus Carex (sedges). The key difference of this system from other ontologyaware systems is that authors can directly add needed terms to the ontology as they wish and can update their data according to ontology updates. a report on the findings from a recent survey of 90+ participants on the need for a tool like Character Recorder; a methods section that describes how we provide semantics to an existing vocabulary of quantitative characters through a set of properties that explain where and how a measurement (e.g., length of perigynium beak) is taken. We also report on how a custom color palette of RGB values obtained from real specimens or highquality specimen images, can be used to help authors choose standardized color descriptions for plant specimens; and a software demonstration, where we show how Character Recorder and Conflict Resolver can work together to construct both humanreadable descriptions and RDF graphs using morphological data derived from species in the plant genus Carex (sedges). The key difference of this system from other ontologyaware systems is that authors can directly add needed terms to the ontology as they wish and can update their data according to ontology updates. The software modules currently incorporated in Character Recorder and Conflict Resolver have undergone formal usability studies. We are actively recruiting Carex experts to participate in a 3day usability study of the entire system of the Platform for AuthorDriven Computable Data and Ontology Production for Taxonomists. Participants will use the platform to record 100 characters about one Carex species. In addition to usability data, we will collect the terms that participants submit to the underlying ontology and the data related to conflict resolution. Such data allow us to examine the types and the quantities of logical conflicts that may result from the terms added by the users and to use Discrete Event Simulation models to understand if and how term additions and conflict resolutions converge. We look forward to a discussion on how the tools (Character Recorder is online at http://shark.sbs.arizona.edu/chrecorder/public) described in our presentation can contribute to producing and publishing FAIR data in taxonomic studies.more » « less