skip to main content

Title: NTP-Miner: Nonoverlapping Three-Way Sequential Pattern Mining
Nonoverlapping sequential pattern mining is an important type of sequential pattern mining (SPM) with gap constraints, which not only can reveal interesting patterns to users but also can effectively reduce the search space using the Apriori (anti-monotonicity) property. However, the existing algorithms do not focus on attributes of interest to users, meaning that existing methods may discover many frequent patterns that are redundant. To solve this problem, this article proposes a task called nonoverlapping three-way sequential pattern (NTP) mining, where attributes are categorized according to three levels of interest: strong, medium, and weak interest. NTP mining can effectively avoid mining redundant patterns since the NTPs are composed of strong and medium interest items. Moreover, NTPs can avoid serious deviations (the occurrence is significantly different from its pattern) since gap constraints cannot match with strong interest patterns. To mine NTPs, an effective algorithm is put forward, called NTP-Miner, which applies two main steps: support (frequency occurrence) calculation and candidate pattern generation. To calculate the support of an NTP, depth-first and backtracking strategies are adopted, which do not require creating a whole Nettree structure, meaning that many redundant nodes and parent–child relationships do not need to be created. Hence, time and space efficiency is improved. To generate candidate patterns while reducing their number, NTP-Miner employs a pattern join strategy and only mines patterns of strong and medium interest. Experimental results on stock market and protein datasets show that NTP-Miner not only is more efficient than other competitive approaches but can also help users find more valuable patterns. More importantly, NTP mining has achieved better performance than other competitive methods in clustering tasks. Algorithms and data are available at: .  more » « less
Award ID(s):
1763452 1828181
Author(s) / Creator(s):
; ; ; ; ; ;
Date Published:
Journal Name:
ACM Transactions on Knowledge Discovery from Data
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Risk patterns are crucial in biomedical research and have served as an important factor in precision health and disease prevention. Despite recent development in parallel and high-performance computing, existing risk pattern mining methods still struggle with problems caused by large-scale datasets, such as redundant candidate generation, inability to discover long significant patterns, and prolonged post pattern filtering. In this article, we propose a novel dynamic tree structure, Risk Hierarchical Pattern Tree (RHPTree), and a top-down search method, RHPSearch, which are capable of efficiently analyzing a large volume of data and overcoming the limitations of previous works. The dynamic nature of the RHPTree avoids costly tree reconstruction for the iterative search process and dataset updates. We also introduce two specialized search methods, the extended target search (RHPSearch-TS) and the parallel search approach (RHPSearch-SD), to further speed up the retrieval of certain items of interest. Experiments on both UCI machine learning datasets and sampled datasets of the Simons Foundation Autism Research Initiative (SFARI)—Simon’s Simplex Collection (SSC) datasets demonstrate that our method is not only faster but also more effective in identifying comprehensive long risk patterns than existing works. Moreover, the proposed new tree structure is generic and applicable to other pattern mining problems. 
    more » « less
  2. Abstract

    Most studies of land‐atmosphere coupling have focused on bivariate linear statistics like correlation. However, more complex dependencies exist, including nonlinear relationships between components of land‐atmosphere coupling and the transmutability of relationships between soil moisture and surface heat fluxes under different environmental conditions. In this study, a technique called multivariate mutual information, based on information theory, is proposed to quantify how surface heat fluxes depend on both surface energy and wetness conditions, that is, net radiation and soil moisture, seasonally across the globe using reanalysis data. Such interdependency is then decomposed into linear and nonlinear contributions, which are further decomposed as different components explainable as the unique contribution from individual surface conditions, redundant contributions shared by both surface conditions, and the synergistic contribution from the concurrent action of net radiation and soil moisture. In reanalysis data, the dependency linearly contributed from soil moisture bears a similar global pattern to previously identified hot spots of coupling. The linear unique contributions of net radiation and soil moisture are mainly nonoverlapping, which suggests two separate regimes are governed by either energy or water limitations. These patterns persist when the nonlinearity is superimposed, thus reinforcing the validity of the land‐atmospheric coupling hot spot paradigm and the spatial division of energy‐limited as well as water‐limited regions. Nevertheless, strong nonlinear relationships are detected, particularly over subtropical regions. Synergistic components are found across the globe, implying widespread multidimensional physical relationships among net radiation, soil moisture, and surface heat fluxes that previously had only been inferred locally.

    more » « less
  3. Abstract

    A major goal of community ecology is to identify the patterns of species associations and the processes that shape them. Arboreal ants are extremely diverse and abundant, making them an interesting and valuable group for tackling this issue. Numerous studies have used observational data of species co‐occurrence patterns to infer underlying assembly processes, but the complexity of these communities has resulted in few solid conclusions. This study takes advantage of an observational dataset that is unusually well‐structured with respect to habitat attributes (tree species, tree sizes, and vegetation structure), to disentangle different factors influencing community organization. In particular, this study assesses the potential role of interspecific competition and habitat selection on the distribution patterns of an arboreal ant community by incorporating habitat attributes into the co‐occurrence analyses. These findings are then contrasted against species traits, to explore functional explanations for the identified community patterns. We ran a suite of null models, first accounting only for the species incidence in the community and later incorporating habitat attributes in the null models. We performed analyses with all the species in the community and then with only the most common species using both a matrix‐level approach and a pairwise‐level approach. The co‐occurrence patterns did not differ from randomness in the matrix‐level approach accounting for all ant species in the community. However, a segregated pattern was detected for the most common ant species. Moreover, with the pairwise approach, we found a significant number of negative and positive pairs of species associations. Most of the segregated associations appear to be explained by competitive interactions between species, not habitat affiliations. This was supported by comparisons of species traits for significantly associated pairs. These results suggest that competition is the most important influence on the distribution patterns of arboreal ants within the focal community. Habitat attributes, in contrast, showed no significant influence on the matrix‐wide results and affected only a few associations. In addition, the segregated pairs shared more biological characteristic in common than the aggregated and random ones.

    more » « less
  4. Faisal, Aldo A (Ed.)
    Animals display characteristic behavioural patterns when performing a task, such as the spiraling of a soaring bird or the surge-and-cast of a male moth searching for a female. Identifying such recurring sequences occurring rarely in noisy behavioural data is key to understanding the behavioural response to a distributed stimulus in unrestrained animals. Existing models seek to describe the dynamics of behaviour or segment individual locomotor episodes rather than to identify the rare and transient sequences of locomotor episodes that make up the behavioural response. To fill this gap, we develop a lexical, hierarchical model of behaviour. We designed an unsupervised algorithm called “BASS” to efficiently identify and segment recurring behavioural action sequences transiently occurring in long behavioural recordings. When applied to navigating larval zebrafish, BASS extracts a dictionary of remarkably long, non-Markovian sequences consisting of repeats and mixtures of slow forward and turn bouts. Applied to a novel chemotaxis assay, BASS uncovers chemotactic strategies deployed by zebrafish to avoid aversive cues consisting of sequences of fast large-angle turns and burst swims. In a simulated dataset of soaring gliders climbing thermals, BASS finds the spiraling patterns characteristic of soaring behaviour. In both cases, BASS succeeds in identifying rare action sequences in the behaviour deployed by freely moving animals. BASS can be easily incorporated into the pipelines of existing behavioural analyses across diverse species, and even more broadly used as a generic algorithm for pattern recognition in low-dimensional sequential data. 
    more » « less
  5. Computing strongly connected components (SCC) is among the most fundamental problems in graph analytics. Given the large size of today's real-world graphs, parallel SCC implementation is increasingly important. SCC is challenging in the parallel setting and is particularly hard on large-diameter graphs. Many existing parallel SCC implementations can be even slower than Tarjan's sequential algorithm on large-diameter graphs.

    To tackle this challenge, we propose an efficient parallel SCC implementation using a new parallel reachability approach. Our solution is based on a novel idea referred to as vertical granularity control (VGC). It breaks the synchronization barriers to increase parallelism and hide scheduling overhead. To use VGC in our SCC algorithm, we also design an efficient data structure called the parallel hash bag. It uses parallel dynamic resizing to avoid redundant work in maintaining frontiers (vertices processed in a round).

    We implement the parallel SCC algorithm by Blelloch et al. (J. ACM, 2020) using our new parallel reachability approach. We compare our implementation to the state-of-the-art systems, including GBBS, iSpan, Multi-step, and our highly optimized Tarjan's (sequential) algorithm, on 18 graphs, including social, web, k-NN, and lattice graphs. On a machine with 96 cores, our implementation is the fastest on 16 out of 18 graphs. On average (geometric means) over all graphs, our SCC is 6.0× faster than the best previous parallel code (GBBS), 12.8× faster than Tarjan's sequential algorithms, and 2.7× faster than the best existing implementation on each graph.

    We believe that our techniques are of independent interest. We also apply our parallel hash bag and VGC scheme to other graph problems, including connectivity and least-element lists (LE-lists). Our implementations improve the performance of the state-of-the-art parallel implementations for these two problems.

    more » « less