skip to main content

Search for: All records

Creators/Authors contains: "Wang, Xiaoling"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. A frequent pattern is a substructure that appears in a database with frequency (aka. support) no less than a user-specified threshold, while a closed pattern is one that has no super-pattern that has the same support. Here, a substructure can refer to different structural forms, such as itemsets, subsequences, subtrees, and subgraphs, and mining such substructures is important in many real applications such as product recommendation and feature extraction. Currently, there lacks a general programming framework that can be easily customized to mine different types of patterns, and existing parallel and distributed solutions are IO-bound rendering CPU cores underutilized. Since mining frequent and/or closed patterns are NP-hard, it is important to fully utilize the available CPU cores. This paper presents such a general-purpose framework called PrefixFPM. The framework is based on the idea of prefix projection which allows a divide-and-conquer mining paradigm. PrefixFPM exposes a unified programming interface to users who can readily customize it to mine their desired patterns. We have adapted the state-of-the-art serial algorithms for mining patterns including subsequences, subtrees, and subgraphs on top of PrefixFPM, and extensive experiments demonstrate an excellent speedup ratio of PrefixFPM with the number of CPU cores.
  2. Given a data matrix 𝐷, a submatrix 𝑆 of 𝐷 is an order-preserving submatrix (OPSM) if there is a permutation of the columns of 𝑆, under which the entry values of each row in 𝑆 are strictly increasing. OPSM mining is widely used in real-life applications such as identifying coexpressed genes and finding customers with similar preference. However, noise is ubiquitous in real data matrices due to variable experimental conditions and measurement errors, which makes conventional OPSM mining algorithms inapplicable. No previous work on OPSM has ever considered uncertain value intervals using the well-established possible world semantics. We establish two different definitions of significant OPSMs based on the possible world semantics: (1) expected support-based and (2) probabilistic frequentness-based. An optimized dynamic programming approach is proposed to compute the probability that a row supports a particular column permutation, with a closed-form formula derived to efficiently handle the special case of uniform value distribution and an accurate cubic spline approximation approach that works well with any uncertain value distributions. To efficiently check the probabilistic frequentness, several effective pruning rules are designed to efficiently prune insignificant OPSMs; two approximation techniques based on the Poisson and Gaussian distributions, respectively, are proposed for further speedup.more »These techniques are integrated into our two OPSM mining algorithms, based on prefix-projection and Apriori, respectively. We further parallelize our prefix-projection-based mining algorithm using PrefixFPM, a recently proposed framework for parallel frequent pattern mining, and we achieve a good speedup with the number of CPU cores. Extensive experiments on real microarray data demonstrate that the OPSMs found by our algorithms have a much higher quality than those found by existing approaches.« less
  3. Mining frequent subtree patterns in a tree database (or, forest) is useful in domains such as bioinformatics and mining semi-structured data. We consider the problem of mining embedded subtrees in a database of rooted, labeled, and ordered trees. We compare two existing serial mining algorithms, PrefixTreeSpan and TreeMiner, and adapt them for parallel execution using PrefixFPM, our general-purpose framework for frequent pattern mining that is designed to effectively utilize the CPU cores in a multicore machine. Our experiments show that TreeMiner is faster than its successor PrefixTreeSpan when a limited number of CPU cores are used, as the total mining workloads is smaller; however, PrefixTreeSpan has a much higher speedup ratio and can beat TreeMiner when given enough CPU cores.
  4. NMR-assisted crystallography—the integrated application of solid-state NMR, X-ray crystallography, and first-principles computational chemistry—holds significant promise for mechanistic enzymology: by providing atomic-resolution characterization of stable intermediates in enzyme active sites, including hydrogen atom locations and tautomeric equilibria, NMR crystallography offers insight into both structure and chemical dynamics. Here, this integrated approach is used to characterize the tryptophan synthase α-aminoacrylate intermediate, a defining species for pyridoxal-5′-phosphate–dependent enzymes that catalyze β-elimination and replacement reactions. For this intermediate, NMR-assisted crystallography is able to identify the protonation states of the ionizable sites on the cofactor, substrate, and catalytic side chains as well as the location and orientation of crystallographic waters within the active site. Most notable is the water molecule immediately adjacent to the substrate β-carbon, which serves as a hydrogen bond donor to the ε-amino group of the acid–base catalytic residue βLys87. From this analysis, a detailed three-dimensional picture of structure and reactivity emerges, highlighting the fate of the L-serine hydroxyl leaving group and the reaction pathway back to the preceding transition state. Reaction of the α-aminoacrylate intermediate with benzimidazole, an isostere of the natural substrate indole, shows benzimidazole bound in the active site and poised for, but unable to initiate, the subsequent bondmore »formation step. When modeled into the benzimidazole position, indole is positioned with C3 in contact with the α-aminoacrylate C β and aligned for nucleophilic attack. Here, the chemically detailed, three-dimensional structure from NMR-assisted crystallography is key to understanding why benzimidazole does not react, while indole does.« less
  5. Frequent pattern mining (FPM) has been a focused theme in data mining research for decades, but there lacks a general programming framework that can be easily customized to mine different kinds of frequent patterns, and existing solutions to FPM over big transaction databases are IO-bound rendering CPU cores underutilized even though FPM is NP-hard. This paper presents, PrefixFPM, a general-purpose framework for FPM that is able to fully utilize the CPU cores in a multicore machine. PrefixFPM follows the idea of prefix projection to partition the workloads of PFM into independent tasks by divide and conquer. PrefixFPM exposes a unified programming interface to users who can customize it to mine their desired patterns, and the parallel execution engine is transparent to end-users and can be reused for mining all kinds of patterns. We have adapted the state-of-the-art serial algorithms for mining frequent patterns including subsequences, subtrees, and subgraphs on top of PrefixFPM, and extensive experiments demonstrate an excellent speedup ratio of PrefixFPM with the number of cores. A demo is available at https://youtu.be/PfioC0GDpsw; the code is available at https://github.com/yanlab19870714/PrefixFPM.