Abstract Statistical relational learning (SRL) frameworks are effective at defining probabilistic models over complex relational data. They often use weighted first-order logical rules where the weights of the rules govern probabilistic interactions and are usually learned from data. Existing weight learning approaches typically attempt to learn a set of weights that maximizes some function of data likelihood; however, this does not always translate to optimal performance on a desired domain metric, such as accuracy or F1 score. In this paper, we introduce a taxonomy of search-based weight learning approaches for SRL frameworks that directly optimize weights on a chosen domain performance metric. To effectively apply these search-based approaches, we introduce a novel projection, referred to as scaled space (SS), that is an accurate representation of the true weight space. We show that SS removes redundancies in the weight space and captures the semantic distance between the possible weight configurations. In order to improve the efficiency of search, we also introduce an approximation of SS which simplifies the process of sampling weight configurations. We demonstrate these approaches on two state-of-the-art SRL frameworks: Markov logic networks and probabilistic soft logic. We perform empirical evaluation on five real-world datasets and evaluate them eachmore »
Mining Order-preserving Submatrices under Data Uncertainty: A Possible-world Approach and Efficient Approximation Methods
Given a data matrix 𝐷, a submatrix 𝑆 of 𝐷 is an order-preserving submatrix (OPSM) if there is a permutation of the columns of 𝑆, under which the entry values of each row in 𝑆 are strictly increasing. OPSM mining is widely used in real-life applications such as identifying coexpressed genes and finding customers with similar preference. However, noise is ubiquitous in real data matrices due to variable experimental conditions and measurement errors, which makes conventional OPSM mining algorithms inapplicable. No previous work on OPSM has ever considered uncertain value intervals using the well-established possible world semantics. We establish two different definitions of significant OPSMs based on the possible world semantics: (1) expected support-based and (2) probabilistic frequentness-based. An optimized dynamic programming approach is proposed to compute the probability that a row supports a particular column permutation, with a closed-form formula derived to efficiently handle the special case of uniform value distribution and an accurate cubic spline approximation approach that works well with any uncertain value distributions. To efficiently check the probabilistic frequentness, several effective pruning rules are designed to efficiently prune insignificant OPSMs; two approximation techniques based on the Poisson and Gaussian distributions, respectively, are proposed for further speedup. more »
- Award ID(s):
- Publication Date:
- NSF-PAR ID:
- Journal Name:
- ACM transactions on database systems
- Sponsoring Org:
- National Science Foundation
More Like this
A frequent pattern is a substructure that appears in a database with frequency (aka. support) no less than a user-specified threshold, while a closed pattern is one that has no super-pattern that has the same support. Here, a substructure can refer to different structural forms, such as itemsets, subsequences, subtrees, and subgraphs, and mining such substructures is important in many real applications such as product recommendation and feature extraction. Currently, there lacks a general programming framework that can be easily customized to mine different types of patterns, and existing parallel and distributed solutions are IO-bound rendering CPU cores underutilized. Since mining frequent and/or closed patterns are NP-hard, it is important to fully utilize the available CPU cores. This paper presents such a general-purpose framework called PrefixFPM. The framework is based on the idea of prefix projection which allows a divide-and-conquer mining paradigm. PrefixFPM exposes a unified programming interface to users who can readily customize it to mine their desired patterns. We have adapted the state-of-the-art serial algorithms for mining patterns including subsequences, subtrees, and subgraphs on top of PrefixFPM, and extensive experiments demonstrate an excellent speedup ratio of PrefixFPM with the number of CPU cores.
Modern Internet of Things ( IoT ) applications generate massive amounts of time-stamped data, much of it in the form of discrete, symbolic sequences. In this work, we present a new system called TOP that deTects Outlier Patterns from these sequences. To solve the fundamental limitation of existing pattern mining semantics that miss outlier patterns hidden inside of larger frequent patterns, TOP offers new pattern semantics based on contextual patterns that distinguish the independent occurrence of a pattern from its occurrence as part of its super-pattern. We present efficient algorithms for the mining of this new class of contextual patterns. In particular, in contrast to the bottom-up strategy for state-of-the-art pattern mining techniques, our top-down Reduce strategy piggy backs pattern detection with the detection of the context in which a pattern occurs. Our approach achieves linear time complexity in the length of the input sequence. Effective optimization techniques such as context-driven search space pruning and inverted index-based outlier pattern detection are also proposed to further speed up contextual pattern mining. Our experimental evaluation demonstrates the effectiveness of TOP at capturing meaningful outlier patterns in several real-world IoT use cases. We also demonstrate the efficiency of TOP, showing it to bemore »
Inference of unknown opinions with uncertain, adversarial (e.g., incorrect or conflicting) evidence in large datasets is not a trivial task. Without proper handling, it can easily mislead decision making in data mining tasks. In this work, we propose a highly scalable opinion inference probabilistic model, namely Adversarial Collective Opinion Inference (Adv-COI), which provides a solution to infer unknown opinions with high scalability and robustness under the presence of uncertain, adversarial evidence by enhancing Collective Subjective Logic (CSL) which is developed by combining SL and Probabilistic Soft Logic (PSL). The key idea behind the Adv-COI is to learn a model of robust ways against uncertain, adversarial evidence which is formulated as a min-max problem. We validate the out-performance of the Adv-COI compared to baseline models and its competitive counterparts under possible adversarial attacks on the logic-rule based structured data and white and black box adversarial attacks under both clean and perturbed semi-synthetic and real-world datasets in three real world applications. The results show that the Adv-COI generates the lowest mean absolute error in the expected truth probability while producing the lowest running time among all.
We implement two novel algorithms for sparse-matrix dense-matrix multiplication (SpMM) on the GPU. Our algorithms expect the sparse input in the popular compressed-sparse-row (CSR) format and thus do not require expensive format conversion. While previous SpMM work concentrates on thread-level parallelism, we additionally focus on latency hiding with instruction-level parallelism and load-balancing. We show, both theoretically and experimentally, that the proposed SpMM is a better fit for the GPU than previous approaches. We identify a key memory access pattern that allows efficient access into both input and output matrices that is crucial to getting excellent performance on SpMM. By combining these two ingredients---(i) merge-based load-balancing and (ii) row-major coalesced memory access---we demonstrate a 4.1x peak speedup and a 31.7% geomean speedup over state-of-the-art SpMM implementations on real-world datasets.