Discovering patterns from a set of text or, more generally, categorical data is an important problem in many disciplines such as biomedical research, linguistics, artificial intelligence and sociology. We consider here the well-known ‘market basket’ problem that is often discussed in the data mining community, and is also quite ubiquitous in biomedical research. The data under consideration are a set of ‘baskets’, where each basket contains a list of ‘items’. Our goal is to discover ‘themes’, which are defined as subsets of items that tend to co-occur in a basket. We describe a generative model, i.e. the theme dictionary model, for such data structures and describe two likelihood-based methods to infer themes that are hidden in a collection of baskets. We also propose a novel sequential Monte Carlo method to overcome computational challenges. Using both simulation studies and real applications, we demonstrate that the new approach proposed is significantly more powerful than existing methods, such as association rule mining and topic modelling, in detecting weak and subtle interactions in the data.
more » « less- PAR ID:
- 10401340
- Publisher / Repository:
- Oxford University Press
- Date Published:
- Journal Name:
- Journal of the Royal Statistical Society Series B: Statistical Methodology
- Volume:
- 76
- Issue:
- 2
- ISSN:
- 1369-7412
- Format(s):
- Medium: X Size: p. 319-347
- Size(s):
- p. 319-347
- Sponsoring Org:
- National Science Foundation
More Like this
-
Next-basket recommendation (NBR) is prevalent in e-commerce and retail industry. In this scenario, a user purchases a set of items (a basket) at a time. NBR performs sequential modeling and recommendation based on a sequence of baskets. NBR is in general more complex than the widely studied sequential (session-based) recommendation which recommends the next item based on a sequence of items. Recurrent neural network (RNN) has proved to be very effective for sequential modeling, and thus been adapted for NBR. However, we argue that existing RNNs cannot directly capture item frequency information in the recommendation scenario. Through careful analysis of real-world datasets, we find that personalized item frequency (PIF) information (which records the number of times that each item is purchased by a user) provides two critical signals for NBR. But, this has been largely ignored by existing methods. Even though existing methods such as RNN based methods have strong representation ability, our empirical results show that they fail to learn and capture PIF. As a result, existing methods cannot fully exploit the critical signals contained in PIF. Given this inherent limitation of RNNs, we propose a simple item frequency based k-nearest neighbors (kNN) method to directly utilize these critical signals. We evaluate our method on four public real-world datasets. Despite its relative simplicity, our method frequently outperforms the state-of-the-art NBR methods - including deep learning based methods using RNNs - when patterns associated with PIF play an important role in the data.more » « less
-
Risk patterns are crucial in biomedical research and have served as an important factor in precision health and disease prevention. Despite recent development in parallel and high-performance computing, existing risk pattern mining methods still struggle with problems caused by large-scale datasets, such as redundant candidate generation, inability to discover long significant patterns, and prolonged post pattern filtering. In this article, we propose a novel dynamic tree structure, Risk Hierarchical Pattern Tree (RHPTree), and a top-down search method, RHPSearch, which are capable of efficiently analyzing a large volume of data and overcoming the limitations of previous works. The dynamic nature of the RHPTree avoids costly tree reconstruction for the iterative search process and dataset updates. We also introduce two specialized search methods, the extended target search (RHPSearch-TS) and the parallel search approach (RHPSearch-SD), to further speed up the retrieval of certain items of interest. Experiments on both UCI machine learning datasets and sampled datasets of the Simons Foundation Autism Research Initiative (SFARI)—Simon’s Simplex Collection (SSC) datasets demonstrate that our method is not only faster but also more effective in identifying comprehensive long risk patterns than existing works. Moreover, the proposed new tree structure is generic and applicable to other pattern mining problems.more » « less
-
It is widely recognized that the ability to exploit Natural Language Processing (NLP) text mining strategies has the potential to increase productivity and innovation in the sciences by orders of magnitude, by enabling scientists to pull information from research articles in scientific disciplines such as genomics and biomedicine. These methods enable scientists to rapidly identify publications relevant to their own research as well as make scientific discoveries by scouring hundreds of research papers for associations and connections (such as between drugs and side effects, or genes and disease pathways) that humans reading each paper individually might not notice. The goal of our work is to enable rapid development of workflows for mining scientific publications and, crucially, means to adapt tools and workflows to data for specific disciplines (domain adaptation). Our work on this project is still ongoing; however, as a starting point we initially addressed the needs for mining biomedical publications, and a Galaxy instance of the LAPPS Grid tailored to mining biomedical publications is currently maintained on the JetStream cloud environment (https://jetstream.lappsgrid.org).more » « less
-
null (Ed.)We describe a data mining pipeline to convert data from educational systems into knowledge component (KC) models. In contrast to other approaches, our approach employs and compares multiple model search methodologies (e.g., sparse factor analysis, covariance clustering) within a single pipeline. In this preliminary work, we describe our approach's results on two datasets when using 2 model search methodologies for inferring item or KCs relations (i.e., implied transfer). The first method uses item covariances which are clustered to determine related KCs, and the second method uses sparse factor analysis to derive the relationship matrix for clustering. We evaluate these methods on data from experimentally controlled practice of statistics items as well as data from the Andes physics system. We explain our plans to upgrade our pipeline to include additional methods of finding item relationships and creating domain models. We discuss advantages of improving the domain model that go beyond model fit, including the fact that models with clustered item KCs result in performance predictions transferring between KCs, enabling the learning system to be more adaptive and better able to track student knowledge.more » « less
-
Abstract Living systems use chemical fuels to transiently assemble functional structures. As a step toward constructing abiotic mimics of such structures, we herein describe dissipative formation of covalent basket cage CBC
5 by reversible imine condensation of cup‐shaped aldehyde2 (i.e., basket) with trivalent aromatic amine4 . This nanosized [4+4] cage (V =5 nm3,M w=6150 Da) has shape of a truncated tetrahedron with four baskets at its vertices and four aromatic amines forming the faces. Importantly,tris ‐aldehyde basket2 and aliphatictris ‐amine7 undergo condensation to give small [1+1] cage6 . The imine metathesis of6 and aromatictris ‐amine4 into CBC5 was optimized to bias the equilibrium favouring6 . Addition of tribromoacetic acid (TBA) as a chemical fuel perturbs this equilibrium to result in the transient formation of CBC5 , with subsequent consumption of TBA via decarboxylation driving the system back to the starting state.