skip to main content


Title: LMLFM: Longitudinal Multi-Level Factorization Machine
We consider the problem of learning predictive models from longitudinal data, consisting of irregularly repeated, sparse observations from a set of individuals over time. Such data often exhibit longitudinal correlation (LC) (correlations among observations for each individual over time), cluster correlation (CC) (correlations among individuals that have similar characteristics), or both. These correlations are often accounted for using mixed effects models that include fixed effects and random effects, where the fixed effects capture the regression parameters that are shared by all individuals, whereas random effects capture those parameters that vary across individuals. However, the current state-of-the-art methods are unable to select the most predictive fixed effects and random effects from a large number of variables, while accounting for complex correlation structure in the data and non-linear interactions among the variables. We propose Longitudinal Multi-Level Factorization Machine (LMLFM), to the best of our knowledge, the first model to address these challenges in learning predictive models from longitudinal data. We establish the convergence properties, and analyze the computational complexity, of LMLFM. We present results of experiments with both simulated and real-world longitudinal data which show that LMLFM outperforms the state-of-the-art methods in terms of predictive accuracy, variable selection ability, and scalability to data with large number of variables. The code and supplemental material is available at https://github.com/junjieliang672/LMLFM.  more » « less
Award ID(s):
1518732 1636795
NSF-PAR ID:
10202721
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Proceedings of the AAAI Conference on Artificial Intelligence
Volume:
34
Issue:
04
ISSN:
2159-5399
Page Range / eLocation ID:
4811 to 4818
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. We consider the problem of learning predictive models from longitudinal data, consisting of irregularly repeated, sparse observations from a set of individuals over time. Such data of- ten exhibit longitudinal correlation (LC) (correlations among observations for each individual over time), cluster correlation (CC) (correlations among individuals that have similar char- acteristics), or both. These correlations are often accounted for using mixed effects models that include fixed effects and random effects, where the fixed effects capture the regression parameters that are shared by all individuals, whereas random effects capture those parameters that vary across individuals. However, the current state-of-the-art methods are unable to se- lect the most predictive fixed effects and random effects from a large number of variables, while accounting for complex cor- relation structure in the data and non-linear interactions among the variables. We propose Longitudinal Multi-Level Factoriza- tion Machine (LMLFM), to the best of our knowledge, the first model to address these challenges in learning predictive mod- els from longitudinal data. We establish the convergence prop- erties, and analyze the computational complexity, of LMLFM. We present results of experiments with both simulated and real-world longitudinal data which show that LMLFM out- performs the state-of-the-art methods in terms of predictive accuracy, variable selection ability, and scalability to data with large number of variables. The code and supplemental material is available at https://github.com/junjieliang672/LMLFM. 
    more » « less
  2. null (Ed.)
    Gaussian processes offer an attractive framework for predictive modeling from longitudinal data, i.e., irregularly sampled, sparse observations from a set of individuals over time. However, such methods have two key shortcomings: (i) They rely on ad hoc heuristics or expensive trial and error to choose the effective kernels, and (ii) They fail to handle multilevel correlation structure in the data. We introduce Longitudinal deep kernel Gaussian process regression (L-DKGPR) to overcome these limitations by fully automating the discovery of complex multilevel correlation structure from longitudinal data. Specifically, L-DKGPR eliminates the need for ad hoc heuristics or trial and error using a novel adaptation of deep kernel learning that combines the expressive power of deep neural networks with the flexibility of non-parametric kernel methods. L-DKGPR effectively learns the multilevel correlation with a novel additive kernel that simultaneously accommodates both time-varying and the time-invariant effects. We derive an efficient algorithm to train L-DKGPR using latent space inducing points and variational inference. Results of extensive experiments on several benchmark data sets demonstrate that L-DKGPR significantly outperforms the state-of-the-art longitudinal data analysis (LDA) methods. 
    more » « less
  3. Background: recent applications of wastewater-based epidemiology (WBE) have demonstrated its ability to track the spread and dynamics of COVID-19 at the community level. Despite the growing body of research, quantitative synthesis of SARS-CoV-2 RNA levels in wastewater generated from studies across space and time using diverse methods has not been performed. Objective: the objective of this study is to examine the correlations between SARS-CoV-2 RNA levels in wastewater and epidemiological indicators across studies, stratified by key covariates in study methodologies. In addition, we examined the association of proportions of positive detections in wastewater samples and methodological covariates. Methods: we systematically searched the Web of Science for studies published by February 16th, 2021, performed a reproducible screening, and employed mixed-effects models to estimate the levels of SARS-CoV-2 viral RNA quantities in wastewater samples and their correlations to the case prevalence, the sampling mode (grab or composite sampling), and the wastewater fraction analyzed ( i.e. , solids, solid–supernatant mixtures, or supernatants/filtrates). Results: a hundred and one studies were found; twenty studies (671 biosamples and 1751 observations) were retained following a reproducible screening. The mean positivity across all studies was 0.68 (95%-CI, [0.52; 0.85]). The mean viral RNA abundance was 5244 marker copies per mL (95%-CI, [0; 16 432]). The Pearson correlation coefficients between the viral RNA levels and case prevalence were 0.28 (95%-CI, [0.01; 0.51]) for daily new cases or 0.29 (95%-CI, [−0.15; 0.73]) for cumulative cases. The fraction analyzed accounted for 12.4% of the variability in the percentage of positive detections, followed by the case prevalence (9.3% by daily new cases and 5.9% by cumulative cases) and sampling mode (0.6%). Among observations with positive detections, the fraction analyzed accounted for 56.0% of the variability in viral RNA levels, followed by the sampling mode (6.9%) and case prevalence (0.9% by daily new cases and 0.8% by cumulative cases). While the sampling mode and fraction analyzed both significantly correlated with the SARS-CoV-2 viral RNA levels, the magnitude of the increase in positive detection associated with the fraction analyzed was larger. The mixed-effects model treating studies as random effects and case prevalence as fixed effects accounted for over 90% of the variability in SARS-CoV-2 positive detections and viral RNA levels. Interpretations: positive pooled means and confidence intervals in the Pearson correlation coefficients between the SARS-CoV-2 viral RNA levels and case prevalence indicators provide quantitative evidence that reinforces the value of wastewater-based monitoring of COVID-19. Large heterogeneities among studies in proportions of positive detections, viral RNA levels, and Pearson correlation coefficients suggest a strong demand for methods to generate data accounting for cross-study heterogeneities and more detailed metadata reporting. Large variance was explained by the fraction analyzed, suggesting sample pre-processing and fractionation as a direction that needs to be prioritized in method standardization. Mixed-effects models accounting for study level variations provide a new perspective to synthesize data from multiple studies. 
    more » « less
  4. Integrating regularization methods with standard loss functions such as the least squares, hinge loss, etc., within a regression framework has become a popular choice for researchers to learn predictive models with lower variance and better generalization ability. Regularizers also aid in building interpretable models with high-dimensional data which makes them very appealing. It is observed that each regularizer is uniquely formulated in order to capture data-specific properties such as correlation, structured sparsity and temporal smoothness. The problem of obtaining a consensus among such diverse regularizers while learning a predictive model is extremely important in order to determine the optimal regularizer for the problem. The advantage of such an approach is that it preserves the simplicity of the final model learned by selecting a single candidate model which is not the case with ensemble methods as they use multiple candidate models for prediction. This is called the consensus regularization problem which has not received much attention in the literature due to the inherent difficulty associated with learning and selecting a model from an integrated regularization framework. To solve this problem, in this paper, we propose a method to generate a committee of non-convex regularized linear regression models, and use a consensus criterion to determine the optimal model for prediction. Each corresponding non-convex optimization problem in the committee is solved efficiently using the cyclic-coordinate descent algorithm with the generalized thresholding operator. Our Consensus RegularIzation Selection based Prediction (CRISP) model is evaluated on electronic health records (EHRs) obtained from a large hospital for the congestive heart failure readmission prediction problem. We also evaluate our model on high-dimensional synthetic datasets to assess its performance. The results indicate that CRISP outperforms several state-of-the-art methods such as additive, interactions-based and other competing non-convex regularized linear regression methods. 
    more » « less
  5. Abstract

    We consider user retention analytics for online freemium role-playing games (RPGs). RPGs constitute a very popular genre of computer-based games that, along with a player’s gaming actions, focus on the development of the player’s in-game virtual character through a persistent exploration of the gaming environment. Most RPGs follow the freemium business model in which the gamers can play for free but they are charged for premium add-on amenities. As with other freemium products, RPGs suffer from the curse of high dropout rates. This makes retention analysis extremely important for successful operation and survival of their gaming portals. Here, we develop a disciplined statistical framework for retention analysis by modelling multiple in-game player characteristics along with the dropout probabilities. We capture players’ motivations through engagement times, collaboration and achievement score at each level of the game, and jointly model them using a generalized linear mixed model (glmm) framework that further includes a time-to-event variable corresponding to churn. We capture the interdependencies in a player’s level-wise engagement, collaboration, achievement with dropout through a shared parameter model. We illustrate interesting changes in player behaviours as the gaming level progresses. The parameters in our joint model were estimated by a Hamiltonian Monte Carlo algorithm which incorporated a divide-and-recombine approach for increased scalability in glmm estimation that was needed to accommodate our large longitudinal gaming data-set. By incorporating the level-wise changes in a player’s motivations and using them for dropout rate prediction, our method greatly improves on state-of-the-art retention models. Based on data from a popular action based RPG, we demonstrate the competitive optimality of our proposed joint modelling approach by exhibiting its improved predictive performance over competitors. In particular, we outperform aggregate statistics based methods that ignore level-wise progressions as well as progression tracking non-joint model such as the Cox proportional hazards model. We also display improved predictions of popular marketing retention statistics and discuss how they can be used in managerial decision making.

     
    more » « less