<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Integrating hypertension phenotype and genotype with hybrid non-negative matrix factorization</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>09/15/2018</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10079789</idno>
					<idno type="doi">10.1093/bioinformatics/bty804</idno>
					<title level='j'>Bioinformatics</title>
<idno>1367-4803</idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Yuan Luo</author><author>Chengsheng Mao</author><author>Yiben Yang</author><author>Fei Wang</author><author>Faraz S Ahmad</author><author>Donna Arnett</author><author>Marguerite R Irvin</author><author>Sanjiv J Shah</author><author>Jonathan Wren</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Motivation: Hypertension is a heterogeneous syndrome in need of improved subtyping using phenotypic and genetic measurements with the goal of identifying subtypes of patients who share similar pathophysiologic mechanisms and may respond more uniformly to targeted treatments. Existing machine learning approaches often face challenges in integrating phenotype and genotype information and presenting to clinicians an interpretable model. We aim to provide informed patient stratification based on phenotype and genotype features. Results: In this article, we present a hybrid non-negative matrix factorization (HNMF) method to integrate phenotype and genotype information for patient stratification. HNMF simultaneously approximates the phenotypic and genetic feature matrices using different appropriate loss functions, and generates patient subtypes, phenotypic groups and genetic groups. Unlike previous methods, HNMF approximates phenotypic matrix under Frobenius loss, and genetic matrix under Kullback-Leibler (KL) loss. We propose an alternating projected gradient method to solve the approximation problem. Simulation shows HNMF converges fast and accurately to the true factor matrices. On a real-world clinical dataset, we used the patient factor matrix as features and examined the association of these features with indices of cardiac mechanics. We compared HNMF with six different models using phenotype or genotype features alone, with or without NMF, or using joint NMF with only one type of loss We also compared HNMF with 3 recently published methods for integrative clustering analysis, including iClusterBayes, Bayesian joint analysis and JIVE. HNMF significantly outperforms all comparison models. HNMF also reveals intuitive phenotype-genotype interactions that characterize cardiac abnormalities. Availability and implementation: Our code will be made publicly available on github upon publication.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Precision medicine aims to utilize information from multiple modalities-including phenotypic and genetic measurements-to develop an individualized and comprehensive view of a patient's pathophysiologic progression, to identify unique disease subtypes, and to administer personalized therapies <ref type="bibr">(Kohane, 2015)</ref>. Existing efforts are often based on only a selected set of biomarkers. The rapid growth of phenotypic and genetic data for many common diseases poses technical challenges for subtyping them, due to the high dimensionality of data, diversity of data types, uncertainty and missing data. However, the rapid growth of multiple data modalities, when linked to the right patients, may provide a prismatic view of the underlying pathophysiology of these diseases and offers a basis for meaningful subtyping of these patients.</p><p>Hypertension is an example of a complex, heterogeneous clinical syndrome characterized by elevated blood pressure. Although typically considered a single disease, primary hypertension (i.e., essential hypertension) is in fact a heterogeneous group of subtypes with varying etiologies and pathophysiology. This common form of hypertension is highly prevalent and is polygenic in nature. However, genetic studies of hypertension have focused primarily on analyzing single variants at a time and then ranking them in terms of significance, as has been done in several genome-wide association studies [see <ref type="bibr">Poulter et al. (2015)</ref> for a review]. However, it is more likely that genetic variants interact with each other to increase susceptibility to disease. Furthermore, genetic variants interact with phenotypic risk factors to further promote the development of diseases such as hypertension. With the growing availability of high throughput genotyping and phenotyping data (such as through NIH/NHLBI TOPMed program), the need for integrating both data modalities is becoming increasingly pressing. Thus, it is critical to develop a methodology to combine phenotypic and genetic data when clustering patients for the identification of novel subtypes of the disease. Such work could help identify novel molecular and pathophysiological pathways of disease and also may identify subgroups of patients who are more homogeneous in their response to specific therapies.</p><p>Major contributions of this paper are: (i) Aiming to provide informed patient stratification, we propose Hybrid Non-negative Matrix Factorization (HNMF) that approximates phenotype and genotype matrices using different appropriate loss functions, instead of single loss function in previous joint NMF methods. (ii) We use simulation to show HNMF converges fast and accurately to true factor matrices, and we use a real-world clinical dataset to show HNMF-generated patient factor matrix is more effective in predicting indices of cardiac mechanics compared to multiple non-NMF, NMF and joint NMF based methods. (iii) We show that HNMFgenerated group matrices lead to insights on phenotype-genotype interactions that characterize cardiac abnormalities.</p><p>From the clinical perspective, there have been only a few previous studies that have examined the clustering of hypertensive patients. <ref type="bibr">Katz et al. applied model-based</ref> clustering to a cohort of 1273 hypertensive individuals, using only phenotypic data as features <ref type="bibr">(Katz et al., 2017)</ref>. Study participants were clustered into two distinct groups that differed markedly in clinical characteristics, cardiac structure/function, and indices of cardiac mechanics. <ref type="bibr">Guo et al. (2017)</ref> used K-means clustering of phenotypic data (clinical and blood pressure characteristics) and found four groups of interest. However, neither of these studies utilized genetic data, which could have provided an additional important dimension to the clustering of hypertension, particularly when combined with phenotypic data.</p><p>From the method perspective, non-negative matrix factorization (NMF) refers to the set of problems on approximating a nonnegative matrix as the product of several non-negative matrices. The problem has become popular since Lee and Seung's Nature paper <ref type="bibr">(Lee and Seung, 1999)</ref>, where they form a nonnegative matrix by concatenating the set of pixel intensity vectors stretched from human facial images. After factorizing such matrix into the product of two matrices, they found that one matrix can be interpreted as the set of image basis with part based representation of human faces, and the other matrix is the coefficients if we were to reconstruct the face image from those bases. Because of the non-negativity constraints, NMF is not a convex problem and they developed a multiplicative update algorithm to obtain a stationary solution, with provable convergence of the algorithm <ref type="bibr">(Lee and Seung, 2001)</ref>.</p><p>Since then researchers have been working on NMF from various aspects. <ref type="bibr">Ding et al. (2005)</ref> showed that there is some equivalence between NMF and Kmeans/spectral clustering and claim NMF can be used for data clustering. <ref type="bibr">Ding et al. (2006)</ref> further developed a t-NMF approach that can perform co-clustering on both matrix columns and rows. They also discussed the various NMF variants <ref type="bibr">(Ding et al.</ref>,2 0 1 0 ). <ref type="bibr">Sra and Dhillon (2006)</ref> extended NMF to the case when the matrix approximation loss is measured by Bregman divergence, which is a much more general loss with both Frobenius norm and KL divergence, which are discussed in <ref type="bibr">(Lee and Seung, 2001</ref>)) as its special cases. On the solution procedure aspect, multiplicative updates have been recognized for its slow convergence and poor quality. Lin <ref type="bibr">(Lin, 2007)</ref> proposed a projected gradient approach for NMF. <ref type="bibr">Kim and Park (2011)</ref> also proposed an active set type of method called principal block pivoting to solve the NMF problem.</p><p>NMF is a highly effective unsupervised method to cluster similar patients <ref type="bibr">(Hofree et al., 2013;</ref><ref type="bibr">Luo et al., 2016b)</ref> and sample cell lines <ref type="bibr">(Mu &#168;ller et al., 2008)</ref>, and to identify subtypes of diseases <ref type="bibr">(Collisson et al., 2011)</ref>. Conventional NMF can only model either phenotypic measurements (e.g. using Frobenius loss) or genetic variants (e.g., using KL loss) but not both. Recent studies have investigated methods for joint matrix factorization, serving the purpose of metaanalysis <ref type="bibr">(Wang et al., 2015)</ref>, multi-view clustering <ref type="bibr">(Liu et al., 2013)</ref> or imposing multiple characterization of documents <ref type="bibr">(Kim et al., 2015)</ref>. However, these methods focus on using Frobenius loss to measure approximation accuracy of multiple matrices, and cannot readily integrate phenotypic measurements and genetic variants where approximating the two matrices admit different types of loss functions. <ref type="bibr">Gunasekar et al. (2016)</ref> proposed collective matrix factorization based on the Bregman divergence framework to integrate multi-source EHR phenotyping data, implemented KL-divergence as matrix approximation loss and experimented on discrete diagnosis and medications data.</p><p>In theory, both KL divergence and Frobenius loss are special cases of Bregman divergence, but care needs to be taken when materializing the theoretical framework to the concrete case of hybrid genotypic and continuous phenotypic data. Challenges include how to derive useful genetic variant information from terabytes of whole exome sequencing data, how to filter deleterious variants, how to properly implement HNMF with missing continuous data, etc. Our paper is one such concrete materialization to integrate phenotypic and genotypic information for patient subtyping. Addressing both the clinical and methodological challenges, we propose the model of HNMF which models the approximations of phenotypic and genetic matrices under Frobenius loss and KL loss respectively. We develop an alternating project gradient descent method for optimizing the HNMF objective, and demonstrate its fast convergence and effectiveness in integrating both the phenotypic and genetic data using both simulated and real-world studies.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Materials and methods</head><p>We develop a hybrid matrix factorization method to integrate both phenotypic and genetic features. The model applies non-negative matrix factorization to discover groups of phenotypic variables and genetic variants simultaneously that collectively and interactively characterize the groups of the patients. The approximation error is measured using Frobenius loss for the phenotypic matrix, and KL loss for the genetic matrix; hence we name our algorithm the HNMF. We have made the following assumptions throughout the paper:</p><p>1. The phenotype matrix and the genotype matrix share a common set of subtypes; 2. The relationship between the variables and the groups are linear and can be modeled with matrix multiplication.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Workflow of the study</head><p>We first outline the workflow of the study in Figure <ref type="figure">1</ref>. This study considers two types of patient data: phenotypic measurements and genetic variants. We first impute missing values in the phenotypic variables. For genetic variants, we first annotate the variants and then keep those variants that are likely gene disruptive (LGD). The pre-processed phenotypic measurements and genetic variants are then used as input to our HNMF algorithm. The patient factor matrix is then used as the feature matrix to perform regression analysis to predict main cardiac mechanistic outcomes. We next explain each step in detail.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Cohort construction and data collection</head><p>Our cohort comes from the Hypertension Genetic Epidemiology Network (HyperGEN) study. HyperGEN, part of the NIH Family Blood Pressure Program, is a cross-sectional study consisting of individuals with hypertension, their siblings and offspring, and a random sample of normotensives, all recruited from 4 cities in the United States <ref type="bibr">(Williams et al.</ref>,2 0 0 0 ). We focus on the African American participants (660 total), for whom we have both the phenotypic data (e.g., vitals) and whole exome sequencing (WES) data. We used two measurements from the echocardiograms that are main reflectors of systolic (longitudinal strain) and diastolic (septal e' velocity) cardiac mechanics as outcome variables (Table <ref type="table">1</ref>) <ref type="bibr">(Mitter et al., 2017;</ref><ref type="bibr">Mor-Avi et al.,2011)</ref>. As opposed to conventional cardiac function measures such as ejection fraction, indices of cardiac mechanics obtained by speckle-tracking echocardiography are more sensitive measures of intrinsic cardiomyocyte function <ref type="bibr">(Shah et al., 2014)</ref>. Furthermore, indices of cardiac mechanics are thought to be subclinical measures of myocardial dysfunction that occur during the transition from risk factors (e.g., hypertension, obesity, diabetes, renal disease) to overt heart failure <ref type="bibr">(Selvaraj et al., 2016)</ref>. WES identifies the variants found in the coding region of genes (exons). In order to accurately and consistently call variants from across all datasets, we adopt the GATK framework <ref type="bibr">(DePristo et al., 2011)</ref> for a standardized processing of WES data. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Imputation on phenotypic variables</head><p>Biomedical, epidemiological and clinical data often contain missing values for test results, some due to issues during data acquisition and archiving, but others due to the fact that clinicians do not order certain tests based on patient-specific diagnostic and treatment course. The missing percentage of the phenotypic variables considered in our study ranges from 0% to 37%. We had our cardiologist colleagues pick rather inclusively 129 phenotypic variables (see Supplementary Material) that can characterize the hypertension risk and cardiac physiology of the patients. We are rather tolerant on missing rate in order to retain as many variables as possible. Nevertheless, only 13/129 (10%) of the phenotypic variables had missingness &gt; 10%. Six of these variables with missingness &gt; 10% (including those with missingness &gt; 30%) were phenotypes related to mitral inflow, which characterize diastolic function. Given the importance of diastolic dysfunction (i.e., abnormal cardiac relaxation and/or reduced cardiac compliance) in the setting of hypertension, we chose to retain these variables because of their clinical importance. We use the Multivariate Imputation by Chained Equations (MICE) algorithm to perform the imputation. This approach assumes a conditional model for each variable to be imputed, with the other variables as possible predictors (van Buuren and Groothuis-Oudshoorn, 2011). The term chained equation comes from the adoption of a Gibbs sampler, which is an iterative Markov Chain Monte Carlo (MCMC) algorithm. Previous studies e.g., <ref type="bibr">Luo et al. (2016a))</ref> showed that even at the presence of high missing rate (over 50%), MICE imputation may still render clinically useful information to predict patient outcome due to redundant information in phenotypic variables.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.4">Annotation-based variant filtration and LGD variant detection</head><p>We next used the ANNOVAR toolkit <ref type="bibr">(Wang et al., 2010)</ref> to comprehensively annotate called variants with a wide array of information, including their hosting gene ;[using several gene models such as RefSeq, UCSC Known Gene, Gencode <ref type="bibr">(Harrow et al., 2012)</ref>]; the variant function; its predicted pathogenicity according to PolyPhen2 <ref type="bibr">(Adzhubei et al., 2013)</ref>, SIFT <ref type="bibr">(Kumar et al., 2009)</ref>, CADD <ref type="bibr">(Kircher et al., 2014)</ref>, and other meta predictors; its minor allele frequency among the 1000 Genomes populations and ExAC <ref type="bibr">(Lek et al., 2016)</ref>; and its phenotype associations according to ClinVar, and HGMD <ref type="bibr">(Stenson et al., 2012)</ref>.</p><p>To address issues of reference mis-annotation, we resort to the recently released Exome Aggregation Consortium (ExAC) exome dataset <ref type="bibr">(Lek et al., 2016)</ref>, which aims to aggregate exome sequencing data from a wide range of large-scale sequencing projects including the cohorts of Myocardial Infarction Genetics Consortium, Swedish Schizophrenia &amp; Bipolar Studies and The Cancer Genome Atlas (TCGA). We filter out those variants whose allele frequencies are observed to be over 90% among the 60, 706 individuals aggregated by ExAC. We also apply a similar 90% filtering threshold on the alternate allele frequency in our cohort. We further focus on likely gene disruptive (LGD) variants, which include frame-shift insertion, frame-shift deletions, nonsense variants and splice site alterations. We have 6430 gene features for our cohort, 660 subjects. We follow the common practice and exclude the genes that have very rare variants (&lt;10 subjects) or very frequent variants (&gt; 50% of the subjects), resulting in 1481 genes. We then follow the common approach of gene prioritization <ref type="bibr">(Moreau and Tranchevent, 2012)</ref> and further select the genes that show significant difference between the two hypertension groups (patient taking 1 vs. multiple anti-hypertensive medications) by two-tailed binomial exact tests <ref type="bibr">(Howell, 2012)</ref>. The gene selection is based on the entire patient cohort but uses a categorical label that is different from the final continuous outcomes of cardiac mechanics indexes. Eventually, 349 (110) genes (Supplementary Material) are selected for our cohort with p-value of binomial test less than 0.1 (0.01). Each entry of our genetic matrix specifies how many variants a patient has on that gene.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.5">Hybrid NMF</head><p>We propose the hybrid NMF (HNMF) model that integrates both phenotypic and genetic measurements of patients. The phenotypic measurements we consider are continuous values, hence we use Gaussian distribution to model the approximation error. The genetic measurements are counts of the genetic variants that happen to a particular gene, thus we use Poisson distribution to model the variant count. A schematic view of our HNMF model is shown in Figure <ref type="figure">2</ref>.</p><p>Our goal is to maximize the joint likelihood of the two approximations. Let the variables be defined as in Table <ref type="table">2</ref>, we establish the following constrained optimization problem</p><p>where k indicates the trade-off between the phenotypic approximation and genetic approximation (k &#188; 1 for our experiment), and the log likelihood functions are defined as follows.  Genotype group assignment matrix</p><p>We require F, G g and G p to be nonnegative in order to achieve better interpretability. Since the values of the entries in both X p and X g are nonnegative, with the nonnegativity constraints on F, G g and G p we are essentially assuming an additive reconstruction of X p and X g from the product of those factors under a hybrid loss. According to the seminal paper from Lee and Seung in Nature <ref type="bibr">(Lee and Seung, 1999)</ref>, such additive reconstruction can result in better interpretation of F, G g and G p .</p><p>By minimizing the negative log likelihood, we arrive at the following objective function.</p><p>where</p><p>where Xp &#188; F T G p and Xg &#188; F T G g . We can use the following alternating projected gradient descent procedure to solve the objective and establish the stopping criteria that the partial gradients should be small enough or all factor matrix updates cannot produce a feasible direction along which the objective function decreases (let P &#254; &#240;&#193;&#222; denote the non-negative projector):</p><p>These equations take turns in optimizing each factor matrix while keeping the other two fixed. We next present the partial gradients with respect to each of the three factor matrices. For phenotype group matrix G p , we have</p><p>Let Xg &#188; F T G g , and Xgij &#188; X g ij = Xg ij , for genotype group matrix G g , we have</p><p>where E G 2 R n&#194;mg is an all-one matrix. For the patient group matrix F, we have</p><p>With those gradients, we can adopt an alternating projected gradient descent procedure to solve the hybrid matrix factorization problem. This is an iterative procedure, at each iteration, the algorithm optimizes the objective with one specific group of variables with all other variables fixed. The optimization procedure used at each iteration will be projected gradient descent. In order to determine the step size at each gradient descent step, we use the Armijo rule as a sub-procedure which looks for the largest g (step size) that satisfies the following sufficient decrease condition. Let H; H new denote the parameters (e.g., F, G g and G p ) before and after each iteration respectively, and d 2&#240;0; 1&#222; be a predefined number. General sufficient decrease condition can be written as</p><p>If L is a quadratic form of H, we have a special fast-to-check sufficient decrease condition as Formula (13) <ref type="bibr">(Lin, 2007)</ref>. The algorithm for projected gradient descent with Armijo rule can be outlined as Algorithm 1. Note that q in the algorithm is a step size controlling parameter that is set to the common choice of 0.1 <ref type="bibr">(Lin, 2007)</ref>.</p><p>2.6 Feature group discovery using HNMF</p><p>In HNMF, the row vectors in the phenotype factor matrix G p and in the genetic factor matrix G g specify the grouping of phenotypic measurements and genetic variants respectively. Such groupings can be viewed as mixtures of phenotypic (or genetic) features, as they allow sharing of these features among different groups as specified by its fractional weights across groups. The motivation is to identify paired phenotypic group and genetic group that together characterize pathophysiologic underpinnings. The approximated phenotypic matrix can be viewed as rank-one sum of outer-product of patient group (e.g., F T &#189; &#193;j , jth column of the patient group matrix) and phenotypic group (e.g., &#189;G p j&#193; , jth row of the phenotypic group matrix). Similar argument holds for genetic group matrix. Thus the patient group (e.g., F T &#189; &#193;j ) bridges the corresponding phenotypic group (e.g., G p &#194;&#195; j&#193; ) and genetic group (e.g., G g &#194;&#195; j&#193; ). We used the patient group matrix F T as the instance-feature matrix in Ridge regression and used the numeric values of the cardiac mechanic variables as outcomes (listed in Table <ref type="table">1</ref>), and identify a column with maximum coefficient (e.g., F T &#189; &#193;j ). We selected the corresponding phenotypic and genetic groups (e.g., &#189;G p j&#193; and G g &#194;&#195; j&#193; ), which are paired through the shared patient group (e.g., F T &#189; &#193;j ) and provide interpretation</p><p>Algorithm 1 Projected gradient descent with Armijo rule 1: Initialize H. Set g &#188; 1 2: for i &#188; 1t ok do 3: if g satisfies Eq. ( <ref type="formula">13</ref>) (or ( 12 ) if quadratic) then 4:</p><p>Repeatedly increase g as g g=q until either g does not satisfy Eq. ( <ref type="formula">13</ref>)) (or ( <ref type="formula">12</ref>)) if quadratic) or H&#240;g=q&#222;&#188;H 5: else 6:</p><p>Repeatedly decrease g as g qg until g satisfy Eq. (13) (or (12) if quadratic) 7:</p><p>end if 8:</p><p>Set H new &#188; max&#240;0; H &#192; gr H L&#240;H&#222;&#222; 9: end for advantage. Using the trained regression model, we rank the patient groups by their regression coefficients and focus on the top patient groups (and associated phenotypic and genetic groups) that are associated with large effect size.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.7">Evaluating the groups discovered by HNMF</head><p>Because there is no innate way (except for simulation) to determine whether the groupings of phenotypic measurements and genetic variants discovered by HNMF are good or poor, we evaluate their utility as features, abstracted from the base data, in a prediction model. We assume that good features will improve prediction and will give us some insights into which phenotypic and genetic patterns are indicative of patient cardiac mechanic abnormality. We use the phenotypic and genetic data for participants from the hypertension genetic epidemiology network (HyperGEN) study. We take a subset of the African American patients who are hypertensive, and for whom we have both phenotypic and genetic data available at large scale. We predict the numeric values of the cardiac mechanic variables as outcomes (listed in 1). For each outcome variable, we randomly split these patients into a 7: 3 train and held-out test dataset, and repeat the random initializations of HNMF and other NMF based comparison models 50 times in order to improve the statistical robustness of the results. We did not require that all the individuals from the same family to be included in either the training or the test set, but not both. This is out of the consideration that we want to minimize the potential bias from family variant patterns during model training. However, we did perform additional experiments requiring all the individuals from the same family to be included in either the training or the test set, but not both, which yielded similar numerical results, please refer to the Supplementary Material for more details.</p><p>To evaluate the effectiveness of HNMF in abstracting raw data into more predictive features, we use the patient factor matrix F to train a Ridge regression model. We chose Ridge regression over alternatives such as support vector regression or random forest regression for its capability to generate deterministic weights for individual features. We match the groups in the phenotypic factor matrix and genetic factor matrix according to their row indices, and link them to the corresponding row in the patient factor matrix F. Linear regression then provides a convenient way to directly assess phenotypic and genetic group contribution.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Results</head><p>In this section, we first evaluate the algorithmic performance using a simulated dataset where the actual factor matrices are known. Then, we evaluate the hybrid matrix factorization performance using the HyperGEN dataset.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Simulation</head><p>We first analyze simulated data where the underlying factor matrices are known. Specifically, we consider a 20 &#194; 10 X p matrix and a 20 &#194; 100 X g matrix with the true number of factors being 3. That is, they are generated by a 3 &#194; 20 F matrix, a 3 &#194; 10 G p matrix, and a 3 &#194; 100 G g matrix. We first sample the F, G p , and G g matrices. We then generate the X p matrix by adding an error term p on top of F T G p where p adopts standard normal distribution. Next we generate the X g matrix by sampling according to Poisson distribution with the parameter set to F T G g .</p><p>In order to evaluate the similarity between the factorized matrix and its true counterpart, we use the following similarity score:</p><p>where tr &#193; &#240;&#222; is trace and tr A T B &#240;&#222; can be considered as matrix inner product. This similarity score is essentially the cosine similarity, which quantifies the closeness between the computed solution and the actual factor matrix and provides a single number between 0 and 1 <ref type="bibr">( Chi and Kolda, 2012)</ref>. In order to test the sensitivity of estimates to the initialization, we performed random initialization 10 times.</p><p>The simulation results are shown in Figure <ref type="figure">3a</ref> where the similarity score is plotted as a function of maximum number of iterations for sub-procedures (optimizing F, G p , G g one at a time while fixing the other two, using the Armijo rule), which represents the closeness to the sub-problem optima. Figure <ref type="figure">3</ref> shows that as we have extra subprocedure iterations, the similarity scores first rise slightly and then plateau quickly. We can also see that the similarity between the true factor matrices and those recovered by HNMF quickly reaches to an accurate level (&gt;0.9). Figure <ref type="figure">3b</ref> shows the convergence speed of the proposed alternating projected gradient descent method with the number of iterations for sub-procedures set to 100. We can see that both loss functions (Frobenius loss for phenotype matrix and KL loss for genotype matrix) quickly decrease within a few iterations. In fact, for our simulation, the stopping criteria is usually met in less than 50 iterations. Regarding the sensitivity of estimates, Figure <ref type="figure">3a</ref> shows that the variation across runs with different initializations is relatively low; Figure <ref type="figure">3b</ref> shows that although the loss function curves may differ in the first few iterations across different initializations, they usually converge to the same levels quickly.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Application on cardiac mechanics</head><p>We then evaluate HNMF on its effectiveness of abstracting raw data into more predictive features. Using the 2 indices of cardiac mechanics listed in Table <ref type="table">1</ref> as the outcome and the patient factor matrix F as the predictors, we train a Ridge regression model. We evaluate the root-mean-square error (RMSE) of our model on the held-out test set, and compare it against two baselines: (b1) Using only genetic variants as regression features; (b2) Using only phenotypic measurements as regression features. We also established five groups of comparison models as follows: (c1) Using only the genetic groups as regression features by applying NMF on the genetic variant matrix only; (c2) In disease with polygenic risk factors, each variant may contribute a small portion of risk, thus we added the total count of risky variants as additional feature to the genetic groups (Liu et al., (c4) Using joint NMF but use Frobenius loss for both matrices; (c5) Using other recently published methods include: iCBayes-Bayesian latent variable model for integrative clustering analysis <ref type="bibr">(Mo et al., 2018)</ref>; jBayes-Bayesian joint analysis <ref type="bibr">(Ray et al., 2014)</ref>; JIVE -Joint and individual variation explained <ref type="bibr">(Lock et al., 2013)</ref>. For the two suggested Bayesian sampling methods, we used the optimal setting described in their respective papers regarding sampling iterations (burn-in iterations and max iterations, e.g., 3000 and 4000 respectively for jBayes). We follow <ref type="bibr">Ho et al. (2014)</ref> on the evaluation procedure in that we vary the group number k from the smallest 2 to where the evaluation metric plateaus and show that across the spectrum HNMF outperforms multiple separate and joint NMF comparison models. The baseline RMSE performances are: 1.25 and 3.88 for genobaseline on septal e' velocity and longitudinal strain respectively, 1.20 and 3.55 for pheno-baseline respectively. The RMSE performance results of HNMF and comparison models are shown in Figure <ref type="figure">4</ref>. Comparing all the factorization models and nonfactorization models, we can see that using factor matrices as features results in significant improvement (smaller RMSE) over using phenotypic measurements and genetic variants directly as features.</p><p>Phenotype-only factor matrices often show better regression accuracy than genotype-only factor matrices, likely due to the fact that genetic raw matrix is much sparser than the phenotypic raw matrix. The HNMF factor matrix for regression also significantly outperforms all comparison models including genotype-only or phenotypeonly factor matrix for regression, as well as the two joint NMF model results using either KL loss or Frobenius loss for both matrices. This suggests that HNMF can effectively integrate the phenotype and genotype features to predict cardiac mechanics outcomes. HNMF also outperformed recently published methods including iCBayes, jBayes, and JIVE regarding both cardiac mechanics indexes. Note that JIVE is a deterministic model (hence no confidence intervals in the figure) whose performance varies little with the rank of the matrix corresponding to joint variation (hence appearing as a flat line in the figure). The joint Bayesian methods occasionally may have large variations possibly due to the fact that our study has a moderate number of subjects with both phenotype and genetic data. Bayesian sampling based methods likely prefer more subjects to achieve stable estimation while HNMF is more stable as it directly optimizes the objective function. We also noted that jBayes occasionally produced large RMSEs (e.g., k &#188; 13 for septal s' velocity), when the corresponding matrices contain large negative entries. This likely suggests overfitting; on the contrary, HNMF produced matrices with entries that have controlled magnitude due to non-negative constraints, and likely reduced overfitting.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Sensitivity analysis</head><p>When performing annotation-based genetic variant filtration, we select the genes that show significant difference in number of LGD variants between the two hypertension groups (patient taking 1 vs. multiple anti-hypertensive medications) by two-tailed binomial exact tests. Using a P-value threshold of being less than 0.01 produces 110 genes for our cohort. This is a relatively stringent threshold and in this section we perform sensitivity analysis by varying the P-value threshold and including 0.05 and 0.1. With these P-value thresholds, we include considerably more genes into consideration: 239 genes for 0.05 as threshold and 349 genes for 0.1 as threshold. The genotype baseline RMSEs are 4.87 (4.63) for longitudinal strain and 1.56 (1.50) for septal e' velocity under P-value threshold 0.1 (0.05). Supplementary Figure <ref type="figure">S1</ref> (Supplementary materials) shows the results of the sensitivity analysis in comparison with Figure <ref type="figure">4</ref>. Comparing these figures, it is easy to see that under all p-value thresholds, HNMF consistently outperforms all baselines and NMF comparison models including pheno-and geno-separate NMF models and joint NMF models with KL or Frobenius losses. On the other hand, as one tightens the P-value threshold, the plateau region becomes wider, suggesting that the regression performance is less sensitive as the group number varies in the plateau region. Thus in the following phenotype and genotype group analysis, we chose P-value threshold of 0.01. Another reason is that with a stricter P-value threshold, we are more confident that selected genes are likely implicated in the pathogenesis of abnormal cardiac mechanics. We also note that with large enough patient cohort size, techniques such as cross-validations can be used to accurately determine the optimal group number. The larger the patient cohort size, the more effective cross-validation is, under more relaxed filtering criteria that result in more genes to consider.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Discussion</head><p>Using the method in the feature group discovery section, we identified the top phenotypic and genetic groups that are associated with (a) (b) Fig. <ref type="figure">4</ref>. RMSE with 95% confidence interval for HNMF and comparison methods. gNMF -using genotype factor matrix as features; pNMF -phenotype factor matrix as features; hNMF -hybrid factor matrix as features; jNMF(KL)joint matrix factorization using KL loss; jNMF(Fro) -joint matrix factorization using Frobenius loss; gNMF&#254;c -genotype factor matrix and the total count of risky variants. Other recently published methods include: iCBayes: Bayesian latent variable model for integrative clustering analysis <ref type="bibr">(Mo et al., 2018)</ref>; jBayes -Bayesian joint analysis <ref type="bibr">(Ray et al., 2014)</ref>; JIVE -Joint and individual variation explained <ref type="bibr">(Lock et al., 2013)</ref> worse cardiac mechanics. Due to space limitation, we only show the top phenotypic and genetics groups associated with lower values of septal e' velocity and longitudinal strain, as listed in Table <ref type="table">3</ref>. The phenotypic groups can help us identify variables that are correlated with abnormal cardiac mechanics. The associated genetic group consists of genes that potentially mediate the corresponding multivariable phenotypic abnormality. They collectively indicate problematic multi-factor genotype and phenotype interaction and attribute such interaction to a specific patient group (in F), thus can more comprehensively and precisely characterize and stratify these patients in an evidence-driven fashion.</p><p>More specifically, the echocardiographic septal e' velocity is one of several variables used during the assessment of diastolic dysfunction. In general lower septal e' values are reflective of a higher degree of diastolic dysfunction, which is associated with the development of heart failure and/or adverse cardiovascular outcomes <ref type="bibr">(Mitter et al., 2017)</ref>. In septal-e' phenotype group, preserved (higher) left ventricular ejection fraction is often present in patients with diastolic dysfunction, other variables are associated with the development of diastolic dysfunction, including abnormal sodium, calcium, and albumin levels, and abnormal left ventricular wall thickness during diastole. In the septal-e' gene group, TPM2 shows strong susceptibility to variants that lead to cardiomyopathies and IDI2 to chronic kidney disease (comorbidity and risk factor for cardiovascular disease). NPR2 is linked to cardiac conduction. GPRC6A is responsible for calcium sensing that affects L-type calcium channel and is critical to cardiac cell function <ref type="bibr">(Mackenzie et al., 2005)</ref>. MSMP is involved in resting heart rate modulation. For longitudinal strain, lower values suggest worse longitudinal systolic function of the subendocardium (inner layer of the heart), thus worse cardiac mechanics <ref type="bibr">(Shah et al., 2014)</ref>. In longitudinal strain phenotype group, besides abnormal sodium, calcium and albumin levels, both higher waist/hip ratio and faster sitting heart rate have a known association with the development of heart failure <ref type="bibr">(Bui et al., 2011)</ref>. In the longitudinal strain gene group, COX6B2 is in the cardiac muscle contraction pathway, CLDN5 is expressed in heart muscle, other genes also show strong susceptibility to variants that lead to cardiomyopathies (TPM2), other cardiovascular diseases (BMP4), and obesity as comorbidity (PAX5).</p><p>This study is subject to potential limitations. First, we only consider the genetic variants that are in coding regions. Genetic variations in coding regions are thought to be the most clinically significant because they often result in a change in the amino acid sequence of a protein. Thus, variations in coding regions of genes typically are associated with more clinical sequelae than variants in non-coding regions. However, variants in non-coding regions could have clinical implications through gene regulation or epigenetic modifications etc. The lack of non-coding variants is a limitation in our study. Applying HNMF on both coding and non-coding variants will be more computationally intensive. Thus in future work, we will develop a more computationally efficient algorithm, and obtain Whole Genome Sequencing (WGS) data to systematically capture potential regulatory variants. Regarding the identified subgroups, we only assessed and discussed their consistency to known knowledge. In the future, we also plan to provide more evidence, and in particular, biological validation to confirm potential novel discoveries. The second limitation concerns the gene feature selection using the entire patient cohort. This is out of consideration that genetic features are sparse and we only have a moderate sized patient cohort. In addition, we use a categorical label that is different from the final continuous outcomes of cardiac mechanics indexes to reduce the impact on generalizability evaluation. Despite our best efforts, we acknowledge that the impact on generalizability cannot be fully eliminated, and we plan to sequence more subjects from external sites to more strictly evaluate the generalizability of our algorithm and how applicable the selected genes would be to future cohorts. The third limitation concerns the fact that some individuals from the same family may be split into the training set while others in the test set. We did so in order to minimize the potential bias from family variant patterns during model training. This may result in an overly optimistic view of the generalizability. However, as neither HNMF nor all the comparison methods explore the family structure, we expect that their relative performances are similar and models' ranks will hold in general. We also performed additional experiments by assigning all individuals from the split families to the training set, therefore guaranteeing family-preserving training-testing split. As shown in Supplementary Figure <ref type="figure">S2</ref>, these experiments yielded similar numerical results and confirmed our expectation, please refer to the Supplementary Material for more detail.</p><p>To sum, we proposed a novel HNMF algorithm that integrates both phenotypic measurements and genetic variants as features in order to subtype patients. HNMF models the approximation error for the phenotypic matrix using Gaussian distribution, and models the variant count for the genetic matrix using Poisson distribution. The objective function is the negative log-likelihood of the data given parameters. We developed an alternating projected gradient descent method to solve the approximation problem. Using the simulated dataset, we demonstrated that HNMF has fast convergence and high accuracy when approximating the true factor matrices. Using the real-world HyperGEN dataset, we demonstrated the effectiveness of HNMF in integrating both the phenotypic and genetic features to derive informative patient subgroupings. We used the patient factor matrix as features to predict the cardiac mechanics outcome variables. We compared HNMF with six different models using phenotype or genotype features directly, using NMF on these features separately, and using joint matrix factorization but with only one type of loss function. HNMF significantly outperforms all comparison models. Analyzing the identified phenotype and genotype groups reveals intuitive phenotype-genotype interactions that characterize cardiac abnormality. For future study, we plan to extend HNMF to consider prior medical knowledge (e.g., known phenotypic and genotypic characteristics associated with heart failure) in guiding the generation of the factor matrices for better patient stratification. We also plan to extend HNMF to a trifactorization model that allows for different group numbers in patient, genotype and phenotype factor matrices, in order to benefit HNMF with more flexibility to handle heterogeneous and distinct modality of data sources. We plan to model the genetic matrix approximation using zero-inflated Poisson distribution, as genetic matrix is sparse. We also plan to relax LGD criteria to include more genetic variants and obtain Whole Genome Sequencing data to systematically capture potential regulatory variants.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>Downloaded from https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/bty804/5098530 by Cornell University Library user on 18 November 2018</p></note>
		</body>
		</text>
</TEI>
