<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Machine Learning Selection of Most Predictive Brain Proteins Suggests Role of Sugar Metabolism in Alzheimer’s Disease</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>03/21/2023</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10404580</idno>
					<idno type="doi">10.3233/JAD-220683</idno>
					<title level='j'>Journal of Alzheimer's Disease</title>
<idno>1387-2877</idno>
<biblScope unit="volume">92</biblScope>
<biblScope unit="issue">2</biblScope>					

					<author>Raghav Tandon</author><author>Allan I. Levey</author><author>James J. Lah</author><author>Nicholas T. Seyfried</author><author>Cassie S. Mitchell</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Background: The complex and not yet fully understood etiology of Alzheimer’s disease (AD) shows important proteopathic signs which are unlikely to be linked to a single protein. However, protein subsets from deep proteomic datasets can be useful in stratifying patient risk, identifying stage dependent disease markers, and suggesting possible disease mechanisms. Objective: The objective was to identify protein subsets that best classify subjects into control, asymptomatic Alzheimer’s disease (AsymAD), and AD. Methods: Data comprised 6 cohorts; 620 subjects; 3,334 proteins. Brain tissue-derived predictive protein subsets for classifying AD, AsymAD, or control were identified and validated with label-free quantification and machine learning. Results: A 29-protein subset accurately classified AD (AUC=0.94). However, an 88-protein subset best predicted AsymAD (AUC=0.92) or Control (AUC=0.92) from AD (AUC=0.98). AD versus Control: APP, DHX15, NRXN1, PBXIP1, RABEP1, STOM, and VGF. AD versus AsymAD: ALDH1A1, BDH2, C4A, FABP7, GABBR2, GNAI3, PBXIP1, and PRKAR1B. AsymAD versus Control: APP, C4A, DMXL1, EXOC2, PITPNB, RABEP1, and VGF. Additional predictors: DNAJA3, PTBP2, SLC30A9, VAT1L, CROCC, PNP, SNCB, ENPP6, HAPLN2, PSMD4, and CMAS. Conclusion: Biomarkers were dynamically separable across disease stages. Predictive proteins were significantly enriched to sugar metabolism.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>INTRODUCTION</head><p>Early elucidation of Alzheimer's disease (AD) is pivotal for constructing clinically impactful treatments. However, the pathophysiology of AD and the driving biochemical changes are not fully understood. Assessment of changes in protein expressions in the brain may assist in elucidation of multifactorial biochemical changes that lead to AD <ref type="bibr">[1]</ref>. Given the complexity and heterogeneity of AD, no single protein is likely to be predictive of all mechanisms or phenotypes which result in AD <ref type="bibr">[2]</ref>. Nonetheless, predictive protein models may suggest novel disease mechanisms, improve assessment of patient risk, and signify disease stage-dependent biomarkers <ref type="bibr">[3]</ref>.</p><p>This work identifies protein subsets that differentiate diagnostic labels for AD. AD diagnosis is often based on clinically measured functional cognitive ISSN 1387-2877 &#169; 2023 -The authors. Published by IOS Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial License (CC BY-NC 4.0). decline. AD diagnosis is typically determined using a battery of neuropsychological tests in combination with suggestive imaging, genomic, or other clinical features. Common cognitive tests used in AD diagnosis include the Montreal Cognitive Assessment or Consortium to Establish a Registry for Alzheimer's Disease (CERAD) neuropsychological battery <ref type="bibr">[4]</ref>. There is no universal definition of asymptomatic AD (AsymAD). AsymAD is typically characterized by changes in age-adjusted biomarkers, such as increase in amyloid-&#9252; and tau in the brain, without overt presence of cognitive decline <ref type="bibr">[5]</ref>. Control subjects typically show no overt cognitive losses and no significant change in age-adjusted biomarkers.</p><p>In particular, identification of subsets of proteins that better predict and stratify the asymptomatic AD stage is pivotal. Earlier identification of patients likely to transition to AD could enable earlier intervention. The ability to intervene early is likely key to improving outcomes, such as slowing progression or improving symptom-related quality of life. The amyloid-&#9252; cascade, tauopathy, and Apolipoprotein E (ApoE) are known aberrant protein signatures in AD <ref type="bibr">[6]</ref><ref type="bibr">[7]</ref><ref type="bibr">[8]</ref><ref type="bibr">[9]</ref>. However, other proteins may provide earlier clues during asymptomatic changes. For example, metabolomic <ref type="bibr">[10]</ref>, lipidomic, and inflammationrelated proteins have also been suggested to be involved in aging and dementia <ref type="bibr">[11]</ref>.</p><p>The study goal was to determine which proteins in the brain (beyond amyloid-&#9252; and phosphorylated tau) are most important for classifying a human subject as either control, AsymAD, or AD. Data consisted of 3,334 brain tissue-derived proteins measured via label-free quantification (LFQ) <ref type="bibr">[3]</ref> in six different clinical cohorts. Machine learning classification with recursive feature elimination was used to select the "best" or most predictive proteins.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>METHODS</head><p>Methods consist of data collection and preprocessing; protein selection using a machine learning algorithm to identify the "best" subset of proteins to predict patient diagnostic classification; validation of the algorithm to accurately classify control, AsymAD, or AD patients using only the identified "best" subset of predictive proteins; and assessment of predictive protein functions. All data preprocessing, machine learning, and analysis was performed in Python 3.6.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Patient diagnostic class labels</head><p>Note that the patient diagnostic labels (Control, AsymAD, AD) were inherited from previously published work. Briefly, according to the definitions outlined by Johnson et al. <ref type="bibr">[3]</ref>, the neuropathological diagnostic classes were determined using CERAD criteria to quantify neuritic plaque distribution and Braak staging to quantify extent of neurofibrillary tangle pathology.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Data used for protein biomarker identification</head><p>Six public data sets were utilized <ref type="bibr">[3]</ref>: Baltimore Longitudinal Study of Aging (BLSA) <ref type="bibr">[12]</ref>, Banner Sun Health Research Institute (Banner) <ref type="bibr">[13]</ref>, Mount Sinai School of Medicine Brain Bank (MSSB) <ref type="bibr">[2]</ref>, Adult Changes in Thought Study (ACT), Mayo Clinic Brain Bank and University of Pennsylvania School of Medicine Brain Bank. Four data sets (n = 419 subjects) were utilized for initial model construction and "best" protein selection: BLSA, Banner, ACT, and MSSB. Two data sets (n = 201 subjects) were used to independently validate the ability of the selected best protein subset to classify the diagnostic label of subjects: Mayo and UPenn cohorts. For all cohorts except Mayo, the tissue was taken from the dorsolateral prefrontal cortex. For the Mayo cohort, the tissue was taken from the temporal cortex. As shown in Fig. <ref type="figure">1a</ref> as part of data preparation, missing values were imputed using the k-nearest neighbor (kNN). The optimal number of neighbors for imputation of missing values was determined to be 20 (Supplementary Figure <ref type="figure">1</ref>). Figure <ref type="figure">1b</ref> shows the number of subjects and quantified proteins for each cohort. Supplementary Figure <ref type="figure">2</ref> illustrates the overall distribution of amyloid-&#9252;, tau, APP, CERAD score, and Braak in the data sets used for protein selection. Because amyloid-&#9252; and tau were utilized to determine the class labels <ref type="bibr">[3]</ref> in the original data sets, tau and amyloid-&#9252; levels are not explicitly utilized as part of the protein identification and selection process here. Inclusion of amyloid-&#9252; and tau would have resulted in a circular analysis that confounded results. However, their pathways are indirectly represented via upstream biomarkers like APP.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Protein selection using machine learning</head><p>As shown in the protein selection row of Fig. <ref type="figure">1a</ref>, proteins from the selection cohort (data from BLSA, Banner, ACT, and MSSB) were selected using a Fig. <ref type="figure">1</ref>. Diagram explaining the data and machine learning pipeline to identify a subset of "best" predictive protein biomarkers to accurately classify Alzheimer's disease (AD), asymptomatic Alzheimer's disease (AsymAD), or Control. a) Machine learning pipeline consisted of data preparation, protein selection, and model validation with selected proteins. Data was prepared by aggregating four cohorts (n = 419 subjects, n = 3,334 unique proteins) and imputing missing values using k-nearest neighbor algorithm. The most predictive proteins were selected using recursive feature elimination (RFE) to construct and train a support vector machine (SVM) classifier that can predict diagnosis using only the selected "best" proteins (n = 29 or n = 88 best predictive proteins). Finally, the developed classifier model was independently validated using 2 additional cohorts (n = 201 subjects) to ensure the model's performance generalizes to new data. b) Details of six data cohorts used in protein selection (4 cohorts) and validation (2 cohorts), including sample sizes. combination of classification algorithms with recursive feature elimination (RFE). RFE is a feature selection algorithm which recursively eliminates less important data features until a pre-defined number of features remain in the dataset. In this study, the "features" are the measured proteins. This iterative procedure is an instance of backward selection <ref type="bibr">[14]</ref>. RFE <ref type="bibr">[14]</ref> is used to determine the most predictive proteins for successful classification. The resultant predictive protein subset was then used to classify each subject as either control, AsymAD, or AD.</p><p>RFE is a wrapper-based feature selection algorithm where recursive rounds of elimination are used to determine the subset of proteins that best predict patient diagnostic classification. The final set of selected predictive proteins is, in part, sensitive to the classification method. Thus, two popular linear classification methods were independently used with RFE in the scikit-learn package of Python: support vector machine (SVM) and logistic regression (LR), both with linear kernels <ref type="bibr">[15]</ref>. The two classifiers, SVM and LR, separately select a specified number of most predictive proteins equal to the RFE criterion. The RFE criterion is the number of proteins the algorithm is allowed to retain. Note that other classifiers were also tried in place or in combination with SVM and LR. However, the intersection of proteins selected by SVM and LR was most consistent and accurate; hence, all results shown utilized this method.</p><p>Proteins are selected based on their superior classification ability as quantitatively measured by the area under the precision-recall curve (AUPRC). The intersecting most predictive proteins become the "best proteins". The RFE algorithm assessed RFE criterions ranging from 10 to 150 proteins. For example, the Venn diagram of Fig. <ref type="figure">1a</ref> for protein selection illustrates that an RFE criterion of 50 for SVM and LR resulted in an intersecting set of 29 best proteins. Upon completion of protein selection using RFE, a new SVM classifier is constructed, trained, and validated to classify diagnosis (AD, control, AsymAD) using only the selected best proteins. With three classes (control, AsymAD, AD), a one versus rest approach was utilized (AD versus NonAD, AsymAD versus non-AsymAD, control versus non-control).</p><p>Note that alternative methods to RFE to identify the most predictive proteins were considered and tried on LFQ as well as held out data: penalized lasso (Supplementary Figure <ref type="figure">5</ref>), random forest feature importance (Supplementary Figure <ref type="figure">6</ref>), and statistical differential protein expression using the f-statistic (Supplementary Figure <ref type="figure">7</ref>). Also, random forests were coupled with RFE to have a more stringent selection criterion -including a protein only when it is selected by three algorithms: SVM, logistic regression, and random forests (Supplementary Figure <ref type="figure">8</ref>). Performance comparison to neural network, which played no role in feature selection, is also shown in Supplementary Figures <ref type="figure">5</ref><ref type="figure">6</ref><ref type="figure">7</ref><ref type="figure">8</ref>. In some cases, the alternate methods shown in the supplementary figures performed marginally better on the UPenn dataset, which has only binary labels (Control/AD). In all cases, RFE chosen proteins performed substantially better on the LFQ dataset which has more samples (n = 419), classes (Control/AsymAD/AD), and comprised 4 different datasets (ACT, MSSB, Banner, BLSA) (Fig. <ref type="figure">1</ref>). Because of its superior multi-class performance and generalizability, the RFE-based primary method shown in Fig. <ref type="figure">1a</ref> was used to produce all results shown in the main article.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Validation of "best" protein subsets to classify diagnosis</head><p>As shown in the Protein Validation row of Fig. <ref type="figure">1a</ref>, the trained SVM classifier was independently tested using validation cohort data (Mayo, UPenn data sets). As part of independent validation, the best set(s) of proteins determined during protein selection with RFE was used to predict validation cohort diagnostic classes. However, there were a couple of exceptions due to required data harmonization. In the Mayo cohort, one of the "best" proteins was not quantified (CROCC|Q5TZA2) and a different protein isoform was quantified for APP; APP|A0A0A0MRG2 was included for Mayo, instead of APP|E9PG40). Similarly, for the UPenn cohort, two of the "best" proteins were not quantified (C4A|P0C0L4, DMXL1|Q9Y485), and a different isoform was quantified for APP (APP|A0A0A0MRG2 instead of APP|E9PG40).</p><p>Confusion matrices illustrate true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN) used to calculate classification performance. Additionally, a precision-recall curve (PRC) is generated to assess aggregate final model performance. PRC assesses the model's classification accuracy when using only the "best" protein subsets to classify diagnosis. PRC is a plot of precision versus recall. Recall is defined as [TP/ (TP+FN)], and precision is defined as [TP/(TP+FP)]. Area under the curve (AUC) provides an aggregate measure of performance across all classification thresholds.</p><p>A separate unsupervised learning technique, tstochastic neighbor embedding (t-SNE), was used to assess separability of AD, AsymAD, and Control subjects using only the selected best proteins subsets determined during supervised learning with RFE.</p><p>Finally, principal component analysis (PCA), a dimensional reduction technique, was used to explore and validate RFE criteria. The scree plot and elbow method were used to separately verify how many proteins are necessary to explain the preponderance of variance. The elbow approximated the number of intersecting proteins selected during RFE for optimal diagnostic classification.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Analysis of protein function modules</head><p>Selected proteins were matched to their protein function using the color modules and algorithms published by Johnson et al. <ref type="bibr">[3]</ref>. There are 14 possible functional modules comprising the entire protein data set (n = 3,334 unique proteins). The percent composition of specific functional modules in the selected "best" protein subsets were compared to the original, full protein set. Significant differences were assessed using two-sided binomial tests at an alpha of 0.05. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>RESULTS</head><p>A machine learning classification and recursive feature elimination process (Fig. <ref type="figure">1a</ref>) determined which of 3,334 possible clinically measured proteins were most important for classifying control, AsymAD, or AD. RFE was used to identify the proteins that best predicted diagnostic class. Six public data sets were utilized (Fig. <ref type="figure">1b</ref>). Four data sets (n = 419 subjects) were for protein selection, which consisted of identifying the "best proteins". Two data sets (n = 201 subjects) were used for independent protein validation. Independent validation on unseen data ensured the model was generalizable. Hence, the model can correctly classify the diagnosis of new subjects using only the selected best protein subset(s). An RFE criterion of 50, which resulted in 29 best proteins, was found sufficient to distinguish AD from control. However, an RFE criterion of 150, which resulted in 88 best proteins, was found necessary to optimally distinguish AsymAD from AD.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Classification performance with 29 best proteins</head><p>Amyloid precursor protein (APP) is linked to the well-known amyloid-&#9252; pathway <ref type="bibr">[16]</ref>. It was also selected by RFE as one of the 29 best proteins. Hence, it was important to carefully assess if APP was impacting or biasing classification compared to other proteins. Figure <ref type="figure">2</ref> illustrates the classification performance using APP alone (Fig. <ref type="figure">2a-c</ref>), the selected 29 best proteins (Fig. <ref type="figure">2d-f</ref>), and APP excluded from the panel of 29 proteins (Fig. <ref type="figure">2g-i</ref>) for three datasets (LFQ, Mayo, and UPenn). Figure <ref type="figure">2d</ref> illustrates the performance confusion matrix for the LFQ cohort, Fig. <ref type="figure">2e</ref> the Mayo validation cohort, and Fig. <ref type="figure">2f</ref> the UPenn validation cohort, with the selected 29 best proteins. For the LFQ cohort, the model correctly classified 186 of the 230 AD patients (80.87%), while 25 AD patients (10.9%) were misclassified as Asy-mAD, and 19 AD patients (8.2%) were misclassified as Control. For the Mayo validation cohort, where proteins were measured in the temporal cortex, 71 AD patients (86.6%) were correctly classified, whereas 11 AD patients were misclassified as Control. For the UPenn validation cohort, where proteins were measured in the dorsolateral prefrontal cortex, 37 AD patients were correctly classified, whereas 12 AD patients (24.5%) were misclassified as Control. Similar trend was seen in Fig. <ref type="figure">2g</ref>-i when APP was removed from the panel of 29 best proteins and the remaining proteins were used. In summary, the exclusion of APP did not significantly diminish the ability of the remaining 28 proteins to predict diagnosis. This result indicates classification is not dependent on APP alone.</p><p>A precision-recall (PR) curve is used to assess aggregate classifier performance. AUC is used to quantify the aggregate classification performance using the selected best protein subset(s). AUC = 1 is a perfect classifier; thus, an AUC closer to 1 is desirable. Figure <ref type="figure">2j</ref> illustrates the PR curve and corresponding AUC for the protein selection cohort for AD, control, and AsymAD, respectively, using the selected 29 best proteins. The shaded area represents the standard error, &#177;&#963;. The 29 best proteins do well in correctly classifying AD (AUC = 0.94 &#177; 0.01; shown in red in Fig. <ref type="figure">2j</ref>) and control (AUC = 0.83 &#177; 0.02; shown in green in Fig. <ref type="figure">2j</ref>). However, the best 29 proteins are poor at classifying AsymAD (AUC = 0.68 &#177; 0.02; shown in yellow in Fig. <ref type="figure">2j</ref>). Figure <ref type="figure">2k</ref> illustrates the PR curve and corresponding AUC for the independent validation cohorts of Mayo and UPenn, respectively. The AUC for Mayo cohort was 0.96 &#177; 0.01 whereas the UPenn cohort AUC was 0.90 &#177; 0.03. Thus, the model performed equally well in diagnosing unseen AD and Control patients in both independent validation cohorts with the 29-protein subset. The fact validation data originated from different brain regions provides further confidence that the 29-protein subset model is generalizable to other future data sets. The selected proteins do not have a high degree of correlation between them, which supports that the predictive ability does Fig. <ref type="figure">2</ref>. Examination of classification performance for Alzheimer's disease (AD), asymptomatic Alzheimer's Disease (AsymAD), and control using n = 29 best predictive proteins in a one-versus-rest classification setting. Confusion matrices (a-i) illustrate numeric classification results, whereas precision-recall curves (j-k.) denote the area under the curve (AUC) with standard error (&#177; &#963;) to quantify overall classification performance. a-c) Confusion matrices illustrating classification results when using APP alone to classify patient diagnostic class in the (a) LFQ, (b) Mayo, and (c) UPenn datasets. d-f) Confusion matrices illustrating classification results with all 29 "best" or most predictive proteins. g-i) Confusion matrices illustrating results when APP was excluded and the remaining 28 predictive proteins were used for classification of patient diagnosis. APP is widely considered pivotal to the AD etiology. However, these results illustrate APP is not overtly biasing diagnostic classification ability. j) Precision-recall curve for the 3 classes (Control, AsymAD, AD) in the LFQ dataset. k) precision-recall curves for the validation datasets (Mayo and UPenn), which consisted of 2 classes (control/AD). In all cases shown (a-k), an SVM classifier is used with a 6-fold cross-validation strategy, and aggregated results from the test sets are shown. not rest upon a few proteins in the set (Supplementary Figure <ref type="figure">3</ref>). Moreover, the unsupervised clustering method, t-SNE, illustrated good separability of the AD, AsymAD, and control classes using the selected subset of 29 "best" proteins (Supplementary Figure <ref type="figure">4</ref>).</p><p>The best 29 proteins are listed in Fig. <ref type="figure">3a</ref> with their corresponding model coefficient weight as determined from the SVM classification model. Since the one-versus-rest approach for multi-class classification is used, it results in three coefficients for each protein. The three coefficients for every protein correspond to the three diagnostic classes (AD, AsymAD, and Control). The corresponding heatmap illustrates how the selected proteins (n = 29) drive classification of AD, AsymAD, or Control. Purple represents negative drivers and blue positive drivers. The depth of the hue corresponds to relative magnitude of the coefficient as shown on the heatmap coefficient scale in Fig. <ref type="figure">3a</ref>. For example, increased APP strongly drives up AD classification, strongly drives down Control, and slightly drives up AsymAD. Similar interpretations can be made for all proteins and their effect on each class.</p><p>Figure <ref type="figure">3b</ref> examines the overlap of the 29 selected proteins in driving diagnostic class (AD, AsymAD, Control). AD and Control, labeled as area 1 on the Venn diagram, share 7 driving proteins: APP, DHX15, NRXN1, PBXIP1, RABEP1, STOM, and VGF. AD and AsymAD, labeled as area 2 on the Venn diagram, share eight driving proteins: ALDH1A1, BDH2, C4A, FABP7, GABBR2, GNAI3, PBXIP1, PKAR1B. AsymAD and Control, labeled as area 3 on the Venn diagram, share seven driving proteins: APP, C4A, DMXL1, EXOC2, PITPNB, RABEP1, and VGF. Note that the color coding of each protein, itself, in Fig. <ref type="figure">3</ref> corresponds to function as described in the Functional Themes in the Selected Proteins section. The most predictive proteins tend to have opposite signs for coefficient modulation between discriminatory class pairs (AD and control; AD and AsymAD; and AsymAD and Control).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Classification performance with 88 proteins</head><p>An RFE of 50, resulting in 29 selected proteins, was sufficient for differentiating AD and Control classes. However, an RFE criterion of 150, resulting in 88 selected proteins, was optimal for differentiating AD and AsymAD classes. Figure <ref type="figure">4a</ref> and 4b illustrate the PR curve and AUC for each class utilizing the 88 protein subset for predicting diag-nostic classification. Utilizing the 88-protein subset increased AD and Control classification performance by approximately 4% and 9% respectively compared to the 29-protein subset (Fig. <ref type="figure">4a</ref>). Utilizing the 88-protein subset increased AsymAD classification performance by approximately 24% (Fig. <ref type="figure">4a</ref>). In the independent validation cohorts (Fig. <ref type="figure">4b</ref>), which did not contain any AsymAD patients, the 29-protein subset marginally outperformed the 88-protein subset. In short, AsymAD requires substantially more proteins for accurate predictive classification.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Further exploration of classification performance as a function of RFE criterion</head><p>The RFE criterion for protein selection was varied to determine the optimal protein subsets. Again, the RFE criterion determines the number of proteins each classifier (SVM and LR) can select. For a given RFE criterion, the intersection of proteins selected by both SVM and LR become the resultant number of "best" predictive proteins. As described above, RFE = 50, resulting in 29 proteins, was sufficient to classify AD versus control. The number of intersecting best predictive proteins for diagnostic classification is not random. Rather, these thresholds are explained by examining dimensional reduction with PCA. Figure <ref type="figure">4c</ref> illustrates variance explained as a function of number of components. The scree plot approximates minimum components needed to explain the preponderance of variance. The "elbow" of the scree plot denotes the optimal range of components needed to account for the preponderance of variance. Figure <ref type="figure">4c</ref> shows 29 components (red dot) corresponds to the start of the elbow and 88 components (black dot) corresponds to the end of the elbow. Variance per component beyond the elbow asymptotically approaches zero. Hence, those additional components should not substantively improve model performance. Figure <ref type="figure">4d</ref> examines the impact of RFE criterion and the resultant number of selected best predictive proteins on diagnostic classification performance. Selecting a RFE criterion greater than 150 (not shown) did not result in increased classification performance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Functional themes in the selected proteins</head><p>The biomarker proteins were mapped to their corresponding modules, which were identified by using the weighted gene correlation network analysis (WGCNA) algorithm. Colors and module number are used to define the function of the biomarkers <ref type="bibr">[3]</ref> and are recapitulated in the first 3 columns of Fig. <ref type="figure">5</ref>. Functional module frequency (expressed as a percentage) of selected best proteins (for n = 29 and n = 88) were compared to the source frequency for the total protein set (n = 3,334). Change in frequency of selected protein module compared to source frequency is indicative of relative importance of a functional module in predicting diagnostic class (AD, AsymAD, Control). The most enriched module in selected protein sets for both n = 29 and n = 88 corresponds to sugar metabolism. Sugar metabolism (M4, yellow) most strongly correlated with AD associated traits (cognition r = -0.67, p = 8.5 &#215; 10 -23 ; neurofibrillary tangle r = 0.49, p = 4.7 &#215; 10 -27 ; amyloid-&#9252; plaque r = 0.46, p = 1.3 &#215; 10 -23 and functional sta-tus r = 0.52, p = 2.6 &#215; 10 -12 ), as reported in <ref type="bibr">[3]</ref>. The 29 best proteins are significantly (p &lt; 0.05) enriched with the sugar metabolism module (M4, yellow). Proteins belonging to sugar metabolism (M4) constituted 5.6% of 3334 total proteins analyzed. However, sugar metabolism proteins constituted 20.7% of the selected best 29 proteins and 13.6% of the selected best 88 proteins (Fig. <ref type="figure">5</ref>). The remaining functional modules are not significantly different in their representation in either set of selected best proteins. Figure <ref type="figure">6</ref> lists the individually selected 29 best proteins (n = 29) and 88 best proteins (n = 88) color-coded by functional module. Note the 29protein set is a subset of the 88-protein set (e.g., the best 29 proteins are all present within the best 88 proteins). This was based on an assessment of the RFE criterion, principal component analysis (PCA), and evaluation of classifier using area under the curve (AUC) with standard error (&#177; &#963;). a) AUC of the precision-recall curve for classification of AD, AsymAD, Control using the n = 88 best predictive protein set. b) AUC of the precision-recall curve for classification of Control versus AD using the n = 88 proteins in the Mayo and UPenn datasets. Of the 88, Mayo dataset had 77 and UPenn dataset had 63 proteins respectively. c) PCA examining variance explained versus number of principal components. Red dot corresponds to the n = 29 selected protein subset and blue dot to the n = 88 selected protein subset. The "elbow" of the scree plot ends by about 88 principal components. d) Analysis of impact of RFE criterion and resultant number of selected best predictive proteins in the LFQ dataset and the validation datasets using each respective resultant protein subset for classification. Fig. <ref type="figure">5</ref>. Functional Protein Module of Selected "Best" Predictive Proteins. The functional protein modules are as defined by Johnson et al. <ref type="bibr">[3]</ref>. Source frequency is the frequency of the module in the source protein set (n = 3334 unique proteins). The selected frequency is for selected best proteins, n = 29 or n = 88. The M4 yellow module for sugar metabolism is significantly enriched (p &lt; 0.05) in selected proteins compared to their frequency in source. Enrichment of sugar metabolism in the selected predictive proteins signifies their importance to diagnostic classification. Fig. <ref type="figure">6</ref>. Individual proteins comprising the selected "best" predictive protein subsets color-coded by functional module. The n = 29 subset was sufficient for differentiating AD from Control. However, the n = 88 subset was optimal for differentiating AD and AsymAD. Note all 29 proteins in the n = 29 selected set are contained within the n = 88 selected set. The Venn diagram inset pictorially summarizes the Recursive Feature Elimination (RFE) algorithm used to select the best protein subsets. The intersecting predictive proteins selected by both the support vector machine (SVM) and logistic regression (LR) classifiers during RFE became the "best proteins". A RFE criterion = 50 resulted in 29 best proteins (e.g., intersection shown on the left Venn inset). A RFE criterion = 150 resulted in 88 best proteins (intersection shown on the right Venn inset).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>DISCUSSION</head><p>Of the 3,334 proteins, machine learning determined a minimum 29-protein subset necessary to accurately classify AD and Control, but an 88-protein subset was necessary to accurately classify AsymAD. The additional proteins needed for AsymAD classification is likely due to greater complexity and heterogeneity of the AsymAD disease state. The "best" predictive protein subsets (n = 29 and n = 88) were significantly enriched for sugar metabolism (Fig. <ref type="figure">6</ref>).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Homeostatic regulatory dynamics key to disease progression</head><p>There was relatively little overlap between the predictive proteins that drive control-AsymAD changes and predictive proteins that drive AsymAD-AD changes (see Fig. <ref type="figure">3b</ref>). This finding indicates an associative relationship to multifactorial dynamic disease progression etiology. In short, the most predictive proteins dynamically change with disease stage (see Fig. <ref type="figure">3a</ref>). Whether familial or sporadic AD, different underlying proteomic perturbations may result in multi-scalar system destabilization (e.g., failed homeostasis) with corresponding functional disease phenotypes. Homeostasis is critical for maintaining health, and thus, instabilities often appear in disease <ref type="bibr">[17]</ref>. Multifactorial homeostatic instability has been suggested as an underlying propagating mechanism in other neurological pathology, including amyotrophic lateral sclerosis <ref type="bibr">[18]</ref>, absence epileptic seizures <ref type="bibr">[19]</ref>, Parkinson's disease <ref type="bibr">[20]</ref>, and secondary spinal cord injury <ref type="bibr">[21]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Overlapping proteins for class discrimination</head><p>Supplementary Table <ref type="table">1</ref> presents literature details on functions and cited associations with each member of the 29-protein subset. Supplementary Table <ref type="table">1</ref> includes the protein unique ID, brief description of its function and role in AD (if known), and a corresponding reference. Five proteins of the 29-protein subset overlapped in class discrimination (Fig. <ref type="figure">3b</ref>): APP, VGF, RABEP1, C4A, PBXIP1. APP (upregulated in AD, AsymAD) was expected given its role in the amyloid cascade <ref type="bibr">[16]</ref>. VGF (downregulated in AD) protects against amyloid-&#9252; pathology <ref type="bibr">[22]</ref>. RABEP1 was key for differentiating Control (upregulated) from AD or AsymAD (downregulated). RABEP1 is tied to longevity and AD <ref type="bibr">[23]</ref>. C4A was key for differentiating AsymAD (downregulated) from AD (upregulated) or Control. Increased C4A copy number <ref type="bibr">[24]</ref> impacts AD risk and schizophrenia <ref type="bibr">[25]</ref>. PBXIP1 was key for differentiating AD (upregulated) from AsymAD (downregulated) or Control. PBXIP1 is cited as altering cell viability and motility through rearrangements of the actin cytoskeleton <ref type="bibr">[26]</ref>. Interestingly, many of the 29-proteins are also biomarkers for various non-neural cancers.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Sugar metabolism biomarkers enriched in 29-protein and 88-protein set</head><p>Sugar metabolism proteins in both the 29-protein and 88-protein sets (Fig. <ref type="figure">6</ref>) included: APP, BDH2, C4A, CROCC, FABP7, PBXIP1. Sugar metabolism proteins in the expanded 88-protein subset included CD44, an immune marker associated with AD <ref type="bibr">[27]</ref>; PADI2, an age and AD-related marker <ref type="bibr">[28]</ref>; BANF1, implicated in aging and progeria <ref type="bibr">[29]</ref>; HSPB8, inhibitor of amyloid-&#9252; formation <ref type="bibr">[30]</ref>; SMC1A, where increased copy number is implicated in epilepsy, AD, and other neurodegenerative diseases <ref type="bibr">[31]</ref>; BBOX1, implicated in diabetic kidney disease, lipid metabolic disorders, and schizophrenia <ref type="bibr">[32]</ref>.</p><p>The significantly enriched sugar metabolism module (Fig. <ref type="figure">5</ref>) supports the recent perspective that asymptomatic and symptomatic AD is characterized by dysregulation of energy metabolism <ref type="bibr">[33,</ref><ref type="bibr">34]</ref>. In short, the presented work supports the hypothesis that sugar metabolism becomes more impacted with disease progression. Insulin resistance in the brain modulates AD inflammatory markers and decreases amyloid clearance <ref type="bibr">[35]</ref>. The exact link between AD and type 1 or 2 diabetes is under debate. Nonetheless, poorly controlled blood sugar appears to increase risk of AD <ref type="bibr">[35]</ref>. Some researchers have referred to the dysregulation of blood sugar in the brain in AD "type 3 diabetes" <ref type="bibr">[36]</ref>. Interconnections between inflammation, metabolism, and protein clearance are further evidence of a multifactorial homeostatic instability contributing to AD progression <ref type="bibr">[3,</ref><ref type="bibr">10,</ref><ref type="bibr">34,</ref><ref type="bibr">37]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>The good and the bad of APP</head><p>APP is another example where homeostatic instability may play a role in disease progression. For quite some time, APP has been known to be involved in the formation of amyloid-&#9252;. A recent genome-wide association study identified APP as a relevant pathway in both familial and sporadic AD <ref type="bibr">[38]</ref>. Typically, APP is associated with the formation of toxic soluble amyloid-&#9252; oligomers. However, researchers have also suggested that the production of soluble APP alpha (sAPP&#9251;) may be a compensatory mechanism to help stave off AD pathology <ref type="bibr">[39]</ref>. In particular, amyloid-&#9252; monomers have similar neuroprotective properties as sAPP&#9251;; they are neurotrophic and neuroprotective and enhance neurogenesis. Hence, deciphering the possible neuroprotective versus the neurogenic role of APP in AD is an ongoing area of research <ref type="bibr">[40]</ref>. The present study cannot confirm or deny the precise causal role of APP as protective, destructive, or a combination of both. The present study's association-based results do show that APP is an important diagnostic classifier in disambiguating the three stages (Control, AsymAD, AD), as shown in Fig. <ref type="figure">3</ref>. Nonetheless, the diagnostic classification ability of APP is complex and intertwined with other biomarkers (Fig. <ref type="figure">2d-f</ref>). When APP was used alone without any other biomarkers, it was not a good classifier (Fig. <ref type="figure">2a-c</ref>), especially for AsymAD.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Blood-based biomarkers</head><p>Biomarkers detected in the blood are preferable for AD risk assessment and early diagnosis <ref type="bibr">[41]</ref>. Only one blood-based module protein was selected in the 29-protein subset: PNP, a purine-related metabolite altered early in AD <ref type="bibr">[42]</ref>. Two additional blood-based proteins were in the 88-protein subset: AHSG and APOC3. A higher apoE level in high density lipoprotein that lacks apoC3 was associated with better cognitive function <ref type="bibr">[43]</ref>. AHSG, a highly glycosylated protein appears downregulated in AD <ref type="bibr">[44]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Assessment of alternatives and limitations</head><p>The presented RFE method in Fig. <ref type="figure">1</ref>, and its corresponding presented results above, were thoroughly vetted and compared to several other statistics-based and machine learning-based model alternatives. The presented method consistently outperformed all other alternative methods and models (see Supplementary Figures <ref type="figure">5</ref><ref type="figure">6</ref><ref type="figure">7</ref><ref type="figure">8</ref>), especially in the mega-LFQ data set with three classes (Control, AsymAD, and AD). In summary, the presented 29-protein and 88-protein lists for diagnostic classification were quite stable. Relaxing the RFE criterion to include more proteins beyond the selected 88-proteins did not improve classification results (Fig. <ref type="figure">4</ref>). Nonetheless, no model or method is perfect. While the model is stable, it is fair to expect that a small number of proteins included on the final presented list(s) could be substituted for non-included proteins (e.g., such as similarly co-expressed proteins or proteins from the same functional module). As such, regardless of method, a few proteins that relayed similar, correlated, or mutual information as the selected proteins may not have made the presented final selected proteins list(s). In full transparency, the performance of the protein list generated by each alternative method is shown in Supplementary Figures <ref type="figure">5</ref><ref type="figure">6</ref><ref type="figure">7</ref><ref type="figure">8</ref>. Notably, many of the RFE selected final proteins presented in the main article were recurringly selected by the alternative methods. Finally, any proteins not included in the presented final lists (or even in the input study data) could have their relative importance deduced based on co-expression or their functional modules.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Future directions</head><p>The LFQ data covered 3,334 proteins from which the identified "best" biomarker subset is derived. However, the presented method could be extended to future more comprehensive data sets, such as tandem mass tag, to further optimize results. Future addition of larger validation cohorts, especially AsymAD, will ensure model generalizability. Additionally, future inclusion of traits such as gender and race (when available) are important to determine if there are specific feature biases that impact the predictive ability or discriminative expression of proteins. Finally, this work utilized the common 3-class AD staging system: control, AsymAD, or AD. However, it is possible there is a more optimal temporal disease staging system. For example, integrative data machine learning analysis suggested with Alzheimer's Disease Neuroimaging Initiative data suggested at least four clusters of symptomatic AD patients <ref type="bibr">[45]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Conclusions</head><p>Machine learning successfully identified proteins subsets most predictive for classifying AD, Asy-mAD, and Control subjects. The most predictive proteins subsets comprised &lt; 3% of the 3,334 proteins assessed. A 29-protein subset accurately classified AD versus Control, but an 88-protein subset was needed to accurately classify AsymAD. The protein subsets resulted in a robust classifier model. The presented model generalized to accurately predict diagnostic labels on unseen data in independent validation cohorts regardless of brain region or minor data set differences. The predictive protein subsets included known important proteins like APP. However, diagnostic classification performance did not hinge upon APP or any single protein or pathway. Finally, the most predictive subsets were significantly enriched in proteins linked to sugar metabolism.</p></div></body>
		</text>
</TEI>
