<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Subject Harmonization of Digital Biomarkers: Improved Detection of Mild Cognitive Impairment from Language Markers</title></titleStmt>
			<publicationStmt>
				<publisher>WORLD SCIENTIFIC</publisher>
				<date>12/01/2023</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10542599</idno>
					<idno type="doi">10.1142/9789811286421_0015</idno>
					
					<author>Bao Hoang</author><author>Yijiang Pang</author><author>Hiroko H Dodge</author><author>Jiayu Zhou</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Mild cognitive impairment (MCI) represents the early stage of dementia including Alzheimer's disease (AD) and is a crucial stage for therapeutic interventions and treatment. Early detection of MCI offers opportunities for early intervention and significantly benefits cohort enrichment for clinical trials. Imaging and in vivo markers in plasma and cerebrospinal fluid biomarkers have high detection performance, yet their prohibitive costs and intrusiveness demand more affordable and accessible alternatives. The recent advances in digital biomarkers, especially language markers, have shown great potential, where variables informative to MCI are derived from linguistic and/or speech and later used for predictive modeling. A major challenge in modeling language markers comes from the variability of how each person speaks. As the cohort size for language studies is usually small due to extensive data collection efforts, the variability among persons makes language markers hard to generalize to unseen subjects. In this paper, we propose a novel subject harmonization tool to address the issue of distributional differences in language markers across subjects, thus enhancing the generalization performance of machine learning models. Our empirical results show that machine learning models built on our harmonized features have improved prediction performance on unseen data. The source code and experiment scripts are available at https://github.com/illidanlab/subject harmonization.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Alzheimer's disease (AD) is a major type of dementia and ranks as the seventh-leading cause of death in the United States in 2020. <ref type="bibr">1</ref> Mild Cognitive Impairment (MCI) is the prodromal stage of dementia, including AD, characterized by minor problems with memory, language, or judgment. Early detection of MCI is critical for early intervention and cohort enrichment. In vivo biomarkers such as A&#946;-amyloid identified by cerebrospinal fluid A&#946;42 or PET amyloid imaging are sensitive to the early or pre-clinical stage. Yet, it is not easily accessible nor affordable for massive screening of general older adults, especially those with limited healthcare access.</p><p>Recently developed digital biomarkers have offered an affordable and non-intrusive alter-native. Especially language markers, 2-4 linguistic and speech variables derived from conversations, both structured 5 or semi-structured, <ref type="bibr">4</ref> have shown a significant correlation with the cognitive capability of the subjects and are recently used for MCI detection. <ref type="bibr">6</ref> Digital biomarkers are generally derived and utilized in a data-driven fashion. For example, language markers are derived from carefully designed cohorts <ref type="bibr">4,</ref><ref type="bibr">7</ref> to build predictive models that take language features as input and clinical variables as output. One significant challenge of digital biomarkers is the limited cohort sample size, where specially designed collection protocols and devices must be deployed for data collection. For example, in the studies of language markers, the I-CONECT study <ref type="bibr">4</ref> collected semi-structured conversation data from 74 subjects in a five-year clinical trial, and the ADReSS data from DementiaBank has spontaneous speech of 158 subjects. <ref type="bibr">7</ref> As the small sample size greatly limits the machine learning models that can be used for analysis, a standard to enrich the sample size is constructing multiple data points from the same subject and associated with the same clinical label of the subject as the prediction target. In sensor studies, for example, by using a fixed time window, multiple time series are derived from the same subject as data points. <ref type="bibr">8,</ref><ref type="bibr">9</ref> Another example is in language marker studies, where linguistic and speech markers are derived from one conversation, and thus multiple conversations from the same subject are treated as different data points. <ref type="bibr">[2]</ref><ref type="bibr">[3]</ref><ref type="bibr">[4]</ref> Even though these treatments greatly increased the sample size for predictive modeling, they have violated the basic assumption of most analytic approaches, that data points should be independent and identically distributed (i.i.d.). The non-i.i.d. is complicated by another challenge of digital biomarkers, which usually have high individual variability compared to other biomarkers, leading to unstable prediction performance and poor generalization performance to unseen subjects. <ref type="bibr">10</ref> Again use language markers as an example: the way people speak can be drastically different, and such differences are much more outstanding than subtle differences characterizing cognitive capabilities. The intuitive idea is to harmonize the distributional bias from subjects, similar to the harmonization that removes confounding factors from demographic data or eliminates batch effects. However, subject harmonization has drastically distinguished itself from eliminating typical confounding variables: the subjects in the testing/inference stage are not accessible during the training, and the embedding of subject information is implicit and may be non-linearly correlated with multiple dimensions in the original feature representations. Therefore, the existing harmonization approach cannot be used to quantify and remove the subject effects.</p><p>In this paper, we propose a novel framework for subject harmonization. The proposed approach uses an auxiliary classification task on the subjects to learn a deep harmonization network, which eliminates both linear and non-linear effects in differentiating subjects. Our empirical results show that the language markers harmonized by the proposed approach can improve MCI detection performance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Works</head><p>Detection of MCI. There are many approaches developed for detecting MCI using a combination of clinical information, 11 brain imaging, <ref type="bibr">[12]</ref><ref type="bibr">[13]</ref><ref type="bibr">[14]</ref><ref type="bibr">[15]</ref> and genetics. <ref type="bibr">16,</ref><ref type="bibr">17</ref> For example, machine learning models built on brain imaging such as MRI and FDG PET have been shown effective for capturing structural and metabolism information of the brain and are strongly associated with the development of AD. <ref type="bibr">14,</ref><ref type="bibr">18</ref> Yet these biomarkers are often expensive and instructive, making them hard to screen general older adults. More recently, digital biomarkers <ref type="bibr">[2]</ref><ref type="bibr">[3]</ref><ref type="bibr">[4]</ref> have offered a promising affordable, and non-intrusive alternative for broader adoption. The development of language markers is still in its early stage. Digital markers derived from the behavior are highly variable and different language markers derived from limited data often yield unstable detection models and are hard to generalize to unseen populations. Data Harmonization. A fundamental challenge of data analysis is the harmonization of confounding variables, i.e., eliminating the effects from confounding variables. <ref type="bibr">19,</ref><ref type="bibr">20</ref> With explicit confounding variables, common harmonization approaches eliminate confounding variables' influence on the original input features or output. <ref type="bibr">21,</ref><ref type="bibr">22</ref> Recent deep learning models require the harmonization of non-linear effects, leading to the development of end-to-end frameworks that cooperate with the task prediction loss and a penalty loss that usually minimizes dependence between confounders and prediction outcomes. <ref type="bibr">[23]</ref><ref type="bibr">[24]</ref><ref type="bibr">[25]</ref><ref type="bibr">[26]</ref> Meanwhile, fair machine learning schemes exploit distributional robust optimization to control implicit demographic confounding effects (bias). <ref type="bibr">[27]</ref><ref type="bibr">[28]</ref><ref type="bibr">[29]</ref> From another aspect, the underlying variables can be considered as some strong signal in the original features but is irrelevant to our prediction goal, then feature engineering helps reduce the effects. <ref type="bibr">30</ref> Most existing harmonization approaches need confounding variables to be accessible during the training and secure the generalization to unseen groups. However, in digital biomarker studies where subjects are treated as a confounding variable, the challenging arises when testing subjects are not seen during the training and demands a generalizable harmonization on subjects.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Methods</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Data</head><p>We use semi-structured conversational data from a clinical trial I-CONECT (Clinicaltrials.gov: NCT02871921). The data is available upon request at <ref type="url">https://www.i-conect.org/</ref>. This clinical trial aims to investigate the potential benefits of regular video chat conversations on the cognitive functions and psychological well-being of individuals aged 75 and older. The dataset has 6771 conversation sessions from 74 participants, with 36 participants being cognitively normal (NL) and 38 diagnosed with mild cognitive impairment (MCI). Each conversational session is about 30 minutes in length. Table <ref type="table">1</ref> shows the participants' demographic information.  <ref type="bibr">3</ref> We first generate a 64dimensional LIWC feature vector for every word in each conversation, with each dimension corresponding to a specific LIWC category (1 = word belongs to the category, 0 = word does not belong); we then sum over the feature vectors of all words in the conversation, resulting in a single 64-dimensional feature vector representing the linguistic feature of that conversation. Syntactic complexity represents the range and intricacy of grammatical structures employed in language production. <ref type="bibr">32</ref> We used the L2 Syntactic Complexity Analyzer 33 to extract the syntactic complexity feature. This tool is specifically designed to automate the analysis of syntactic complexity in English language texts produced by advanced learners of English. We extract a 23-dimensional vector from each conversation representing the syntactic complexity of conversation, with each dimension corresponding to a specific English syntactic complexity measure from the tool. Lexical Diversity is the range of different words within a given text, wherein a wider range indicates greater diversity. <ref type="bibr">34</ref> Given a text input, lexical diversity has been measured using the type-token ratio (TTR), <ref type="bibr">35</ref> obtained by dividing the total number of unique words by the overall word count. To adopt this in our study, we extract the TTR from participants' conversational responses, as well as its variations, such as the moving average type-token ratio (MATTR) <ref type="bibr">36</ref> and the mean segmental type-token ratio (MSTTR). We also use additional lexical diversity measures, including the Hypergeometric distribution D (HD-D) and the measure of textual Lexical Diversity (MTLD). <ref type="bibr">37</ref> In total, we derive a 10-dimensional vector representing conversations' lexical diversity, with each dimension corresponding to one of the aforementioned lexical diversity measures and its respective variation.</p><p>Response length: Our analysis suggests that NL individuals tend to provide lengthier responses to questions posed by interviewer than MCI individuals, showing great potential for distinguishing between MCI and NL individuals. We extract the mean and variance of participants' response lengths within each conversation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Generalized Least Squares</head><p>Generalized least squares is a widely used harmonization approach to remove linear effects given confounding variables, such as age, gender, and education. <ref type="bibr">21,</ref><ref type="bibr">38</ref> For each conversation's extracted language marker features x i , we assume that these features are linearly biased by three confounding variables age, sex, and education of the subject, denoted c i = [age, sex, education], such that:</p><p>where w is weight matrix and x harmonized i is our goal harmonized language markers. The objective function for generalized least square method is given by:</p><p>After obtaining weight matrix w by solving the above objective function, the harmonized language markers is derived by:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Subject Harmonization for Non-linear Predictive Modeling</head><p>Unlike other types of in-vivo biomarkers, digital markers show great individual variability. In language markers, for instance, how one speaks a language can differ greatly, even if they are all native speakers. The differences can be visualized by checking the distributions of language features. Our empirical results in Sec. 4.1 show that the feature variables have clear clustering structures w.r.t. subjects. As such, successful analysis and predictive modeling need careful harmonization to eliminate individual variability. Generalized least squares's harmonization mechanism eliminates the linear subspace that is predictive of these confounding variables and uses the orthogonal complement subspace as the harmonized features. Though all linear effects are removed through the harmonization approach, the approach does not remove any non-linear effects from data. For example, if the multiplication of two confounding variables (e.g., age and gender) has effects on the data, such effects will not be removed and will be picked up by non-linear models such as random forest and deep learning models. Another challenge comes from the generalization of harmonization, where digital biomarkers demand a unique harmonization procedure that can be generalized to unseen subjects. To address the above challenge, we propose a deep harmonization network to facilitate analytics with digital biomarkers. In the context of the prediction of MCI from language markers, we are given a set of conversations collected from a set of different subjects and we would like to build a predictive model for MCI using these conversations. We follow the last section to extract features for each conversation and form a feature vector for each conversation. The setting of predictive modeling is to classify each conversation/feature vector into a label (MCI or not), which will be later aggregated into a prediction of the subject. The feature vectors of one subject will be either used in training or testing but not both. The goal of harmonization is to remove the confounding factor of subjects in the feature vectors. The proposed approach has two stages: in the first stage, we construct an auxiliary task to learn the deep harmonization network; in the second, the learned harmonization network is used to transform the data points, and the harmonized data is then used for building a downstream classifier of MCI.</p><p>The design of a deep harmonization network is based on two intuitions: 1) a good harmonization should remove all linear and non-linear effects from subjects, and therefore the harmonized features should not be able to differentiate subjects under deep models; 2) the harmonized features should be as close to the original feature as possible (otherwise, the harmonization admits a trivial solution where all features are wiped and set to the same value). Following these intuitions, the proposed approach seeks to minimize the subject differentiation between data points obtained from different subjects and minimizes the differences between harmonized and original language markers. Generally, for M pairs of extracted language features and corresponding subject labels (x i , y s i ), we denote f FH (&#8226;) : x &#8594; x as the feature harmonization network parameterized with &#952; F H , f s (&#8226;) : x &#8594; s as the auxiliary subject classifier parameterized with &#952; s . The composite function f s &#8226; f FH denotes a classifier f s using harmonized features f FH . The objective for learning feature harmonization is given by:</p><p>where &#8467; ent (&#8226;) is the cross-entropy loss and minimizing -&#8467; ent (&#8226;) encourages the harmonized features cannot be differentiated by subject identities, and &#8467; mse (&#8226;) is the mean square error which encourages the similarity between the original features and the harmonized features. Note that we do not restrict the type of classifier to be used in f s , but a non-linear model is preferred due to the design of deep harmonization. In our study, we use a 3-layer MLP for the harmonization network.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.5.">MCI Detection using Harmonized Features</head><p>After the harmonization process, we use the harmonized features with confounding effects removed for the downstream task of MCI detection. The MCI detection can be modeled by two classification tasks: a) conversation classification that identifies whether a given conversation is from an MCI subject or an NL subject using language markers extracted from the conversation, and b) subject classification, which collectively uses the results from the conversation classification on conversations from one subject and predict if a subject is an MCI subject or an NL subject. We model conversation classification as a standard machine learning task that seeks a classifier that takes language markers as an input and outputs a binary prediction. Formally, we have M pairs of extracted features and corresponding cognitive status label</p><p>We denote f t (&#8226;) : x &#8594; t as the MCI classifier parameterized with &#952; t . In our study, we use two classifiers: a linear model (logistic regression, LR) and a non-linear model (2-layer multi-layer perceptron, MLP). Then, the objective function for cognitive status classification is formulated as:</p><p>where &#8467;(&#8226;) is the binary cross entropy loss. To achieve subject classification, we use a majority vote strategy so that if more than 50% of a subject's conversations are predicted as MCI by the conversation classifier, we classify that subject as MCI and NL otherwise. For both settings, we randomly sample 80% subjects as train subjects and the remaining subjects as test subjects. The conversations from training subjects are used to train the conversation classifier. The complete framework is illustrated in Figure <ref type="figure">1</ref>. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Experimental Results and Analysis</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Effectiveness and Generalizability of Subject Harmonization</head><p>The design of harmonization is to remove the confounding factor of the variable of subjects. Therefore, we investigate the prediction power towards subjects using features before and after harmonization. The stronger the confounding variable, the better the features' prediction power differentiating subjects. A successful harmonization should greatly eliminate such prediction power. In this experiment, conversations from individual subjects are assigned the same labels, while conversations from different subjects are assigned distinct labels. For example, all conversations from the first subject have the label 1, and all those from the second subject have the label 2. With a total of 74 subjects, we have 74 unique labels. We randomly split data (original or harmonized) into training and testing, with 80% of conversations for training and 20% for testing. We build a linear classifier (Logistic Regression) and a deep classifier (Multilayer perceptron) using the training data and evaluate the performance in terms of accuracy using the data. For the harmonization network, we use a 3-layer Multi-layer Perceptron. We repeat the experiment for 100 random seeds, and report the average accuracy of predicting testing conversations' subject labels before and after harmonization in table 2. We use the same training and testing conversations for each random seed while evaluating before and after harmonization. We see a substantial decrease in subject classification performance in both models, showing the effectiveness of the harmonization design that removes the confounding variables' linear and non-linear effects.</p><p>We conduct a qualitative study that visualizes the distributions of the language markers before and after the subject harmonization in Figure <ref type="figure">2</ref>. We use t-SNE <ref type="bibr">39</ref> to plot the 99dimensional language markers in a comprehensible 2-dimensional space, where conversations from the same subjects are assigned matching colors. From the visualization, we see that data points from the same subjects show a clear clustering structure of subjects, indicating subject Fig. <ref type="figure">2</ref>. The visualization of language markers extracted from conversations collected from 10 randomly selected subjects before and after subject harmonization. We see that a clear clustering structure exists before subject harmonization, which is successfully destroyed by the harmonization.</p><p>bias in the language markers. After the harmonization, such clustered structure is visually destroyed, showing the effectiveness of the purpose harmonization strategy.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">MCI Detection via Harmonized Language Markers</head><p>We now investigate the predictive power of language markers in detecting MCI subjects. We compare a set of different harmonization approaches: a) generalized least squares, <ref type="bibr">21,</ref><ref type="bibr">38</ref> commonly used for harmonizing linear effects and used age/gender/education as confounding variables; b) the proposed deep subject harmonization, which harmonizes against the subject variable but does not use demographic variables (age/gender/education); c) deep harmonization that does not use subject information and jointly harmonizes all demographic variables. d) deep harmonization approaches that harmonize only individual demographic variables. When harmonizing demographic variables using a deep harmonization network, we construct category variables from age/gender/education (e.g., age between 75-79 as category 1, age between 80-84 as category 2) and train equation 1. We repeat the experiments for 100 random seeds and report the average and standard deviation of Area under the ROC curve (AUC), F1, Sensitivity, and Specificity on the test data in Table <ref type="table">3</ref>.</p><p>From the results, we find the following: 1) The non-linear model MLP using features from deep subject harmonization, which harmonizes the subject variable using a deep model, provides the best downstream classification performance on both conversation and subject predictions. 2) Both the linear and non-linear models benefit more from deep subject harmonization than generalized least squares. 3) For MLP, deep harmonization on demographic variables performs worse than generalized least squares, even though both jointly harmonize against all three demographic variables.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.">Performance on Different Sub-Populations</head><p>Table <ref type="table">4</ref> presents the performance of conversation and subject classification on different subpopulations, i.e., different gender groups, education levels, and age groups. By zooming in on the performance of different sub-population groups, we want to inspect how the proposed subject harmonization impacts these groups, given that demographic variables are not used in the harmonization process. From the results, we see that the proposed subject harmonization consistently improved the performance of most groups, with the exception of 1) the higher educated group (Edu years 19-21), for both conversation and subject classification, and 2) minor performance drop in the Male group for the subject classification. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4.">Important Language Markers Before and After Harmonization</head><p>In this section, we investigate the feature importance and compare the top language markers before and after harmonization. For linear models, feature importance can be directly derived from the model weights, and for non-linear MLP models used in this paper, we do not have such a straightforward way of getting them. We adopt commonly used permutation feature importance <ref type="bibr">40</ref> to estimate the feature importance. We permute each feature's values and subsequently feed the modified dataset into our pipeline. After that, we derive the AUC score for both conversation and subject classification using this permutated dataset. The feature importance of a feature is then determined by computing the difference between the AUC values obtained from the original dataset and the permuted dataset. A larger decrease in AUC indicates higher importance of the respective feature in the classification model.</p><p>In table <ref type="table">5</ref>, we present the top 10 language features before and after the feature harmonization for both conversation and subject classification. We see that: 1) top features differ quite much before and after harmonization. Notably, we see "Nonfluencies" being the most important feature after harmonization, which better supports the pathology of dementia, where dementia (even at the preclinical stage) may impact a subject, making it harder to find the right words and therefore showing a higher number of nonfluencies during communication. 2) more syntactic complexity features appear after harmonization for subject classification. The top features "T-unit per sentence" and "mean length of sentence" directly correlate to the language capability of constructing longer features.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Discussion</head><p>In this paper, we propose a subject harmonization algorithm to mitigate the distributional difference of digital biomarkers induced by subject variability. Our empirical results show that applying subject harmonization to language markers improves the performance of MCI detection. We show the effects of subject variability from a quantitative perspective using a subject prediction task, and also from a qualitative perspective from visible clusters in the visualization of language markers. Our experiments show that the proposed subject harmonization approach effectively mitigates the subject variability so that the harmonized data has much less power to differentiate among subjects. Meanwhile, we show that MCI detection models built from language markers harmonized by the proposed subject harmonization improve the predictive performance. The harmonization improves the AUC score of MCI prediction from 0.594 to 0.646 in conversation classification task and from 0.626 to 0.657 in subject classification task. We further investigated the sub-group performance of different age/gender/years of education, and we see that the performance of most groups have been improved. Despite the improvement in prediction performance using language markers through the harmonization algorithm, future studies still need investigation. Firstly, the prediction performance from language markers is yet to be improved. A possible reason is the quality of the language markers and that we only used linguistic and syntactic information. We will study subject harmonization on additional feature variables, such as speech and video. Secondly, performing subject harmonization on demographic variables witnessed reduced predictive performance, indicating that the proposed deep harmonization network is currently not applicable to general harmonization usage. We plan to investigate theoretical relationship between the two harmonization types, and improve deep harmonization network to handle demographic variables. Thirdly, while we have successfully validated the positive impact of harmonization on language markers, it remains to confirm its efficacy on other data types. We plan to dedicate considerable time to applying the harmonization algorithm to different types of markers, such as clinical data or brain imaging data. This broader exploration will enable us to assess the generalizability and versatility of the harmonization technique across various data modalities, facilitating a more comprehensive understanding of its potential applications.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Acknowledgement</head><p>This material is based in part upon work supported by the National Science Foundation under Grant IIS-2212174, IIS-1749940, Office of Naval Research N00014-20-1-2382, and National Institute on Aging (NIA) RF1AG072449, R01AG051628, R01AG056102.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>Biocomputing 2024 Downloaded from www.worldscientific.com by UNIVERSITY OF MICHIGAN ANN ARBOR on 09/18/24. Re-use and distribution is strictly not permitted, except for Open Access articles.</p></note>
		</body>
		</text>
</TEI>
