<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Information-incorporated clustering analysis of disease prevalence trends</title></titleStmt>
			<publicationStmt>
				<publisher>Institute of Mathematical Statistics</publisher>
				<date>06/01/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10512845</idno>
					<idno type="doi">10.1214/23-AOAS1821</idno>
					<title level='j'>The Annals of Applied Statistics</title>
<idno>1932-6157</idno>
<biblScope unit="volume">18</biblScope>
<biblScope unit="issue">2</biblScope>					

					<author>Chenjin Ma</author><author>Cunjie Lin</author><author>Yuan Xue</author><author>Sanguo Zhang</author><author>Qingzhao Zhang</author><author>Shuangge Ma</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[In biomedical research the analysis of disease prevalence is of critical importance. While most of the existing prevalence studies focus on individual diseases, there has been increasing effort that jointly examines the prevalence values and their trends of multiple diseases. Such joint analysis can provide valuable insights not shared by individual-disease analysis. A critical limitation of the existing analysis is that there is a lack of attention to existing information, which has been accumulated through a large number of studies and can be valuable especially when there are a large number of diseases but the number of prevalence values for a specific disease is limited. In this study we conduct the functional clustering analysis of prevalence trends for a large number of diseases. A novel approach based on the penalized fusion technique is developed to incorporate information mined from published articles. It is innovatively designed to take into account that such information may not be fully relevant or correct. Another significant development is that statistical properties are rigorously established. Simulation is conducted and demonstrates its competitive performance. In the analysis of data from Taiwan NHIRD (National Health Insurance Research Database), new and interesting findings that differ from the existing ones are made.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction.</head><p>In biomedical research the analysis of prevalence has had a pivotal role. There have been extensive studies on the "snapshot" values of disease prevalence and their associated factors. Quite a few other studies are on disease prevalence trends, and such analysis can assist prioritizing diseases (e.g., those with fast increasing prevalence should receive more attention), identifying new risk factors for etiology (which can facilitate developing prevention and treatment strategies), and managing diseases in clinical practice (e.g., by making proper resource planning and allocation). Most of the existing prevalence studies focus on individual diseases (or small classes of preselected, tightly connected diseases). On the other hand, there has been increasing effort that jointly analyzes the prevalence values of two or more diseases <ref type="bibr">(Joffres et al. (2013</ref><ref type="bibr">), Romanowski et al. (2015)</ref>, <ref type="bibr">Jadhav et al. (2021)</ref>). For example, a prospective observational study conducted in the U.S. examines the prevalence of chest pain and acute myocardial infarction (MI) and shows that patients without chest pain on presentation represent a large segment of the MI population and have an increased risk for delays in seeking treatment <ref type="bibr">(Canto et al. (2000)</ref>). Another example is the examination of the prevalence values and trends of multiple cancers under the metastasis networks <ref type="bibr">(Chen et al. (2009)</ref>). Also, using the Taiwan NHI (National Health Insurance) data, studies have been conducted on diseases sharing similar prevalence trends with HIV/AIDS <ref type="bibr">(Lai (2015)</ref>) and amyotrophic lateral sclerosis <ref type="bibr">(Tsai, Hu and Lee (2019)</ref>).</p><p>The growth in joint prevalence analysis fits the paradigm shifting in biomedical research from individual-disease to pan-disease analysis. One of the early breakthroughs is the human disease network (HDN) analysis, which examines the interconnections among diseases based on their genetic risk factors <ref type="bibr">(Goh et al. (2007)</ref>). The phenotypic HDN (pHDN) analysis has been subsequently conducted and differs from the molecular HDN analysis by focusing on clinical phenotypes <ref type="bibr">(Zhou et al. (2014)</ref>). With the consideration that both molecular basis and disease phenotypes are not "close enough" to clinics, pan-disease analysis of the interconnections between disease clinical treatment measures, such as treatment cost <ref type="bibr">(Ma et al. (2020)</ref>) and inpatient length of stay, has been conducted. Here we note that disease prevalence trend is correlated with the aforementioned variables (e.g., a disease with low prevalence is likely to have limited treatment cost and inpatient stay) but cannot be fully derived from them. As such, this work is expected to complement but not strongly overlap with the aforementioned ones.</p><p>In this study we examine the interrelationships among diseases in terms of prevalence trend. To fix ideas, in the upper-left panel of Figure <ref type="figure">1</ref>, we present the prevalence trends of 10 diseases. In the upper-right panel, we move the curves vertically and find four clusters, with those in the same cluster having similar patterns. Although functional clustering has been extensively conducted <ref type="bibr">(Jacques and Preda (2014)</ref>), its application to disease prevalence trends has been very limited but can have important implications. First, the temporal variations of disease prevalence can be largely attributed to the development of prevention programs, improvement in diagnosis, change in environmental conditions, and other time-dependent influential factors. As such, if multiple diseases have similar temporal trends, it is sensible to hypothesize that they share time-dependent risk factors and/or are affected by similar prevention/diagnosis/treatment programs. Identifying such shared factors can advance our understanding of diseases and inform clinical practice. For example, epidemiologic studies suggest that cognitive dysfunction and type 2 diabetes are "correlated," which is manifested in the shared patterns of their prevalence trends <ref type="bibr">(Luchsinger et al. (2007)</ref>). Motivated by such observations, researchers have examined the micro-relation of glycemic status with different domains of cognitive functions and found a number of vascular and neurodegenetive mechanisms through which type 2 diabetes and cognitive function are interconnected. Second, clustering analysis can lead to a new/alternative way of disease classification and characterization. Different classifications are needed to serve different purposes <ref type="bibr">(Zhou et al. (2018)</ref>), and the classification, as exemplified in the upper-right panel of Figure <ref type="figure">1</ref>, has a basis different from those based on organ, symptom, and genetics. It can be "closer" to public health as well as medical care management and planning. Third, clustering analysis can lead to simpler data structures and more accurate estimation, as estimation only needs to be done at the cluster (as opposed to individual) level.</p><p>Among the existing studies, the most relevant is Jadhav et al. ( <ref type="formula">2021</ref>), which also conducts the functional clustering analysis of disease prevalence trends and analyzes the NHI data. The present study advances from <ref type="bibr">Jadhav et al. (2021)</ref> in multiple important aspects. As exemplified in the upper-left panel of Figure <ref type="figure">1</ref>, for each disease the number of measurements (usually one prevalence value per year) is limited. Combined with the large number of diseases, this leads to a lack of information and hence unreliable estimation and unsatisfactory clustering. Many diseases have been extensively studied in published literature, and quite a few studies have touched on the "interconnections" among diseases. In Figure <ref type="figure">2</ref>, for selected diseases, we present the numbers of published studies that have mentioned disease pairs (more details on information extraction is presented in Section 2.2). For example, our information extraction identifies 1431 publications that have mentioned both bipolar and dementias. Information as sketched in Figure <ref type="figure">2</ref> has been accumulated through a large number of studies and can be valuable. Methodologically, this study significantly advances from Jadhav et al. ( <ref type="formula">2021</ref>) by developing a novel strategy to incorporate existing information. Borrowing strength from outside information is not a new concept, however, limitedly pursued in functional clustering. In the middle and lower panels of Figure <ref type="figure">1</ref>, we show two scenarios of existing information and the corresponding estimates when such information is incorporated, which suggests that incorporating information does have an impact on clustering. A foreseeable significant challenge is that information obtained from mining published studies can be partially correct or even incorrect. This is highly likely when it is not possible to accurately scrutinize each piece of information. For example, a study that mentions the interconnection between two diseases may derive that on a molecular basis. In this case the information is partially relevant. Consider another example where a study mentioning a disease pair actually suggests that they are not related. In this case this information will be included in Figure <ref type="figure">2</ref>, however, is incorrect for the interconnection between the two diseases. To accommodate such scenar- ios, significant methodological developments are needed. Another significant advancement from <ref type="bibr">Jadhav et al. (2021)</ref> and some other studies is that rigorous theoretical development is conducted. This can be nontrivial, as estimation needs to accommodate the partial correctness and incorrectness of existing information. Other methodological advancements, for example, the adoption of a different base penalty, are described below. Under the Bayesian paradigm, functional clustering using external information-such as publication results and published data-as prior has been developed <ref type="bibr">(Biau et al. (2017)</ref>, <ref type="bibr">Isci et al. (2014)</ref>, <ref type="bibr">Ray and Mallick (2006)</ref>). The prior information and estimation strategies adopted in these studies are significantly different from those in this study.</p><p>Overall, the goal of this study is to conduct the clustering analysis of disease prevalence trends, with the assistance of information contained in published literature. This study advances from the existing ones in the following important aspects. First, the analysis scheme is significantly different and novel. More specifically, it differs from that limited to the prevalence of a single disease or a small number of diseases by conducting pan-disease analysis. It also differs from the HDN, pHDN, and pan-disease clinical treatment studies by analyzing prevalence. Second, the analysis technique significantly differs from the existing ones. The proposed approach is based on penalization fusion, which is relatively more recent and has notable advantages over many other functional clustering techniques. For example, it combines estimation and clustering and can conveniently determine the number of clusters. In principle, it can accommodate clusters as small as size one. Third, as described above, this study significantly advances from its closest competitor in Jadhav et al. ( <ref type="formula">2021</ref>) both methodologically and statistically. Last but not least, it delivers a new way of extracting useful information from the NHIRD and other medical claims (record) databases.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Data.</head><p>Two sources of data (Figure <ref type="figure">3</ref>) are collectively analyzed. The first comes from the medical claims database and generates the prevalence values. The second comes from text mining and provides information on disease interconnections reported in the literature.</p><p>2.1. NHI data. In Taiwan basic health insurance coverage is provided by NHI, which was launched on March 1, 1995. By the end of 2014, almost 99.9% of the Taiwan population were covered. As NHI is also convenient and as uninsured and commercially insured healthcare is expensive, almost all hospital/clinic-based disease treatments have been going through NHI. In 2000, Taiwan established NHIRD, which contains detailed information on diagnosis, treatment, and outcome. We refer to the literature <ref type="bibr">(Hsieh et al. (2019)</ref>) for more information on NHI and NHIRD. The NHI data has multiple unique characteristics: almost the whole population is covered; comprehensive information is available on all inpatient and outpatient treatment episodes; all data have been collected and stored under the same protocol, and extensive data processing has been conducted by NHIRD staff.</p><p>Data on one million randomly selected subjects is first extracted from NHIRD for the period of 2000-2013. Disease diagnosis and hence period prevalence information is collected from both outpatient and inpatient treatments, using the NHIRD CD (ambulatory care expenditure by visits) and DD (inpatient expenditure by admissions) files. For disease definition the ICD-9-CM code version 1992 is converted into the 2001 version. Records with the ICD-9-CM codes E and V (external causes of injury and supplemental classification), 630-679 (pregnancy, childbirth, and puerperium complications), and 760-999 (symptoms, signs, and ill-defined conditions) are excluded from analysis. The resulting dataset contains records on 986,650 patients, and disease period prevalence values are computed based on 173,355,725 outpatient and 1,381,749 inpatient treatment episodes. To avoid sparse data caused by too many diseases under ICD-9-CM, we apply the Phenome-Wide Association Study (PheWAS) vocabulary approach and group the 14,000 ICD-9-CM diseases into 1723 disease phecodes. We further select diseases based on the following considerations. We first identify diseases that have high prevalence and/or high mortality, such as diseases of the circulatory system, certain cancers, diseases of the respiratory system, and others. We then also consider diseases that have high clinical significance (e.g., those with long inpatient length of stay, no effective treatment or clear causes), such as rare cancers, certain diseases of the blood and blood-forming organs, acquired coagulation factor deficiency, and others. Overall, 405 diseases are included in analysis (detailed information provided in the Supplementary Material <ref type="bibr">(Ma et al. (2024)</ref>)), and it is noted that this number is considerably larger than in many peer studies. These diseases have different types of trends (Figure <ref type="figure">4</ref>).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Information extraction.</head><p>There are multiple ways of defining and extracting existing information. The pan-disease perspective, variable of interest, and study population make our analysis unique. As such, it is unlikely the desired information can be obtained from a single or a few publications. We take a broad search strategy via text mining. Specifically, we adopt PubMatrix (pubmatrix.grc.nia.nih.gov), a web-based text mining tool tailored to PubMed, with the understanding that alternative tools such as VxInsight, MedMiner, and UALCAN may also be applicable. PubMatrix searches PubMed and returns the cooccurrence frequency of (i.e., number of publications that simultaneously contain) any pair of two keywords (e.g., "type 2 diabetes" and "melanoma"). We refer to <ref type="bibr">Becker et al., 2003</ref> and many publications that adopt this tool for details on how it functions. When considering all possible disease pairs, we can obtain results as exemplified in Figure <ref type="figure">2</ref>. More detailed results are presented in the Supplementary Material S5. Here we note that this information extraction may be coarse. For example, different publications may call the same disease differently, and PubMatrix is not able to detect that. On the other hand, it is noted that the ICD-based disease phecode approach is standard. Our data analysis results suggest that a large number of publications can be identified using the PubMatrix-based text mining. It is possible to apply more advanced tools, which may be more complex, to refine the information; this may lead to improved estimation. As the proposed approach can accommodate partially correct information, this is not pursued. It is also noted that there are alternative ways of defining/extracting existing information on disease interconnections, for example, based on keywords of professional books, information related to clinical complications, cross-sectional epidemiological data, and so on. In Figure <ref type="figure">S2</ref> (Supplementary Material), we also present the existing information based on the calculated disease prevalence correlation coefficients, where we see some similarities but also significant differences. The proposed information extraction can have notable advantages: it is based on a huge number of published studies and likely to be comprehensive and "less biased," and it can avoid the "using the same data twice" problem. We note that, in the literature, there is no optimal way of information extraction. As this is not our focus, we do not examine further.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Methods.</head><p>Denote n as the number of diseases (sample size), T as the number of observations per disease, y i (t j ) as the prevalence value of disease i at the j th time point, and Y i = (y i (t 1 ), . . . , y i (t T )) as the vector of observed prevalence values for disease i. Consider the model (1)</p><p>where t j &#8712; T , a i 's represent the average levels, f i (t)'s are unknown smooth functions of t (with proper mean constraints for identifiability), and &#949; ij 's are random errors. Let f 0 i denote the true value of f i . Assume that all curves can be classified into K clusters G 1 , . . . , G K , and curves i and j belong to the same cluster if and only if</p><p>The goal is to simultaneously recover the structure of G and estimate the unknown f i 's. In what follows, we assume that normalization has been properly conducted, and a i 's are omitted.</p><p>3.1. Penalized fusion. We first briefly describe the penalized fusion analysis, as conducted in Jadhav et al. ( <ref type="formula">2021</ref>), which is a building block of the proposed analysis. Consider the basis expansion</p><p>where x 1 (t), x 2 (t), . . . , x p (t) are known basis functions and &#946; il 's are unknown regression coefficients. There have been extensive developments on choosing the form, number, and constraints of the basis functions <ref type="bibr">(Schumaker (2007)</ref>), which are also applicable to this study. We adopt the B-spline basis in all our simulation and data analysis.</p><p>,</p><p>where Y = (Y 1 , . . . , Y n ) and &#8226; is the 2 norm. We note that including covariance can in theory improve efficiency. However, our exploration suggests that, with penalization, the covariance estimates may be unsatisfactory, leading to overall inferior performance. As shown below, this loss function can lead to consistent clustering and estimation.</p><p>Consider the penalization fusion approach, which is abbreviated as "Fusion" in our numerical study. The penalized loss function is</p><p>where p(&#8226;, &#955;) is a concave penalty with a data-dependent tuning parameter &#955; &gt; 0. In our numerical study, we adopt the minimax concave penalty (MCP, Zhang, 2010), defined by p(t, &#955;) = &#955; |t| 0 (1x/(a&#955;)) + dx, and note that some other penalties are also applicable. Here (x) + = xI (x &gt; 0), and a is the regularization parameter. It is noted that Jadhav et al. ( <ref type="formula">2021</ref>) adopts Lasso, which may have inferior properties compared to MCP. Denote the minimizer of (3) as &#946;&#955; = ( &#946;&#955; 1 , . . . , &#946;&#955; n ) . Diseases i and j are concluded as in the same cluster if and only if &#946;&#955; i = &#946;&#955; j . We refer to <ref type="bibr">Ma and Huang (2017)</ref> and follow-up studies for developments on penalized fusion.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>3.2.</head><p>A new approach to incorporate existing information. We propose a two-step approach to incorporate existing information on the interconnections among diseases:</p><p>In Step 1, consider the objective function</p><p>where &#951; &gt; 0 is a tuning parameter and W = (w ij ) n&#215;n is the weight matrix that describes existing information. In our numerical analysis, we set w ij = log(1 + count ij ), where count ij is the number of publications that simultaneously include diseases i and j . Denote the minimizer of (4) for disease i as &#946;p i , compute the predicted value as &#374; p i = X &#946;p i , and denote</p><p>. This is a weighted penalized estimation. In (4) if two diseases have more evidence of being interconnected, they are encouraged to have similar estimates. Similar strategies have been developed in the literature, although under significantly different contexts. As the goal of this step is not to generate clustering, ridge-type penalization is imposed, which is computationally much simpler than penalized fusion. As can be seen from Figure <ref type="figure">2</ref>, the distribution of the number of publications is quite skewed, which can be caused by research selection/publication bias, as opposed to the true amount of evidence. With this consideration and also to stabilize estimation, the logarithm transformation is taken.</p><p>In Step 2, we propose the objective function</p><p>where 0 &#8804; &#964; &#8804; 1 is a tuning parameter. &#955; has the same implication as in the standard penalized fusion. Simple derivation shows that the first two terms of ( <ref type="formula">5</ref>) are equal to Q(&#946;; &#7928; , X) plus a term that does not depend on &#946;, where</p><p>. With a slight abuse of notation, denote the minimizer of (5) as &#946;. The clustering structure can be fully obtained by examining &#946;.</p><p>Objective function ( <ref type="formula">5</ref>) has a very lucid interpretation. The loss balances between what is obtained from the data (the first term) and its counterpart if the existing information is credible (the second term). &#964; is introduced to balance between these two terms. Intuitively, if the information is of low quality, then with &#964; &#8594; 0, and the proposed analysis can reduce to that based on the observed data only. On the other hand, &#964; &#8594; 1 leads to the analysis heavily relying on the existing information. Assisted by this tuning, the proposed approach can flexibly accommodate a varying quality of existing information. For sparse linear regression, a related information-incorporation strategy has been developed by <ref type="bibr">Jiang, He and Zhang (2016)</ref> from which this study advances by analyzing prevalence trends, conducting functional clustering, and extracting information in a different way.</p><p>In the proposed analysis, Jiang, He and Zhang (2016) and some others, the same data is analyzed in both steps. Our numerical and theoretical developments below as well as those in the published studies suggest that this is valid and sensible. We conjecture that it is also possible to split data and use one half for each step. When the number of subjects (for generating the prevalence data) is large enough, the two approaches are expected to have minimum differences.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Statistical properties.</head><p>Consider model y i (t j ) = f i (t j ) + &#949; ij for i = 1, . . . , n and j = 1, . . . , T . Let &#949; i = (&#949; i1 , . . . , &#949; iT ) . Without loss of generality, assume T = [0, 1]. With the rth order B-spline basis functions x 1 (t), . . . , x p (t), we have the approximation f i (t) &#8776; p l=1 &#946; il x l (t). Here p = m + r, and m is the number of interior knots satisfying 0 = &#954; 0 &lt; &#954; 1 &lt; &#8226; &#8226; &#8226; &lt; &#954; m &lt; &#954; m+1 = 1. By Corollary 6.21 of Schumaker ( <ref type="formula">2007</ref>) and (C1)-(C3) listed below, there exists a spline approximation</p><p>] is a qth order continuously differentiable function defined on [0, 1], and r &#8805; q.</p><p>(C2) Let &#948; = max 0&#8804;l&#8804;m (&#954; l+1&#954; l ). Assume that there exists a constant M &gt; 0 such that</p><p>), and m = o(T ).</p><p>(C3) For deterministic design points t i &#8712; [0, 1], i = 1, . . . , T , assume that there exists a distribution function G with a positive continuous design density g</p><p>(C5) &#961;(t) is a symmetric function of t and is nondecreasing and concave in t for t &#8712; [0, &#8734;) with a continuous derivative &#961; (t) on (0, &#8734;). In addition, &#961; (0+) is independent of &#955;. There exists a constant 0 &lt; a &lt; &#8734; such that &#961;(t) is a constant for all t &#8805; a&#955;.</p><p>(C1)-(C3) are standard assumptions for B-spline functions <ref type="bibr">(Zhou, Shen and Wolfe (1998)</ref>). Condition (C4) gives the boundedness condition for the error terms <ref type="bibr">(Chu, Li and Reimherr (2016)</ref>). Condition (C5) implies the choice of penalty functions and is common in the literature on high-dimensional variable selection <ref type="bibr">(Fan and Lv (2011)</ref>). Both MCP and SCAD satisfy this condition.</p><p>First, consider the oracle estimator &#946;or that incorporates the existing information,</p><p>where</p><p>Note that this term reflects the reliability of the existing information. THEOREM 3.1. Suppose that Conditions (C1)-(C4) hold. Then we have</p><p>where</p><p>The following theorem shows that the oracle estimator &#946;or is a strict local minimizer of the objective function with probability approaching one. <ref type="formula">1</ref>), where &#966; n and &#968; n are given in Theorem 3.1, then there exits a strict local minimizer &#946; of the objective function Q &#955;,&#964; (&#946;; Y , &#374; p , X) in (5) satisfying</p><p>The above theorems show that, under mild conditions (including that on the estimate generated in Step 1 and hence the existing information), the proposed approach has clustering and estimation consistency. It can "automatically" determine the number of clusters. In line with <ref type="bibr">Jiang, He and Zhang (2016)</ref>, the assumed conditions and theoretical results are more complicated with functional data. Proof is provided in the Supplementary Material S1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4.">Computation.</head><p>In the Supplementary Material S2, we develop an effective ADMM (alternating direction method of multipliers)-based algorithm. Information on tuning parameter selection is also presented.</p><p>3.5. Simulation. In the Supplementary Material S3, we conduct comprehensive simulation, evaluate performance of the proposed approach, and compare against multiple alternatives. It is observed that, when the existing information has moderate to high quality, the proposed approach significantly outperforms with higher clustering accuracy. When the existing information is of low quality, the alternative approach that fully trusts such information (i.e., Step 1 of the proposed analysis) may have inferior performance, while the proposed approach, with its flexibility, can correct for it to a large extent.</p><p>4. Data analysis. The NHIRD disease prevalence data and existing information extracted from PubMed, as described in Section 2, are analyzed. With selected &#964; = 0.40, the proposed approach identifies 47 clusters with sizes at least 2. The nontrivial cluster sizes range from 2 to 28, with a median of 6. In addition, there are also eight diseases forming clusters with size 1. Detailed clustering information is provided in the Supplementary Material S5. For the nontrivial clusters, the unnormalized prevalence trends of their diseases are shown in Figure <ref type="figure">4</ref>, where different colors correspond to different prevalence levels.</p><p>In general, we observe increasing trends in clusters 1-20. However, different clusters have different increasing patterns. For example, cluster 19 has increasing rates higher than the other clusters. This cluster includes atopic/contact dermatitis due to other or unspecified causes. This is a relatively common skin condition that affects a large number of children and adults in industrialized countries. It is estimated that about 19.5% of the general population in North America and Western Europe are affected. However, data has been relatively lacking for Taiwan. Environmental factors, which are often time-dependent, play an important role in the development of atopic/contact dermatitis, and aeroallergens are a trigger for exacerbations. The deterioration of air quality and other environmental factors in Taiwan has been well noted, which can explain the fast increase. Other diseases also included in cluster 19 include acute sinusitis, type 2 diabetes, reflux esophagitis, and mixed hyperlipidemia, whose increases over time have been reported for Taiwan in the literature. Similar sensible findings are also made with the other clusters with increasing trends. Representative examples include periodontitis in cluster 18, dysthymic disorder in cluster 16 as well as malignant and unknown neoplasms of brain and nervous system in cluster 3. Clusters 21-24 contain 50 diseases within general decreasing trends. Among these diseases neoplasm of uncertain behavior of breast has the sharpest decreasing trend, followed by disorders of esophageal motility and umbilical hernia. Some of the observed decreases can be explained by the discovery of advanced treatments, computerized devices, and improvements in healthcare services. For example, the decrease in the prevalence of viral hepatitis, particularly in industrialized nations, can be attributable to the effort in hepatitis vaccination, screening of blood products, screening and postexposure prophylaxis of healthcare workers, and increased availability of safe injection materials. A total of 96 diseases with relatively flat prevalence and small fluctuations are included in clusters 25-30. Their incidence may not be strongly influenced by time-dependent risk factors and prevention/diagnosis/treatment interventions. A total of 36 diseases in clusters 31-36 in general show reverse "V" shapes. Their prevalence values first rise to peaks and then decrease. The causes of such shapes have also been provided in the literature. Consider, for example, HPV. Invasive cervical cancer is one of the leading causes of cancer-related death among women. HPV vaccines were licensed in 2006 and then became widely available in many counties/regions including Taiwan. As such, a decrease in the prevalence of HPV infection and/or cervical cancer is sensible after the broad availability of vaccination. Diseases in clusters 37-38 have similar trends: their prevalence values first decrease and then stay relatively flat. In contrast, the prevalence values of diseases in clusters 39-40 first increase and then stay relatively flat. A total of 21 diseases with "irregular" trends are included in clusters 41-46. The "irregularity" can be caused by disease outbreaks, changes in prevention and control measures, and interference with other related factors. The eight diseases forming their individual clusters are noninfectious gastroenteritis, gingivitis, essential hypertension, dental caries, acute gastritis, asthma with exacerbation, influenza, acute upper respiratory infections of multiple or unspecified sites. Most of these diseases are common, such as acute upper respiratory infections of multiple or unspecified sites, dental caries, and gingivitis. Their high prevalence values may "amplify" variations, making them difficult to be clustered together with other diseases.</p><p>Although the above examination of individual prevalence trends for a large number of diseases is of interest, the key advancement of this study lies in the clustering analysis of diseases. A closer examination of the clustering results suggests their sensibility, with many of the interconnections reported in the literature (although in a very scattered manner). For example, it has been suggested that, because of shared genetic and other risk factors, the trends of the following diseases tend to be similar: HIV infection and hemangioma and lymphangioma (any site), which are coclustered in cluster 10 <ref type="bibr">(Wiegand et al. (2008)</ref>), and renal failure NOS, thrombocytopenia, hypertensive heart and/or renal disease, and nephrotic syndrome without mention of glomerulonephritis, which are coclustered in cluster 13 <ref type="bibr">(Kressel et al. (1981)</ref>). Another sensible finding is that disorders of function of stomach, gastritis and duodenitis, and peptic ulcer (excluding esophageal) are coclustered in cluster 44. It has been suggested by <ref type="bibr">Sipponen and Hyv&#228;rinen (1993)</ref> that the pathogenesis of peptic ulcer and gastric cancer is closely associated with H. pylori gastritis and its subsequent atrophic sequelae (atrophic gastritis). Cellulitis is a spreading bacterial infection of the skin and tissues immediately beneath the skin <ref type="bibr">(Gabillot-Carr&#233; and Roujeau (2007)</ref>). With this cause other local infections of skin and subcutaneous tissue and cellulitis and abscess of arm/hand are clustered together in cluster 45. Acute bronchitis and bronchiolitis and acute pharyngitis are the only two diseases in cluster 36, and they have very similar prevalence trend patterns. This is because acute bronchitis and bronchiolitis often develops from other upper respiratory tract infections, such as acute pharyngitis. Beyond those with strong support from the literature, there are also new disease coclusterings that have not been suggested in the literature. For example, it is found that schizoid personality disorder and dissociative disorder share similar patterns and are coclustered in cluster 8, although published studies suggest that their interconnections are inconclusive <ref type="bibr">(Modestin, Hermann and Endrass (2007)</ref>). Another example is that cancer of lip and allergic purpura, which are coclustered in cluster 1, have similar prevalence patterns but do not have support from published literature. It is noted that the prevalence values are computed based on a large number of individuals. Their credibility is expected to be high, and as such, the observed similarity in prevalence patterns is expected to be true. The above new findings on coclusterings and those alike suggest new directions for identifying interconnections underneath diseases.</p><p>Data is also analyzed using the following alternatives, which are also considered in simulation and described in more detail in the Supplementary Material S3. [OLS] Each prevalence curve is first estimated separately. Then diseases i and j are clustered together if &#946;i&#946;j &#8804; &#954;. Here &#954; is determined in a similar way as for the proposed approach (Supplementary Material S2). <ref type="bibr">[kmeans]</ref> This approach first fits each prevalence curve separately. Then the vectors of regression coefficients are clustered using the kmeans approach. <ref type="bibr">[distK]</ref> This approach first computes distance correlation between the observed prevalence values and then use the Kmeans method to generate clusters. It is implemented using the R functions dcor and kmeans.</p><p>[funFEM] This method is based on mixture modeling and takes functions as input. As such, we first implement a smoothing method to obtain functions passing through the observed discretized points. This method is implemented using the R package funFEM.</p><p>[FClust] The FClust approach is implemented using the R package fdapace and conducts functional clustering and identification of data substructures for longitudinal and other functional data. [funHDDC] The funHDDC method is implemented using the R package funHDDC and conducts model-based clustering and identification of functional subspaces.</p><p>[waveclust] This method is based on a wavelet decomposition of signal and a mixture model that integrates random effects and implemented with the R package curvclust.</p><p>[fitfclust] This is a functional clustering method with special attention to sparsely sampled data and available through the R package funcy.</p><p>[Fusion] This is the approach described in Section 3.1. <ref type="bibr">[Prior]</ref> This approach generates estimates as described in Step 1 of the proposed approach and then conducts clustering in the same way as the OLS approach.</p><p>The clustering results using the alternatives are presented in Figures S3-S12 (Supplementary Material S4) and the Supplementary Material S5. The resulted numbers of clusters are 22 (OLS), 10 (kmeans), 10 (distK), two (funFEM), 10 (Fclust), three (funHDDC), 10 (waveclust), 10 (fitclust), 35 (Fusion), and 17 (prior). In Table <ref type="table">S5</ref> (Supplementary Material S4), we present the discrepancy (which is the normalized "clustering error") between any two methods and observe small to large discrepancy values. Here we note that, with a large number of disease pairs, a small discrepancy value can correspond to notable differences in clustering, which can be partly reflected in the number of clusters and numbers of diseases in clusters. With respect to the proposed approach, the Fusion approach leads to the most similar findings (with a discrepancy value of 0.06), and the funHDDC approach leads to the most discrepant findings (with a discrepancy value of 0.9). This is reasonable as the analysis frameworks of the proposed and Fusion approaches are closest, while funHDDC has a highly different framework.</p><p>We take a closer look at Fusion. The moderate &#964; value and relatively small discrepancy value seem to suggest a small impact of the existing information. However, a closer look suggests that Fusion and the proposed approach have significant differences in clustering structures (e.g., number of clusters, number of diseases in individual clusters, and disease memberships). A representative example is in Figure <ref type="figure">S13</ref> (Supplementary Material S4). Specifically, under the Fusion approach, eight diseases are clustered together in cluster 7. Incorporating the existing information, they belong to four different clusters-along with other diseasesunder the proposed approach (detailed information available from the authors). Another example is that the proposed approach clusters malignant and unknown neoplasms of brain and nervous system with cancer of brain in cluster 3. However, under Fusion, they belong to different clusters. One more example of a similar kind is reticulosarcoma and benign neoplasm of adrenal gland, which are coclustered in cluster 29. This interconnection can be partly explained by p53 tumor-suppressor gene and Li-Fraumeni syndrome. Germline mutations in p53 have been identified in families with the Li-Fraumeni syndrome, a rare familial cancer syndrome characterized by an unusually high incidence of multiple cancers such as sarcomas, adrenocortical carcinomas, and other diverse neoplasms. Families with Li-Fraumeni syndrome have been described as including a proband with a sarcoma diagnosed early in life <ref type="bibr">(Malkin et al. (1992)</ref>).</p><p>To gain further insights, we also conduct evaluation. First, we use the first 13 observations of all diseases for clustering analysis and estimation. Then, based on the models, the last observation of each disease is predicted. The mean squared error values are 1.29 &#215; 10 -5 (OLS), 8.32 &#215; 10 -6 (Fusion), 1.12 &#215; 10 -5 (Prior), and 7.28 &#215; 10 -6 (proposed), respectively. Here we note that the other alternatives do not generate explicit regression models and hence cannot be directly used for prediction. This evaluation demonstrates that, as it generates models, the proposed analysis can also be used for prediction. As it is not the focus of this analysis, we do not further pursue this aspect. In addition, we conduct an evaluation of the stability of clustering. Specifically, we randomly select 2/3 of the diseases and apply the proposed and alternative approaches. We then compare the clustering with the randomly sampled data against that using the whole data and compute the discrepancy value. Note that this calculation is limited to diseases that are selected. With 500 resamplings the resulted stability measure (1-discrepancy) values are 0.966 (OLS), 0.973 (Fusion), 0.962 (Prior), and 0.977 (proposed), respectively. Overall, the proposed approach has competitive prediction and stability performance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Discussion.</head><p>In this study we have conducted the functional clustering analysis of disease prevalence trends, taking advantage of the uniquely valuable NHI data. Similar analysis schemes have been limitedly pursued in the literature, and as discussed above, the findings can have high practical value. Methodologically, this study has significantly advanced from <ref type="bibr">Jadhav et al. (2021)</ref> and other functional clustering analyses by flexibly incorporating existing information obtained from mining a large number of PubMed publications. The proposed approach is intuitive and will be directly applicable to some other types of existing information. Also different from Jadhav et al. ( <ref type="formula">2021</ref>), the MCP penalization, which has been shown as more effective than Lasso, is adopted. Significant theoretical development has been conducted, which is highly nontrivial with the complex data structure, penalized fusion estimation, and existing information that is not guaranteed to be accurate. Simulation shows that when existing information has reasonable quality, the proposed approach has superior performance. When existing information is of very low quality, which is highly unlikely in practice, the proposed approach, with its great flexibility, can still have competitive performance.</p><p>In the analysis of NHI data, clustering results different from the alternatives have been generated. It is also found that incorporating the existing information has a moderate impact on the results. For a few cases, for example, the one related to HPV, we have identified solid evidence from the literature to support our findings. Here we note that although for some diseases published studies have suggested their similarity in prevalence trends and underneath interconnections, this study is among the first to systematically do so at the pan-disease level. It is also noted that a drawback of pan-disease analysis is that, with a huge number of disease pairs, it is impossible to examine the underlying causes one by one. Rather our findings can serve as the ground stone for researchers interested in individual diseases. Although with many advantages, the NHI data also has limitations. In particular, the covered population is dominatingly Chinese. In addition, with data access limitation, more recent data is not available. With the dependence of disease prevalence on population and time, the broader applicability of our findings may warrant additional scrutinization.</p><p>Software and data. The R program implementing the proposed method is available at www.github.com/shuanggema. The data that support the findings in this paper are obtained from the National Health Insurance Research Database at <ref type="url">https://nhird.nhri.org.tw/en/</ref>.</p></div></body>
		</text>
</TEI>
