<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Topological Risk-Landscape in Metric-Free Categorical Database</title></titleStmt>
			<publicationStmt>
				<publisher>IEEE</publisher>
				<date>01/01/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10555974</idno>
					<idno type="doi">10.1109/ACCESS.2024.3398416</idno>
					<title level='j'>IEEE Access</title>
<idno>2169-3536</idno>
<biblScope unit="volume">12</biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Hsieh Fushing</author><author>Hong-Wei Kao</author><author>Elizabeth P Chou</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[The Entropy-based Categorical Exploratory Data Analysis (CEDA) paradigm is elaborately refined to algorithmically explore the intricate high-order directional associative relational patterns within the heterogeneous chronical disease dynamics captured by Behavioral Risk Factor Surveillance System (BRFSS) database. Operating on this imbalanced categorical dataset represented fully by its metricfree high-dimensional histogram, our algorithms conduct data-driven computations to investigate chronic disease mechanisms across four sub-populations along the age-axis, culminating in comprehensive systemic understandings. Upon this categorical data-world, CEDA first recognizes the category-oriented 1D histogram as the simplest form of a piece of explainable information. Then, utilizing Kolmogorov's randomnessproper-based reliability check, CEDA identifies and confirms collectives of 1D histograms as major featurecategories of varying orders within each sub-population. These confirmed major feature-categories' binary memberships are then arranged into a subject-vs-feature-category bipartite network heatmap, revealing serial horizontal and vertical blocks framed by clusters of similar subjects characterized by individual-risklandscapes (IRL) against clusters of structurally dependent major feature-categories. Based on such blockseries, sub-population-specific disease mechanisms emerge as collective high-order interacting effects, elucidating directional associative relationships from study subjects' topological neighborhoods to responsecategories. Notably, the topological individual-risk-landscape offers profound insights into complex system dynamics and simultaneously exposes atypical subjects as explainable errors across all Machine Learning classifiers.INDEX TERMS Behavioral risk factor surveillance system (BRFSS), bipartite network heatmap, categorical exploratory data analysis (CEDA), complex system, conditional entropy.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>I. INTRODUCTION</head><p>The US agency Center for Disease Control and Prevention (CDC) conducts an annual phone survey with over 400K participants to construct a yearly Behavioral Risk Factor Surveillance System (BRFSS) database. Since 1984, the primary goal of this database has been to understand the dynamic and evolving linkages between multiple chronic diseases and their potential risk factors across the 50 states of US society over many years <ref type="bibr">[1]</ref>, <ref type="bibr">[2]</ref>, <ref type="bibr">[3]</ref>. Each yearly BRFSS database, by design, encompasses all associative relations between multiple chronic diseases and many behavioral risk</p><p>The associate editor coordinating the review of this manuscript and approving it for publication was Xianzhi Wang . factors, referred to as feature-variables here, to sustain a complex system dynamics of chronic disease in American society for the year <ref type="bibr">[4]</ref>. After 40 years, this yearly BRFSS complex system dynamics is, by and large, still unknown like a mystery, not to mention its evolution along the year-axis.</p><p>Could such a complex system dynamics of BRFSS be computationally extractable and explicitly displayable? To our limited knowledge, due to the seemingly boundless complexity and scope of such a system, comprehensive studies addressing this question have scarcely been conducted and rigorously reported in the literature. Nevertheless, positive and practical answers to this question are not only critical for the US but also for many countries that have developed similar surveillance systems, as seen on the BRFSS website (<ref type="url">https://www.cdc.gov/brfss/index.html</ref>).</p><p>In fact, this question holds a significant degree of universality across all sciences, and the potential impacts of its answers can extend far beyond the realm of the BRFSS. Why have there been hardly any comprehensive studies aimed at making complex systems dynamics readable and understandable? We are confident that the cause can be partly attributed to the fundamental barrier within data analysis. When handling a large system, data analysts encounter barriers stemming from two kinds of complexity embedded within data <ref type="bibr">[5]</ref>, <ref type="bibr">[6]</ref>. The first kind of complexity is heterogeneity, which characterizes a large system by containing many heterogeneous local mechanisms. As described in <ref type="bibr">[7]</ref>, heterogeneity is expected to be observed through broken-symmetry patterns across different scales and localities within almost all large complex systems. This characteristic certainly is not limited to large physical or chemical systems. Indeed, the BRFSS has been shown to embrace heterogeneity characterized by (GenHL, Age) from the perspective of Heart Disease (HD) dynamics <ref type="bibr">[8]</ref>, where GenHL stands for the feature-variable called ''general health.''</p><p>The second kind of complexity pertains to the information content contained in large databases, which goes far beyond data visualization per se. Since the full information content in data (ICiD) includes all patterns of relational nature. Discovering such relational patterns, especially for highorder ones, requires a wide spectrum of genuine creativity and exploratory computing efforts. The information content channeled through high-order relational patterns is of particular scientific importance and practical interest because such directional associative relations ''from a covariate featureset toward another response feature-set'' offer a unique window into essential mechanisms at a locality. However, such information has hardly ever been explored or even considered in data analysis due to its unknown functional form, making the ideas and practices of modeling unrealistic and incorrect. As a result, real-world associative relations of high orders, in general, are completely unknown even to domain scientists. This fact provides the brief background of this information complexity.</p><p>This information complexity is particularly evident in the BRFSS database because all its variables, from disease statuses to behavioral and demographic risk factors, are either categorical or categorized along certain axes. Thus, each yearly BRFSS database is entirely and completely represented by its high-dimensional histogram. Without losing any bit of information, this histogram forms a categorical data world of its own. Thus, the database's ICiD is perceived as consisting of all yet-to-be-discovered relational patterns of a wide spectrum of orders. In this paper, we delve deep into BRFSS's categorical data world, and set our primary goal as to graphically display BRFSS's computable and extractable information complexity. Conversely, this concrete and visible histogram would easily render man-made structures and assumptions as foreign objects. This is why almost all modeling-based results are likely unauthentic and obviously unscientific in this categorical data world.</p><p>The remainder of this Introduction section is dedicated to data descriptions, CEDA based computations and chief results for takeaways in three subsections, respectively. Subsection-A provides a detailed description of the Kaggle version of the BRFSS database, which serves two roles in this paper: as an illustrative example of our algorithmic CEDA computing and simultaneously as the database of scientific interest. Subsection-B explains how the computational CEDA paradigm generates the new concept of individual risk-landscape and its implications. We briefly outline the major results achieved in this paper in Subsection-C.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. KAGGLE VERSION OF 2015 BRFSS DATABASE</head><p>Each yearly BRFSS database is complicated by the prevalence and various kinds of no-responses. Across more than one hundred questions in the survey, the no-response rates vary greatly, and the types of non-responses are diverse. However, it's important to note that ''no-response'' does not equate to ''no information,'' as some subjects may choose not to answer certain questions due to sensitivity. Moreover, since many questions are highly related, such no-response data types persistently pose many difficulties and challenges when analyzing BRFSS databases.</p><p>Such difficulties and challenges pertaining to the specific 2015 BRFSS are avoided in its Kaggle version. It cleans out the majority of missing or non-response data points and significantly reduced the number of feature-variables, making it a popular database in Machine Learning literature. This Kaggle version of the 2015 BRFSS database consists of more than 250K subjects and 21 selected feature-variables. All 21 feature-variables, including several chronic diseases such as heart disease (HD), stroke (STK), and diabetes, among others, are categorical with symbolic codes. In this paper, as would be detailed below, some demographic and health variables, such as Age, Income, Mental Health..etc., of this Kaggle version are further regrouped to reduce their relative large numbers of categories. It is essential to note that these symbolic codes bear no sense of metric. For instance, code 1 may represent a diseased status, while code 0 indicates non-disease. Some feature-variables do bear ordinal senses among numerical codes. For example, five categories of both Age and GenHL are coded 1 to 5, where subjects with GenHL = 5 have the worst condition. However, the degree of ''difference'' between GenHL = 5 and GenHL = 4 is not necessarily equal to the difference between GenHL = 4 and GenHL = 3. As such, this Kaggle version categorical data is metric-free in nature.</p><p>In this paper, we continue to adopt the bivariate (GenHL, Age) as the defining axis of heterogeneity identified in <ref type="bibr">[8]</ref>. Instead of focusing on one single chronic disease, here we designate the bivariate (Stroke (STK), Heart Disease (HD)) as the response (Re)-variable, with the remaining 17 onedimensional feature-variables as covariate (Co)-variables.</p><p>The reason behind this choice of Y = (STK, HD) as the response variable is twofold. First, it better represents the real chronic disease dynamics of the complex system of interest than any single disease alone does. Secondly, it maintains a great degree of simplicity because all chronic diseases are structurally dependent. In this fashion, we utilize this Re-Co dynamics to represent the real chronic disease dynamics embraced by the 2015 BRFSS. Furthermore, all computational developments for Y = (STK, HD) can be easily expanded for any high-dimensional Y.</p><p>To study this Re-Co dynamics, the entire 250K subjects are subdivided into 24 sub-populations with respect to 24 categories of (GenHL, Age). Notably, the category (GenHL, Age) = (5,2) is empty. Each subpopulation is postulated to embrace homogeneous disease mechanisms of (STK, HD). Therefore, the quest of analyzing the Kaggle version of 2015 BRFSS database is transformed into two steps: first, exploring each sub-population's ICiD thoroughly to enable a graphic display of its disease mechanisms; second, linking all 24 locally computed and fully represented disease mechanisms into a global disease dynamics. It is noted that, for the sake of length of this paper, we only demonstrate linkages among 4 sub-populations with GenHL = 5 and Age = 1, 3, 4, and 5. This synthesized disease dynamics of the poor health population along the Age-axis is of great scientific interest on its own right, while the full global disease dynamics is separately presented in a companion report.</p><p>The signature ''imbalance phenomenon'' of the BRFSS is retained in the Kaggle version as well. Here, the ''imbalance phenomenon'' refers to the highly uneven sample sizes among response-categories. Such a phenomenon is observed with significant unevenness of sample sizes across all 24 subpopulations. Specifically, the non-diseased category of (STK, HD) = (0,0) typically has a sample size many times that of the sample sizes of the three diseased categories (0,1), (1,0), <ref type="bibr">(1,</ref><ref type="bibr">1)</ref> combined. This phenomenon is noteworthy because of its linkages to two technical fronts.</p><p>The first front pertains to recognizing why the marginal information of a feature-variable concerning a variable or a set of variables is imprecise and confusing. The second front is that this phenomenon has widely been attributed as the underlying cause of failures of many classifiers, such as various variants of Random Forest and Boosting, in Statistics and Machine Learning (ML) literatures. This phenomenon is even considered ''intrinsic'' <ref type="bibr">[13]</ref>. Numerous remedial approaches have also been proposed without guaranteed successes <ref type="bibr">[14]</ref>, <ref type="bibr">[15]</ref>. However, it is counterintuitive that an observed pattern of sample sizes of responsecategories could become an intrinsic barrier hindering all classifiers. Additionally, it is equally counterintuitive regarding the merits of developing sampling schemes on observed data to improve the performance of classifiers per se without concerning the potential consequences of distorting ICiD. These two technical fronts are explicitly addressed in this paper, and their resolutions are outlined in the next two subsections, with further details provided in Section V. They indeed serve as two signatures of this paper.</p><p>At the end of this subsection, we describe our coding schemes for regrouping the following 5 variables of the Kaggle version of dataset. 1 to 9; Physhlth-3: 10 to 29; Physhlth-4: all 30 days. This regrouping scheme is necessary for computations conducted within sub-populations defined by (GenHL, Age). By so doing, the 24 sub-populations have sizes around several thousands. For the computational simplicity, all subjects with missing or no-responses within the Kaggle version are further excluded in this paper.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. CEDA COMPUTING PARADIGM AND INDIVIDUAL RISK-LANDSCAPE</head><p>As mentioned earlier, the 21-dimensional histogram constructed from the Kaggle version of the BRFSS database as one whole is metric-free. Within this categorical data world, arithmetic operations and functional forms are meaningless. In this paper, we adapt the data-driven bottom-up computational paradigm, called Categorical Exploratory Data Analysis (CEDA), to effectively explore high-order associative relations that constitute and reveal the database's information complexity. These explorations are necessary due to our recognition of the 1D histogram as the simplest form of ''a piece of explainable information'' and Kolmogorov's randomness-proper on any contingency tables. Throughout this paper, a contingency table is typically constructed by arranging all categories of a covariate feature-set along the row-axis and the four categories of (STK, HD) along the column-axis. Any such a contingency table is simply a projection of the 21-dimensional histogram of the whole data set.</p><p>Information extraction from a contingency table is carried out by comparing one row-vector, representing a 1D histogram of the conditional variable of Y given a covariate feature-category, with the column-sum vector, representing a 1D histogram or marginal distribution of Y. In a stepby-step manner, each individual comparison yields one piece of information regarding one aspect of the interacting effects of the covariate feature-set. Then, the collective comparisons reveal glimpses of potential interacting effects of the covariate feature-set on the response-variable Y = (STK, HD). No functional forms of interacting effects are needed in such comparisons.</p><p>In general, scientists are not capable of fully prescribing interacting effects because diverse asymmetric relational patterns are possible and potential among all involved categories. Therefore, it becomes not only necessary but also critical to be able to demonstrate and confirm all individual category-specific interacting effects. From this perspective of interacting effects, three standpoints of our elaborately refined CEDA here make evident differences from the original CEDA algorithms developed in a series of previous works <ref type="bibr">[8]</ref>, <ref type="bibr">[10]</ref>, <ref type="bibr">[11]</ref>, <ref type="bibr">[12]</ref>.</p><p>The first standpoint is that a feature-set's categoryspecific effect takes the central role, not its marginal effect, which is calculated via a weighted sum scheme. That is, a category-specific effect is demonstrated by comparing its corresponding conditional entropy of Y conditioning on the corresponding covariate category with the entropy of Y without involving with covariate information of any sort. This is a very unique standpoint taken in this paper.</p><p>The second standpoint is that this comparison must be conducted under equal ''randomness'' footings. This is where Kolmogorov's randomness-proper comes in to play its essential role through a contingency table platform <ref type="bibr">[9]</ref>. Here, two versions of Kolmogorov's randomness-proper are respectively seen through the following two constructed ensembles: 1) an ensemble of mimicries of the observed contingency table, which share the same randomness embraced by the observed table; 2) another ensemble of simulated contingency tables only retain randomness embraced by the observed rowsum vector. Both ensembles are commonly subject to the column-sum vector, which represents the fixed sample sizes of the four categories of Y. The conceptual differences of these two ensembles rest on the fact that the first ensemble genuinely reflects the data's intrinsic randomness, while the second ensemble embraces the hypothetical randomness as if the covariate feature of the row-axis is independent of Y. The first ensemble gives rise to an alternative entropy distribution, while the second ensemble gives rise to a null entropy distribution. Thus, the aforementioned comparison is carried out by comparing alternative-vs-null entropy distributions resulting in the minimum sum of Type-I and Type-II errors or the two distributions' overlapping area.</p><p>This comparison plays a key role at the heart of this refined CEDA paradigm. Such a Kolmogorov's randomnessproper based comparison would be applied twice to select and confirm a major feature-category, instead of major feature-variable. Its first application is to a major featurecategory candidate of given order, which is equal to the size of covariate feature-set. This application involves the entire samples belonging to the sub-population. Its second application is necessarily performed when the order of the potential major feature-category candidate is larger than one.</p><p>Since we need to make sure this candidate is not redundant with respect to an already confirmed major feature-category of lower order. That is, a major feature-category of high order must provide extra-information (Extra-Info) on top of what a confirmed major feature-category of lower order can provide. Hence, this application involves only samples constituting this confirmed major feature-category of lower order. Subsequently, we build a graphic display based on a collection of selected high-order major feature-categories, which becomes the chief part of the ICiD within each subpopulation.</p><p>The aforementioned feature-category based computational operations are newly developed here, offering contrasting differences with the original version based on marginal mutual information calculations in selecting major featurevariables. Such differences are especially evident and crucial when the database is subject to a high degree of ''imbalance''. Realistically speaking, the majority of real-world databases retain varying degrees of ''imbalance''.</p><p>The third standpoint relies on the capability of representing computed and confirmed major feature-categories through a graphic display in this new version of CEDA. As each selected major feature-category of any order has its own memberships due to its locality, each subject will be prescribed by a binary vector indicating its presence or absence with respect to all selected major feature-categories. This subject-specific binary vector sheds light on the positive and negative disease risks facing this subject. From this aspect, we term this binary vector of memberships of all selected major feature-categories the subject's individual risk-landscape (IRL).</p><p>Furthermore, based on the collective individual risklandscapes, two significant sub-population specific characteristics can be derived. First, a topology is defined on the study-subject space with a natural choice of dissimilarity or similarity measure. The neighborhood system offered by this topological space explicitly reveal information about which subjects are close to which subjects, but far away from other subjects. A graphic display of the entire topological space pertaining to high-order major feature-categories in general is very informative regarding sub-population specific chronic disease dynamics and beyond, as would be clearly seen in Section V. For instance, this topological characteristic among subjects can serve as a critical basis for matching in causality study and optimal selection for the highest or lowest risk subject-groups.</p><p>The second significant sub-population specific characteristic is that each cluster of subjects' individual risk-landscapes will characterized by a horizontal blocks framed by a series of clusters of major feature-categories. When coupled with annotated response-categories, such a horizontal series of blocks provides ''readable'' and ''visible'' information defining this cluster of subjects. One piece of vital information is the explicit map of so-called ''atypical subjects''. Here, an ''atypical subject'' is referred to a study subject encoded with an annotated response-category is found belonging to an individual risk-landscape neighborhood sharing with several other study subjects, who are encoded with very different annotated response-categories from his/her. A large collective of ''atypical subjects'' allows us to fundamentally resolve the aforementioned ''imbalance phenomenon'' issue <ref type="bibr">[13]</ref>, <ref type="bibr">[14]</ref>, <ref type="bibr">[15]</ref>.</p><p>Putting together these two sub-population specific characteristics, we can further point to the fact that, as a byproduct, such a topological space of individual risk-landscapes is an informative platform for building variants of explainable inferential decision-makings, including prediction and classification.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. MAJOR RESULTS OF THIS PAPER</head><p>With Y = (STK, HD) as the response variable, we explore the Re-Co dynamics representing the chronic disease dynamics underlying the 2015 BRFSS database. Upon the Kaggle data set, our CEDA computations first illustrate why a feature-category specific 1D histogram is the simplest form of ''a piece of information'' within this categorical data world. We then explicitly demonstrate the unsuitability of the ''marginal form of information''. This simple fact establishes a wide spectrum of profound impacts that would be seen not only in Data Analysis as a scientific discipline <ref type="bibr">[16]</ref>, but also in all sciences. Since so far data analyzing methodologies from statistics and ML primarily rely on ''operations of variables'', such as all modeling based topics and methodologies, like variable selection among many others. That is, from the ICiD perspective, it is legitimate to seriously question the validity of methodologies developed in these two fields. Since a methodology employed in any data analysis for sciences needs to pass ''the test of experience''. This is the most crucial criterion underlying any scientific disciplines as advocated by John Tukey <ref type="bibr">[16]</ref> in his 1962 paper with title: ''The future of data analysis''.</p><p>Secondly, the recognition of Kolmogorov's randomnessproper on any contingency table plays an instrumental role in our refined version of CEDA. In fact, this concept is essential and fundamental in its own right in Data Analysis beyond the categorical data world. Given that a histogram can very well approximate any quantitative variable's entire empirical distribution <ref type="bibr">[17]</ref>, so this concept is indeed applicable across all data types. Its importance is indeed self-evident for its capability of facilitating the pair of alternative-vs-null entropy distributions. It is extremely critical that Type-I and Type-II errors can be evaluated without assuming any manmade modeling structures and distributional assumptions in any data analysis.</p><p>Thirdly, we build algorithms to conduct CEDA computing step-by-step: from identifying and confirming major 1-feature-categories, major 2-feature-categories to major 3feature-categories within each sub-population. Among major 2-feature-categories, we show diverse forms of asymmetry of order-2 interacting effects across a series of feature-pairs. Such diversity of asymmetric interacting effects is meant to reiterate a key point in data analysis: Invaluable knowledge of disease mechanisms is available to be discovered only if data analysts and domain scientists are willing to explore.</p><p>Fourthly, a sub-population's computed disease mechanisms are represented through the collection of all computed and confirmed order-3 interacting effects, so-called major 3feature-categories with either positive or negative risks. The presence-absence memberships of this collection of major 3-feature-categories are compiled into a binary bipartite network matrix. Thus, this graphic display collectively reveals all involving subjects' individual risk-landscapes as a serial positive or negative disease risks exposures. After rearrangements via hierarchical clustering on the row-axis of subjects and column-axis of major 3-feature-categories, respectively, this block-pattern sustained heatmap reveals explicit relational patterns though the authentic topology of individual risk-landscape defined on the subject space and complex structured dependency on the collection of major 3-feature-categories. This heatmap allows us to figure out characteristics of the sub-population specific disease mechanisms via horizontal series of blocks discovered on the scale of category of Y and on a finer scale of clusters within category of Y. That is, such a heatmap indeed sustains functionally critical and philosophically vital parts of information content in data (ICiD) pertaining to the subpopulation under study.</p><p>Fifthly, along the heterogeneity-axis of GenHL = 5 and Age = 1, 3, 4, and 5, we then patch and link all relational patterns derived from the four sub-populations into a composite complex system. Such global functionality embraced by the linked four sub-populations provides one important aspect of understanding the whole complex system. The grand global view of chronic disease dynamics underlying 2015 BRFSS database would be separately presented in a companion study by embracing all 24 sub-populations.</p><p>Sixthly, the four sub-population specific heatmaps collectively and explicitly reveal the existence and prevalence of ''atypical subjects'' across diseased and nondiseased categories. These atypical subjects would definitely cause ''errors'' to whatever classifiers. From this standpoint, the ''imbalance phenomenon'' is indeed not the intrinsic cause of ill performances of all classifiers. On the other hand, if there are no ''atypical subjects'' present in a sub-population, this heatmap graphic display would allow perfect classifications even under the presence of a very severe ''imbalance phenomenon''. That is, the ''imbalance phenomenon'' is simply a fundamental misconception.</p><p>Finally, we conclude that the explicit demonstration of ''atypical subjects'' reflects the fact that building ICiD is indeed the ultimate goal of data analysis. Subsequently, any inferential operations must be performed strictly in accord with ICiD. This is the merit of pointing out this long-standing big mistake. On the other hand, we also emphasize here that a comprehensive study of complex systems must achieve ICiD.</p><p>From the technical perspective, the most far-reaching implication of our methodological developments in this paper is that this CEDA paradigm can, in fact, be at the heart of all structured data analysis of any data types. Since each quantitative data set can be categorized and simultaneously retain its chief part of ICiD. Then, from this ICiD perspective, the brand-new concept of individual risk-landscape truly provides authentic insights through its topological subject space. Through its block-sustained heatmap display, the comprehensive and vital pattern information pertaining to nature of complex chronic disease dynamics is explicitly summarized. As such it becomes rather unthinkable for any inferential decision-making without embracing insights of the subject's individual risk-landscape and its topological neighborhood structures.</p><p>We organize the rest of this paper as follows. In section II, we lay the foundations for the CEDA paradigm, including arguments for the simplest form of a piece of information and the directional associative relation from any featureset toward the response variable Y = (STK , HD), as well as reliability checks based on Kolmogorov's randomnessproper. In section III, we develop the algorithm for CEDA computing major feature-categories of various orders and explain and visualize their rather convoluted interacting effects. Section IV is devoted to presenting diverse kinds of asymmetry found in order-2 interacting effects and their evolutions along the age-axis. In Section V, we show results centered around individual risk-landscapes and their heatmaps, along with consequent summarizing statistics. We also construct the global dynamics of Y = (STK , HD) under GenHL = 5 by combining results from the four subpopulations along the age-axis. In the conclusion section, we reflect on the potential impacts of our CEDA-enabled topological results and the induced issues within and beyond Data Analysis.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>II. WHAT INFORMATION LOOKS LIKE?</head><p>Within the metric-free 21-dimensional categorical data world, the journey of data analysis naturally commences with addressing the simplest question: What does a piece of information look like? This inquiry is especially significant because such a piece of information remains invariant to all permutations along all dimensions of the histogram. Only after obtaining an answer to this question does it become feasible to address the subsequent critical and fundamental question: What is ICiD made of? In the following two subsections, we exemplify CEDA computations for selecting and confirming major feature-categories in a bottom-up datadriven fashion.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. WHAT IS THE SIMPLEST FORM OF A PIECE OF INFORMATION?</head><p>Take anyone of the 21 categorical feature-variables. Does one of its 1D categorical data points mean anything? The answer is apparently negative. Since it is simply a labelcode which can be arbitrarily encoded. A label-code only marks a ''location'' on the metric-free-axis of this featurevariable. Further, any aggregation of one single label-code is also meaningless in relation to this feature-variable's multiple locations. Furthermore, any missing aggregation of anyone label-code along this feature-variable's location-axis will distort the information formation. As such a label-code as a ''location'' apparently acts like an element in formatting a piece of information. And, the simplest form of a piece of information is delivered by an 1D histogram of an 1D categorical feature-variable. This is the answer to the first question.</p><p>When two 1D categorical feature-variables are observed or derived from the same system, this bivariate featurevariable owns an 2D contingency table or histogram. All aspects of relational information content in this 2D data set is fully described by ''location-to-location'' correspondences as being visibly laid out via the 2D contingency table. Likewise for all relational relations involving with more than 2 categorical feature-variable. As such the chief mechanism of formatting relational information within a categorical data world is still operated via the fundamental ''location-tolocation'' correspondence within their histogram or so-called hyper-contingency table.</p><p>Nonetheless, when comparing two histograms, again it is conducted on the ''location-to-location'' basis. This comparison will not be altered by any permutations respectively applied on their common metric-free-axis. Thus, at least ideally, ''location-to-location'' basis still give rise to the full and meaningful information regarding comparing two histograms of any dimensionality. By the same argument, one effective way of comparing multiple histograms is simply done by adding an extra categorical ID-variable.</p><p>In summary, we term a ''label-code'' meaning a ''location'' of any 1D data point as ''an element of information''. It is understood that such an element of information bears with a feature-variable specific ''location'' message. With this concept of ''element of information'', we clearly see that a 1D histogram is indeed the most fundamental form of ''a piece of information'' in any categorical data worlds. Given that the ''location-to-location'' correspondence is the most fundamental mechanism of relational information formation, our next task is how to effectively extract all essential relational information content from data's very high dimensional histogram.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. WHAT ICiD IS MADE OF?</head><p>Next we turn to the fundamental question: What ICiD is primarily made of? There are only two potential possible answers in sight: either marginal information of featurevariables, or 1D histogram of feature-category, in this categorical data world. As aforementioned, 1D histogram is the simplest form of a piece of information, while any feature-variable or a feature set's marginal information involves multiple 1D histograms arranged in a contingency table format. The first critical difference between these two <ref type="bibr">VOLUME 12, 2024</ref> possible answers rests on their different scales. An 1D histogram is of category-specific locality scale, while featurevariable's marginal information is of the global scale. The second critical differences is regarding their meanings. An 1D histogram is clear and explainable. In contrast, a feature-variable's marginal meaning can be confusing because of conflicting meanings derived from different localities. That is, individual category-specific meanings can be lost, even distorted, within its marginal version. This loss and distortion surely will miss out many important and essential feature-categories. We explicitly illustrate here why marginal information is not the fundamental format of information content representing categorical data. In fact, the feature-category is the right format for revealing aforementioned ''broken symmetry'' as another key characteristic of complex system <ref type="bibr">[7]</ref>. Here this characteristic is seen through the drastically distinct pattern information emitted from a feature-variable or a feature-set's distinct categories.</p><p>In this subsection, our illustrating example is built upon the sub-population of GenHL = 5 and Age = 1, which consists of 1776 subjects. As aforementioned, we focus on the Re-Co dynamics with the categorical response variable is Y = (STK , HD) and the rest of 17 categorical features as covariate variables, including Diabetes, High Blood Pressure (HighBP), High Cholesterol (HighChol),.., etc. The response variable Y has four categories of bivariate disease-status: {(0, 0), (1, 0), (0, 1), <ref type="bibr">(</ref>1, 1)}. Except BMI, Education (EDU) and Income are encoded with 3 categories, Mental health (MentHlth) and Physical Health (PhyHlth) encoded with four categories. The remaining covariate 12 feature-variables are all binary.</p><p>We begin by illustrating the associative relations between Y = (STK, HD) and HighBP through the 2 &#215; 4 contingency Table <ref type="table">1</ref>. This table is denoted as HCT [(STK , HD); HighBP]. The two rows of this table are two 1D histograms with 4 bins pertaining to 1-feature-categories: HighBP 0 and HighBP 1 , standing for two categories of subjects: not having and having high blood pressure, respectively. They are 1D histograms of Y conditioning on HighBP = 0 and HighBP = 1, respectively. In contrast, the column-sum vector pertaining to the four response categories is ''constant'' with respect to all covariate variables, which is the 1D histogram of Y.</p><p>By listing the three histograms within Table <ref type="table">1</ref>, we intend to compare the individual 1D histograms of Y conditioning on HighBP 0 or HighBP 1 with the marginal 1D histogram of Y. One meaningful comparison is performed by comparing row-wise conditional (Shannon) entropy with that of the column-sum vector <ref type="bibr">[18]</ref> The entropy reduction of CE[Y|HighBP 0 ] is attributed to the observation of having relatively more subjects in the non-diseased Y = (0, 0) category and less subjects in diseased categories: Y &#8712; {(0, 1), (1, 0), (1, 1), in comparison with column-sum vector of proportion of Y. While the entropy increase of CE[Y|HighBP 1 ] is attributed to the observation that non-diseased Y = (0, 0) category has a reduced proportion, but still keeps the majority of HighBP 1 subjects, while even though the proportions of diseased categories: Y &#8712; {(0, 1), (1, 0), (1, 1), have slight increases against the column-sum vector of proportion of Y. In this fashion, the row vector of HighBP 1 becomes more evenly distributed among nondiseased and diseased categories than the column-sum vector of Y. This is the phenomenon of ''imbalance'', which is underlying the somehow counterintuitive scenario of extra information of HighBP 1 indeed promoting more, not less, uncertainty of Y.</p><p>As would be confirmed in the next subsection, both conditional entropies CE[Y|HighBP 0 ] and CE[Y|HighBP 1 ] pass the reliability checks of being significantly different from CE <ref type="bibr">[Y]</ref>. That is, both covariate categories are highly associated with Y. However, if the predictive perspective is taken as the solo focus for associative relation, then these two 1-feature-categories give rise to two rather conflicting kinds of messages. The information of HighBP 0 is good for predictive relation with Y, while the information HighBP 1 is not. That is, using the predictive capacity as a way of quantifying the strength of associative relationship between two variables is fundamentally improper, especially under the ''imbalance phenomenon''.</p><p>Further, from disease dynamics perspective, it is transparent that HighBP 0 strongly points to less risk of the bivariate disease, while HighBP 1 points to higher risk. In sharp contrast, the marginal conditional entropy: CE[Y|HighBP] = 0.73075, which is calculated as the weighted sum of CE[Y|HighBP 0 ] and CE[Y|HighBP 1 ] with weights 843  1776 and 933 1776 , respectively, does not convey either one of the two directional associative relations. That is, the effects on Y incurred by feature-variable: HighBP, can not be properly delivered by its marginal relationship with the responsevariable Y. In summary, the description of associative relation of HighBP-to-(STK, HD) is necessary of category-locality nature. Based on this simple example, we are confident that 1D histogram is the answer to the question: What ICiD is made of? That is, the quest of data analysis is to extract all relevant pieces of relational information in a form of 1D histogram. We likewise conclude that all pattern information in ICiD is of locality nature. More evidences are seen through the interacting effects conveyed by categories of 1D covariate feature-pairs in Section IV. The implications of this somehow simplistic statement are far reaching. A major impact is that, under the shadow of ''data's imbalance phenomenon'', all Statistics and Machine Learning topics become by and large invalid because they solely rely on marginal information of global nature. Consequently, all modeling approaches in these two fields likely fail.</p><p>TABLE 1. Contingency table HCT [(STK , HD); HighBP] through the perspective of heterogeneity (GenHL = 5, Age = 1).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. KOMOGOROV'S RANDOMNESS-PROPER AND RELIABILITY CHECK</head><p>Next, we discuss how to make sure that both pieces of information: the two 1D histograms of Y conditioning on HighBP 0 or HighBP 1 are significant by passing their reliability checks. For a piece of information, its reliability check is performed by precisely two observed versions: alternative-to-null, of Kolmogorov's randomness-proper pertaining to Table 1 <ref type="bibr">[9]</ref>. The alternative version is columnwise randomness given only the corresponding column-sum. That is, each column vector is seen as a realization of Multinomial randomness given its column-sum and its observed column-specific vector of proportion. In contrast, the null version is the randomness of row-sum vector being equally imposed onto all columns. That is, each column vector is seen as a realization of Multinomial randomness given its column-sum and the common proportion vector of rowsums. These versions indeed cover all randomness observed within a contingency table, like Table <ref type="table">1</ref>, while the null version indeed bears no associative information regrading HighBP to Y.</p><p>Here are the technicalities of the alternative and null versions of randomness underlying the 2&#215;4 contingency table in Table <ref type="table">1</ref>. The 4-dim column-sum vector (1372, 106, 231, 67) is fixed with a total 1776. They present four columnwise constraints in formatting the contingency Table <ref type="table">1</ref>. That is, both kinds of randomness are conditioning on this column-sum vector. Subsequently, the 2-dim row-sum vector (r 0 , r 1 ) = (843, 933) is an observed vector being specifically subject to randomness of covariate feature-variable HighBP as one whole under the constraint of total sum 1776. In other words, each row-sum's randomness is linked to its four components' randomness under the four columnwise constraints. With randomness-proper tied to all observed entries of Table <ref type="table">1</ref> and (r 0 , r 1 ), we can depict the all aspects of randomness-proper associated with Table <ref type="table">1</ref> as follows.</p><p>Alternative Randomness: This randomness for the four observed columns of Table <ref type="table">1</ref> is given as: MN (n y , P a y ) with y &#8712; {(0,0), (1,0), (0,1), (1,1)} and P a y = (n y [0]/n y , n y <ref type="bibr">[1]</ref>/n y ) with (n y [0], n y <ref type="bibr">[1]</ref>) &#8242; being y-th column vector. Such multinomial randomness protocols constitute the randomness-proper underlying the alternative setting against the null setting described below.</p><p>Null Randomness: The column-wise null randomness is given by the multinomial distribution MN (n y , P o ) with P o = (r 0 /1776, r 1 /1776).</p><p>With the above alternative and null randomness specifications for Table</p><p>1, a generic form of simulated contingency table with respect to having alternative-effect and null-effect of HighBP, denoted by HCT [Y; HighBP] and HCT [Y; null-HighBP], respectively, is given in the Table 2.</p><p>TABLE 2. Generic form of simulated contingency table of From the two panels of Fig. <ref type="figure">1</ref>, we clearly see that both pieces of information: the two 1D histograms of conditional entropy of Y given HighBP 0 and HighBP 1 , respectively, have zero-sums of type-I and type-II errors. That is, both are confirmed being rather significant. In contrast, the marginal alternative entropy distribution is expected to be centered around CE[Y|HighBP] = 0.73075 that would be heavily overlapping with the marginal null entropy distribution centered around CE[Y] = 0.7565562. This example illustrated why the marginal evaluations of potential effect of any feature-variables have dangers of giving rise to misinformation. It is somehow critical for performing this reliability check based on the minimum sum of Type-I and Type-II errors, or the overlapping area of the alternative and null entropy distributions. Since the commonly used criterion based on P-value is simply too optimistic in a sense of too many false positive feature-categories being selected. This fact truly reflects the importance of Kolmogorov's randomness-proper when analyzing real world data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>III. DATA-DRIVEN BOTTOM-UP CEDA PARADIGM</head><p>After the description of reliability check in the previous subsection, it is essential to put the technical meaning of confirming both pieces of information of HighBP 0 and HighBP 1 pertaining to the dynamics of Y in perspective. Since this dynamics is limited to the sub-population defined by (GenHL, Age) = (5, 1). The structural dependency and de-associating operation discussed in details in <ref type="bibr">[8]</ref> assure the fact that these two 1-feature-categories HighBP 0 and HighBP 1 indeed provide extra-information beyond what the bivariate-category (GenHL, Age) = (5, 1) can provide into the dynamics of Y. This seemingly simple de-associating operation is critical for identifying true factors underlying Y from two fronts. First, its chief merit is to identify potential major feature-categories of various orders. Secondly, it provides a way of checking whether one featurecategory indeed provides extra-info, instead of piggy-backing upon an already confirm major feature-category. Via these two fronts, we construct our CEDA bottom-up data-driven computational developments in this section.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. MFCI ALGORITHM</head><p>As the first phase of developing CEDA paradigm, we build an algorithm for Major Feature-Category Identification (MFCI): from order-1 to higher orders. Before describing the MFCI algorithm, we first clarify the ''identification'' operation of CEDA paradigm that facilitate two types of tasks: MFC and Extra-info, in the MFCI algorithm given below.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>1) FOR MFC</head><p>For identifying potential major k-feature-categories (MFC of order k) pertaining to a covariate feature-set A with cardinality k, we first build a hyper-contingency table HCT [Y; A] based on the entire sub-population of data points. Secondly, we perform reliability check upon each row of HCT [Y; A], respectively. Thirdly, a decision of identification is made with respect to a chosen threshold of minimum sum of Type-I and Type-II errors.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>2) FOR EXTRA-INFO</head><p>For identifying whether 1-feature-category, says B b , can provide extra-info upon an identified major k-featurecategory, says A a , we first build a hyper-contingency table HCT [Y; B|A a ] based on the collection of data points belonging to A a . Secondly, we perform reliability check upon the b-th row of HCT [Y; B]. Thirdly, again a decision of identification is made with respect to a chosen threshold of minimum sum of Type-I and Type-II errors.</p><p>Within the sub-population (GenHL, Age) = (5, 1), apparently, both tasks of identification for MFC and Extra-Info are so-called de-associating operations working on two different data-settings: one is the sub-population specified by (GenHL, Age) = (5, 1) and the other is specified by the targeted major feature-category, such as HighBP 0 , see details in <ref type="bibr">[8]</ref>. We now describe the MFCI algorithm below. (Naturally, the finite sample size would force our computations to a stop at Step[k+1]-1 when no more 1-feature-categories can be found to provide Extra-Info upon all identified major k-featurecategories.) The above description of MFCI algorithm is designed to cope with finite computing resource, in particular when facing a large number of 1D covariate feature-variables contained in data set. It might miss some major featurecategories of high orders with its 1D component-memberfeatures being not involving in selected major 1-featurecategories. Such kinds feature-categories of high orders are relatively rare. On the other hand, if computing resource is large enough, then the orders of Step[2]-1 and Step[2]-2 can be switched. Then the concern of missing some order-2 interacting effect is resolved. Likewise for switching orders of Step[3]-1 and Step[3]-2. Nonetheless, if the MFCI-3 step is performed with its two steps in reversed order, then there would be 17   3   = 680 triplets of features and more than 5440 reliability checks. From the cost-benefit aspect, we do not switch the order of these two steps. Patterns reported in the next two sections seem to support this decision.</p><p>Here we report results from the first two steps of MFCI algorithm applied on the sub-population (GenHL, Age) = (5, 1). By applying MFCI-1 step of MFCI algorithm, we found 13 major 1-feature-categories with respect to a chosen 0.1 threshold value of minimum sum of Type-I and Type-II error, Error -I &amp;II for short, see Fig. <ref type="figure">2</ref>. From the column-perspective of this figure, it is worth noting that these 13 major 1-feature-categories consist one positive and one negative disease risk groups marked with ''+'' and ''-'' signs. Members of each group overlap with significant number of subjects. Such overlapping patterns clearly indicate the strong structural dependency among these 1D features. From the row-perspective, it is obvious that many subjects share the same or very similar memberships across 13 major 1-feature-categories, while they belong to very distinct response-categories. These patterns together promote the necessity of carrying out MFCI-2 step for more informative associative patterns. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Upon applying</head><p>Step <ref type="bibr">[2]</ref>-1 of MFCI-2 step, we found 57 candidate 1-feature-categories that can provide extrainfo upon the 13 major 1-feature-categories resulted from MFCI-1 step. Further, applying Step[2]-2 of MFCI-2 step, we identified and confirmed 31 major 2-feature-categories, see Fig. <ref type="figure">3</ref>. Overall, the heatmap in Fig. <ref type="figure">3</ref> reveals much clear associative patterns from major 2-feature-categories of positive and negative disease risks than that in Fig. <ref type="figure">2</ref>. Especially, it is striking to see that subject-members of the non-diseased category Y = (0, 0) are respectively separated into two obvious groups. One group consists of membersubjects having prevalent memberships among 14 major 2feature-categories of positive disease-risk, and another group consists of member-subjects having prevalent memberships among 17 major 2-feature-categories of negative disease-risk. This obvious improvement from major 1-featurecategories to major 2-feature-categories naturally motivates us to further carry out the MFCI-3 step. Before we report computational results from the MFCI-3 steps in Section V, we report four types of interacting effects of order 2 resulted from MFCI-2 step and illustrate their age-related evolving patterns across four sub-populations of (GenHL, Age) = (5, k) with k = 1, 3, 4, 5 in the next Section IV. Discoveries of interacting effects and understanding of their evolutions across age-axis are especially important from the perspectives of societal chronical disease and individual risk dynamics. The scientific discoveries and understanding become self-evident when we face the diverse formats of asymmetry among all involving 2-feature-categories. In contrast, we would see heatmap-based displays of individual risk-landscape topologies and their age-related evolution along the same age-axis in Section V.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>IV. DIVERSE TYPES OF ORDER-2 INTERACTING EFFECTS AND THEIR EVOLUTIONS</head><p>As an 1-feature-category is found to provide Extra-Info to any major 1-feature-category, this fact indeed signals the potentials of discovering essential and important interacting effects of order-2. In particular, when such discoveries are arranged with respective to age-axis, we figure out their evolutions. Such evolutions are rather interesting and critical. Here, four types of asymmetry of order-2 interacting effects would be illustrated centering around HighBP coupled with four 1D binary features. Their four types of interacting effects are characterized by their diverse relations with HighBP given as follows: 1) ''DiffWalk being independently equal''; 2) ''Diabetes being nearly complete dominated''; 3) ''HighChol being highly dependently equal'', and 4) ''Smoker being seemingly irrelevant, but strikingly modified just at one locality''. Each type gives to one format of asymmetry as displayed in a figure format with double-scale panels. At the age-scale, such a figure consists of four age-panels with increasing age-category. At the bivariate-category scale for patterns of interacting effects, each age-panel consist of 4 panels: (0,0), (0,1), (1,0) and (1,1). Such discovered asymmetric patterns of interacting effects in general give rise to clear senses of information complexity. Computationally, we would clearly see the merits of (Step[2]-1, Step[2]-2) in MFCI-2 in the MFCI algorithm in this section.</p><p>All interacting effects of order-2 are uploaded into the GitHub with address listed in the caption. It is also noted that, in fact, such explorations can and should be likewise done for interacting effects of any orders. Here, it is necessary to reiterate that authentic interacting effects of order-2 and higher-order ones will contribute to our true understanding on bivariate-disease dynamics of Y. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. TYPE-0</head><p>From the Fig. <ref type="figure">4</ref>, we discuss our discoveries in a fashion with respect to each of 2 &#215; 2 panels of bivariate-feature (Diffwalk, HighBP) across the four age-categories. The evolution of non-linear interacting effects of bivariatefeature (Diffwalk, HighBP) across the four Age-panels is exhibited through the above computational patterns and results. Via its graphic display in the Fig. <ref type="figure">4</ref>, this evolution strongly indicates that {HighBP} and {DiffWalk} play somehow equal roles along the age-axis. And both {DiffWalk = 0} and {HighBP = 0} retain dominant effects over {HighBP = 1} and {DiffWalk = 1}, respectively. This dominance manifestation of interacting effects of bivariatefeature (Diffwalk, HighBP) sends multiple strong medical and scientific messages about the dynamics underlying Y = (STK , HD) with respect to Age. Here, such messages precisely refer to the potential benefits of changing status: from {DiffWalk = 1} to {DiffWalk = 0} and {HighBP = 1} to {HighBP = 0} in individual and societal levels. Nonetheless, if changes can't be done on both{DiffWalk = 1} and {HighBP = 1}, one change would also create significant impacts. This is one of the key merits of figuring out the interacting effects of bivariate-feature (Diffwalk, HighBP). On the other hand, the Fig. <ref type="figure">4</ref> is indeed a graphicdisplay for demonstrating the necessity of employing a datadriven bottom-up computational paradigm like CEDA for authentic information contained in data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. TYPE-I</head><p>From the Fig. <ref type="figure">5</ref>, we continue discussing patterns of interacting effects in the same fashion with respect to each of 2 &#215; 2 panels of bivariate-feature (Diabetes, HighBP) across the four age-categories. The computed patterns here embrace some intrinsic differences from that found in the above Type-0 of bivariate-feature (Diffwalk, HighBP), in particular, in (0,1)-and (1,0)-panels.  {Diabetes = 0} with Error -I &amp;II &#8804; 0.1 in all Age categories. 2.</p><p>On the (0,1)-panel,{Diabetes-HighBP = [0,1] } fails to be a major 2-feature-category across all four age-categories. Nonetheless, we confirmed the directional effect: {HighBP = 1} provides Extra-Info upon {Diabetes = 0}, but not the other way, across all four age-categories. Such Extra-info results are reflected on the fact that the alternative entropy distribution appeared on the right-hand side of null entropy distribution as being coherent with the pattern of {HighBP = 1}only in the Age = 1, while the three pairs of alternativevs-null CE distributions are nearly completely overlapping in Age = 3, Age = 4 and Age = 5. This evolution of patterns of interacting effects of bivariate-feature (Diffwalk, HighBP) seemingly indicates that the the category {Diabetes = 0} in fact has varying capacity of reducing the disease risk from that of {HighBP = 1} with respect to age-categories.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>3.</head><p>On the (1,0)-panel, again {Diabetes-HighBP = [1, 0] } fails to be a major 2-feature-category across all four age-categories. Though, we confirmed the onedirectional effect: {HighBP = 0} provides extrainfo upon {Diabetes = 1}, but not the other way, the alternative entropy distribution appeared on the left-hand side of null entropy distribution same as the pattern of {HighBP = 0}, but the opposite of the pattern of {Diabetes = 1}, also across all four age-categories. Thus, patterns of interacting effects  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. TYPE-II</head><p>The three 1D features: {DiffWalk}, {HighBP} and {High-Chol}, are key risk factors of disease dynamics underlying Y = (STK , HD). At Age = 1, 3 and 4, these three binary factors mutually provide Extra-Info among their categories, while such mutual relations disappear in Age = 5. Nonetheless, the evolutions of patterns of order-2 interacting effects of bivariate-feature ({HighBP}, {DiffWalk}) and ({HighBP}, {HighChol}) are somehow distinct. As would be seen below through Fig. <ref type="figure">6</ref> and panel-based summary, the bivariate-feature ( {HighBP}, {HighChol}) reveal some extents of ''asymmetry'', which is not exactly identical the asymmetric patterns found in bivariate-feature ( {HighBP}, {DiffWalk}).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>1.</head><p>On the (0,0)-and (1,1) panels, {HighBP-HighChol = [0, 0] } and {HighBP-HighChol = [1, 1] } are all confirmed as a major 2-feature-category at Age = 1, 3 and 4, but not Age = 5. 2.</p><p>On the (0,1)-panel, {HighBP-HighChol = [0, 1] } is confirmed as a major 2-feature-category at Age = 3 and 4, but not Age = 1 and 5. The alternative entropy distribution is located on the left-hand side of null entropy distribution with sizable overlapping at Age = 1 and 5, but having near-zero overlapping at Age = 3 and 4. The interpretation of this evolving pattern over age is that the status {HighBP = 0} is more important than status {HighChol = 1} in terms of subject's disease risk.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>3.</head><p>On the (1, 0)-panel, {HighBP-HighChol = [1, 0] } is also confirmed as a major 2-featurecategory only at Age = 4, but not at Age = 1, 3 and 5. The evolving pattern of relative position of the alternative entropy distribution toward the null entropy distributions has gone from almost entirely overlapping to entirely separated and back to heavily overlapping from Age = 1 to Age = 5. It means that {HighChol = 0} is more important than {HighBP = 1} only at Age = 4.</p><p>Though {HighBP} and {HighChol} seem to play equal roles, their interacting effects revealed in (0,1)-and (1,0)panels are highly asymmetric across all age-categories. Such evolving asymmetry is hardly known in any priori fashion. That is, the evolution of asymmetric order-2 interacting effects of {HighBP} and {HighChol} can only be described precisely on the category-locality, not on global or marginal scale. The computational approach for patterns of such nature needs to be data-driven and bottom-up like CEDA paradigm. And these computational and observed facts further enhance that ICiD is consisting of 1D histograms of feature-categories.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. TYPE-III</head><p>Next, we consider the evolving order-2 interacting effects of bivariate-feature (HighBP, Smoker) across the four agecategories. Upon the Fig. <ref type="figure">7</ref>, we would see two significant evolving patterns. The first pattern is that, through the (0,0)-, (0,1)-and (1,1)-panels, we see the two categories of {Smoker} have nearly zero interacting effects with categories of {HighBP} at Age = 1, 3 and 4, while this pattern of interacting effect disappears at Age = 5 in a fashion that the alternative and null entropy distributions become heavily overlapping. The second pattern is seen through (1,0)-panels across age-categories. The pair of alternative and null entropy distributions is nearly overlapping each other at Age = 1.</p><p>Then, the alternative one shifts to the left of the null one at Age = 3 and 4. At the end, the alternative one shifts to the right of the null one at Age = 5. This evolving pattern means that interacting effects of {Smoker = 0} and {HighBP = 1} are visible, but deceasing to a great extent at Age = 5 subpopulation. In summary, these two evolving patterns indeed bear significant scientific impacts on understanding chronical diseases. Hence, it is worth reiterating that not only {HighBP} plays a dominant role over {Smoker} with highly asymmetric effects, but also their relational patterns do change along the age-axis. In fact, as would be seen in the next section, {Smoker} does play important role through its interacting effects with {DiffWalk}, {HighBP} and {HighChol}. This is one authentic and scientific, but very different way of describing effects of smoking in our society. This is indeed rather striking. In contrast, {Veggies} don't have similar effects at all.</p><p>Though the evolution of the above four types of order-2 interacting effects are not unthinkable if we take a retrospective viewpoint, the existence of such seemingly all natural types interacting effects emphasizes one simple fact that the diversity of functional forms of interacting effects can be too complex to be modeled realistically. As such we again emphasize the fact that these natural and explainable patterns are possible and visible only when we adopt bottom-up datadriven computational paradigm, like CEDA. This simple fact is tied to the categorical-locality nature.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>V. TOPOLOGICAL INDIVIDUAL RISK-LANDSCAPES AND THEIR EVOLUTIONS VIA MFCI</head><p>Four heatmaps of confirmed major 1-feature-categories for the four age-categories are resulted from applying MFCI-1 step of algorithm MFCI respectively and shown in Fig. <ref type="figure">8</ref>. The memberships of the four sets of major 1-feature-categories (or 1-risk-factor-categories) are highly overlapping. The common risk factors along the age-axis are:{DiffWalk, Diabetes, HighBP, HighChol}. Interestingly, risk-factors {Smoker, Income} are present from Age = 1 to Age = 3, but drops out at Age = 4 and are replaced by {NoDocbcCost}. This evolution coherently confirms the expected fact that the effects of these two groups of riskfactors are highly age-dependent. It is evidently noted that there is an evolutionary break-down seen in the Age = 5 panel, which consists only 3 major 1-feature-categories. The discussion of this phenomenon is given in the subsection just before the Conclusions. Three heatmaps of confirmed major 2-feature-categories for the age-categories: Age = 1, 3 and 4, are resulted from applying MFCI-2 step of algorithm MFCI respectively and shown in Fig. <ref type="figure">9</ref>. At Age = 5, we don't find any 1feature-category being able to provide Extra-Info for all confirmed major 1-feature-categories found through MFCI-1 step. However, we present those 2-feature-categories that merely satisfy the threshold Error -I &amp;II &#8804; 0.1.</p><p>Across the four heatmaps in Fig. <ref type="figure">9</ref> along the age-axis, almost all major 2-feature-categories are primary interacting pairs of major 1-feature-categories. The chief implication of such an evident pattern is that higher order interacting effects are highly potential at least in age-categories: Age = 1, 3 and 4. On one hand, since 2-feature-categories narrated in (0,1)-and (1,0)-panels in the previous section are most not major 2-feature-categories found in Fig. <ref type="figure">9</ref>, diseased-vs-nondiseased subjects belonging to these two 2-feature-categories need to be further separated by at least one more 1-featurecategories. On the other hand, subjects in those confirmed major 2-feature-categories narrated in (0,0)-and (1,1)-panels in the previous section could be further separated to achieve better diseased-vs-non-diseased separation beyond 2-featurecategories, as would clearly be seen in next two subsections.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. RESULTS OF MFCI-3 STEP AT AGE = 1</head><p>In this subsection, we report computed patterns through various heatmaps of major 3-feature-categories within the subpopulation (GenHL, Age) = (5, 1), while similar resultant patterns of subpopulations at Age = 3 and 4 are reported in the next subsection. In this section, one key idea of individual risk-landscape would be introduced. And all subjects' individual risk-landscapes are collectively displayed through three versions of heatmaps. We then construct summarizing contingency tables to confirm that such individual risklandscapes based heatmaps contain significant amounts of pattern information content in data (ICiD). At the end, we demonstrate the apparently existing so-called ''atypical subjects'' in both diseased and non-diseased response categories.</p><p>Upon the Step[2]-2 of MFCI-2 step, we identified and confirmed 31 major 2-feature-categories as seen in Fig. <ref type="figure">3</ref>. We then further perform the MFCI-3 step of the MFCI algorithm. Upon applying Step <ref type="bibr">[3]</ref>-1 of MFCI-3 step, we found 65 major and non-major 1-feature-categories that can provide extra-info upon the 31 major 2-feature-categories resulted from MFCI-2 step. Further, applying Step[3]-2 of MFCI-3 step, we identified and confirmed 31 major 3-featurecategories, see Fig. <ref type="figure">10</ref>. There will be 41 confirmed if the confirmation criterion is switched to P-value being less than 0.05, see Fig. <ref type="figure">11</ref>. This heatmap is presented here to indicate the potential fact that selection criterions based only on Pvalues, not involving with alternative distributions, are likely over-optimistic. All subsequent analyses are to be based on results contained in Fig. <ref type="figure">10</ref>.</p><p>Here, here by having the binary bipartite network's matrix lattice as a platform, we only collect all the major 3feature-categories and arrange them onto the column-axis in Fig. <ref type="figure">10</ref>. Memberships of each major 3-feature-category is represented by the corresponding binary column-vector. The column-axis is framed by a hierarchical clustering (HC) tree derived by using Euclidean distance, while the row-axis is also rearranged in a response-category specific fashion. That is, subjects belonging to the same response-category are arranged by its own HC-tree. So subjects of different response-categories do not fix together. Such a heatmap is created purely for easy visualization purpose. We wish to convey that response-category specific patterns in such heatmaps could help shed light on how major 3-featurecategories collectively work out their roles for the dynamics of Y. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>1) INDIVIDUAL RISK-LANDSCAPE INTERPRETATION</head><p>The chief merit of employing a heatmap here is to explicitly reveal the concept of individual risk-landscape and foster authentic understanding from collective patterns of such risk-landscapes. We first recall that the heatmap shown in Fig. <ref type="figure">10</ref> has a format of 1776 &#215; 31 matrix with 31 identified and confirmed major 3-feature-categories arranged along the column-axis. The binary memberships among the 1776 subjects each of 3-feature-category is listed as one column. As such each subject is represented by a 31-dim binary row-vector across the 31 major 3-featurecategories. Such a 31-dim binary vector indeed explicitly indicates what kinds of positive or negative disease risks this subject is facing simultaneously. Specifically speaking, along the column-axis with 31 columns, each subject within this sub-population (GenHL = 5 and Age = 1) embraces potential positive disease risk via memberships of 11 major 3-feature-categories marked with ''+'' signs and potential negative disease risk via memberships of 20 major 3-featurecategories marked with negative''-'' signs.</p><p>As such a 31-dim binary vector endorses a subject's individual-risk-landscape that becomes the most critical information pertaining to this individual's health. In comparison with the 31 major 2-feature-categories presented in Fig. <ref type="figure">3</ref>, all major 3-feature-categories in Fig. <ref type="figure">10</ref> by-and-large have higher disease-to-non-disease odds. Such higher odds would render clearer and more informative disease-related mechanistic patterns as would be derived below.</p><p>All 1776 subjects' individual 31-dim individual risk landscapes indeed collectively constitute visible patterns of various scales. The heatmap shown in Fig. <ref type="figure">10</ref> is framed by a hierarchical clustering (HC) tree on column-axis and 4 colorcoded bivariate diseases categories. At its top internal node, HC-tree splits into left and right branches, coded as L1 vs. R1, respectively. The L1 branch consists all 11 positive disease risk major 3-feature-categories, while the R1 branch consists all 20 negative disease risk 3-feature-categories. Branch L1 further splits into L1L2 and L1R2 subbranches which are color-coded gray on 4 and green on 7 major 3-featurecategories of positive disease risk, respectively. Likewise R1 splits into R1L2 and R1R2 subbranches color-coded red on 3 and blue on 17 major 3-feature-categories negative disease risk, respectively.</p><p>These four subbranches indeed embrace their characteristics due to their distinct compositions of major 3feature-categories. It becomes natural to take these four characteristics into considerations when thinking about the similarity or dissimilarity among study subjects. That is, we make the membership-sums of these four subbranches into four extra feature-variables. The 35-dim Euclidean distance is used to remake an extended version of heatmap as shown in Fig. <ref type="figure">12</ref>. This heatmap embraces more evident blocks than the one in Fig. <ref type="figure">10</ref>. This revised heatmap clearly reveal block-patterns as demonstrated in Fig. <ref type="figure">12</ref>. Apparently, each block in is jointly framed by membership-cluster of a subbranch of major 3feature-categories, which are more or less constant, and a cluster of study subjects, who are rather similar for their individual risk-landscapes. These blocks collectively convey explicit pattern-dynamics underlying Y. In particular, a series of horizontally displayed blocks will characterize a cluster of study subjects with visible and explainable pattern information. In the next subsection, we elaborate merits and importance of such characterization in details.</p><p>As each horizontal series of blocks induced a welldefined neighborhood for all study subject participating in this block, this heatmap can be taken as an informative display of ''topology'' defined on the collection of 1776 study subjects. In mathematical term, this topological space here is equipped with the 35-dim Euclidean distance that defines neighborhoods for all study subjects. In covariate information term, a subject's neighborhood is meant to be a set of subjects having very similar individual risk-landscapes. This heatmap as a topological display is one of our chief summarizing statistics. The reasons underlying this statement are given as follows. The relational pattern information regarding dynamics of Y is in full display: [responsecategory]-vs -[individual risk-landscape], in the heatmap. Further, the topological insights will have profound impacts on Data Analysis as a scientific discipline. In contrast, from this topological perspective, many statistical and machine learning topics: ranging from classification to clustering, are not rigorously formulated when facing real-world complex systems, also see <ref type="bibr">[8]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>2) SUMMARIZING STATISTICS BASED ON SUBJECTS' TOPOLOGY</head><p>To explicitly see merits of the topology-bearing heatmap from another aspect of summarizing statistics, we specifically look at one positive and one negative disease risk subbranches: L1R2 and R1R2, respectively, among the four aforementioned subbranches on column-axis. And for expositional simplicity, we interpret the disease risk pertaining to Y &#8712; {(1, 0), (0, 1), (1, 1)} against Y = (0, 0) via the odds. The baseline or overall disease odds in this (GenHL = 5 and Age = 1) subpopulation as 404 1372 = 0.2944. Upon the L1R2 subbranch, which consists of 7 major 3-feature-categories of positive disease risk, each subject's total memberships of this subbranch ranges from 0 to 7. The odds for subjects, who accumulate 4 up to 7 memberships, is 109 119 = 0.9160. This odds indicates that a subject having 4 or more memberships within this subbranch has probability of belonging to the diseased with probability nearly 0.5. The odds-ratio is calculated as 0.9160 0.2944 = 3.1114. This ratio indicates that these subjects are at least 3 times more likely to be diseased: either Stroke or Heart disease, than subjects in the entire subpopulation in general. In sharp contrast, the odds for subjects, who accumulate 1 membership up to 3 memberships is 168 397 = 0.4232. And the odds-ratio is calculated as 0.4232 0.2944 = 1.4375. This ratio indicates that these subjects nearly 1.5 times likely to be diseased as the subjects in this subpopulation in general. Further, subjects have zero memberships on this subbranch have an odds 127 856 = 0.1484. That is, such subjects have an odds-ratio 0.1484 0.2944 = 0.5041, that is, subjects in this sub-population in general are twice as likely to be diseased as such subjects with zero memberships in this subbranch. These three spreading widely odds-ratios indicates the informativeness of L1R2 subbranch on the positive disease risk.</p><p>However, the heatmap shown in Fig. <ref type="figure">12</ref> reveal much more important visual patterns beyond the above three widely spreading odds-ratios and their interpretations based on results associated with L1R2 subbranch. Here are the essences of three implications derived from the visual patterns:</p><p>1.</p><p>The superficial disease-imbalance phenomenon indeed is embedded with somehow surprising hidden structural causes: ''atypical subjects'', as would be described below. Such causes render any predictive approaches unsustainable because of not only having very high error-rates, but also being neither informative nor scientifically correct. 2.</p><p>The precise and drastically distinct multi-scale block-patterns of topological individual risklandscapes of all involving subjects are critical for understanding the dynamics of Y = (STK , HD).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>3.</head><p>The high vs zero intensities of memberships within each block across positive and negative disease risks of major 3-feature-categories pave ways for distinguishing high risk subjects against lower risk ones.</p><p>These are three chief findings in our CEDA based data analysis and chief characteristics of resultant ICiD of locality nature.</p><p>The above three chief finding of CEDA data analysis are even more evident via interpretations of results from the branch R1R2. This branch consists of 17 major 3-featurecategories of negative disease risk. There are total 51 subjects having 8 or more memberships. Strikingly, this group of subject have zero odds. This result is indeed striking. There are 280 subjects who have at least 3, but no more than 7 memberships. This group of subjects' odds is 13  267 = 0.0487, and odds-ratio 0.0487 0.2944 = 0.1654. The probability of being diseased for subjects in this group is as low as 0.05, and its relative risk of this group to the whole subpopulation is less than one fifth. There are 253 subjects have one or two memberships. This group's odds is 30 223 = 0.1345, so its subjects' probability of being disease is less than 1  8 . Its odds-ratio is less than 1  2 . Finally, there are 1192 subjects who do not own any memberships out of these 20 major 3-feature-categories. There are 360 having diseases among these 1192 subjects, that is, the odds is 361 1192-361 = 361 831 = 0.4344, and the probability is 0.3029. The odds-ratio is 1.4755. That is, subjects with zero memberships on R1R2 branch will have 1.5 times of the relative risk of subjects within sub-population in general.</p><p>The above topological risk-landscape based findings of positive risk based on the branch L1R2 and of negative risk on the branch R1R2 together clearly spell out the essential merits of identified and confirmed high orders effects of feature-categories. It is somehow revealing to see such significant results and informative patterns via such simplistic computations. More revealing is that the amount of branch-memberships becomes a synthesized variable. That is, L1R2 and R1R2 can be transformed into two very informative variables that more precisely prescribe positive and negative disease risks, respectively, than any feature-sets. Such consequential synthesizing mechanism of risk factors is amazingly achieved without any man-made structures.</p><p>We respectively transform the membership in branch L1R2 and in branch R1R2 into two new variables: Syn[L1R2] and Syn[R1R2], in the following fashions. Denote a subject's total memberships on branch L1R2 and R1R2 as two variables: #[L1R2] and #[R1R2], respectively.</p><p>1.</p><p>[On branch L1R2:] Syn</p><p>We then build the following odds and odds-ratio table. Let n 4+,8-= d 4+,8-+ nond 4+,8-be the number of subjects belonging to the (Syn[L1R2], Syn[R1R2]) = (4+, 8-), which is the sum of d 4+,8-as the number of diseased subjects and nd 4+,8-as the number of non-diseased subjects. This table reveals that the bivariate (Syn[L1R2], Syn[R1R2]) is capable of a wide spectrum of odds, so it is rather informative to dynamics underlying Y. That is to say that the topological individual risk-landscape via heatmap, shown in Fig. <ref type="figure">12</ref>, indeed captures the very essential associative patterns regrading this Re-Co dynamics.</p><p>TABLE 3. Odds table of Syn[L1R2]  <ref type="table">4</ref>. The information content is relatively similar with that in Table <ref type="table">3</ref>.</p><p>TABLE 4. Odds table of Syn[L1] -vs -Syn[R1] division of the sub-population of heterogeneity (GenHL = 5, Age = 1).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>3) ATYPICAL SUBJECTS</head><p>As the [response-category]-vs-[individual risk-landscape] relational pattern information of dynamics of Y being in full display via block-patterns embedded within the heatmap shown Fig. <ref type="figure">12</ref>, we clearly see that there are 3 types of non-diseased subjects in the category Y = (0, 0). The three types are: 1) zero positive risk with prevalent memberships in R1 branch; 2) some positive and some negative risk; 3) zero negative risk with prevalent memberships in L1 branch across the 31 major 3-feature-categories. These three types are seemingly seen within the diseased subjects in the categories Y = {(0, 1), (1, 0), (1, 1)} as well. Immediately, we realize contradicting mechanisms through the simultaneous presence of these three types of [response-category]-vs -[individual risk-landscape] relational patterns in both diseased and nondiseased categories.</p><p>Intuitively and ideally speaking, the category Y = (0, 0) should be full of the type-1 subjects, or at most include some type-2, and the categories Y = {(0, 1), (1, 0), (1, 1)} should be full of type-3 subjects. But, apparently, this is not case in the heatmap in Fig. <ref type="figure">12</ref>. Counter-intuitive and non-ideal situations are observed: a large group of type-3 subjects in the category Y = (0, 0) and a group of type-1 subjects in the categories Y = {(0, 1), (1, 0), (1, 1)}.</p><p>In particular, a large group of type-3 subjects in the category Y = (0, 0) are those who resist the trend to go much higher odds-ratios than 3 in the bivariate cell (Syn[L1R2], Syn[R1R2]) = (4, 0) in Table <ref type="table">3</ref> or (Syn[L1], Syn[R1]) = (4, 0) in Table <ref type="table">4</ref>. For this reason, we term such subjects: ''atypical subjects'' in Y = (0, 0). Likewise, we have ''atypical subjects'' in the categories Y = {(0, 1), (1, 0), (1, 1)}. To better visualize the presence of such atypical subjects, we build another heatmap by lifting off the response-category constraint in Fig. <ref type="figure">13</ref>. It is clearly seen that subjects with similar individual risk-landscapes are grouped together, while their response-category-marks are mixed. At the end of this subsection, we mention the following obvious implications of ''atypical subjects'' within calY -vs-IRL topological relations. Though the nature of ''atypical subjects'' is to be discovered, the graphic displays of mapping out and displaying all such ''atypical subjects'' within the heatmap indeed reveal full [response-category]vs-[individual risk-landscape] relations. Thus, Fig. <ref type="figure">12</ref> and Fig. <ref type="figure">13</ref> are essential computational results in data analysis. From the perspective of ''atypical subject'', the Fig. <ref type="figure">15</ref> and Fig. <ref type="figure">16</ref> together reveal its existential evidence in diseased and non-diseased categories within both Age = 3 and Age = 4. Such an existence of ''atypical subjects'' strongly indicates the importance of recognizing the true goal of data analysis as constructing graphic displays that can explicitly exhibit detailed individual risk-landscape. Upon these heatmaps, we can visualize the topological relations pertaining to each subject's neighborhoods under the same platform, with which subject-specific similarity and dissimilarity become natural and obvious. Such topologies immediately link to many essential scientific issues, such as how to design randomized trials, how to design experiments for finding extreme high (or low) disease risk subjects and how to properly understand causal effects under observational study and many others.</p><p>At the end of this subsection, we again present the synthesized bivariate (Syn[L1], Syn[R1]) as an identified informative summarizing 2D statistics for complex disease dynamics of Y = (STK , HD) within the sub-populations Age = 3 and 4, respectively. Two corresponding contingency tables: Table <ref type="table">5</ref> for Age = 3 and Table <ref type="table">6</ref> for Age = 4, are seen to capture a wide spectrum of disease risk potentials and characteristics: from very low disease risk to rather high disease risk in terms of cell-specific odds comparing with the odds pertaining to the sub-population. It is evident that the socalled imbalance phenomenon is no longer present in these two sub-populations. However, the presence of ''atypical subjects'' remains. That is, any predictive approaches are to suffer very high error rates and to be seen as being impractical. Once again, this presence of ''atypical subjects'' is a clear and strong reminder regarding the fact that some important risk-factors might still be missing in this Kaggle version of BRFSS data set. <ref type="bibr">VOLUME 12, 2024</ref> corresponding horizontal series of blocks. This block-series collectively reveals intricate relational information of the response-variable Y =(STK, HD). Conversely, any cluster of high-order feature-categories unveils complex structural dependency among all involving feature-variables. Such complexity in structural dependency remains unexplored in the literature.</p><p>Additionally, this heatmap-based individual risk-landscape topology highlights typical subjects versus ''atypical subjects'' within each response-category. The contrasting presence of typical versus atypical subjects across distinct response-categories underscores the importance of computing and displaying pattern information as the primary objective of data analysis. This elucidates why errors may occur in predictive approaches within the fields of ML and Statistics.</p><p>Throughout this paper, we employ CEDA to address the extremely important and fundamental data analysis issue: What is the information content in data (ICiD)? We propose a primary approach to answer this question: A heatmap of individual risk-landscape of high orders, which serves as the chief component of sub-population specific ICiD in this paper. We also endeavor to synthesize all findings concerning the four sub-populations of GenHl=5 and Age=1, 3, 4, and 5 to gain a true understanding of the joint disease dynamics of multiple chronic diseases. With such results in hand, we are confident that our computational approach is a critical method for studying the BRFSS as a complex system and its evolution over many years.</p><p>Although this paper focuses on discussions within a categorical data world, the entire computational framework is applicable to all structured databases. Any database represented in a matrix format inherently contains a categorical data world. Specifically, any quantitative variable can be categorized through its histogram as an approximately sufficient statistic. Their joint high-dimensional histogram would retain almost all essential information content in data (ICiD). Thus, by accepting a slight amount of information loss when relinquishing ''smoothness'', the gains from applying CEDA are tremendous from many perspectives. The foremost perspective is the explicit interpretability of ICiD, which is completely free from all manmade structures and assumptions. Therefore, all CEDA results are authentic. This fact leads to another essential perspective in scientific data analysis: Unlike symmetrybased correlation, which may provide a distorted marginal version of associative information, our directional associative patterns are of a local nature. These patterns are visible, explicit, realistic, intuitive, and most importantly explainable.</p><p>In conclusion, all heatmaps presented in this paper explicitly underscore the central role of classification in the study of complex systems. The scientific value of a complex system lies in comprehensive explanations of its dynamic nature, which is expressed through all study subjects. Therefore, any classification task must be real-istically approached by revealing and showcasing intrinsic information related to each individual study subject. In this manner, we convincingly demonstrate in this paper that our CEDA paradigm can effectively explore complex systems and uncover their dynamics captured in ICiD. Furthermore, we illustrate that CEDA is capable of accommodating highly complex response variables and relatively small sample sizes.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2024" xml:id="foot_0"><p>The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ VOLUME 12, 2024</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_1"><p>VOLUME 12, 2024   </p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="66300" xml:id="foot_2"><p>  VOLUME 12, 2024   </p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="66306" xml:id="foot_3"><p>  VOLUME 12, 2024   </p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="66308" xml:id="foot_4"><p>  VOLUME 12, 2024   </p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="66310" xml:id="foot_5"><p>  VOLUME 12, 2024   </p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="66318" xml:id="foot_6"><p>  VOLUME 12, 2024   </p></note>
		</body>
		</text>
</TEI>
