skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Unraveling Hidden Major Factors by Breaking Heterogeneity into Homogeneous Parts within Many-System Problems
For a large ensemble of complex systems, a Many-System Problem (MSP) studies how heterogeneity constrains and hides structural mechanisms, and how to uncover and reveal hidden major factors from homogeneous parts. All member systems in an MSP share common governing principles of dynamics, but differ in idiosyncratic characteristics. A typical dynamic is found underlying response features with respect to covariate features of quantitative or qualitative data types. Neither all-system-as-one-whole nor individual system-specific functional structures are assumed in such response-vs-covariate (Re–Co) dynamics. We developed a computational protocol for identifying various collections of major factors of various orders underlying Re–Co dynamics. We first demonstrate the immanent effects of heterogeneity among member systems, which constrain compositions of major factors and even hide essential ones. Secondly, we show that fuller collections of major factors are discovered by breaking heterogeneity into many homogeneous parts. This process further realizes Anderson’s “More is Different” phenomenon. We employ the categorical nature of all features and develop a Categorical Exploratory Data Analysis (CEDA)-based major factor selection protocol. Information theoretical measurements—conditional mutual information and entropy—are heavily used in two selection criteria: C1—confirmable and C2—irreplaceable. All conditional entropies are evaluated through contingency tables with algorithmically computed reliability against the finite sample phenomenon. We study one artificially designed MSP and then two real collectives of Major League Baseball (MLB) pitching dynamics with 62 slider pitchers and 199 fastball pitchers, respectively. Finally, our MSP data analyzing techniques are applied to resolve a scientific issue related to the Rosenberg Self-Esteem Scale.  more » « less
Award ID(s):
1934568
PAR ID:
10349698
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Entropy
Volume:
24
Issue:
2
ISSN:
1099-4300
Page Range / eLocation ID:
170
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Without assuming any functional or distributional structure, we select collections of major factors embedded within response-versus-covariate (Re-Co) dynamics via selection criteria [C1: confirmable] and [C2: irrepaceable], which are based on information theoretic measurements. The two criteria are constructed based on the computing paradigm called Categorical Exploratory Data Analysis (CEDA) and linked to Wiener–Granger causality. All the information theoretical measurements, including conditional mutual information and entropy, are evaluated through the contingency table platform, which primarily rests on the categorical nature within all involved features of any data types: quantitative or qualitative. Our selection task identifies one chief collection, together with several secondary collections of major factors of various orders underlying the targeted Re-Co dynamics. Each selected collection is checked with algorithmically computed reliability against the finite sample phenomenon, and so is each member’s major factor individually. The developments of our selection protocol are illustrated in detail through two experimental examples: a simple one and a complex one. We then apply this protocol on two data sets pertaining to two somewhat related but distinct pitching dynamics of two pitch types: slider and fastball. In particular, we refer to a specific Major League Baseball (MLB) pitcher and we consider data of multiple seasons. 
    more » « less
  2. null (Ed.)
    We develop Categorical Exploratory Data Analysis (CEDA) with mimicking to explore and exhibit the complexity of information content that is contained within any data matrix: categorical, discrete, or continuous. Such complexity is shown through visible and explainable serial multiscale structural dependency with heterogeneity. CEDA is developed upon all features’ categorical nature via histogram and it is guided by all features’ associative patterns (order-2 dependence) in a mutual conditional entropy matrix. Higher-order structural dependency of k(≥3) features is exhibited through block patterns within heatmaps that are constructed by permuting contingency-kD-lattices of counts. By growing k, the resultant heatmap series contains global and large scales of structural dependency that constitute the data matrix’s information content. When involving continuous features, the principal component analysis (PCA) extracts fine-scale information content from each block in the final heatmap. Our mimicking protocol coherently simulates this heatmap series by preserving global-to-fine scales structural dependency. Upon every step of mimicking process, each accepted simulated heatmap is subject to constraints with respect to all of the reliable observed categorical patterns. For reliability and robustness in sciences, CEDA with mimicking enhances data visualization by revealing deterministic and stochastic structures within each scale-specific structural dependency. For inferences in Machine Learning (ML) and Statistics, it clarifies, upon which scales, which covariate feature-groups have major-vs.-minor predictive powers on response features. For the social justice of Artificial Intelligence (AI) products, it checks whether a data matrix incompletely prescribes the targeted system. 
    more » « less
  3. All features of any data type are universally equipped with categorical nature revealed through histograms. A contingency table framed by two histograms affords directional and mutual associations based on rescaled conditional Shannon entropies for any feature-pair. The heatmap of the mutual association matrix of all features becomes a roadmap showing which features are highly associative with which features. We develop our data analysis paradigm called categorical exploratory data analysis (CEDA) with this heatmap as a foundation. CEDA is demonstrated to provide new resolutions for two topics: multiclass classification (MCC) with one single categorical response variable and response manifold analytics (RMA) with multiple response variables. We compute visible and explainable information contents with multiscale and heterogeneous deterministic and stochastic structures in both topics. MCC involves all feature-group specific mixing geometries of labeled high-dimensional point-clouds. Upon each identified feature-group, we devise an indirect distance measure, a robust label embedding tree (LET), and a series of tree-based binary competitions to discover and present asymmetric mixing geometries. Then, a chain of complementary feature-groups offers a collection of mixing geometric pattern-categories with multiple perspective views. RMA studies a system’s regulating principles via multiple dimensional manifolds jointly constituted by targeted multiple response features and selected major covariate features. This manifold is marked with categorical localities reflecting major effects. Diverse minor effects are checked and identified across all localities for heterogeneity. Both MCC and RMA information contents are computed for data’s information content with predictive inferences as by-products. We illustrate CEDA developments via Iris data and demonstrate its applications on data taken from the PITCHf/x database. 
    more » « less
  4. ABSTRACT AimEcological theory suggests that dispersal limitation and selection by climatic factors influence bacterial community assembly at a continental scale, yet the conditions governing the relative importance of each process remains unclear. The carnivorous pitcher plantSarracenia purpureaprovides a model aquatic microecosystem to assess bacterial communities across the host plant's north–south range in North America. This study determined the relative influences of dispersal limitation and environmental selection on the assembly of bacterial communities inhabitingS. purpureapitchers at the continental scale. LocationEastern United States and Canada. Time Period2016. Major Taxa StudiedBacteria inhabitingS. purpureapitchers. MethodsPitcher morphology, fluid, inquilines and prey were measured, and pitcher fluid underwent DNA sequencing for bacterial community analysis. Null modelling of β‐diversity provided estimates for the contributions of selection and dispersal limitation to community assembly, complemented by an examination of spatial clustering of individuals. Phylogenetic and ecological associations of co‐occurrence network module bacteria was determined by assessing the phylogenetic diversity and habitat preferences of member taxa. ResultsDispersal limitation was evident from between‐site variation and spatial aggregation of individual bacterial taxa in theS. purpureapitcher system. Selection pressure was weak across the geographic range, yet network module analysis indicated environmental selection within subgroups. A group of aquatic bacteria held traits under selection in warmer, wetter climates, and midge abundance was associated with selection for traits held by a group of saprotrophs. Processes that increased pitcher fluid volume weakened selection in one module, possibly by supporting greater bacterial dispersal. ConclusionDispersal limitation governed bacterial community assembly inS. purpureapitchers at a continental scale (74% of between‐site comparisons) and was significantly greater than selection across the range. Network modules showed evidence for selection, demonstrating that multiple processes acted concurrently in bacterial community assembly at the continental scale. 
    more » « less
  5. ABSTRACT Graphical models are powerful tools to investigate complex dependency structures in high-throughput datasets. However, most existing graphical models make one of two canonical assumptions: (i) a homogeneous graph with a common network for all subjects or (ii) an assumption of normality, especially in the context of Gaussian graphical models. Both assumptions are restrictive and can fail to hold in certain applications such as proteomic networks in cancer. To this end, we propose an approach termed robust Bayesian graphical regression (rBGR) to estimate heterogeneous graphs for non-normally distributed data. rBGR is a flexible framework that accommodates non-normality through random marginal transformations and constructs covariate-dependent graphs to accommodate heterogeneity through graphical regression techniques. We formulate a new characterization of edge dependencies in such models called conditional sign independence with covariates, along with an efficient posterior sampling algorithm. In simulation studies, we demonstrate that rBGR outperforms existing graphical regression models for data generated under various levels of non-normality in both edge and covariate selection. We use rBGR to assess proteomic networks in lung and ovarian cancers to systematically investigate the effects of immunogenic heterogeneity within tumors. Our analyses reveal several important protein–protein interactions that are differentially associated with the immune cell abundance; some corroborate existing biological knowledge, whereas others are novel findings. 
    more » « less