skip to main content


Title: Specific and Shared Causal Relation Modeling and Mechanism-based Clustering
State-of-the-art approaches to causal discovery usually assume a fixed underlying causal model. However, it is often the case that causal models vary across domains or subjects, due to possibly omitted factors that affect the quantitative causal effects. As a typical example, causal connectivity in the brain network has been reported to vary across individuals, with significant differences across groups of people, such as autistics and typical controls. In this paper, we develop a unified framework for causal discovery and mechanism-based group identification. In particular, we propose a specific and shared causal model (SSCM), which takes into account the variabilities of causal relations across individuals/groups and leverages their commonalities to achieve statistically reliable estimation. The learned SSCM gives the specific causal knowledge for each individual as well as the general trend over the population. In addition, the estimated model directly provides the group information of each individual. Experimental results on synthetic and real-world data demonstrate the efficacy of the proposed method.  more » « less
Award ID(s):
1829681
NSF-PAR ID:
10125766
Author(s) / Creator(s):
Date Published:
Journal Name:
Advances in neural information processing systems
ISSN:
1049-5258
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Pollard, Tom J. (Ed.)
    Modern predictive models require large amounts of data for training and evaluation, absence of which may result in models that are specific to certain locations, populations in them and clinical practices. Yet, best practices for clinical risk prediction models have not yet considered such challenges to generalizability. Here we ask whether population- and group-level performance of mortality prediction models vary significantly when applied to hospitals or geographies different from the ones in which they are developed. Further, what characteristics of the datasets explain the performance variation? In this multi-center cross-sectional study, we analyzed electronic health records from 179 hospitals across the US with 70,126 hospitalizations from 2014 to 2015. Generalization gap, defined as difference between model performance metrics across hospitals, is computed for area under the receiver operating characteristic curve (AUC) and calibration slope. To assess model performance by the race variable, we report differences in false negative rates across groups. Data were also analyzed using a causal discovery algorithm “Fast Causal Inference” that infers paths of causal influence while identifying potential influences associated with unmeasured variables. When transferring models across hospitals, AUC at the test hospital ranged from 0.777 to 0.832 (1st-3rd quartile or IQR; median 0.801); calibration slope from 0.725 to 0.983 (IQR; median 0.853); and disparity in false negative rates from 0.046 to 0.168 (IQR; median 0.092). Distribution of all variable types (demography, vitals, and labs) differed significantly across hospitals and regions. The race variable also mediated differences in the relationship between clinical variables and mortality, by hospital/region. In conclusion, group-level performance should be assessed during generalizability checks to identify potential harms to the groups. Moreover, for developing methods to improve model performance in new environments, a better understanding and documentation of provenance of data and health processes are needed to identify and mitigate sources of variation. 
    more » « less
  2. Abstract

    Autism Spectrum Disorder (ASD) is characterized as a neurodevelopmental disorder with a heterogeneous nature, influenced by genetics and exhibiting diverse clinical presentations. In this study, we dissect Autism Spectrum Disorder (ASD) into its behavioral components, mirroring the diagnostic process used in clinical settings. Morphological features are extracted from magnetic resonance imaging (MRI) scans, found in the publicly available dataset ABIDE II, identifying the most discriminative features that differentiate ASD within various behavioral domains. Then, each subject is categorized as having severe, moderate, or mild ASD, or typical neurodevelopment (TD), based on the behavioral domains of the Social Responsiveness Scale (SRS). Through this study, multiple artificial intelligence (AI) models are utilized for feature selection and classifying each ASD severity and behavioural group. A multivariate feature selection algorithm, investigating four different classifiers with linear and non-linear hypotheses, is applied iteratively while shuffling the training-validation subjects to find the set of cortical regions with statistically significant association with ASD. A set of six classifiers are optimized and trained on the selected set of features using 5-fold cross-validation for the purpose of severity classification for each behavioural group. Our AI-based model achieved an average accuracy of 96%, computed as the mean accuracy across the top-performing AI models for feature selection and severity classification across the different behavioral groups. The proposed AI model has the ability to accurately differentiate between the functionalities of specific brain regions, such as the left and right caudal middle frontal regions. We propose an AI-based model that dissects ASD into behavioral components. For each behavioral component, the AI-based model is capable of identifying the brain regions which are associated with ASD as well as utilizing those regions for diagnosis. The proposed system can increase the speed and accuracy of the diagnostic process and result in improved outcomes for individuals with ASD, highlighting the potential of AI in this area.

     
    more » « less
  3. Explaining why animal groups vary in size is a fundamental problem in behavioral ecology. One hypothesis is that life-history differences among individuals lead to sorting of phenotypes into groups of different sizes where each individual does best. This hypothesis predicts that individuals should be relatively consistent in their use of particular group sizes across time. Little is known about whether animals’ choice of group size is repeatable across their lives, especially in long-lived species. We studied consistency in choice of breeding-colony size in colonially nesting cliff swallows ( Petrochelidon pyrrhonota ) in western Nebraska, United States, over a 32-year period, following 6,296 birds for at least four breeding seasons. Formal repeatability of size choice for the population was about 0.41. About 45% of individuals were relatively consistent in choice of colony size, while about 40% varied widely in the colony size they occupied. Birds using the smaller and larger colonies appeared more consistent in size use than birds occupying more intermediate sized colonies. Consistency in colony size was also influenced by whether a bird used the same physical colony site each year and whether the site had been fumigated to remove ectoparasites. The difference between the final and initial colony sizes for an individual, a measure of the net change in its colony size over its life, did not significantly depart from 0 for the dataset as a whole. However, different year-cohorts did show significant net change in colony size, both positive and negative, that may have reflected fluctuating selection on colony size among years based on climatic conditions. The results support phenotypic sorting as an explanation for group size variation, although cliff swallows also likely use past experience at a given site and the extent of ectoparasitism to select breeding colonies. 
    more » « less
  4. Abstract This project is funded by the US National Science Foundation (NSF) through their NSF RAPID program under the title “Modeling Corona Spread Using Big Data Analytics.” The project is a joint effort between the Department of Computer & Electrical Engineering and Computer Science at FAU and a research group from LexisNexis Risk Solutions. The novel coronavirus Covid-19 originated in China in early December 2019 and has rapidly spread to many countries around the globe, with the number of confirmed cases increasing every day. Covid-19 is officially a pandemic. It is a novel infection with serious clinical manifestations, including death, and it has reached at least 124 countries and territories. Although the ultimate course and impact of Covid-19 are uncertain, it is not merely possible but likely that the disease will produce enough severe illness to overwhelm the worldwide health care infrastructure. Emerging viral pandemics can place extraordinary and sustained demands on public health and health systems and on providers of essential community services. Modeling the Covid-19 pandemic spread is challenging. But there are data that can be used to project resource demands. Estimates of the reproductive number (R) of SARS-CoV-2 show that at the beginning of the epidemic, each infected person spreads the virus to at least two others, on average (Emanuel et al. in N Engl J Med. 2020, Livingston and Bucher in JAMA 323(14):1335, 2020). A conservatively low estimate is that 5 % of the population could become infected within 3 months. Preliminary data from China and Italy regarding the distribution of case severity and fatality vary widely (Wu and McGoogan in JAMA 323(13):1239–42, 2020). A recent large-scale analysis from China suggests that 80 % of those infected either are asymptomatic or have mild symptoms; a finding that implies that demand for advanced medical services might apply to only 20 % of the total infected. Of patients infected with Covid-19, about 15 % have severe illness and 5 % have critical illness (Emanuel et al. in N Engl J Med. 2020). Overall, mortality ranges from 0.25 % to as high as 3.0 % (Emanuel et al. in N Engl J Med. 2020, Wilson et al. in Emerg Infect Dis 26(6):1339, 2020). Case fatality rates are much higher for vulnerable populations, such as persons over the age of 80 years (> 14 %) and those with coexisting conditions (10 % for those with cardiovascular disease and 7 % for those with diabetes) (Emanuel et al. in N Engl J Med. 2020). Overall, Covid-19 is substantially deadlier than seasonal influenza, which has a mortality of roughly 0.1 %. Public health efforts depend heavily on predicting how diseases such as those caused by Covid-19 spread across the globe. During the early days of a new outbreak, when reliable data are still scarce, researchers turn to mathematical models that can predict where people who could be infected are going and how likely they are to bring the disease with them. These computational methods use known statistical equations that calculate the probability of individuals transmitting the illness. Modern computational power allows these models to quickly incorporate multiple inputs, such as a given disease’s ability to pass from person to person and the movement patterns of potentially infected people traveling by air and land. This process sometimes involves making assumptions about unknown factors, such as an individual’s exact travel pattern. By plugging in different possible versions of each input, however, researchers can update the models as new information becomes available and compare their results to observed patterns for the illness. In this paper we describe the development a model of Corona spread by using innovative big data analytics techniques and tools. We leveraged our experience from research in modeling Ebola spread (Shaw et al. Modeling Ebola Spread and Using HPCC/KEL System. In: Big Data Technologies and Applications 2016 (pp. 347-385). Springer, Cham) to successfully model Corona spread, we will obtain new results, and help in reducing the number of Corona patients. We closely collaborated with LexisNexis, which is a leading US data analytics company and a member of our NSF I/UCRC for Advanced Knowledge Enablement. The lack of a comprehensive view and informative analysis of the status of the pandemic can also cause panic and instability within society. Our work proposes the HPCC Systems Covid-19 tracker, which provides a multi-level view of the pandemic with the informative virus spreading indicators in a timely manner. The system embeds a classical epidemiological model known as SIR and spreading indicators based on causal model. The data solution of the tracker is built on top of the Big Data processing platform HPCC Systems, from ingesting and tracking of various data sources to fast delivery of the data to the public. The HPCC Systems Covid-19 tracker presents the Covid-19 data on a daily, weekly, and cumulative basis up to global-level and down to the county-level. It also provides statistical analysis for each level such as new cases per 100,000 population. The primary analysis such as Contagion Risk and Infection State is based on causal model with a seven-day sliding window. Our work has been released as a publicly available website to the world and attracted a great volume of traffic. The project is open-sourced and available on GitHub. The system was developed on the LexisNexis HPCC Systems, which is briefly described in the paper. 
    more » « less
  5. Why, when, and how do stereotypes change? This paper develops a computational account based on the principles of structure learning: stereotypes are governed by probabilistic beliefs about the assignment of individuals to groups. Two aspects of this account are particularly important. First, groups are flexibly constructed based on the distribution of traits across individuals; groups are not fixed, nor are they assumed to map on to categories we have to provide to the model. This allows the model to explain the phenomena of group discovery and subtyping, whereby deviant individuals are segregated from a group, thus protecting the group’s stereotype. Second, groups are hierarchically structured, such that groups can be nested. This allows the model to explain the phenomenon of subgrouping, whereby a collection of deviant individuals is organized into a refinement of the superordinate group. The structure learning account also sheds light on several factors that determine stereotype change, including perceived group variability, individual typicality, cognitive load, and sample size. 
    more » « less