skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Discovering and deciphering relationships across disparate data modalities
If you want to estimate whether height is related to weight in humans, what would you do? You could measure the height and weight of a large number of people, and then run a statistical test. Such ‘independence tests’ can be thought of as a screening procedure: if the two properties (height and weight) are not related, then there is no point in proceeding with further analyses. In the last 100 years different independence tests have been developed. However, classical approaches often fail to accurately discern relationships in the large, complex datasets typical of modern biomedical research. For example, connectomics datasets include tens or hundreds of thousands of connections between neurons that collectively underlie how the brain performs certain tasks. Discovering and deciphering relationships from these data is currently the largest barrier to progress in these fields. Another drawback to currently used methods of independence testing is that they act as a ‘black box’, giving an answer without making it clear how it was calculated. This can make it difficult for researchers to reproduce their findings – a key part of confirming a scientific discovery. Vogelstein et al. therefore sought to develop a method of performing independence tests on large datasets that can easily be both applied and interpreted by practicing scientists. The method developed by Vogelstein et al., called Multiscale Graph Correlation (MGC, pronounced ‘magic’), combines recent developments in hypothesis testing, machine learning, and data science. The result is that MGC typically requires between one half to one third as big a sample size as previously proposed methods for analyzing large, complex datasets. Moreover, MGC also indicates the nature of the relationship between different properties; for example, whether it is a linear relationship or not. Testing MGC on real biological data, including a cancer dataset and a human brain imaging dataset, revealed that it is more effective at finding possible relationships than other commonly used independence methods. MGC was also the only method that explained how it found those relationships. MGC will enable relationships to be found in data across many fields of inquiry – and not only in biology. Scientists, policy analysts, data journalists, and corporate data scientists could all use MGC to learn about the relationships present in their data. To that extent, Vogelstein et al. have made the code open source in MATLAB, R, and Python.  more » « less
Award ID(s):
1921310 1712947
PAR ID:
10098512
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
eLife
Volume:
8
ISSN:
2050-084X
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Understanding the intricacies of the brain often requires spotting and tracking specific neurons over time and across different individuals. For instance, scientists may need to precisely monitor the activity of one neuron even as the brain moves and deforms; or they may want to find universal patterns by comparing signals from the same neuron across different individuals. Both tasks require matching which neuron is which in different images and amongst a constellation of cells. This is theoretically possible in certain ‘model’ animals where every single neuron is known and carefully mapped out. Still, it remains challenging: neurons move relative to one another as the animal changes posture, and the position of a cell is also slightly different between individuals. Sophisticated computer algorithms are increasingly used to tackle this problem, but they are far too slow to track neural signals as real-time experiments unfold. To address this issue, Yu et al. designed a new algorithm based on the Transformer, an artificial neural network originally used to spot relationships between words in sentences. To learn relationships between neurons, the algorithm was fed hundreds of thousands of ‘semi-synthetic’ examples of constellations of neurons. Instead of painfully collated actual experimental data, these datasets were created by a simulator based on a few simple measurements. Testing the new algorithm on the tiny worm Caenorhabditis elegans revealed that it was faster and more accurate, finding corresponding neurons in about 10ms. The work by Yu et al. demonstrates the power of using simulations rather than experimental data to train artificial networks. The resulting algorithm can be used immediately to help study how the brain of C. elegans makes decisions or controls movements. Ultimately, this research could allow brain-machine interfaces to be developed. 
    more » « less
  2. Summary Chatterjee (2021) introduced a simple new rank correlation coefficient that has attracted much attention recently. The coefficient has the unusual appeal that it not only estimates a population quantity first proposed by Dette et al. (2013) that is zero if and only if the underlying pair of random variables is independent, but also is asymptotically normal under independence. This paper compares Chatterjee’s new correlation coefficient with three established rank correlations that also facilitate consistent tests of independence, namely Hoeffding’s $$D$$, Blum–Kiefer–Rosenblatt’s $$R$$, and Bergsma–Dassios–Yanagimoto’s $$\tau^*$$. We compare the computational efficiency of these rank correlation coefficients in light of recent advances, and investigate their power against local rotation and mixture alternatives. Our main results show that Chatterjee’s coefficient is unfortunately rate-suboptimal compared to $$D$$, $$R$$ and $$\tau^*$$. The situation is more subtle for a related earlier estimator of Dette et al. (2013). These results favour $$D$$, $$R$$ and $$\tau^*$$ over Chatterjee’s new correlation coefficient for the purpose of testing independence. 
    more » « less
  3. The human brain contains billions of cells called neurons that rapidly carry information from one part of the brain to another. Progress in medical research and healthcare is hindered by the difficulty in understanding precisely which neurons are active at any given time. New brain imaging techniques and genetic tools allow researchers to track the activity of thousands of neurons in living animals over many months. However, these experiments produce large volumes of data that researchers currently have to analyze manually, which can take a long time and generate irreproducible results. There is a need to develop new computational tools to analyze such data. The new tools should be able to operate on standard computers rather than just specialist equipment as this would limit the use of the solutions to particularly well-funded research teams. Ideally, the tools should also be able to operate in real-time as several experimental and therapeutic scenarios, like the control of robotic limbs, require this. To address this need, Giovannucci et al. developed a new software package called CaImAn to analyze brain images on a large scale. Firstly, the team developed algorithms that are suitable to analyze large sets of data on laptops and other standard computing equipment. These algorithms were then adapted to operate online in real-time. To test how well the new software performs against manual analysis by human researchers, Giovannucci et al. asked several trained human annotators to identify active neurons that were round or donut-shaped in several sets of imaging data from mouse brains. Each set of data was independently analyzed by three or four researchers who then discussed any neurons they disagreed on to generate a ‘consensus annotation’. Giovannucci et al. then used CaImAn to analyze the same sets of data and compared the results to the consensus annotations. This demonstrated that CaImAn is nearly as good as human researchers at identifying active neurons in brain images. CaImAn provides a quicker method to analyze large sets of brain imaging data and is currently used by over a hundred laboratories across the world. The software is open source, meaning that it is freely-available and that users are encouraged to customize it and collaborate with other users to develop it further. 
    more » « less
  4. Abstract Heterogeneity in brain activity can give rise to heterogeneity in behavior, which in turn comprises our distinctive characteristics as individuals. Studying the path from brain to behavior, however, often requires making assumptions about how similarity in behavior scales with similarity in brain activity. Here, we expand upon recent work (Finn et al., 2020) which proposes a theoretical framework for testing the validity of such assumptions. Using intersubject representational similarity analysis in two independent movie-watching functional MRI (fMRI) datasets, we probe how brain-behavior relationships vary as a function of behavioral domain and participant sample. We find evidence that, in some cases, the neural similarity of two individuals is not correlated with behavioral similarity. Rather, individuals with higher behavioral scores are more similar to other high scorers whereas individuals with lower behavioral scores are dissimilar from everyone else. Ultimately, our findings motivate a more extensive investigation of both the structure of brain-behavior relationships and the tacit assumption that people who behave similarly will demonstrate shared patterns of brain activity. 
    more » « less
  5. Summary Identifying dependency in multivariate data is a common inference task that arises in numerous applications. However, existing nonparametric independence tests typically require computation that scales at least quadratically with the sample size, making it difficult to apply them in the presence of massive sample sizes. Moreover, resampling is usually necessary to evaluate the statistical significance of the resulting test statistics at finite sample sizes, further worsening the computational burden. We introduce a scalable, resampling-free approach to testing the independence between two random vectors by breaking down the task into simple univariate tests of independence on a collection of $$2\times 2$$ contingency tables constructed through sequential coarse-to-fine discretization of the sample , transforming the inference task into a multiple testing problem that can be completed with almost linear complexity with respect to the sample size. To address increasing dimensionality, we introduce a coarse-to-fine sequential adaptive procedure that exploits the spatial features of dependency structures. We derive a finite-sample theory that guarantees the inferential validity of our adaptive procedure at any given sample size. We show that our approach can achieve strong control of the level of the testing procedure at any sample size without resampling or asymptotic approximation and establish its large-sample consistency. We demonstrate through an extensive simulation study its substantial computational advantage in comparison to existing approaches while achieving robust statistical power under various dependency scenarios, and illustrate how its divide-and-conquer nature can be exploited to not just test independence, but to learn the nature of the underlying dependency. Finally, we demonstrate the use of our method through analysing a dataset from a flow cytometry experiment. 
    more » « less