skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Discovering and deciphering relationships across disparate data modalities
If you want to estimate whether height is related to weight in humans, what would you do? You could measure the height and weight of a large number of people, and then run a statistical test. Such ‘independence tests’ can be thought of as a screening procedure: if the two properties (height and weight) are not related, then there is no point in proceeding with further analyses. In the last 100 years different independence tests have been developed. However, classical approaches often fail to accurately discern relationships in the large, complex datasets typical of modern biomedical research. For example, connectomics datasets include tens or hundreds of thousands of connections between neurons that collectively underlie how the brain performs certain tasks. Discovering and deciphering relationships from these data is currently the largest barrier to progress in these fields. Another drawback to currently used methods of independence testing is that they act as a ‘black box’, giving an answer without making it clear how it was calculated. This can make it difficult for researchers to reproduce their findings – a key part of confirming a scientific discovery. Vogelstein et al. therefore sought to develop a method of performing independence tests on large datasets that can easily be both applied and interpreted by practicing scientists. The method developed by Vogelstein et al., called Multiscale Graph Correlation (MGC, pronounced ‘magic’), combines recent developments in hypothesis testing, machine learning, and data science. The result is that MGC typically requires between one half to one third as big a sample size as previously proposed methods for analyzing large, complex datasets. Moreover, MGC also indicates the nature of the relationship between different properties; for example, whether it is a linear relationship or not. Testing MGC on real biological data, including a cancer dataset and a human brain imaging dataset, revealed that it is more effective at finding possible relationships than other commonly used independence methods. MGC was also the only method that explained how it found those relationships. MGC will enable relationships to be found in data across many fields of inquiry – and not only in biology. Scientists, policy analysts, data journalists, and corporate data scientists could all use MGC to learn about the relationships present in their data. To that extent, Vogelstein et al. have made the code open source in MATLAB, R, and Python.  more » « less
Award ID(s):
1921310 1712947
PAR ID:
10098512
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
eLife
Volume:
8
ISSN:
2050-084X
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Understanding the intricacies of the brain often requires spotting and tracking specific neurons over time and across different individuals. For instance, scientists may need to precisely monitor the activity of one neuron even as the brain moves and deforms; or they may want to find universal patterns by comparing signals from the same neuron across different individuals. Both tasks require matching which neuron is which in different images and amongst a constellation of cells. This is theoretically possible in certain ‘model’ animals where every single neuron is known and carefully mapped out. Still, it remains challenging: neurons move relative to one another as the animal changes posture, and the position of a cell is also slightly different between individuals. Sophisticated computer algorithms are increasingly used to tackle this problem, but they are far too slow to track neural signals as real-time experiments unfold. To address this issue, Yu et al. designed a new algorithm based on the Transformer, an artificial neural network originally used to spot relationships between words in sentences. To learn relationships between neurons, the algorithm was fed hundreds of thousands of ‘semi-synthetic’ examples of constellations of neurons. Instead of painfully collated actual experimental data, these datasets were created by a simulator based on a few simple measurements. Testing the new algorithm on the tiny worm Caenorhabditis elegans revealed that it was faster and more accurate, finding corresponding neurons in about 10ms. The work by Yu et al. demonstrates the power of using simulations rather than experimental data to train artificial networks. The resulting algorithm can be used immediately to help study how the brain of C. elegans makes decisions or controls movements. Ultimately, this research could allow brain-machine interfaces to be developed. 
    more » « less
  2. Summary Chatterjee (2021) introduced a simple new rank correlation coefficient that has attracted much attention recently. The coefficient has the unusual appeal that it not only estimates a population quantity first proposed by Dette et al. (2013) that is zero if and only if the underlying pair of random variables is independent, but also is asymptotically normal under independence. This paper compares Chatterjee’s new correlation coefficient with three established rank correlations that also facilitate consistent tests of independence, namely Hoeffding’s $$D$$, Blum–Kiefer–Rosenblatt’s $$R$$, and Bergsma–Dassios–Yanagimoto’s $$\tau^*$$. We compare the computational efficiency of these rank correlation coefficients in light of recent advances, and investigate their power against local rotation and mixture alternatives. Our main results show that Chatterjee’s coefficient is unfortunately rate-suboptimal compared to $$D$$, $$R$$ and $$\tau^*$$. The situation is more subtle for a related earlier estimator of Dette et al. (2013). These results favour $$D$$, $$R$$ and $$\tau^*$$ over Chatterjee’s new correlation coefficient for the purpose of testing independence. 
    more » « less
  3. The human brain contains billions of cells called neurons that rapidly carry information from one part of the brain to another. Progress in medical research and healthcare is hindered by the difficulty in understanding precisely which neurons are active at any given time. New brain imaging techniques and genetic tools allow researchers to track the activity of thousands of neurons in living animals over many months. However, these experiments produce large volumes of data that researchers currently have to analyze manually, which can take a long time and generate irreproducible results. There is a need to develop new computational tools to analyze such data. The new tools should be able to operate on standard computers rather than just specialist equipment as this would limit the use of the solutions to particularly well-funded research teams. Ideally, the tools should also be able to operate in real-time as several experimental and therapeutic scenarios, like the control of robotic limbs, require this. To address this need, Giovannucci et al. developed a new software package called CaImAn to analyze brain images on a large scale. Firstly, the team developed algorithms that are suitable to analyze large sets of data on laptops and other standard computing equipment. These algorithms were then adapted to operate online in real-time. To test how well the new software performs against manual analysis by human researchers, Giovannucci et al. asked several trained human annotators to identify active neurons that were round or donut-shaped in several sets of imaging data from mouse brains. Each set of data was independently analyzed by three or four researchers who then discussed any neurons they disagreed on to generate a ‘consensus annotation’. Giovannucci et al. then used CaImAn to analyze the same sets of data and compared the results to the consensus annotations. This demonstrated that CaImAn is nearly as good as human researchers at identifying active neurons in brain images. CaImAn provides a quicker method to analyze large sets of brain imaging data and is currently used by over a hundred laboratories across the world. The software is open source, meaning that it is freely-available and that users are encouraged to customize it and collaborate with other users to develop it further. 
    more » « less
  4. Abstract Heterogeneity in brain activity can give rise to heterogeneity in behavior, which in turn comprises our distinctive characteristics as individuals. Studying the path from brain to behavior, however, often requires making assumptions about how similarity in behavior scales with similarity in brain activity. Here, we expand upon recent work (Finn et al., 2020) which proposes a theoretical framework for testing the validity of such assumptions. Using intersubject representational similarity analysis in two independent movie-watching functional MRI (fMRI) datasets, we probe how brain-behavior relationships vary as a function of behavioral domain and participant sample. We find evidence that, in some cases, the neural similarity of two individuals is not correlated with behavioral similarity. Rather, individuals with higher behavioral scores are more similar to other high scorers whereas individuals with lower behavioral scores are dissimilar from everyone else. Ultimately, our findings motivate a more extensive investigation of both the structure of brain-behavior relationships and the tacit assumption that people who behave similarly will demonstrate shared patterns of brain activity. 
    more » « less
  5. Millions of concussions happen each year in the US alone. A proportionally large number of these concussions are due to high impact sports injury. Currently, there exists no solution to quickly monitor brain functions and test the oculomotor functions of individuals who have suffered a traumatic brain injury in order to diagnose them as having suffered a concussion. What is presently done to diagnose concussions is a CT scan or MRI, which are lengthy procedures to schedule, set up, and conduct; and furthermore, takes additional time to analyze the results in order to arrive at a diagnosis. This prolongation of the diagnosing process is inherently problematic since the longer time it takes between time of injury and time of diagnosis, there is greater risk of decisions and actions which can worsen damage to the brain. The sooner a concussion can be diagnosed, the sooner and better the treatment can be performed for recovery. In order to ameliorate this issue, we seek to develop a device to perform the function of diagnosis and monitoring of brain activity in a more rapid and timely manner. Literature review into the anatomy of vestibular and ocular brain functions was performed; as well as research into various testing and monitoring methodologies of these vestibular and ocular functions. One such method that has proven to be a reliable method for diagnosis is Vestibular Ocular Motor Screening (VOMS), which is a visual and balance test performed by a doctor with a patient. Further research was also done into existing technologies whose functionalities would allow the device in order to perform brain monitoring, visual testing, and ultimately diagnosis; namely EEG, VR, and infrared eye tracking. Currently, very few devices on the market take advantage of these technologies together for medical uses. A device incorporating these technologies together allows would allow for more consistent administering of visual tests and real-time monitoring of brain activity. With a functional prototype, user testing is to be performed in order to assess the function and viability of the device. 
    more » « less