skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Digital medicine and the curse of dimensionality
Abstract Digital health data are multimodal and high-dimensional. A patient’s health state can be characterized by a multitude of signals including medical imaging, clinical variables, genome sequencing, conversations between clinicians and patients, and continuous signals from wearables, among others. This high volume, personalized data stream aggregated over patients’ lives has spurred interest in developing new artificial intelligence (AI) models for higher-precision diagnosis, prognosis, and tracking. While the promise of these algorithms is undeniable, their dissemination and adoption have been slow, owing partially to unpredictable AI model performance once deployed in the real world. We posit that one of the rate-limiting factors in developing algorithms that generalize to real-world scenarios is the very attribute that makes the data exciting—their high-dimensional nature. This paper considers how the large number of features in vast digital health data can challenge the development of robust AI models—a phenomenon known as “the curse of dimensionality” in statistical learning theory. We provide an overview of the curse of dimensionality in the context of digital health, demonstrate how it can negatively impact out-of-sample performance, and highlight important considerations for researchers and algorithm designers.  more » « less
Award ID(s):
2048223 2007688
PAR ID:
10353113
Author(s) / Creator(s):
; ; ; ; ; ;
Date Published:
Journal Name:
npj Digital Medicine
Volume:
4
Issue:
1
ISSN:
2398-6352
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Forecasting all components in complex systems is an open and challenging task, possibly due to high dimensionality and undesirable predictors. We bridge this gap by proposing a data-driven and model-free framework, namely, feature-and-reconstructed manifold mapping (FRMM), which is a combination of feature embedding and delay embedding. For a high-dimensional dynamical system, FRMM finds its topologically equivalent manifolds with low dimensions from feature embedding and delay embedding and then sets the low-dimensional feature manifold as a generalized predictor to achieve predictions of all components. The substantial potential of FRMM is shown for both representative models and real-world data involving Indian monsoon, electroencephalogram (EEG) signals, foreign exchange market, and traffic speed in Los Angeles Country. FRMM overcomes the curse of dimensionality and finds a generalized predictor, and thus has potential for applications in many other real-world systems. 
    more » « less
  2. Improving our understanding of how humans perceive AI teammates is an important foundation for our general understanding of human-AI teams. Extending relevant work from cognitive science, we propose a framework based on item response theory for modeling these perceptions. We apply this framework to real-world experiments, in which each participant works alongside another person or an AI agent in a question-answering setting, repeatedly assessing their teammate’s performance. Using this experimental data, we demonstrate the use of our framework for testing research questions about people’s perceptions of both AI agents and other people. We contrast mental models of AI teammates with those of human teammates as we characterize the dimensionality of these mental models, their development over time, and the influence of the participants’ own self-perception. Our results indicate that people expect AI agents’ performance to be significantly better on average than the performance of other humans, with less variation across different types of problems. We conclude with a discussion of the implications of these findings for human-AI interaction. 
    more » « less
  3. Artificial Intelligence (AI)-driven Digital Health (DH) systems are poised to play a critical role in the future of healthcare. In 2021, $57.2 billion was invested in DH systems around the world, recognizing the promise this concept holds for aiding in delivery and care management. DH systems traditionally include a blend of various technologies, AI, and physiological biomarkers and have shown a potential to provide support for individuals with various health conditions. Digital therapeutics (DTx) is a more specific set of technology-enabled interventions within the broader DH sphere intended to produce a measurable therapeutic effect. DTx tools can empower both patients and healthcare providers, informing the course of treatment through data-driven interventions while collecting data in real-time and potentially reducing the number of patient office visits needed. In particular, socially assistive robots (SARs), as a DTx tool, can be a beneficial asset to DH systems since data gathered from sensors onboard the robot can help identify in-home behaviors, activity patterns, and health status of patients remotely. Furthermore, linking the robotic sensor data to other DH system components, and enabling SAR to function as part of an Internet of Things (IoT) ecosystem, can create a broader picture of patient health outcomes. The main challenge with DTx, and DH systems in general, is that the sheer volume and limited oversight of different DH systems and DTxs is hindering validation efforts (from technical, clinical, system, and privacy standpoints) and consequently slowing widespread adoption of these treatment tools. 
    more » « less
  4. null (Ed.)
    Research and experimentation using big data sets, specifically large sets of electronic health records (EHR) and social media data, is demonstrating the potential to understand the spread of diseases and a variety of other issues. Applications of advanced algorithms, machine learning, and artificial intelligence indicate a potential for rapidly advancing improvements in public health. For example, several reports indicate that social media data can be used to predict disease outbreak and spread (Brown, 2015). Since real-world EHR data has complicated security and privacy issues preventing it from being widely used by researchers, there is a real need to synthetically generate EHR data that is realistic and representative. Current EHR generators, such as Syntheaä (Walonoski et al., 2018) only simulate and generate pure medical-related data. However, adding patients’ social media data with their simulated EHR data would make combined data more comprehensive and realistic for healthcare research. This paper presents a patients’ social media data generator that extends an EHR data generator. By adding coherent social media data to EHR data, a variety of issues can be examined for emerging interests, such as where a contagious patient may have been and others with whom they may have been in contact. Social media data, specifically Twitter data, is generated with phrases indicating the onset of symptoms corresponding to the synthetically generated EHR reports of simulated patients. This enables creation of an open data set that is scalable up to a big-data size, and is not subject to the security, privacy concerns, and restrictions of real healthcare data sets. This capability is important to the modeling and simulation community, such as scientists and epidemiologists who are developing algorithms to analyze the spread of diseases. It enables testing a variety of analytics without revealing real-world private patient information. 
    more » « less
  5. Diffusion models achieve state-of-the-art performance in various generation tasks. However, their theoretical foundations fall far behind. This paper studies score approximation, estimation, and distribution recovery of diffusion models, when data are supported on an unknown low-dimensional linear subspace. Our result provides sample complexity bounds for distribution estimation using diffusion models. We show that with a properly chosen neural network architecture, the score function can be both accurately approximated and efficiently estimated. Further, the generated distribution based on the estimated score function captures the data geometric structures and converges to a close vicinity of the data distribution. The convergence rate depends on subspace dimension, implying that diffusion models can circumvent the curse of data ambient dimensionality. 
    more » « less