skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Building the Model
Context.— Machine learning (ML) allows for the analysis of massive quantities of high-dimensional clinical laboratory data, thereby revealing complex patterns and trends. Thus, ML can potentially improve the efficiency of clinical data interpretation and the practice of laboratory medicine. However, the risks of generating biased or unrepresentative models, which can lead to misleading clinical conclusions or overestimation of the model performance, should be recognized. Objectives.— To discuss the major components for creating ML models, including data collection, data preprocessing, model development, and model evaluation. We also highlight many of the challenges and pitfalls in developing ML models, which could result in misleading clinical impressions or inaccurate model performance, and provide suggestions and guidance on how to circumvent these challenges. Data Sources.— The references for this review were identified through searches of the PubMed database, US Food and Drug Administration white papers and guidelines, conference abstracts, and online preprints. Conclusions.— With the growing interest in developing and implementing ML models in clinical practice, laboratorians and clinicians need to be educated in order to collect sufficiently large and high-quality data, properly report the data set characteristics, and combine data from multiple institutions with proper normalization. They will also need to assess the reasons for missing values, determine the inclusion or exclusion of outliers, and evaluate the completeness of a data set. In addition, they require the necessary knowledge to select a suitable ML model for a specific clinical question and accurately evaluate the performance of the ML model, based on objective criteria. Domain-specific knowledge is critical in the entire workflow of developing ML models.  more » « less
Award ID(s):
1750326
PAR ID:
10438317
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
Archives of Pathology & Laboratory Medicine
Volume:
147
Issue:
7
ISSN:
0003-9985
Page Range / eLocation ID:
826 to 836
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Digital pathology has transformed the traditional pathology practice of analyzing tissue under a microscope into a computer vision workflow. Whole-slide imaging allows pathologists to view and analyze microscopic images on a computer monitor, enabling computational pathology. By leveraging artificial intelligence (AI) and machine learning (ML), computational pathology has emerged as a promising field in recent years. Recently, task-specific AI/ML (eg, convolutional neural networks) has risen to the forefront, achieving above-human performance in many image-processing and computer vision tasks. The performance of task-specific AI/ML models depends on the availability of many annotated training datasets, which presents a rate-limiting factor for AI/ML development in pathology. Task-specific AI/ML models cannot benefit from multimodal data and lack generalization, eg, the AI models often struggle to generalize to new datasets or unseen variations in image acquisition, staining techniques, or tissue types. The 2020s are witnessing the rise of foundation models and generative AI. A foundation model is a large AI model trained using sizable data, which is later adapted (or fine-tuned) to perform different tasks using a modest amount of task-specific annotated data. These AI models provide in-context learning, can self-correct mistakes, and promptly adjust to user feedback. In this review, we provide a brief overview of recent advances in computational pathology enabled by task-specific AI, their challenges and limitations, and then introduce various foundation models. We propose to create a pathology-specific generative AI based on multimodal foundation models and present its potentially transformative role in digital pathology. We describe different use cases, delineating how it could serve as an expert companion of pathologists and help them efficiently and objectively perform routine laboratory tasks, including quantifying image analysis, generating pathology reports, diagnosis, and prognosis. We also outline the potential role that foundation models and generative AI can play in standardizing the pathology laboratory workflow, education, and training. 
    more » « less
  2. Range aggregate queries (RAQs) are an integral part of many real-world applications, where, often, fast and approximate answers for the queries are desired. Recent work has studied answering RAQs using machine learning (ML) models, where a model of the data is learned to answer the queries. However, there is no theoretical understanding of why and when the ML based approaches perform well. Furthermore, since the ML approaches model the data, they fail to capitalize on any query specific information to improve performance in practice. In this paper, we focus on modeling "queries" rather than data and train neural networks to learn the query answers. This change of focus allows us to theoretically study our ML approach to provide a distribution and query dependent error bound for neural networks when answering RAQs. We confirm our theoretical results by developing NeuroSketch, a neural network framework to answer RAQs in practice. Extensive experimental study on real-world, TPC-benchmark and synthetic datasets show that NeuroSketch answers RAQs multiple orders of magnitude faster than state-of-the-art and with better accuracy. 
    more » « less
  3. null (Ed.)
    Learning is usually conceptualized as a process during which new information is processed in working memory to form knowledge structures called schemas, which are stored in long-term memory. Practice plays a critical role in developing schemas through learning-by-doing. Contemporary expertise development theories have highlighted the influence of deliberate practice (DP) on achieving exceptional performance in sports, music, and different professional fields. Concurrently, there is an emerging method for improving learning efficiency by combining deliberate practice with cognitive load theory (CLT), a cognition-architecture-based theory for instructional design. Mechanics is a foundation for most branches of engineering. It serves to develop problem-solving skills and consolidate understanding of other subjects, such as applied mathematics and physics. Mechanics has been a challenging subject. Students need to understand governing principles to gain conceptual knowledge and acquire procedural knowledge to apply these principles to solve problems. Due to the difficulty in developing conceptual and procedural knowledge, mechanics courses are among those which receive high DFW rates (percentage of students receiving a grade of D or F or Withdrawing from a course) and students are more likely to leave engineering after taking mechanics courses. Since deliberate practice can help novices develop good representations of the knowledge needed to produce superior problem solving performance, this study is to evaluate how deliberate practice helps students learn mechanics during the process of schema acquisition and consolidation. Considering cognitive capacity limitations, we will apply cognitive load theory to develop deliberate practice to help students build declarative and procedural knowledge without exceeding their working memory limitations. We will evaluate the effects of three practice strategies based on CLT on the schema acquisition and consolidation in two mechanics courses (i.e., Statics and Dynamics). Examples and assessment results will be provided to evaluate the effectiveness of the practice strategies as well as the challenges. 
    more » « less
  4. Learning is usually conceptualized as a process during which new information is processed in working memory to form knowledge structures called schemas, which are stored in long-term memory. Practice plays a critical role in developing schemas through learning-by-doing. Contemporary expertise development theories have highlighted the influence of deliberate practice (DP) on achieving exceptional performance in sports, music, and different professional fields. Concurrently, there is an emerging method for improving learning efficiency by combining deliberate practice with cognitive load theory (CLT), a cognition-architecture-based theory for instructional design. Mechanics is a foundation for most branches of engineering. It serves to develop problem-solving skills and consolidate understanding of other subjects, such as applied mathematics and physics. Mechanics has been a challenging subject. Students need to understand governing principles to gain conceptual knowledge and acquire procedural knowledge to apply these principles to solve problems. Due to the difficulty in developing conceptual and procedural knowledge, mechanics courses are among those which receive high DFW rates (percentage of students receiving a grade of D or F or Withdrawing from a course) and students are more likely to leave engineering after taking mechanics courses. Since deliberate practice can help novices develop good representations of the knowledge needed to produce superior problem solving performance, this study is to evaluate how deliberate practice helps students learn mechanics during the process of schema acquisition and consolidation. Considering cognitive capacity limitations, we will apply cognitive load theory to develop deliberate practice to help students build declarative and procedural knowledge without exceeding their working memory limitations. We will evaluate the effects of three practice strategies based on CLT on schema acquisition and consolidation in two mechanics courses (i.e., Statics and Dynamics). Examples and assessment results will be provided to evaluate the effectiveness of the practice strategies as well as the challenges. 
    more » « less
  5. In recent years, the pace of innovations in the fields of machine learning (ML) has accelerated, researchers in SysML have created algorithms and systems that parallelize ML training over multiple devices or computational nodes. As ML models become more structurally complex, many systems have struggled to provide all-round performance on a variety of models. Particularly, ML scale-up is usually underestimated in terms of the amount of knowledge and time required to map from an appropriate distribution strategy to the model. Applying parallel training systems to complex models adds nontrivial development overheads in addition to model prototyping, and often results in lower-than-expected performance. This tutorial identifies research and practical pain points in parallel ML training, and discusses latest development of algorithms and systems on addressing these challenges in both usability and performance. In particular, this tutorial presents a new perspective of unifying seemingly different distributed ML training strategies. Based on it, introduces new techniques and system architectures to simplify and automate ML parallelization. This tutorial is built upon the authors' years' of research and industry experience, comprehensive literature survey, and several latest tutorials and papers published by the authors and peer researchers. The tutorial consists of four parts. The first part will present a landscape of distributed ML training techniques and systems, and highlight the major difficulties faced by real users when writing distributed ML code with big model or big data. The second part dives deep to explain the mainstream training strategies, guided with real use case. By developing a new and unified formulation to represent the seemingly different data- and model- parallel strategies, we describe a set of techniques and algorithms to achieve ML auto-parallelization, and compiler system architectures for auto-generating and exercising parallelization strategies based on models and clusters. The third part of this tutorial exposes a hidden layer of practical pain points in distributed ML training: hyper-parameter tuning and resource allocation, and introduces techniques to improve these aspects. The fourth part is designed as a hands-on coding session, in which we will walk through the audiences on writing distributed training programs in Python, using the various distributed ML tools and interfaces provided by the Ray ecosystem. 
    more » « less