skip to main content

Title: Differences between human and machine perception in medical diagnosis

Deep neural networks (DNNs) show promise in image-based medical diagnosis, but cannot be fully trusted since they can fail for reasons unrelated to underlying pathology. Humans are less likely to make such superficial mistakes, since they use features that are grounded on medical science. It is therefore important to know whether DNNs use different features than humans. Towards this end, we propose a framework for comparing human and machine perception in medical diagnosis. We frame the comparison in terms of perturbation robustness, and mitigate Simpson’s paradox by performing a subgroup analysis. The framework is demonstrated with a case study in breast cancer screening, where we separately analyze microcalcifications and soft tissue lesions. While it is inconclusive whether humans and DNNs use different features to detect microcalcifications, we find that for soft tissue lesions, DNNs rely on high frequency components ignored by radiologists. Moreover, these features are located outside of the region of the images found most suspicious by radiologists. This difference between humans and machines was only visible through subgroup analysis, which highlights the importance of incorporating medical domain knowledge into the comparison.

; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; more » ; « less
Award ID(s):
Publication Date:
Journal Name:
Scientific Reports
Nature Publishing Group
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Background

    Nucleotide excision repair is the primary DNA repair mechanism that removes bulky DNA adducts such as UV-induced pyrimidine dimers. Correspondingly, genome-wide mapping of nucleotide excision repair with eXcision Repair sequencing (XR-seq), provides comprehensive profiling of DNA damage repair. A number of XR-seq experiments at a variety of conditions for different damage types revealed heterogenous repair in the human genome. Although human repair profiles were extensively studied, how repair maps vary between primates is yet to be investigated. Here, we characterized the genome-wide UV-induced damage repair in gray mouse lemur,Microcebus murinus, in comparison to human.


    We derived fibroblast cell lines from mouse lemur, exposed them to UV irradiation, and analyzed the repair events genome-wide using the XR-seq protocol. Mouse lemur repair profiles were analyzed in comparison to the equivalent human fibroblast datasets. We found that overall UV sensitivity, repair efficiency, and transcription-coupled repair levels differ between the two primates. Despite this, comparative analysis of human and mouse lemur fibroblasts revealed that genome-wide repair profiles of the homologous regions are highly correlated, and this correlation is stronger for highly expressed genes. With the inclusion of an additional XR-seq sample derived from another human cell line in the analysis, we found thatmore »fibroblasts of the two primates repair UV-induced DNA lesions in a more similar pattern than two distinct human cell lines do.


    Our results suggest that mouse lemurs and humans, and possibly primates in general, share a homologous repair mechanism as well as genomic variance distribution, albeit with their variable repair efficiency. This result also emphasizes the deep homologies of individual tissue types across the eukaryotic phylogeny.

    « less
  2. Obeid, I. (Ed.)
    The Neural Engineering Data Consortium (NEDC) is developing the Temple University Digital Pathology Corpus (TUDP), an open source database of high-resolution images from scanned pathology samples [1], as part of its National Science Foundation-funded Major Research Instrumentation grant titled “MRI: High Performance Digital Pathology Using Big Data and Machine Learning” [2]. The long-term goal of this project is to release one million images. We have currently scanned over 100,000 images and are in the process of annotating breast tissue data for our first official corpus release, v1.0.0. This release contains 3,505 annotated images of breast tissue including 74 patients with cancerous diagnoses (out of a total of 296 patients). In this poster, we will present an analysis of this corpus and discuss the challenges we have faced in efficiently producing high quality annotations of breast tissue. It is well known that state of the art algorithms in machine learning require vast amounts of data. Fields such as speech recognition [3], image recognition [4] and text processing [5] are able to deliver impressive performance with complex deep learning models because they have developed large corpora to support training of extremely high-dimensional models (e.g., billions of parameters). Other fields that do notmore »have access to such data resources must rely on techniques in which existing models can be adapted to new datasets [6]. A preliminary version of this breast corpus release was tested in a pilot study using a baseline machine learning system, ResNet18 [7], that leverages several open-source Python tools. The pilot corpus was divided into three sets: train, development, and evaluation. Portions of these slides were manually annotated [1] using the nine labels in Table 1 [8] to identify five to ten examples of pathological features on each slide. Not every pathological feature is annotated, meaning excluded areas can include focuses particular to these labels that are not used for training. A summary of the number of patches within each label is given in Table 2. To maintain a balanced training set, 1,000 patches of each label were used to train the machine learning model. Throughout all sets, only annotated patches were involved in model development. The performance of this model in identifying all the patches in the evaluation set can be seen in the confusion matrix of classification accuracy in Table 3. The highest performing labels were background, 97% correct identification, and artifact, 76% correct identification. A correlation exists between labels with more than 6,000 development patches and accurate performance on the evaluation set. Additionally, these results indicated a need to further refine the annotation of invasive ductal carcinoma (“indc”), inflammation (“infl”), nonneoplastic features (“nneo”), normal (“norm”) and suspicious (“susp”). This pilot experiment motivated changes to the corpus that will be discussed in detail in this poster presentation. To increase the accuracy of the machine learning model, we modified how we addressed underperforming labels. One common source of error arose with how non-background labels were converted into patches. Large areas of background within other labels were isolated within a patch resulting in connective tissue misrepresenting a non-background label. In response, the annotation overlay margins were revised to exclude benign connective tissue in non-background labels. Corresponding patient reports and supporting immunohistochemical stains further guided annotation reviews. The microscopic diagnoses given by the primary pathologist in these reports detail the pathological findings within each tissue site, but not within each specific slide. The microscopic diagnoses informed revisions specifically targeting annotated regions classified as cancerous, ensuring that the labels “indc” and “dcis” were used only in situations where a micropathologist diagnosed it as such. Further differentiation of cancerous and precancerous labels, as well as the location of their focus on a slide, could be accomplished with supplemental immunohistochemically (IHC) stained slides. When distinguishing whether a focus is a nonneoplastic feature versus a cancerous growth, pathologists employ antigen targeting stains to the tissue in question to confirm the diagnosis. For example, a nonneoplastic feature of usual ductal hyperplasia will display diffuse staining for cytokeratin 5 (CK5) and no diffuse staining for estrogen receptor (ER), while a cancerous growth of ductal carcinoma in situ will have negative or focally positive staining for CK5 and diffuse staining for ER [9]. Many tissue samples contain cancerous and non-cancerous features with morphological overlaps that cause variability between annotators. The informative fields IHC slides provide could play an integral role in machine model pathology diagnostics. Following the revisions made on all the annotations, a second experiment was run using ResNet18. Compared to the pilot study, an increase of model prediction accuracy was seen for the labels indc, infl, nneo, norm, and null. This increase is correlated with an increase in annotated area and annotation accuracy. Model performance in identifying the suspicious label decreased by 25% due to the decrease of 57% in the total annotated area described by this label. A summary of the model performance is given in Table 4, which shows the new prediction accuracy and the absolute change in error rate compared to Table 3. The breast tissue subset we are developing includes 3,505 annotated breast pathology slides from 296 patients. The average size of a scanned SVS file is 363 MB. The annotations are stored in an XML format. A CSV version of the annotation file is also available which provides a flat, or simple, annotation that is easy for machine learning researchers to access and interface to their systems. Each patient is identified by an anonymized medical reference number. Within each patient’s directory, one or more sessions are identified, also anonymized to the first of the month in which the sample was taken. These sessions are broken into groupings of tissue taken on that date (in this case, breast tissue). A deidentified patient report stored as a flat text file is also available. Within these slides there are a total of 16,971 total annotated regions with an average of 4.84 annotations per slide. Among those annotations, 8,035 are non-cancerous (normal, background, null, and artifact,) 6,222 are carcinogenic signs (inflammation, nonneoplastic and suspicious,) and 2,714 are cancerous labels (ductal carcinoma in situ and invasive ductal carcinoma in situ.) The individual patients are split up into three sets: train, development, and evaluation. Of the 74 cancerous patients, 20 were allotted for both the development and evaluation sets, while the remain 34 were allotted for train. The remaining 222 patients were split up to preserve the overall distribution of labels within the corpus. This was done in hope of creating control sets for comparable studies. Overall, the development and evaluation sets each have 80 patients, while the training set has 136 patients. In a related component of this project, slides from the Fox Chase Cancer Center (FCCC) Biosample Repository ( -facility) are being digitized in addition to slides provided by Temple University Hospital. This data includes 18 different types of tissue including approximately 38.5% urinary tissue and 16.5% gynecological tissue. These slides and the metadata provided with them are already anonymized and include diagnoses in a spreadsheet with sample and patient ID. We plan to release over 13,000 unannotated slides from the FCCC Corpus simultaneously with v1.0.0 of TUDP. Details of this release will also be discussed in this poster. Few digitally annotated databases of pathology samples like TUDP exist due to the extensive data collection and processing required. The breast corpus subset should be released by November 2021. By December 2021 we should also release the unannotated FCCC data. We are currently annotating urinary tract data as well. We expect to release about 5,600 processed TUH slides in this subset. We have an additional 53,000 unprocessed TUH slides digitized. Corpora of this size will stimulate the development of a new generation of deep learning technology. In clinical settings where resources are limited, an assistive diagnoses model could support pathologists’ workload and even help prioritize suspected cancerous cases. ACKNOWLEDGMENTS This material is supported by the National Science Foundation under grants nos. CNS-1726188 and 1925494. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. REFERENCES [1] N. Shawki et al., “The Temple University Digital Pathology Corpus,” in Signal Processing in Medicine and Biology: Emerging Trends in Research and Applications, 1st ed., I. Obeid, I. Selesnick, and J. Picone, Eds. New York City, New York, USA: Springer, 2020, pp. 67 104. [2] J. Picone, T. Farkas, I. Obeid, and Y. Persidsky, “MRI: High Performance Digital Pathology Using Big Data and Machine Learning.” Major Research Instrumentation (MRI), Division of Computer and Network Systems, Award No. 1726188, January 1, 2018 – December 31, 2021. https://www. [3] A. Gulati et al., “Conformer: Convolution-augmented Transformer for Speech Recognition,” in Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), 2020, pp. 5036-5040. [4] C.-J. Wu et al., “Machine Learning at Facebook: Understanding Inference at the Edge,” in Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA), 2019, pp. 331–344. [5] I. Caswell and B. Liang, “Recent Advances in Google Translate,” Google AI Blog: The latest from Google Research, 2020. [Online]. Available: [Accessed: 01-Aug-2021]. [6] V. Khalkhali, N. Shawki, V. Shah, M. Golmohammadi, I. Obeid, and J. Picone, “Low Latency Real-Time Seizure Detection Using Transfer Deep Learning,” in Proceedings of the IEEE Signal Processing in Medicine and Biology Symposium (SPMB), 2021, pp. 1 7. https://www.isip. [7] J. Picone, T. Farkas, I. Obeid, and Y. Persidsky, “MRI: High Performance Digital Pathology Using Big Data and Machine Learning,” Philadelphia, Pennsylvania, USA, 2020. [8] I. Hunt, S. Husain, J. Simons, I. Obeid, and J. Picone, “Recent Advances in the Temple University Digital Pathology Corpus,” in Proceedings of the IEEE Signal Processing in Medicine and Biology Symposium (SPMB), 2019, pp. 1–4. [9] A. P. Martinez, C. Cohen, K. Z. Hanley, and X. (Bill) Li, “Estrogen Receptor and Cytokeratin 5 Are Reliable Markers to Separate Usual Ductal Hyperplasia From Atypical Ductal Hyperplasia and Low-Grade Ductal Carcinoma In Situ,” Arch. Pathol. Lab. Med., vol. 140, no. 7, pp. 686–689, Apr. 2016.« less
  3. With the spread of COVID-19, significantly more patients have required medical diagnosis to determine whether they are a carrier of the virus. COVID-19 can lead to the development of pneumonia in the lungs, which can be captured in X-Ray and CT scans of the patient's chest. The abundance of X-Ray and CT image data available can be used to develop a high-performing computer vision model able to identify and classify instances of pneumonia present in medical scans. Predictions made by these deep learning models can increase the confidence of diagnoses made by analyzing minute features present in scans exhibiting COVID-19 pneumonia, often unnoticeable to the human eye. Furthermore, rather than teaching clinicians about the mathematics behind deep learning and heat maps, we introduce novel methods of explainable artificial intelligence (XAI) with the goal to annotate instances of pneumonia in medical scans exactly as radiologists do to inform other radiologists, clinicians, and interns about patterns and findings. This project explores methods to train and optimize state-of-the-art deep learning models on COVID-19 pneumonia medical scans and apply explainability algorithms to generate annotated explanations of model predictions that are useful to clinicians and radiologists in analyzing these images.
  4. Abstract

    High-resolution millimeter-wave imaging (HR-MMWI), with its high discrimination contrast and sufficient penetration depth, can potentially provide affordable tissue diagnostic information noninvasively. In this study, we evaluate the application of a real-time system of HR-MMWI for in-vivo skin cancer diagnosis. 136 benign and malignant skin lesions from 71 patients, including melanoma, basal cell carcinoma, squamous cell carcinoma, actinic keratosis, melanocytic nevi, angiokeratoma, dermatofibroma, solar lentigo, and seborrheic keratosis were measured. Lesions were classified using a 3-D principal component analysis followed by five classifiers including linear discriminant analysis (LDA), K-nearest neighbor (KNN) with different K-values, linear and Gaussian support vector machine (LSVM and GSVM) with different margin factors, and multilayer perception (MLP). Our results suggested that the best classification was achieved by using five PCA components followed by MLP with 97% sensitivity and 98% specificity. Our findings establish that real-time millimeter-wave imaging can be used to distinguish malignant tissues from benign skin lesions with high diagnostic accuracy comparable with clinical examination and other methods.

  5. Introduction Multi-series CT (MSCT) scans, including non-contrast CT (NCCT), CT Perfusion (CTP), and CT Angiography (CTA), are widely used in acute stroke imaging. While each scan has its advantage in disease diagnosis, the varying image resolution of different series hinders the ability of the radiologist to discern subtle suspicious findings. Besides, higher image quality requires high radiation doses, leading to increases in health risks such as cataract formation and cancer induction. Thus, it is highly crucial to develop an approach to improve MSCT resolution and to lower radiation exposure. Hypothesis MSCT imaging of the same patient is highly correlated in structural features, the transferring and integration of the shared and complementary information from different series are beneficial for achieving high image quality. Methods We propose TL-GAN, a learning-based method by using Transfer Learning (TL) and Generative Adversarial Network (GAN) to reconstruct high-quality diagnostic images. Our TL-GAN method is evaluated on 4,382 images collected from nine patients’ MSCT scans, including 415 NCCT slices, 3,696 CTP slices, and 271 CTA slices. We randomly split the nine patients into a training set (4 patients), a validation set (2 patients), and a testing set (3 patients). In preprocessing, we remove the background and skullmore »and visualize in brain window. The low-resolution images (1/4 of the original spatial size) are simulated by bicubic down-sampling. For training without TL, we train different series individually, and for with TL, we follow the scanning sequence (NCCT, CTP, and CTA) by finetuning. Results The performance of TL-GAN is evaluated by the peak-signal-to-noise ratio (PSNR) and structural similarity (SSIM) index on 184 NCCT, 882 CTP, and 107 CTA test images. Figure 1 provides both visual (a-c) and quantity (d-f) comparisons. Through TL-GAN, there is a significant improvement with TL than without TL (training from scratch) for NCCT, CTP, and CTA images, respectively. These significances of performance improvement are evaluated by one-tailed paired t-tests (p < 0.05). We enlarge the regions of interest for detail visual comparisons. Further, we evaluate the CTP performance by calculating the perfusion maps, including cerebral blood flow (CBF) and cerebral blood volume (CBV). The visual comparison of the perfusion maps in Figure 2 demonstrate that TL-GAN is beneficial for achieving high diagnostic image quality, which are comparable to the ground truth images for both CBF and CBV maps. Conclusion Utilizing TL-GAN can effectively improve the image resolution for MSCT, provides radiologists more image details for suspicious findings, which is a practical solution for MSCT image quality enhancement.« less