skip to main content


Title: Evaluation of Machine Learning Algorithms in a Human-Computer Hybrid Record Linkage System
Record linkage, often called entity resolution or de-duplication, refers to identifying the same entities across one or more databases. As the amount of data that is generated grows at an exponential rate, it becomes increasingly important to be able to integrate data from several sources to perform richer analysis. In this paper, we present an open source comprehensive end to end hybrid record linkage framework that combines the automatic and manual review process. Using this framework, we train several models based on different machine learning algorithms such as random forests, linear SVM, Radial SVM, and Dense Neural Networks and compare the effectiveness and efficiency of these models for record linkage in different settings. We evaluate model performance based on Recall, F1-score (quality of linkages) and number of uncertain pairs which is the number of pairs that need manual review. We also test our trained models in a new dataset to test how different trained models transfer to a new setting. The RF, linear SVM and radial SVM models transfer much better compared to the DNN. Finally, we study the effect of name2vec (n2v) feature, a letter embedding in names, on model performance. Using n2v results in smaller manual review set with slightly less F1-score. Overall the SVM models performed best in all experiments.  more » « less
Award ID(s):
1744071
NSF-PAR ID:
10223341
Author(s) / Creator(s):
; ;
Editor(s):
Martin, Andreas; Hinkelmann, Knut; Fill, Hans-Georg; Gerber, Aurona; Lenat, Doug; Stolle, Reinhard; Harmelen, Frank van
Date Published:
Journal Name:
CEUR workshop proceedings
Volume:
2846
Issue:
4
ISSN:
1613-0073
Page Range / eLocation ID:
25
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Background

    The research gap addressed in this study is the applicability of deep neural network (NN) models on wearable sensor data to recognize different activities performed by patients with Parkinson’s Disease (PwPD) and the generalizability of these models to PwPD using labeled healthy data.

    Methods

    The experiments were carried out utilizing three datasets containing wearable motion sensor readings on common activities of daily living. The collected readings were from two accelerometer sensors. PAMAP2 and MHEALTH are publicly available datasets collected from 10 and 9 healthy, young subjects, respectively. A private dataset of a similar nature collected from 14 PwPD patients was utilized as well. Deep NN models were implemented with varying levels of complexity to investigate the impact of data augmentation, manual axis reorientation, model complexity, and domain adaptation on activity recognition performance.

    Results

    A moderately complex model trained on the augmented PAMAP2 dataset and adapted to the Parkinson domain using domain adaptation achieved the best activity recognition performance with an accuracy of 73.02%, which was significantly higher than the accuracy of 63% reported in previous studies. The model’s F1 score of 49.79% significantly improved compared to the best cross-testing of 33.66% F1 score with only data augmentation and 2.88% F1 score without data augmentation or domain adaptation.

    Conclusion

    These findings suggest that deep NN models originating on healthy data have the potential to recognize activities performed by PwPD accurately and that data augmentation and domain adaptation can improve the generalizability of models in the healthy-to-PwPD transfer scenario. The simple/moderately complex architectures tested in this study could generalize better to the PwPD domain when trained on a healthy dataset compared to the most complex architectures used. The findings of this study could contribute to the development of accurate wearable-based activity monitoring solutions for PwPD, improving clinical decision-making and patient outcomes based on patient activity levels.

     
    more » « less
  2. Abstract Background Predictive models utilizing social determinants of health (SDH), demographic data, and local weather data were trained to predict missed imaging appointments (MIA) among breast imaging patients at the Boston Medical Center (BMC). Patients were characterized by many different variables, including social needs, demographics, imaging utilization, appointment features, and weather conditions on the date of the appointment. Methods This HIPAA compliant retrospective cohort study was IRB approved. Informed consent was waived. After data preprocessing steps, the dataset contained 9,970 patients and 36,606 appointments from 1/1/2015 to 12/31/2019. We identified 57 potentially impactful variables used in the initial prediction model and assessed each patient for MIA. We then developed a parsimonious model via recursive feature elimination, which identified the 25 most predictive variables. We utilized linear and non-linear models including support vector machines (SVM), logistic regression (LR), and random forest (RF) to predict MIA and compared their performance. Results The highest-performing full model is the nonlinear RF, achieving the highest Area Under the ROC Curve (AUC) of 76% and average F1 score of 85%. Models limited to the most predictive variables were able to attain AUC and F1 scores comparable to models with all variables included. The variables most predictive of missed appointments included timing, prior appointment history, referral department of origin, and socioeconomic factors such as household income and access to caregiving services. Conclusions Prediction of MIA with the data available is inherently limited by the complex, multifactorial nature of MIA. However, the algorithms presented achieved acceptable performance and demonstrated that socioeconomic factors were useful predictors of MIA. In contrast with non-modifiable demographic factors, we can address SDH to decrease the incidence of MIA. 
    more » « less
  3. Due to the growing volume of remote sensing data and the low latency required for safe marine navigation, machine learning (ML) algorithms are being developed to accelerate sea ice chart generation, currently a manual interpretation task. However, the low signal-to-noise ratio of the freely available Sentinel-1 Synthetic Aperture Radar (SAR) imagery, the ambiguity of backscatter signals for ice types, and the scarcity of open-source high-resolution labelled data makes automating sea ice mapping challenging. We use Extreme Earth version 2, a high-resolution benchmark dataset generated for ML training and evaluation, to investigate the effectiveness of ML for automated sea ice mapping. Our customized pipeline combines ResNets and Atrous Spatial Pyramid Pooling for SAR image segmentation. We investigate the performance of our model for: i) binary classification of sea ice and open water in a segmentation framework; and ii) a multiclass segmentation of five sea ice types. For binary ice-water classification, models trained with our largest training set have weighted F1 scores all greater than 0.95 for January and July test scenes. Specifically, the median weighted F1 score was 0.98, indicating high performance for both months. By comparison, a competitive baseline U-Net has a weighted average F1 score of ranging from 0.92 to 0.94 (median 0.93) for July, and 0.97 to 0.98 (median 0.97) for January. Multiclass ice type classification is more challenging, and even though our models achieve 2% improvement in weighted F1 average compared to the baseline U-Net, test weighted F1 is generally between 0.6 and 0.80. Our approach can efficiently segment full SAR scenes in one run, is faster than the baseline U-Net, retains spatial resolution and dimension, and is more robust against noise compared to approaches that rely on patch classification. 
    more » « less
  4. Convolutional neural networks (CNNs) are becoming an increasingly popular approach for classification mapping of large complex regions where manual data collection is too time consuming. Stream boundaries in hyper-arid polar regions such as the McMurdo Dry Valleys (MDVs) in Antarctica are difficult to locate because they have little hydraulic flow throughout the short summer months. This paper utilizes a U-Net CNN to map stream boundaries from lidar derived rasters in Taylor Valley located within the MDVs, covering ∼770 km2. The training dataset consists of 217 (300 × 300 m2) well-distributed tiles of manually classified stream boundaries with diverse geometries (straight, sinuous, meandering, and braided) throughout the valley. The U-Net CNN is trained on elevation, slope, lidar intensity returns, and flow accumulation rasters. These features were used for detection of stream boundaries by providing potential topographic cues such as inflection points at stream boundaries and reflective properties of streams such as linear patterns of wetted soil, water, or ice. Various combinations of these features were analyzed based on performance. The test set performance revealed that elevation and slope had the highest performance of the feature combinations. The test set performance analysis revealed that the CNN model trained with elevation independently received a precision, recall, and F1 score of 0.94±0.05, 0.95±0.04, and 0.94±0.04 respectively, while slope received 0.96±0.03, 0.93±0.04, and 0.94±0.04, respectively. The performance of the test set revealed higher stream boundary prediction accuracies along the coast, while inland performance varied. Meandering streams had the highest stream boundary prediction performance on the test set compared to the other stream geometries tested here because meandering streams are further evolved and have more distinguishable breaks in slope, indicating stream boundaries. These methods provide a novel approach for mapping stream boundaries semi-automatically in complex regions such as hyper-arid environments over larger scales than is possible for current methods. 
    more » « less
  5. Abstract

    Pollen identification is necessary for several subfields of geology, ecology, and evolutionary biology. However, the existing methods for pollen identification are laborious, time-consuming, and require highly skilled scientists. Therefore, there is a pressing need for an automated and accurate system for pollen identification, which can be beneficial for both basic research and applied issues such as identifying airborne allergens. In this study, we propose a deep learning (DL) approach to classify pollen grains in the Great Basin Desert, Nevada, USA. Our dataset consisted of 10,000 images of 40 pollen species. To mitigate the limitations imposed by the small volume of our training dataset, we conducted an in-depth comparative analysis of numerous pre-trained Convolutional Neural Network (CNN) architectures utilizing transfer learning methodologies. Simultaneously, we developed and incorporated an innovative CNN model, serving to augment our exploration and optimization of data modeling strategies. We applied different architectures of well-known pre-trained deep CNN models, including AlexNet, VGG-16, MobileNet-V2, ResNet (18, 34, and 50, 101), ResNeSt (50, 101), SE-ResNeXt, and Vision Transformer (ViT), to uncover the most promising modeling approach for the classification of pollen grains in the Great Basin. To evaluate the performance of the pre-trained deep CNN models, we measured accuracy, precision, F1-Score, and recall. Our results showed that the ResNeSt-110 model achieved the best performance, with an accuracy of 97.24%, precision of 97.89%, F1-Score of 96.86%, and recall of 97.13%. Our results also revealed that transfer learning models can deliver better and faster image classification results compared to traditional CNN models built from scratch. The proposed method can potentially benefit various fields that rely on efficient pollen identification. This study demonstrates that DL approaches can improve the accuracy and efficiency of pollen identification, and it provides a foundation for further research in the field.

     
    more » « less