skip to main content


Search for: All records

Award ID contains: 1852498

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Outlier detection is critical in real world. Due to the existence of many outlier detection techniques which often return different results for the same data set, the users have to address the problem of determining which among these techniques is the best suited for their task and tune its parameters. This is particularly challenging in the unsupervised setting, where no labels are available for cross-validation needed for such method and parameter optimization. In this work, we propose AutoOD which uses the existing unsupervised detection techniques to automatically produce high quality outliers without any human tuning. AutoOD's fundamentally new strategy unifies the merits of unsupervised outlier detection and supervised classification within one integrated solution. It automatically tests a diverse set of unsupervised outlier detectors on a target data set, extracts useful signals from their combined detection results to reliably capture key differences between outliers and inliers. It then uses these signals to produce a "custom outlier classifier" to classify outliers, with its accuracy comparable to supervised outlier classification models trained with ground truth labels - without having access to the much needed labels. On a diverse set of benchmark outlier detection datasets, AutoOD consistently outperforms the best unsupervised outlier detector selected from hundreds of detectors. It also outperforms other tuning-free approaches from 12 to 97 points (out of 100) in the F-1 score. 
    more » « less
    Free, publicly-accessible full text available May 26, 2024
  2. Human activity recognition (HAR) is the process of using mobile sensor data to determine the physical activities performed by individuals. HAR is the backbone of many mobile healthcare applications, such as passive health monitoring systems, early diagnosing systems, and fall detection systems. Effective HAR models rely on deep learning architectures and big data in order to accurately classify activities. Unfortunately, HAR datasets are expensive to collect, are often mislabeled, and have large class imbalances. State-of-the-art approaches to address these challenges utilize Generative Adversarial Networks (GANs) for generating additional synthetic data along with their labels. Problematically, these HAR GANs only synthesize continuous features — features that are represented by real numbers — recorded from gyroscopes, accelerometers, and other sensors that produce continuous data. This is limiting since mobile sensor data commonly has discrete features that provide additional context such as device location and the time-of-day, which have been shown to substantially improve HAR classification. Hence, we studied Conditional Tabular Generative Adversarial Networks (CTGANs) for data generation to synthesize mobile sensor data containing both continuous and discrete features, a task never been done by state-of-the-art approaches. We show HAR-CTGANs generate data with greater realism resulting in allowing better downstream performance in HAR models, and when state-of-the-art models were modified with HAR-CTGAN characteristics, downstream performance also improves. 
    more » « less
  3. Corpuses of unstructured textual data, such as text messages between individuals, are often predictive of medical issues such as depression. The text data usually used in healthcare applications has high value and great variety, but is typically small in volume. Generating labeled unstructured text data is important to improve models by augmenting these small datasets, as well as to facilitate anonymization. While methods for labeled data generation exist, not all of them generalize well to small datasets. In this work, we thus perform a much needed systematic comparison of conditional text generation models that are promising for small datasets due to their unified architectures. We identify and implement a family of nine conditional sequence generative adversarial networks for text generation, which we collectively refer to as cSeqGAN models. These models are characterized along two orthogonal design dimensions: weighting strategies and feedback mechanisms. We conduct a comparative study evaluating the generation ability of the nine cSeqGAN models on three diverse text datasets with depression and sentiment labels. To assess the quality and realism of the generated text, we use standard machine learning metrics as well as human assessment via a user study. While the unconditioned models produced predictive text, the cSeqGAN models produced more realistic text. Our comparative study lays a solid foundation and provides important insights for further text generation research, particularly for the small datasets common within the healthcare domain. 
    more » « less
  4. Fair Learning To Rank (LTR) frameworks require demographic information; however, that information is often unavailable. Inference algorithms may infer the missing demographic information to supply to the fair LTR model. In this study, we analyze the effect of using a trained fair LTR model with uncertain demographic inferences. We show that inferred data results in varying levels of fairness and utility depending on inference accuracy. Specifically, less accurate inferred data adversely affects the rankings’ fairness, while more accurate inferred data creates fairer rankings. Therefore, we recommend that a careful evaluation of demographic inference algorithms before use is critical. 
    more » « less
  5. Given that depression is one of the most prevalent mental illnesses, developing effective and unobtrusive diagnosis tools is of great importance. Recent work that screens for depression with text messages leverage models relying on lexical category features. Given the colloquial nature of text messages, the performance of these models may be limited by formal lexicons. We thus propose a strategy to automatically construct alternative lexicons that contain more relevant and colloquial terms. Specifically, we generate 36 lexicons from fiction, forum, and news corpuses. These lexicons are then used to extract lexical category features from the text messages. We utilize machine learning models to compare the depression screening capabilities of these lexical category features. Out of our 36 constructed lexicons, 14 achieved statistically significantly higher average F1 scores over the pre-existing formal lexicon and basic bag-of-words approach. In comparison to the pre-existing lexicon, our best performing lexicon increased the average F1 scores by 10%. We thus confirm our hypothesis that less formal lexicons can improve the performance of classification models that screen for depression with text messages. By providing our automatically constructed lexicons, we aid future machine learning research that leverages less formal text. 
    more » « less
  6. Leading 3D object detectors for automated vehicles, such as PIXOR, do not robustly account for noise and are vulnerable to adversarial attacks. Existing attack methods do not accurately simulate naturally occurring noise, as they attempt to continuously on a discrete input space. In this paper, we propose a novel attack method, which maximizes loss by making gradient-informed, discrete changes. A subset of points within an image move based on a percentage change between the original and new gradient. We measure the validity of an attack based on its visual similarity to the original point cloud and numeric metrics. 
    more » « less
  7. We propose a prognostic machine learning (ML) framework to support the behavioural outcome prediction for cancer survivors. Specifically, our contributions are four-fold: (1) devise a data-driven, clinical domain guided pipeline to select the best set of predictors among cancer treatments, chronic health conditions, and socio-environmental factors to perform behavioural outcome predictions; (2) use the state-of-the-art two-tier ensemble-based technique to select the best set of predictors for the downstream ML regressor constructions; (3) develop a StackNet Regressor Architecture (SRA) algorithm, i.e., an intelligent meta-modeling algorithm, to dynamically and automatically build an optimized multilayer ensemble-based RA from a given set of ML regressors to predict long-term behavioural outcomes; and (4) conduct a preliminarily experimental case study on our existing study data (i.e., 207 cancer survivors who suffered from either Osteogenic Sarcoma, Soft Tissue Sarcomas, or Acute Lymphoblastic Leukemia before the age of 18) collected by our investigators in a public hospital in Hong Kong. In this pilot study, we demonstrate that our approach outperforms the traditional statistical and computation methods, including Linear and non-Linear ML regressors. 
    more » « less