In many real-world classification applications such as fake news detection, the training data can be extremely imbalanced, which brings challenges to existing classifiers as the majority classes dominate the loss functions of classifiers. Oversampling techniques such as SMOTE are effective approaches to tackle the class imbalance problem by producing more synthetic minority samples. Despite their success, the majority of existing oversampling methods only consider local data distributions when generating minority samples, which can result in noisy minority samples that do not fit global data distributions or interleave with majority classes. Hence, in this paper, we study the class imbalance problem by simultaneously exploring local and global data information since: (i) the local data distribution could give detailed information for generating minority samples; and (ii) the global data distribution could provide guidance to avoid generating outliers or samples that interleave with majority classes. Specifically, we propose a novel framework GL-GAN, which leverages the SMOTE method to explore local distribution in a learned latent space and employs GAN to capture the global information, so that synthetic minority samples can be generated under even extremely imbalanced scenarios. Experimental results on diverse real data sets demonstrate the effectiveness of our GL-GAN framework in producing realistic and discriminative minority samples for improving the classification performance of various classifiers on imbalanced training data. Our code is available at https://github.com/wentao-repo/GL-GAN.
more »
« less
Design Space Exploration in Sparse, Mixed Continuous/Discrete Spaces via Synthetically Enhanced Classification
Exploration of a design space is the first step in identifying sets of high-performing solutions to complex engineering problems. For this purpose, Bayesian network classifiers (BNCs) have been shown to be effective for mapping regions of interest in the design space, even when those regions of interest exhibit complex topologies. However, identifying sets of desirable solutions can be difficult with a BNC when attempting to map a space where high-performance designs are spread sparsely among a disproportionately large number of low-performance designs, resulting in an imbalanced classifier. In this paper, a method is presented that utilizes probabilities of class membership for known training points, combined with interpolation between those points, to generate synthetic high-performance points in a design space. By adding synthetic design points into the BNC training set, a designer can rebalance an imbalanced classifier and improve classification accuracy throughout the space. For demonstration, this approach is applied to an acoustics metamaterial design problem with a sparse design space characterized by a combination of discrete and continuous design variables. Paper No: DETC2018-85274
more »
« less
- Award ID(s):
- 1641078
- PAR ID:
- 10106452
- Date Published:
- Journal Name:
- ASME IDETC/CIE Design Automation Conference, Quebec City, Quebec, Canada
- Page Range / eLocation ID:
- V02BT03A004
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Zhang, Zhaolei (Ed.)In eukaryotes, polyadenylation (poly(A)) is an essential process during mRNA maturation. Identifying the cis -determinants of poly(A) signal (PAS) on the DNA sequence is the key to understand the mechanism of translation regulation and mRNA metabolism. Although machine learning methods were widely used in computationally identifying PAS, the need for tremendous amounts of annotation data hinder applications of existing methods in species without experimental data on PAS. Therefore, cross-species PAS identification, which enables the possibility to predict PAS from untrained species, naturally becomes a promising direction. In our works, we propose a novel deep learning method named Poly(A)-DG for cross-species PAS identification. Poly(A)-DG consists of a Convolution Neural Network-Multilayer Perceptron (CNN-MLP) network and a domain generalization technique. It learns PAS patterns from the training species and identifies PAS in target species without re-training. To test our method, we use four species and build cross-species training sets with two of them and evaluate the performance of the remaining ones. Moreover, we test our method against insufficient data and imbalanced data issues and demonstrate that Poly(A)-DG not only outperforms state-of-the-art methods but also maintains relatively high accuracy when it comes to a smaller or imbalanced training set.more » « less
-
We study the effects of injecting human-generated designs into the initial population of an evolutionary robotics experiment, where subsequent population of robots are optimised via a Genetic Algorithm and MAP-Elites. First, human participants interact via a graphical front-end to explore a directly-parameterised legged robot design space and attempt to produce robots via a combination of intuition and trial-and-error that perform well in a range of environments. Environments are generated whose corresponding high-performance robot designs range from intuitive to complex and hard to grasp. Once the human designs have been collected, their impact on the evolutionary process is assessed by replacing a varying number of designs in the initial population with human designs and subsequently running the evolutionary algorithm. Our results suggest that a balance of random and hand-designed initial solutions provides the best performance for the problems considered, and that human designs are most valuable when the problem is intuitive. The influence of human design in an evolutionary algorithm is a highly understudied area, and the insights in this paper may be valuable to the area of AI-based design more generally.more » « less
-
This paper addresses the complex issue of resource-constrained scheduling, an NP-hard problem that spans critical areas including chip design and high-performance computing. Tradi- tional scheduling methods often stumble over scal- ability and applicability challenges. We propose a novel approach using a differentiable combina- torial scheduling framework, utilizing Gumbel- Softmax differentiable sampling technique. This new technical allows for a fully differentiable formulation of linear programming (LP) based scheduling, extending its application to a broader range of LP formulations. To encode inequality constraints for scheduling tasks, we introduce con- strained Gumbel Trick, which adeptly encodes arbitrary inequality constraints. Consequently, our method facilitates an efficient and scalable scheduling via gradient descent without the need for training data. Comparative evaluations on both synthetic and real-world benchmarks high- light our capability to significantly improve the optimization efficiency of scheduling, surpassing state-of-the-art solutions offered by commercial and open-source solvers such as CPLEX, Gurobi, and CP-SAT in the majority of the designs.more » « less
-
In the field of healthcare, electronic health records (EHR) serve as crucial training data for developing machine learning models for diagnosis, treatment, and the management of healthcare resources. However, medical datasets are often imbalanced in terms of sensitive attributes such as race/ethnicity, gender, and age. Machine learning models trained on class-imbalanced EHR datasets perform significantly worse in deployment for individuals of the minority classes compared to those from majority classes, which may lead to inequitable healthcare outcomes for minority groups. To address this challenge, we propose Minority Class Rebalancing through Augmentation by Generative modeling (MCRAGE), a novel approach to augment imbalanced datasets using samples generated by a deep generative model. The MCRAGE process involves training a Conditional Denoising Diffusion Probabilistic Model (CDDPM) capable of generating high-quality synthetic EHR samples from underrepresented classes. We use this synthetic data to augment the existing imbalanced dataset, resulting in a more balanced distribution across all classes, which can be used to train less biased downstream models. We measure the performance of MCRAGE versus alternative approaches using Accuracy, F1 score and AUROC of these downstream models. We provide theoretical justification for our method in terms of recent convergence results for DDPMs.more » « less