skip to main content

Title: Counterexample-guided data augmentation
We present a novel framework for augmenting data sets for machine learning based on counterexamples. Counterexamples are misclassified examples that have important properties for retraining and improving the model. Key components of our framework include a counterexample generator, which produces data items that are misclassified by the model and error tables, a novel data structure that stores information pertaining to misclassifications. Error tables can be used to explain the model's vulnerabilities and are used to efficiently generate counterexamples for augmentation. We show the efficacy of the proposed framework by comparing it to classical augmentation techniques on a case study of object detection in autonomous driving based on deep neural networks.
Authors:
; ; ; ; ;
Award ID(s):
1645964
Publication Date:
NSF-PAR ID:
10109211
Journal Name:
Proceedings of the ... International Conference on Artificial Intelligence
Page Range or eLocation-ID:
2071-2078
ISSN:
2521-7860
Sponsoring Org:
National Science Foundation
More Like this
  1. Tables provide valuable knowledge that can be used to verify textual statements. While a number of works have considered table-based fact verification, direct alignments of tabular data with tokens in textual statements are rarely available. Moreover, training a generalized fact verification model requires abundant labeled training data. In this paper, we propose a novel system to address these problems. Inspired by counterfactual causality, our system identifies token-level salience in the statement with probing-based salience estimation. Salience estimation allows enhanced learning of fact verification from two perspectives. From one perspective, our system conducts masked salient token prediction to enhance the model for alignment and reasoning between the table and the statement. From the other perspective, our system applies salience-aware data augmentation to generate a more diverse set of training instances by replacing non-salient terms. Experimental results on TabFact show the effective improvement by the proposed salience-aware learning techniques, leading to the new SOTA performance on the benchmark.
  2. The classification of Antibody Mediated Rejection (AMR) in kidney transplant remains challenging even for experienced nephropathologists; this is partly because histological tissue stain analysis is often characterized by low inter-observer agreement and poor reproducibility. One of the implicated causes for inter-observer disagreement is the variability of tissue stain quality between (and within) pathology labs, coupled with the gradual fading of archival sections. Variations in stain colors and intensities can make tissue evaluation difficult for pathologists, ultimately affecting their ability to describe relevant morphological features. Being able to accurately predict the AMR status based on kidney histology images is crucial for improving patient treatment and care. We propose a novel pipeline to build robust deep neural networks for AMR classification based on StyPath, a histological data augmentation technique that leverages a light weight style-transfer algorithm as a means to reduce sample-specific bias. Each image was generated in 1.84 ± 0.03 s using a single GTX TITAN V gpu and pytorch, making it faster than other popular histological data augmentation techniques. We evaluated our model using a Monte Carlo (MC) estimate of Bayesian performance and generate an epistemic measure of uncertainty to compare both the baseline and StyPath augmented models. We alsomore »generated Grad-CAM representations of the results which were assessed by an experienced nephropathologist; we used this qualitative analysis to elucidate on the assumptions being made by each model. Our results imply that our style-transfer augmentation technique improves histological classification performance (reducing error from 14.8% to 11.5%) and generalization ability.« less
  3. Attributed networks are a type of graph structured data used in many real-world scenarios. Detecting anomalies on attributed networks has a wide spectrum of applications such as spammer detection and fraud detection. Although this research area draws increasing attention in the last few years, previous works are mostly unsupervised because of expensive costs of labeling ground truth anomalies. Many recent studies have shown different types of anomalies are often mixed together on attributed networks and such invaluable human knowledge could provide complementary insights in advancing anomaly detection on attributed networks. To this end, we study the novel problem of modeling and integrating human knowledge of different anomaly types for attributed network anomaly detection. Specifically, we first model prior human knowledge through a novel data augmentation strategy. We then integrate the modeled knowledge in a Siamese graph neural network encoder through a well-designed contrastive loss. In the end, we train a decoder to reconstruct the original networks from the node representations learned by the encoder, and rank nodes according to its reconstruction error as the anomaly metric. Experiments on five real-world datasets demonstrate that the proposed framework outperforms the state-of-the-art anomaly detection algorithms.
  4. This work is centered on high-fidelity modeling, analysis, and rigorous experiments of vibrations and guided (Lamb) waves in a human skull in two connected tracks: (1) layered modeling of the cranial bone structure (with cortical tables and diploë) and its vibration-based elastic parameter identification (and validation); (2) transcranial leaky Lamb wave characterization experiments and radiation analyses using the identified elastic parameters in a layered semi analytical finite element framework, followed by time transient simulations that consider the inner porosity as is. In the first track, non-contact vibration experiments are conducted to extract the first handful of modal frequencies in the auditory frequency regime, along with the associated damping ratios and mode shapes, of dry cranial bone segments extracted from the parietal and frontal regions of a human skull. Numerical models of the bone segments are built with a novel image reconstruction scheme that employs microcomputed tomographic scans to build a layered bone geometry with separate homogenized domains for the cortical tables and the diploë. These numerical models and the experimental modal frequencies are then used in an iterative parameter identification scheme that yields the cortical and diploic isotropic elastic moduli of each domain, whereas the corresponding densities are estimated usingmore »the total experimental mass and layer mass ratios obtained from the scans. With the identified elastic parameters, the average error between experimental and numerical modal frequencies is less than 1.5% and the modal assurance criterion values for most modes are above 0.90. Furthermore, the extracted parameters are in the range of the results reported in the literature. In the second track, the focus is placed on the subject of leaky Lamb waves, which has received growing attention as a promising alternative to conventional ultrasound techniques for transcranial transmission, especially to access the brain periphery. Experiments are conducted on the same cranial bone segment set for leaky Lamb wave excitation and radiation characterization. The degassed skull bone segments are used in submersed experiments with an ultrasonic transducer and needle hydrophone setup for radiation pressure field scanning. Elastic parameters obtained from the first track are used in guided wave dispersion simulations, and the radiation angles are accurately predicted using the aforementioned layered model in the presence of fluid loading. The dominant radiation angles are shown to correspond to guided wave modes with low attenuation and a significant out-of-plane polarization. The experimental radiation spectra are finally compared against those obtained from time transient finite element simulations that leverage geometric models reconstructed from microcomputed tomographic scans.« less
  5. Mathelier, Anthony (Ed.)
    Abstract Motivation An important step in the transcriptomic analysis of individual cells involves manually determining the cellular identities. To ease this labor-intensive annotation of cell-types, there has been a growing interest in automated cell annotation, which can be achieved by training classification algorithms on previously annotated datasets. Existing pipelines employ dataset integration methods to remove potential batch effects between source (annotated) and target (unannotated) datasets. However, the integration and classification steps are usually independent of each other and performed by different tools. We propose JIND (joint integration and discrimination for automated single-cell annotation), a neural-network-based framework for automated cell-type identification that performs integration in a space suitably chosen to facilitate cell classification. To account for batch effects, JIND performs a novel asymmetric alignment in which unseen cells are mapped onto the previously learned latent space, avoiding the need of retraining the classification model for new datasets. JIND also learns cell-type-specific confidence thresholds to identify cells that cannot be reliably classified. Results We show on several batched datasets that the joint approach to integration and classification of JIND outperforms in accuracy existing pipelines, and a smaller fraction of cells is rejected as unlabeled as a result of the cell-specific confidence thresholds.more »Moreover, we investigate cells misclassified by JIND and provide evidence suggesting that they could be due to outliers in the annotated datasets or errors in the original approach used for annotation of the target batch. Availability and implementation Implementation for JIND is available at https://github.com/mohit1997/JIND and the data underlying this article can be accessed at https://doi.org/10.5281/zenodo.6246322. Supplementary information Supplementary data are available at Bioinformatics online.« less