Natural Language Processing (NLP) is one of the most captivating applications of Deep Learning. In this survey, we consider how the Data Augmentation training strategy can aid in its development. We begin with the major motifs of Data Augmentation summarized into strengthening local decision boundaries, brute force training, causality and counterfactual examples, and the distinction between meaning and form. We follow these motifs with a concrete list of augmentation frameworks that have been developed for text data. Deep Learning generally struggles with the measurement of generalization and characterization of overfitting. We highlight studies that cover how augmentations can construct test sets for generalization. NLP is at an early stage in applying Data Augmentation compared to Computer Vision. We highlight the key differences and promising ideas that have yet to be tested in NLP. For the sake of practical implementation, we describe tools that facilitate Data Augmentation such as the use of consistency regularization, controllers, and offline and online augmentation pipelines, to preview a few. Finally, we discuss interesting topics around Data Augmentation in NLP such as task-specific augmentations, the use of prior knowledge in self-supervised learning versus Data Augmentation, intersections with transfer and multi-task learning, and ideas for AI-GAs (AI-Generating Algorithms). We hope this paper inspires further research interest in Text Data Augmentation.more » « less
- NSF-PAR ID:
- Publisher / Repository:
- Springer Science + Business Media
- Date Published:
- Journal Name:
- Journal of Big Data
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
Data augmentation is a common practice to help generalization in the procedure of deep model training. In the context of physiological time series classification, previous research has primarily focused on label-invariant data augmentation methods. However, another class of augmentation techniques (i.e., Mixup) that emerged in the computer vision field has yet to be fully explored in the time series domain. In this study, we systematically review the mix-based augmentations, including mixup, cutmix, and manifold mixup, on six physio- logical datasets, evaluating their performance across different sensory data and classification tasks. Our results demonstrate that the three mix-based augmentations can consistently improve the performance on the six datasets. More importantly, the improvement does not rely on expert knowledge or extensive parameter tuning. Lastly, we provide an overview of the unique properties of the mix-based augmentation methods and highlight the potential benefits of using the mix-based augmentation in physiological time series data. Our code and results are available at https://github.com/comp-well-org/ Mix-Augmentation-for-Physiological-Time-Series-Classification.more » « less
Jovanovic, Jelena ; Chounta, Irene-Angelica ; Uhomoibhi, James ; McLaren, Bruce (Ed.)Computer-supported education studies can perform two important roles. They can allow researchers to gather important data about student learning processes, and they can help students learn more efficiently and effectively by providing automatic immediate feedback on what the students have done so far. The evaluation of student work required for both of these roles can be relatively easy in domains like math, where there are clear right answers. When text is involved, however, automated evaluations become more difficult. Natural Language Processing (NLP) can provide quick evaluations of student texts. However, traditional neural network approaches require a large amount of data to train models with enough accuracy to be useful in analyzing student responses. Typically, educational studies collect data but often only in small amounts and with a narrow focus on a particular topic. BERT-based neural network models have revolutionized NLP because they are pre-trained on very large corpora, developing a robust, contextualized understanding of the language. Then they can be “fine-tuned” on a much smaller set of data for a particular task. However, these models still need a certain base level of training data to be reasonably accurate, and that base level can exceed that provided by educational applications, which might contain only a few dozen examples. In other areas of artificial intelligence, such as computer vision, model performance on small data sets has been improved by “data augmentation” — adding scaled and rotated versions of the original images to the training set. This has been attempted on textual data; however, augmenting text is much more difficult than simply scaling or rotating images. The newly generated sentences may not be semantically similar to the original sentence, resulting in an improperly trained model. In this paper, we examine a self-augmentation method that is straightforward and shows great improvements in performance with different BERT-based models in two different languages and on two different tasks that have small data sets. We also identify the limitations of the self-augmentation procedure.more » « less
Deep learning models tend not to be out-of-distribution robust primarily due to their reliance on spurious features to solve the task. Counterfactual data augmentations provide a general way of (approximately) achieving representations that are counterfactual-invariant to spurious features, a requirement for out-of-distribution (OOD) robustness. In this work, we show that counterfactual data augmentations may not achieve the desired counterfactual-invariance if the augmentation is performed by a context-guessing machine, an abstract machine that guesses the most-likely context of a given input. We theoretically analyze the invariance imposed by such counterfactual data augmentations and describe an exemplar NLP task where counterfactual data augmentation by a context-guessing machine does not lead to robust OOD classifiers.more » « less
Artificial intelligence (AI) can enhance teachers' capabilities by sharing control over different parts of learning activities. This is especially true for complex learning activities, such as dynamic learning transitions where students move between individual and collaborative learning in un‐planned ways, as the need arises. Yet, few initiatives have emerged considering how shared responsibility between teachers and AI can support learning and how teachers' voices might be included to inform design decisions. The goal of our article is twofold. First, we describe a secondary analysis of our co‐design process comprising six design methods to understand how teachers conceptualise sharing control with an AI co‐orchestration tool, called
Pair‐Up. We worked with 76 middle school math teachers, each taking part in one to three methods, to create a co‐orchestration tool that supports dynamic combinations of individual and collaborative learning using two AI‐based tutoring systems. We leveraged qualitative content analysis to examine teachers' views about sharing control with Pair‐Up, and we describe high‐level insights about the human‐AI interaction, including control, trust, responsibility, efficiency, and accuracy. Secondly, we use our results as an example showcasing how human‐centred learning analytics can be applied to the design of human‐AI technologies and share reflections for human‐AI technology designers regarding the methods that might be fruitful to elicit teacher feedback and ideas. Our findings illustrate the design of a novel co‐orchestration tool to facilitate the transitions between individual and collaborative learning and highlight considerations and reflections for designers of similar systems. Practitioner notes
What is already known about this topic:
Artificial Intelligence (AI) can help teachers facilitate complex classroom activities, such as having students move between individual and collaborative learning in unplanned ways.
Designers should use human‐centred design approaches to give teachers a voice in deciding what AI might do in the classroom and if or how they want to share control with it.
What this paper adds:
Presents teacher views about how they want to share control with AI to support students moving between individual and collaborative learning.
Describes how we adapted six design methods to design AI features.
Illustrates a complete, iterative process to create human‐AI interactions to support teachers as they facilitate students moving from individual to collaborative learning.
Implications for practice:
We share five implications for designers that teachers highlighted as necessary when designing AI‐features, including control, trust, responsibility, efficiency and accuracy.
Our work also includes a reflection on our design process and implications for future design processes.
null (Ed.)Deep learning approaches currently achieve the state-of-the-art results on camera-based vital signs measurement. One of the main challenges with using neural models for these applications is the lack of sufficiently large and diverse datasets. Limited data increases the chances of overfitting models to the available data which in turn can harm generalization. In this paper, we show that the generalizability of imaging photoplethysmography models can be improved by augmenting the training set with "magnified" videos. These augmentations are specifically designed to reveal useful features for recovering the photoplethysmogram. We show that using augmentations of this form is more effective at improving model robustness than other commonly used data augmentation approaches. We show better within-dataset and especially cross-dataset performance with our proposed data augmentation approach on three publicly available datasets.more » « less