skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Iterative Paraphrastic Augmentation with Discriminative Span Alignment
Abstract We introduce a novel paraphrastic augmentation strategy based on sentence-level lexically constrained paraphrasing and discriminative span alignment. Our approach allows for the large-scale expansion of existing datasets or the rapid creation of new datasets using a small, manually produced seed corpus. We demonstrate our approach with experiments on the Berkeley FrameNet Project, a large-scale language understanding effort spanning more than two decades of human labor. With four days of training data collection for a span alignment model and one day of parallel compute, we automatically generate and release to the community 495,300 unique (Frame,Trigger) pairs in diverse sentential contexts, a roughly 50-fold expansion atop FrameNet v1.7. The resulting dataset is intrinsically and extrinsically evaluated in detail, showing positive results on a downstream task.  more » « less
Award ID(s):
1749025
PAR ID:
10293091
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
Transactions of the Association for Computational Linguistics
Volume:
9
ISSN:
2307-387X
Page Range / eLocation ID:
494 to 509
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Measuring the organization of the cellular cytoskeleton and the surrounding extracellular matrix (ECM) is currently of wide interest as changes in both local and global alignment can highlight alterations in cellular functions and material properties of the extracellular environment. Different approaches have been developed to quantify these structures, typically based on fiber segmentation or on matrix representation and transformation of the image, each with its own advantages and disadvantages. Here we present AFT − Alignment by Fourier Transform , a workflow to quantify the alignment of fibrillar features in microscopy images exploiting 2D Fast Fourier Transforms (FFT). Using pre-existing datasets of cell and ECM images, we demonstrate our approach and compare and contrast this workflow with two other well-known ImageJ algorithms to quantify image feature alignment. These comparisons reveal that AFT has a number of advantages due to its grid-based FFT approach. 1) Flexibility in defining the window and neighborhood sizes allows for performing a parameter search to determine an optimal length scale to carry out alignment metrics. This approach can thus easily accommodate different image resolutions and biological systems. 2) The length scale of decay in alignment can be extracted by comparing neighborhood sizes, revealing the overall distance that features remain anisotropic. 3) The approach is ambivalent to the signal source, thus making it applicable for a wide range of imaging modalities and is dependent on fewer input parameters than segmentation methods. 4) Finally, compared to segmentation methods, this algorithm is computationally inexpensive, as high-resolution images can be evaluated in less than a second on a standard desktop computer. This makes it feasible to screen numerous experimental perturbations or examine large images over long length scales. Implementation is made available in both MATLAB and Python for wider accessibility, with example datasets for single images and batch processing. Additionally, we include an approach to automatically search parameters for optimum window and neighborhood sizes, as well as to measure the decay in alignment over progressively increasing length scales. 
    more » « less
  2. null (Ed.)
    Recent work on Question Answering (QA) and Conversational QA (ConvQA) emphasizes the role of retrieval: a system first retrieves evidence from a large collection and then extracts answers. This open-retrieval setting typically assumes that each question is answerable by a single span of text within a particular passage (a span answer). The supervision signal is thus derived from whether or not the system can recover an exact match of this ground-truth answer span from the retrieved passages. This method is referred to as span-match weak supervision. However, information-seeking conversations are challenging for this span-match method since long answers, especially freeform answers, are not necessarily strict spans of any passage. Therefore, we introduce a learned weak supervision approach that can identify a paraphrased span of the known answer in a passage. Our experiments on QuAC and CoQA datasets show that although a span-match weak supervisor can handle conversations with span answers, it is not sufficient for freeform answers generated by people. We further demonstrate that our method is more flexible since it can handle both span answers and freeform answers. In particular, our method outperforms the span-match method on conversations with freeform answers, and it can be more powerful when combined with the span-match method. We also conduct in-depth analyses to show more insights on open-retrieval ConvQA under a weak supervision setting. 
    more » « less
  3. Beginner musicians often struggle to identify specific errors in their performances, such as playing incorrect notes or rhythms. There are two limitations in existing tools for music error detection: (1) Existing approaches rely on automatic alignment; therefore, they are prone to errors caused by small deviations between alignment targets; (2) There is insufficient data to train music error detection models, resulting in over-reliance on heuristics. To address (1), we propose a novel transformer model, Polytune, that takes audio inputs and outputs annotated music scores. This model can be trained end-to-end to implicitly align and compare performance audio with music scores through latent space representations. To address (2), we present a novel data generation technique capable of creating large-scale synthetic music error datasets. Our approach achieves a 64.1% average Error Detection F1 score, improving upon prior work by 40 percentage points across 14 instruments. Additionally, our model can handle multiple instruments compared with existing transcription methods repurposed for music error detection. 
    more » « less
  4. Robust state tracking for task-oriented dialogue systems currently remains restricted to a few popular languages. This paper shows that given a large-scale dialogue data set in one language, we can automatically produce an effective semantic parser for other languages using machine translation. We propose automatic translation of dialogue datasets with alignment to ensure faithful translation of slot values and eliminate costly human supervision used in previous benchmarks. We also propose a new contextual semantic parsing model, which encodes the formal slots and values, and only the last agent and user utterances. We show that the succinct representation reduces the compounding effect of translation errors, without harming the accuracy in practice. We evaluate our approach on several dialogue state tracking benchmarks. On RiSAWOZ, CrossWOZ, CrossWOZ-EN, and MultiWOZ-ZH datasets we improve the state of the art by 11%, 17%, 20%, and 0.3% in joint goal accuracy. We present a comprehensive error analysis for all three datasets showing erroneous annotations can lead to misguided judgments on the quality of the model. Finally, we present RiSAWOZ English and German datasets, created using our translation methodology. On these datasets, accuracy is within 11% of the original showing that high-accuracy multilingual dialogue datasets are possible without relying on expensive human annotations. We release our datasets and software open source. 
    more » « less
  5. null (Ed.)
    Abstract In contrast to the conventional approach of directly comparing genomic sequences using sequence alignment tools, we propose a computational approach that performs comparisons between sequence generators. These sequence generators are learned via a data-driven approach that empirically computes the state machine generating the genomic sequence of interest. As the state machine based generator of the sequence is independent of the sequence length, it provides us with an efficient method to compute the statistical distance between large sets of genomic sequences. Moreover, our technique provides a fast and efficient method to cluster large datasets of genomic sequences, characterize their temporal and spatial evolution in a continuous manner, get insights into the locality sensitive information about the sequences without any need for alignment. Furthermore, we show that the technique can be used to detect local regions with mutation activity, which can then be applied to aid alignment techniques for the fast discovery of mutations. To demonstrate the efficacy of our technique on real genomic data, we cluster different strains of SARS-CoV-2 viral sequences, characterize their evolution and identify regions of the viral sequence with mutations. 
    more » « less