skip to main content


Title: Iterative Paraphrastic Augmentation with Discriminative Span Alignment
Abstract We introduce a novel paraphrastic augmentation strategy based on sentence-level lexically constrained paraphrasing and discriminative span alignment. Our approach allows for the large-scale expansion of existing datasets or the rapid creation of new datasets using a small, manually produced seed corpus. We demonstrate our approach with experiments on the Berkeley FrameNet Project, a large-scale language understanding effort spanning more than two decades of human labor. With four days of training data collection for a span alignment model and one day of parallel compute, we automatically generate and release to the community 495,300 unique (Frame,Trigger) pairs in diverse sentential contexts, a roughly 50-fold expansion atop FrameNet v1.7. The resulting dataset is intrinsically and extrinsically evaluated in detail, showing positive results on a downstream task.  more » « less
Award ID(s):
1749025
NSF-PAR ID:
10293091
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
Transactions of the Association for Computational Linguistics
Volume:
9
ISSN:
2307-387X
Page Range / eLocation ID:
494 to 509
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Measuring the organization of the cellular cytoskeleton and the surrounding extracellular matrix (ECM) is currently of wide interest as changes in both local and global alignment can highlight alterations in cellular functions and material properties of the extracellular environment. Different approaches have been developed to quantify these structures, typically based on fiber segmentation or on matrix representation and transformation of the image, each with its own advantages and disadvantages. Here we present AFT − Alignment by Fourier Transform , a workflow to quantify the alignment of fibrillar features in microscopy images exploiting 2D Fast Fourier Transforms (FFT). Using pre-existing datasets of cell and ECM images, we demonstrate our approach and compare and contrast this workflow with two other well-known ImageJ algorithms to quantify image feature alignment. These comparisons reveal that AFT has a number of advantages due to its grid-based FFT approach. 1) Flexibility in defining the window and neighborhood sizes allows for performing a parameter search to determine an optimal length scale to carry out alignment metrics. This approach can thus easily accommodate different image resolutions and biological systems. 2) The length scale of decay in alignment can be extracted by comparing neighborhood sizes, revealing the overall distance that features remain anisotropic. 3) The approach is ambivalent to the signal source, thus making it applicable for a wide range of imaging modalities and is dependent on fewer input parameters than segmentation methods. 4) Finally, compared to segmentation methods, this algorithm is computationally inexpensive, as high-resolution images can be evaluated in less than a second on a standard desktop computer. This makes it feasible to screen numerous experimental perturbations or examine large images over long length scales. Implementation is made available in both MATLAB and Python for wider accessibility, with example datasets for single images and batch processing. Additionally, we include an approach to automatically search parameters for optimum window and neighborhood sizes, as well as to measure the decay in alignment over progressively increasing length scales. 
    more » « less
  2. Modern knowledge bases have matured to the extent of being capable of complex reasoning at scale. Unfortunately, wide deployment of this technology is still hindered by the fact that specifying the req- uisite knowledge requires skills that most domain experts do not have, and skilled knowledge engineers are in short supply. A way around this problem could be to acquire knowledge from text. However, the current knowledge acquisition technologies for information extraction are not up to the task because logic reasoning systems are extremely sensitive to er- rors in the acquired knowledge, and existing techniques lack the required accuracy by too large of a margin. Because of the enormous complexity of the problem, controlled natural languages (CNLs) were proposed in the past, but even they lack high enough accuracy. Instead of tackling the general problem of text understanding, our interest is in a related, but different, area of knowledge authoring—a technology designed to enable domain experts to manually create formalized knowledge using CNL. Our approach adopts and formalizes the FrameNet methodology for rep- resenting the meaning, enables incrementally-learnable and explainable semantic parsing, and harnesses rich knowledge graphs like BabelNet in the quest to obtain unique, disambiguated meaning of CNL sentences. Our experiments show that this approach is 95.6% accurate in standard- izing the semantic relations extracted from CNL sentences—far superior to alternative systems. 
    more » « less
  3. Modern knowledge bases have matured to the extent of being capable of complex reasoning at scale. Unfortunately, wide deployment of this technology is still hindered by the fact that specifying the requisite knowledge requires skills that most domain experts do not have, and skilled knowledge engineers are in short supply. A way around this problem could be to acquire knowledge from text. However, the current knowledge acquisition technologies for information extraction are not up to the task because logic reasoning systems are extremely sensitive to errors in the acquired knowledge, and existing techniques lack the required accuracy by too large of a margin. Because of the enormous complexity of the problem, controlled natural languages (CNLs) were proposed in the past, but even they lack high enough accuracy. Instead of tackling the general problem of text understanding, our interest is in a related, but different, area of knowledge authoring—a technology designed to enable domain experts to manually create formalized knowledge using CNL. Our approach adopts and formalizes the FrameNet methodology for representing the meaning, enables incrementally-learnable and explainable semantic parsing, and harnesses rich knowledge graphs like BabelNet in the quest to obtain unique, disambiguated meaning of CNL sentences. Our experiments show that this approach is 95.6% accurate in standardizing the semantic relations extracted from CNL sentences—far superior to alternative systems. 
    more » « less
  4. Robust state tracking for task-oriented dialogue systems currently remains restricted to a few popular languages. This paper shows that given a large-scale dialogue data set in one language, we can automatically produce an effective semantic parser for other languages using machine translation. We propose automatic translation of dialogue datasets with alignment to ensure faithful translation of slot values and eliminate costly human supervision used in previous benchmarks. We also propose a new contextual semantic parsing model, which encodes the formal slots and values, and only the last agent and user utterances. We show that the succinct representation reduces the compounding effect of translation errors, without harming the accuracy in practice. We evaluate our approach on several dialogue state tracking benchmarks. On RiSAWOZ, CrossWOZ, CrossWOZ-EN, and MultiWOZ-ZH datasets we improve the state of the art by 11%, 17%, 20%, and 0.3% in joint goal accuracy. We present a comprehensive error analysis for all three datasets showing erroneous annotations can lead to misguided judgments on the quality of the model. Finally, we present RiSAWOZ English and German datasets, created using our translation methodology. On these datasets, accuracy is within 11% of the original showing that high-accuracy multilingual dialogue datasets are possible without relying on expensive human annotations. We release our datasets and software open source. 
    more » « less
  5. Context. Molecular filaments and hubs have received special attention recently thanks to new studies showing their key role in star formation. While the (column) density and velocity structures of both filaments and hubs have been carefully studied, their magnetic field (B-field) properties have yet to be characterized. Consequently, the role of B-fields in the formation and evolution of hub-filament systems is not well constrained. Aims. We aim to understand the role of the B-field and its interplay with turbulence and gravity in the dynamical evolution of the NGC 6334 filament network that harbours cluster-forming hubs and high-mass star formation. Methods. We present new observations of the dust polarized emission at 850 μ m toward the 2 pc × 10 pc map of NGC 6334 at a spatial resolution of 0.09 pc obtained with the James Clerk Maxwell Telescope (JCMT) as part of the B-field In STar-forming Region Observations (BISTRO) survey. We study the distribution and dispersion of the polarized intensity ( PI ), the polarization fraction ( PF ), and the plane-of-the-sky B-field angle ( χ B_POS ) toward the whole region, along the 10 pc-long ridge and along the sub-filaments connected to the ridge and the hubs. We derived the power spectra of the intensity and χ B POS along the ridge crest and compared them with the results obtained from simulated filaments. Results. The observations span ~3 orders of magnitude in Stokes I and PI and ~2 orders of magnitude in PF (from ~0.2 to ~ 20%). A large scatter in PI and PF is observed for a given value of I . Our analyses show a complex B-field structure when observed over the whole region (~ 10 pc); however, at smaller scales (~1 pc), χ B POS varies coherently along the crests of the filament network. The observed power spectrum of χ B POS can be well represented with a power law function with a slope of − 1.33 ± 0.23, which is ~20% shallower than that of I . We find that this result is compatible with the properties of simulated filaments and may indicate the physical processes at play in the formation and evolution of star-forming filaments. Along the sub-filaments, χ B POS rotates frombeing mostly perpendicular or randomly oriented with respect to the crests to mostly parallel as the sub-filaments merge with the ridge and hubs. This variation of the B-field structure along the sub-filaments may be tracing local velocity flows of infalling matter in the ridge and hubs. Our analysis also suggests a variation in the energy balance along the crests of these sub-filaments, from magnetically critical or supercritical at their far ends to magnetically subcritical near the ridge and hubs. We also detect an increase in PF toward the high-column density ( N H 2 ≳ 10 23  cm −2 ) star cluster-forming hubs. These latter large PF values may be explained by the increase in grain alignment efficiency due to stellar radiation from the newborn stars, combined with an ordered B-field structure. Conclusions. These observational results reveal for the first time the characteristics of the small-scale (down to ~ 0.1 pc) B-field structure of a 10 pc-long hub-filament system. Our analyses show variations in the polarization properties along the sub-filaments that may be tracing the evolution of their physical properties during their interaction with the ridge and hubs. We also detect an impact of feedback from young high-mass stars on the local B-field structure and the polarization properties, which could put constraints on possible models for dust grain alignment and provide important hints as to the interplay between the star formation activity and interstellar B-fields. 
    more » « less