Title: Cross-linguistic semantic annotation: reconciling the language-specific and the universal
Developers of cross-lingual semantic annotation schemes face a number of issues not encountered in monolingual annotation. This paper discusses four such issues, related to the establishment of annotation labels, and the treatment of languages with more fine-grained, more coarse-grained, and cross-cutting categories. We propose that a lattice-like architecture of the annotation categories can adequately handle all four issues, and at the same time remain both intuitive for annotators and faithful to typological insights. This position is supported by a brief annotation experiment. more »« less
Briakou, Eleftheria; Carpuat, Marine(
, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP))
null
(Ed.)
Detecting fine-grained differences in content conveyed in different languages matters for cross-lingual NLP and multilingual corpora analysis, but it is a challenging machine learning problem since annotation is expensive and hard to scale. This work improves the prediction and annotation of fine-grained semantic divergences. We introduce a training strategy for multilingual BERT models by learning to rank synthetic divergent examples of varying granularity. We evaluate our models on the Rationalized English-French Semantic Divergences, a new dataset released with this work, consisting of English-French sentence-pairs annotated with semantic divergence classes and token-level rationales. Learning to rank helps detect fine-grained sentence-level divergences more accurately than a strong sentence-level similarity model, while token-level predictions have the potential of further distinguishing between coarse and fine-grained divergences.
Hou, Shudi; Xia, Yu; Chen, Muhao; Li, Sujian(
, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics)
Traditional text classification typically categorizes texts into pre-defined coarse-grained classes, from which the produced models cannot handle the real-world scenario where finer categories emerge periodically for accurate services. In this work, we investigate the setting where fine-grained classification is done only using the annotation of coarse-grained categories and the coarse-to-fine mapping. We propose a lightweight contrastive clustering-based bootstrapping method to iteratively refine the labels of passages. During clustering, it pulls away negative passage-prototype pairs under the guidance of the mapping from both global and local perspectives. Experiments on NYT and 20News show that our method outperforms the state-of-the-art methods by a large margin.
Ge, Sijia; Wright-Bettner, Kristin; Myers, Skatje; Xue, Nianwen; Palmer, Martha(
, Proceedings of the 17th Linguistic Annotation Workshop (LAW-XVII))
UMR-Writer is a web-based tool for annotating semantic graphs with the Uniform Meaning Representation (UMR) scheme. UMR is a graph-based semantic representation that can be applied cross-linguistically for deep semantic analysis of texts. In this work, we implemented a new keyboard interface in UMR-Writer 2.0, which is a powerful addition to the original mouse interface, supporting faster annotation for more experienced annotators. The new interface also addresses issues with the original mouse interface. Additionally, we demonstrate an efficient workflow for annotation project management in UMR-Writer 2.0, which has been applied to many projects.
Fine-grained entity typing (FET) is the task of identifying specific
entity types at a fine-grained level for entity mentions based on
their contextual information. Conventional methods for FET require
extensive human annotation, which is time-consuming and costly
given the massive scale of data. Recent studies have been developing
weakly supervised or zero-shot approaches.We study the setting of
zero-shot FET where only an ontology is provided. However, most
existing ontology structures lack rich supporting information and
even contain ambiguous relations, making them ineffective in guiding
FET. Recently developed language models, though promising
in various few-shot and zero-shot NLP tasks, may face challenges
in zero-shot FET due to their lack of interaction with task-specific
ontology. In this study, we propose OnEFET, where we (1) enrich
each node in the ontology structure with two categories of extra information:
instance information for training sample augmentation
and topic information to relate types with contexts, and (2) develop
a coarse-to-fine typing algorithm that exploits the enriched information
by training an entailment model with contrasting topics
and instance-based augmented training samples. Our experiments
show that OnEFET achieves high-quality fine-grained entity typing
without human annotation, outperforming existing zero-shot methods
by a large margin and rivaling supervised methods. OnEFET
also enjoys strong transferability to unseen and finer-grained types.
Code is available at https://github.com/ozyyshr/OnEFET.
ABSTRACT Bacterial mobile genetic elements (MGEs) encode functional modules that perform both core and accessory functions for the element, the latter of which are often only transiently associated with the element. The presence of these accessory genes, which are often close homologs to primarily immobile genes, incur high rates of false positives and, therefore, limits the usability of these databases for MGE annotation. To overcome this limitation, we analyzed 10,776,849 protein sequences derived from eight MGE databases to compile a comprehensive set of 6,140 manually curated protein families that are linked to the “life cycle” (integration/excision, replication/recombination/repair, transfer, stability/transfer/defense, and phage-specific processes) of plasmids, phages, integrative, transposable, and conjugative elements. We overlay experimental information where available to create a tiered annotation scheme of high-quality annotations and annotations inferred exclusively through bioinformatic evidence. We additionally provide an MGE-class label for each entry (e.g., plasmid or integrative element), and assign to each entry a major and minor category. The resulting database, mobileOG-db (for mobile orthologous groups), comprises over 700,000 deduplicated sequences encompassing five major mobileOG categories and more than 50 minor categories, providing a structured language and interpretable basis for an array of MGE-centered analyses. mobileOG-db can be accessed at mobileogdb.flsi.cloud.vt.edu/, where users can select, refine, and analyze custom subsets of the dynamic mobilome. IMPORTANCE The analysis of bacterial mobile genetic elements (MGEs) in genomic data is a critical step toward profiling the root causes of antibiotic resistance, phenotypic or metabolic diversity, and the evolution of bacterial genera. Existing methods for MGE annotation pose high barriers of biological and computational expertise to properly harness. To bridge this gap, we systematically analyzed 10,776,849 proteins derived from eight databases of MGEs to identify 6,140 MGE protein families that can serve as candidate hallmarks, i.e., proteins that can be used as “signatures” of MGEs to aid annotation. The resulting resource, mobileOG-db, provides a multilevel classification scheme that encompasses plasmid, phage, integrative, and transposable element protein families categorized into five major mobileOG categories and more than 50 minor categories. mobileOG-db thus provides a rich resource for simple and intuitive element annotation that can be integrated seamlessly into existing MGE detection pipelines and colocalization analyses.
Van Gysel, Jens E., Vigus, Meagan, Kalm, Pavlína, Lee, Sook-kyung, Regan, Michael, and Croft, William. Cross-linguistic semantic annotation: reconciling the language-specific and the universal. Retrieved from https://par.nsf.gov/biblio/10401361. Proceedings of the First International Workshop on Designing Meaning Representations (DMR 2019) . Web. doi:10.18653/v1/W19-3301.
Van Gysel, Jens E., Vigus, Meagan, Kalm, Pavlína, Lee, Sook-kyung, Regan, Michael, & Croft, William. Cross-linguistic semantic annotation: reconciling the language-specific and the universal. Proceedings of the First International Workshop on Designing Meaning Representations (DMR 2019), (). Retrieved from https://par.nsf.gov/biblio/10401361. https://doi.org/10.18653/v1/W19-3301
Van Gysel, Jens E., Vigus, Meagan, Kalm, Pavlína, Lee, Sook-kyung, Regan, Michael, and Croft, William.
"Cross-linguistic semantic annotation: reconciling the language-specific and the universal". Proceedings of the First International Workshop on Designing Meaning Representations (DMR 2019) (). Country unknown/Code not available. https://doi.org/10.18653/v1/W19-3301.https://par.nsf.gov/biblio/10401361.
@article{osti_10401361,
place = {Country unknown/Code not available},
title = {Cross-linguistic semantic annotation: reconciling the language-specific and the universal},
url = {https://par.nsf.gov/biblio/10401361},
DOI = {10.18653/v1/W19-3301},
abstractNote = {Developers of cross-lingual semantic annotation schemes face a number of issues not encountered in monolingual annotation. This paper discusses four such issues, related to the establishment of annotation labels, and the treatment of languages with more fine-grained, more coarse-grained, and cross-cutting categories. We propose that a lattice-like architecture of the annotation categories can adequately handle all four issues, and at the same time remain both intuitive for annotators and faithful to typological insights. This position is supported by a brief annotation experiment.},
journal = {Proceedings of the First International Workshop on Designing Meaning Representations (DMR 2019)},
author = {Van Gysel, Jens E. and Vigus, Meagan and Kalm, Pavlína and Lee, Sook-kyung and Regan, Michael and Croft, William},
editor = {Xue, Nianwen and Croft, William and Hajic, Jan and Huang, Chu-Ren and Oepen, Stephan and Palmer, Martha and Pustejovsky, James}
}
Warning: Leaving National Science Foundation Website
You are now leaving the National Science Foundation website to go to a non-government website.
Website:
NSF takes no responsibility for and exercises no control over the views expressed or the accuracy of
the information contained on this site. Also be aware that NSF's privacy policy does not apply to this site.