Active Learning Design Choices for NER with Transformers

Vacareanu, Robert; Noriega-Atala, Enrique; Hahn-Powell, Gus; Valenzuela-Escarcega, Marco A; Surdeanu, Mihai

Citation Details

We explore multiple important choices that have not been analyzed in conjunction regarding active learning for token classification using transformer networks. These choices are: (i) how to select what to annotate, (ii) decide whether to annotate entire sentences or smaller sentence fragments, (iii) how to train with incomplete annotations at token-level, and (iv) how to select the initial seed dataset. We explore whether annotating at sub-sentence level can translate to an improved downstream performance by considering two different sub-sentence annotation strategies: (i) entity-level, and (ii) token-level. These approaches result in some sentences being only partially annotated. To address this issue, we introduce and evaluate multiple strategies to deal with partially-annotated sentences during the training process. We show that annotating at the sub-sentence level achieves comparable or better performance than sentence-level annotations with a smaller number of annotated tokens. We then explore the extent to which the performance gap remains once accounting for the annotation time and found that both annotation schemes perform similarly. more »

Award ID(s):: 2006583

PAR ID:: 10550330

Author(s) / Creator(s):: Vacareanu, Robert; Noriega-Atala, Enrique; Hahn-Powell, Gus; Valenzuela-Escarcega, Marco A; Surdeanu, Mihai

Editor(s):: Calzolari, Nicoletta; Kan, Min-Yen; Hoste, Veronique; Lenci, Alessandro; Sakti, Sakriani; Xue, Nianwen

Publisher / Repository:: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Date Published:: 2024-05-20

Page Range / eLocation ID:: 321-334

Format(s):: Medium: X

Location:: Torino, Italy

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Proceeding:
The DOI is not currently available.

More Like this