Surface Statistics of an Unknown Language Indicate How to Parse It

Wang, Dingquan; Eisner, Jason

doi:10.1162/tacl_a_00248

Citation Details

Surface Statistics of an Unknown Language Indicate How to Parse It

We introduce a novel framework for delexicalized dependency parsing in a new language. We show that useful features of the target language can be extracted automatically from an unparsed corpus, which consists only of gold part-of-speech (POS) sequences. Providing these features to our neural parser enables it to parse sequences like those in the corpus. Strikingly, our system has no supervision in the target language. Rather, it is a multilingual system that is trained end-to-end on a variety of other languages, so it learns a feature extractor that works well. We show experimentally across multiple languages: (1) Features computed from the unparsed corpus improve parsing accuracy. (2) Including thousands of synthetic languages in the training yields further improvement. (3) Despite being computed from unparsed corpora, our learned task-specific features beat previous work’s interpretable typological features that require parsed corpora or expert categorization of the language. Our best method improved attachment scores on held-out test languages by an average of 5.6 percentage points over past work that does not inspect the unparsed data (McDonald et al., 2011), and by 20.7 points over past “grammar induction” work that does not use training languages (Naseem et al., 2010). more »

Award ID(s):: 1718846

PAR ID:: 10345015

Author(s) / Creator(s):: Wang, Dingquan; Eisner, Jason

Date Published:: 2018-12-01

Journal Name:: Transactions of the Association for Computational Linguistics

Volume:: 6

ISSN:: 2307-387X

Page Range / eLocation ID:: 667 to 685

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript
Journal Article:
https://doi.org/10.1162/tacl_a_00248

More Like this