skip to main content


Title: Effect of Chinese characters on machine learning for Chinese author name disambiguation: A counterfactual evaluation
Chinese author names are known to be more difficult to disambiguate than other ethnic names because they tend to share surnames and forenames, thus creating many homonyms. In this study, we demonstrate how using Chinese characters can affect machine learning for author name disambiguation. For analysis, 15K author names recorded in Chinese are transliterated into English and simplified by initialising their forenames to create counterfactual scenarios, reflecting real-world indexing practices in which Chinese characters are usually unavailable. The results show that Chinese author names that are highly ambiguous in English or with initialised forenames tend to become less confusing if their Chinese characters are included in the processing. Our findings indicate that recording Chinese author names in native script can help researchers and digital libraries enhance authority control of Chinese author names that continue to increase in size in bibliographic data.  more » « less
Award ID(s):
1917663
NSF-PAR ID:
10292508
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Journal of Information Science
ISSN:
0165-5515
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    How related is skin to a quilt or door to worry? Here, we show that linguistic experience strongly informs people’s judgments of such word pairs. We asked Chinese-speakers, English-speakers, and Chinese-English bilinguals to rate semantic and visual similarity between pairs of Chinese words and of their English translation equivalents. Some pairs were unrelated, others were also unrelated but shared a radical (e.g., “expert” and “dolphin” share the radical meaning “pig”), others also shared a radical which invokes a metaphorical relationship. For example, a quilt covers the body like skin; understand, with a sun radical, invokes understanding as illumination. Importantly, the shared radicals are not part of the pronounced word form. Chinese speakers rated word pairs with metaphorical connections as more similar than other pairs. English speakers did not even though they were sensitive to shared radicals. Chinese-English bilinguals showed sensitivity to the metaphorical connections even when tested with English words. 
    more » « less
  2. Abstract

    Are the brain bases of language comprehension the same across all human languages, or do these bases vary in a way that corresponds to differences in linguistic typology? English and Mandarin Chinese attest such a typological difference in the domain of relative clauses. Using functional magnetic resonance imaging with English and Chinese participants, who listened to the same translation-equivalent story, we analyzed neuroimages time aligned to object-extracted relative clauses in both languages. In a general linear model analysis of these naturalistic data, comprehension was selectively associated with increased hemodynamic activity in left posterior temporal lobe, angular gyrus, inferior frontal gyrus, precuneus, and posterior cingulate cortex in both languages. This result suggests the processing of object-extracted relative clauses is subserved by a common collection of brain regions, regardless of typology. However, there were also regions that were activated uniquely in our Chinese participants albeit not to a significantly greater degree. These were in the temporal lobe. These Chinese-specific results could reflect structural ambiguity-resolution work that must be done in Chinese but not English object-extracted relative clauses.

     
    more » « less
  3. Because Chinese reading and writing systems are not phonetic, Mandarin Chinese learners must construct six-way mental connections in order to learn new words, linking characters, meanings, and sounds. Little research has focused on the difficulties inherent to each specific component involved in this process, especially within digital learning environments. The present work examines Chinese word acquisition within ASSISTments, an online learning platform traditionally known for mathematics education. Students were randomly assigned to one of three conditions in which researchers manipulated a learning assignment to exclude one of three bi-directional connections thought to be required for Chinese language acquisition (i.e., sound-meaning and meaning-sound). Researchers then examined whether students’ performance differed significantly when the learning assignment lacked sound-character, character-meaning, or meaning-sound connection pairs, and whether certain problem types were more difficult for students than others. Assessment of problems by component type (i.e., characters, meanings, and sounds) revealed support for the relative ease of problems that provided sounds, with students exhibiting higher accuracy with fewer attempts and less need for system feedback when sounds were included. However, analysis revealed no significant differences in word acquisition by condition, as evidenced by next-day post-test scores or pre- to post-test gain scores. Implications and suggestions for future work are discussed. 
    more » « less
  4. Abstract

    Previous work has shown that English native speakers interpret sentences as predicted by a noisy‐channel model: They integrate both the real‐world plausibility of the meaning—the prior—and the likelihood that the intended sentence may be corrupted into the perceived sentence. In this study, we test the noisy‐channel model in Mandarin Chinese, a language taxonomically different from English. We present native Mandarin speakers sentences in a written modality (Experiment 1) and an auditory modality (Experiment 2) in three pairs of syntactic alternations. The critical materials are literally implausible but require differing numbers and types of edits in order to form more plausible sentences. Each sentence is followed by a comprehension question that allows us to infer whether the speakers interpreted the item literally, or made an inference toward a more likely meaning. Similar to previous research on related English constructions, Mandarin participants made the most inferences for implausible materials that could be inferred as plausible by deleting a single morpheme or inserting a single morpheme. Participants were less likely to infer a plausible meaning for materials that could be inferred as plausible by making an exchange across a preposition. And participants were least likely to infer a plausible meaning for materials that could be inferred as plausible by making an exchange across a main verb. Moreover, we found more inferences in written materials than spoken materials, possibly a result of a lack of word boundaries in written Chinese. Overall, the fact that the results were so similar to those found in related constructions in English suggests that the noisy‐channel proposal is robust.

     
    more » « less
  5. Adpositions are frequent markers of semantic relations, but they are highly ambiguous and vary significantly from language to language. Moreover, there is a dearth of annotated corpora for investigating the cross-linguistic variation of adposition semantics, or for building multilingual disambiguation systems. This paper presents a corpus in which all adpositions have been semantically annotated in Mandarin Chinese; to the best of our knowledge, this is the first Chinese corpus to be broadly annotated with adposition semantics. Our approach adapts a framework that defined a general set of supersenses according to ostensibly language-independent semantic criteria, though its development focused primarily on English prepositions (Schneider et al., 2018). We find that the supersense categories are well-suited to Chinese adpositions despite syntactic differences from English. On a Mandarin translation of The Little Prince, we achieve high inter-annotator agreement and analyze semantic correspondences of adposition tokens in bitext. 
    more » « less