Patton: Language Model Pretraining on Text-Rich Networks

Jin, Bowen; Zhang, Wentao; Zhang, Yu; Meng, Yu; Zhang, Xinyang; Zhu, Qi; Han, Jiawei

doi:10.18653/v1/2023.acl-long.387

Citation Details

Patton: Language Model Pretraining on Text-Rich Networks

A real-world text corpus sometimes comprises not only text documents, but also semantic links between them (e.g., academic papers in a bibliographic network are linked by citations and co-authorships). Text documents and semantic connections form a text-rich network, which empowers a wide range of downstream tasks such as classification and retrieval. However, pretraining methods for such structures are still lacking, making it difficult to build one generic model that can be adapted to various tasks on text-rich networks. Current pretraining objectives, such as masked language modeling, purely model texts and do not take inter-document structure information into consideration. To this end, we propose our PretrAining on TexT-Rich NetwOrk framework PATTON. PATTON1 includes two pretraining strategies: network-contextualized masked language modeling and masked node prediction, to capture the inherent dependency between textual attributes and network structure. We conduct experiments on four downstream tasks in five datasets from both academic and e-commerce domains, where PATTON outperforms baselines significantly and consistently. more »

Award ID(s):: 1956151 1741317 1704532 2118329 2019897

PAR ID:: 10466995

Author(s) / Creator(s):: Jin, Bowen; Zhang, Wentao; Zhang, Yu; Meng, Yu; Zhang, Xinyang; Zhu, Qi; Han, Jiawei

Publisher / Repository:: Association for Computational Linguistics

Date Published:: 2023-07-10

Page Range / eLocation ID:: 7005 to 7020

Subject(s) / Keyword(s):: Language model pretraining, LLM, large language models, Text-Rich Networks

Format(s):: Medium: X

Location:: Toronto, Canada

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
https://doi.org/10.18653/v1/2023.acl-long.387

More Like this