Language-Agnostic Representation Learning of Source Code from Structure and Context

Zugner, Daniel; Kirschstein, Tobias; Catasta, Michele; Leskovec, Jure; Gunnemann, Stephan

Citation Details

Source code (Context) and its parsed abstract syntax tree (AST; Structure) are two complementary representations of the same computer program. Traditionally, designers of machine learning models have relied predominantly either on Structure or Context. We propose a new model, which jointly learns on Context and Structure of source code. In contrast to previous approaches, our model uses only language-agnostic features, i.e., source code and features that can be computed directly from the AST. Besides obtaining state-of-the-art on monolingual code summarization on all five programming languages considered in this work, we propose the first multilingual code summarization model. We show that jointly training on non-parallel data from multiple programming languages improves results on all individual languages, where the strongest gains are on low-resource languages. Remarkably, multilingual training only from Context does not lead to the same improvements, highlighting the benefits of combining Structure and Context for representation learning on code. more »

Award ID(s):: 2030477 1918940 1934578 1835598

PAR ID:: 10300281

Author(s) / Creator(s):: Zugner, Daniel; Kirschstein, Tobias; Catasta, Michele; Leskovec, Jure; Gunnemann, Stephan

Date Published:: 2021-01-01

Journal Name:: International Conference on Learning Representations (ICLR)

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
The DOI is not currently available.

More Like this