<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>TDLR: Top Semantic-Down Syntactic Language Representation</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>12/02/2022</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10429264</idno>
					<idno type="doi"></idno>
					<title level='j'>NeurIPS'22 Workshop on All Things Attention: Bridging Different Perspectives on Attention</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Vipula Rawte Kaushik Roy</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Language understanding involves processing text with both the grammatical and 2 common-sense contexts of the text fragments. The text “I went to the grocery store 3 and brought home a car” requires both the grammatical context (syntactic) and 4 common-sense context (semantic) to capture the oddity in the sentence. Contex5 tualized text representations learned by Language Models (LMs) are expected to 6 capture a variety of syntactic and semantic contexts from large amounts of training 7 data corpora. Recent work such as ERNIE has shown that infusing the knowl8 edge contexts, where they are available in LMs, results in significant performance 9 gains on General Language Understanding (GLUE) benchmark tasks. However, 10 to our knowledge, no knowledge-aware model has attempted to infuse knowledge 11 through top-down semantics-driven syntactic processing (Eg: Common-sense to 12 Grammatical) and directly operated on the attention mechanism that LMs leverage 13 to learn the data context. We propose a learning framework Top-Down Language 14 Representation (TDLR) to infuse common-sense semantics into LMs. In our 15 implementation, we build on BERT for its rich syntactic knowledge and use the 16 knowledge graphs ConceptNet and WordNet to infuse semantic knowledge.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>LMs like BERT <ref type="bibr">[1]</ref>, RoBERTa <ref type="bibr">[3]</ref>, T5 <ref type="bibr">[7]</ref>, GPT2 <ref type="bibr">[6]</ref> efficiently learn distributed representations for text fragments such as tokens, entities, and phrases based on statistically likely patterns (syntactic -a text fragment's language context is defined by statistically likely neighbors). The language syntax is characterized by grammar rules and the frequency of text fragment co-occurrences reflected in large language corpora. These models outperform human baselines GLUE tasks <ref type="bibr">[10]</ref>. LMs implicitly model a broad notion of "common-sense" in large language corpora. This is due to the nature of pattern learning (tending to a "normal" distribution) on large data. However, human-understandable semantics found in external knowledge sources such as ConceptNet and WordNet is not explicitly leveraged. We might explicitly leverage the knowledge graph ConceptNet <ref type="bibr">[9]</ref> to derive the commonsense conceptual knowledge that world war I and II are different. Distinct concepts would have different neighboring contexts (graphical neighborhoods) in ConceptNet (e.g. world war one-trench warfare, world war two-radio communications). The knowledge graph WordNet <ref type="bibr">[5]</ref> gives possible word senses for words. LMs can use the word-sense knowledge from WordNet explicitly to process equivalence between "What does eat the phone battery quickly" and "What would cause the battery on my phone to drain so quickly". The words "eat" and "drain" carry a similar word sense in this example. There has been a growing trend of research around the techniques to infuse knowledge from knowledge graphs into LMs to improve performance <ref type="bibr">[12] [11]</ref> [2] <ref type="bibr">[10]</ref>. We propose the Top-Down Language Representation (TDLR) framework -a technique to explicitly infuse commonsense semantics as humans do from available knowledge graphs that capture such semantics. The framework proposes a clear set of steps for top-down semantics driven syntactic processing while providing simple mechanisms to expand the scope of the driving semantics utilized. (e.g. Expanding the scope to factual common-sense knowledge such as the current president of a country, found in the knowledge graph WikiData).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">TDLR Learning Framework</head><p>The TDLR framework performs three simple steps:</p><p>&#8226; Construct syntactic representations of the knowledge graphs and the data (Embedding Knowledge and Data at the Syntactic Level).</p><p>&#8226; Explicitly encode the desired semantics from relevant knowledge graphs in the self-attention mechanism of LMs (Encoding Knowledge Graph Semantics).</p><p>&#8226; Train the LM as before, thus enabling desired semantics-driven processing of the syntactic information (Knowledge Graph Semantics Driven Syntax Processing).</p><p>We show how the TDLR framework processes a sentence using the running example: "The World Wars have had a significant impact on 21st-century technology. The great war introduced tanks in battle, and the second world war introduced the use of sophisticated and encrypted radio communications, the drain caused by resource-hungry tech propelled the advancement of modern transistor technology.".</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Embedding Knowledge and Data at the Syntactic Level</head><p>The sentence is embedded by deriving and concatenating its constituent word embeddings obtained using a word embedding model <ref type="bibr">[4]</ref>. Next, the knowledge concepts are encoded using a knowledge graph embedding technique <ref type="bibr">[8]</ref>. Finally, the word embedding and knowledge concept embedding representations are concatenated. For example, the term "War" in our running example has representations from the word2vec (word-embedding model), ConceptNet Numberbatch embedding model, and the convAI WordNet embedding model. Next, all three representations are concatenated to obtain the final representation of the word "war". Finally, all the individual word representations are concatenated to form the sentence representations. Thus we get representations of the sentence that contain the syntactic information from the embedding models.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Encoding Knowledge Graph Semantics</head><p>The word "war" appears in many contexts (e.g. civil war, drug war, proxy war), and the context "world war" may not be so common in the language corpora used to train embedding models. While knowledge graphs like ConceptNet contain the concepts of civil war, drug war, and proxy war in the same graphical context, the embedding models such as Numberbatch have aggregate representations of all the contexts in a given graphical neighborhood, thus losing specific meanings. Therefore we construct a knowledge graph mask that encodes the particular contexts of interest that represent the semantics that will drive the processing of the syntactic input and knowledge representations.</p><p>Using our running example, let e 11 refer to the word "great" and e 12 refer to the word "war" respectively (see Figure <ref type="figure">1 (d)</ref>). Assuming that the word "war" has civil, drug, and proxy contexts in the data, an LM trained without explicitly encoding the semantic context "great war" might not capture this meaning. Thus we ensure that the word "war" attends to the word "great" by setting the corresponding entry in the mask to 1 while masking out the rest of the entries with 0 (see Figure <ref type="figure">1 (b)</ref>). Likewise, denoting the singleton word "war" as e 2 (see Figure <ref type="figure">1 (d)</ref>), similarly enables knowledge graph semantics to be encoded in the corresponding mask entries for the singleton word "war". In essence, using our approach, we have explicitly encoded the semantic context for the word "war" to mean itself and the accompanying word "great". After encoding the desired semantics in the mask (see Figure <ref type="figure">1</ref> (b)), we apply the mask to obtain a knowledge semantics encoded self-attention matrix (see Figure <ref type="figure">1 (c)</ref>).</p><p>Bayesian Perspective: A question might arise that the knowledge semantics encoded self-attention matrix has lost its probabilistic interpretation (the row and column sums are no longer = 1). We can see the application of the mask as a natural application of the Bayes rule in Equation <ref type="formula">1</ref>.</p><p>Here A is Self-Attention, K is the knowledge, and Z is the normalizing constant. The posterior in Equation 1 is A * and the prior is A. The knowledge mask encodes a prior probability distribution (unnormalized as row and column sums are not 1). The self-attention matrix encodes data-likelihood probabilities. Thus we can liken the application of the mask to a likelihood prior product that is proportional to the posterior probability.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Knowledge Graph Semantics Driven Syntax Processing</head><p>With the desired knowledge semantics encoded in the self-attention matrix, we execute the forwardbackward training pass as usual in an LM (see Figure <ref type="figure">2</ref>). Expanding the knowledge semantics scope that drives the top-down processing in TDLR requires the simple addition of multiple attention masks at different layers. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Experiments</head><p>We test the TDLR method on GLUE benchmark tasks that require the infusion of specific knowledge semantics in the data. We build TDLR on the BERT BASE model and the BERT LARGE model. Both these models execute "normally" distributed semantics driven syntactic processing. To infuse semantics contained in WordNet and ConceptNet, we encode the graph information at the input (syntactic) level (see Section 2.1), as well as apply mask encodings that capture the semantics in these knowledge graphs (see Section 2.2). Thus TDLR executes ConceptNet and WordNet semantics-driven processing of the syntactic information in language for a series of benchmark tasks.</p><p>In Table <ref type="table">1</ref>  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Conclusion and future work</head><p>We propose Top Down Language Representations (TDLR), a method to infuse knowledge in the selfattention mechanism. TDLR enables top-level semantics-driven bottom-level language processing at a general level. We demonstrate TDLR's performance improvements using common-sense semantics from WordNet and ConceptNet built on top of BERT. In future work, we will explore extensions that use common-sense semantics, such as factual knowledge in Wikipedia and domain-specific knowledge in the Unified Medical Language System.</p></div></body>
		</text>
</TEI>
