<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Designing a Uniform Meaning Representation for Natural Language Processing</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>04/30/2021</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10288269</idno>
					<idno type="doi">10.1007/s13218-021-00722-w</idno>
					<title level='j'>KI - Künstliche Intelligenz</title>
<idno>0933-1875</idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Jens E. Van Gysel</author><author>Meagan Vigus</author><author>Jayeol Chun</author><author>Kenneth Lai</author><author>Sarah Moeller</author><author>Jiarui Yao</author><author>Tim O’Gorman</author><author>Andrew Cowell</author><author>William Croft</author><author>Chu-Ren Huang</author><author>Jan Hajič</author><author>James H. Martin</author><author>Stephan Oepen</author><author>Martha Palmer</author><author>James Pustejovsky</author><author>Rosa Vallejos</author><author>Nianwen Xue</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[In this paper we present Uniform Meaning Representation (UMR), a meaning representation designed to annotate the semantic content of a text. UMR is primarily based on Abstract Meaning Representation (AMR), an annotation framework initially designed for English, but also draws from other meaning representations. UMR extends AMR to other languages, particularly morphologically complex, low-resource languages. UMR also adds features to AMR that are critical to semantic interpretation and enhances AMR by proposing a companion document-level representation that captures linguistic phenomena such as coreference as well as temporal and modal dependencies that potentially go beyond sentence boundaries.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>It is undeniable that neural network-based end-to-end systems have led to fundamental changes in the landscape of NLP. In areas where large training data sets exist, such as machine translation <ref type="bibr">[4]</ref> and machine reading <ref type="bibr">[63,</ref><ref type="bibr">80,</ref><ref type="bibr">83]</ref>, end-to-end systems enabled by neural network models have reduced reliance on intermediate semantic (and other) representations. The successful application of such end-to-end systems has caused many to wonder whether it is still necessary to invest time and money in building such linguistically annotated resources. We argue that the emergence of a host of new application scenarios makes the need for deep semantic analysis and representations more urgent than ever. In human-robot interactions, meaning representations are needed as the medium of communication between human users and robots. Intelligent agents in the medical field require intermediate meaning representations in order to provide interpretable background for predictions, judgments and diagnoses. Even in machine translation, while end-to-end systems have made impressive advances, especially in fluency <ref type="bibr">[10,</ref><ref type="bibr">11,</ref><ref type="bibr">18]</ref>, intermediate structures can provide "scaffolding" for the learning process <ref type="bibr">[69]</ref>, improving the faithfulness of translations <ref type="bibr">[31,</ref><ref type="bibr">82]</ref>.</p><p>This paper presents the fundamentals of Uniform Meaning Representation (UMR), a practical and crosslinguistically valid meaning representation designed to meet the needs of a wide range of NLP applications. The remainder of the paper is organized as follows. In Section 2, we lay out four desiderata that guide the design of UMR. In Section 3, we present an overview of Abstract Meaning Representation that serves as the starting point of UMR. We present UMR sentence-level extensions to AMR in Section 4, and document-level extensions in Section 5. We discuss how UMR addresses cross-linguistic diversity in linguistic distinctions and in mapping words to UMR concepts in Section 6. In Section 7, we present our strategy for applying UMR to minority languages that face cultural and technological challenges and lack of foundational resources. Although annotated full UMRs are not yet available as we actively develop tools to support UMR annotation, we present experiments on novel aspects of UMR in Section 8 that show they can be annotated reliably. In Section 9, we discuss how UMR is related to existing meaning representations, and in particular, we compare UMR with existing meaning representations in how they address each aspect of our four desiderata. We conclude in 10.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Design goals for UMR</head><p>The design of UMR is guided by the following four desiderata:</p><p>-Scalability/learnability. Meaning representations are expected to be automatically reproduced by machine learning systems trained on data annotated with this representation. As such, it is important for the meaning representations to be annotated at scale on large data sets. This means that the meaning representation needs to be intuitive so that it does not put too many constraints on the pool of annotators who are capable of performing the annotation. The meaning representation also needs to be a formal object such as a tree or a graph that is easy to manipulate algorithmically. -Supporting similarity-based lexical inference. Natural languages are known to be both variable (the same meaning can be expressed through different morphosyntactic constructions) and ambiguous (the same surface string can have different meanings in different contexts). For a meaning representation to support lexical inference, different natural language expressions that have the same meaning should be expressed in the same way. This means that the meaning representation needs to abstract away from the morphosyntactic variations, disambiguate the senses of a word or phrase, and resolve references of referring expressions such as proper nouns and pronouns. -Supporting logical inference. Supporting logical inference has been the primary goal for classical meaning representations, which aim to be easily translatable to logical form -typically first-order logic. Logical systems allow new statements to be inferred from known facts, and linguistic phenomena such as quantification, negation, tense and aspect, and modality have traditionally figured prominently in logic-based meaning representations. Firstorder logic formalisms have also played a key role in grounded semantic parsing, the goal of which is to parse natural language queries into first-order logic-based meaning representations that can be executed against knowledge bases <ref type="bibr">[35,</ref><ref type="bibr">46,</ref><ref type="bibr">81,</ref><ref type="bibr">49,</ref><ref type="bibr">16,</ref><ref type="bibr">17,</ref><ref type="bibr">64]</ref>. It is also important to canonicalize referring expressions etc. so that they can be easily grounded to external knowledge bases. -Cross-linguistic plausibility and portability. We envision a meaning representation that is uniform across languages, so that a wide variety of languages can be annotated in a comparable way. It must thus be able to deal with variability in morphosyntax (e.g. constituent order, degree of inflectional synthesis of the verb), grammaticalization of different ways of dividing up conceptual space <ref type="bibr">[75]</ref>, and different morphosemantic mappings between concepts and words.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Overview of Abstract Meaning Representation</head><p>When designing UMR we use Abstract Meaning Representation (AMR) <ref type="bibr">[6]</ref>, a meaning representation designed for English, as a starting point, and extend it to other languages and enhance its expressiveness. AMR has attracted significant attention in recent years due to its simplicity and its focus on semantic content such as predicate-argument structure, named entities, and word senses that are key to many NLP applications. Its formal properties as a single-rooted, node-and edge-labeled directed graph also make it amenable to machine learning based parsing algorithms <ref type="bibr">[32,</ref><ref type="bibr">79,</ref><ref type="bibr">50,</ref><ref type="bibr">87,</ref><ref type="bibr">15]</ref>, adding to its attractiveness. In this sense it has already satisfied the first two of our UMR design goals. An example AMR is provided in <ref type="bibr">(1)</ref>. In this example, the AMR of the sentence "The president pardoned him for health reasons" is formally a graph where the nodes represent semantic concepts and edges represent relations. The semantic concepts can be word senses (e.g., pardon-01, cause-01) or word lemmas when the senses of a word are yet to be defined (e.g., president, he, reason, health). The concepts can also be entity types (e.g., person, date-entity), or quantity types (e.g., monetary-quantity, distance-quantity). AMR relations include participant roles that are defined for each predicate (e.g., ARG0, ARG1), as well as general semantic relations (e.g., MOD).</p><p>(1) The president pardoned him for health reasons. Arg1-of</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Arg0</head><p>Mod UMR extends AMR in three core ways. First, while it has been shown that AMR can be extended to languages like Chinese, Czech, or Korean <ref type="bibr">[47,</ref><ref type="bibr">48,</ref><ref type="bibr">19]</ref> that have existing foundational resources like valency lexicons or frame files, it is not immediately clear how to extend it to language families with different morphosyntactic properties and especially to low-resource languages. Second, while existing meaning representations such as Minimal Recursion Semantics (MRS) <ref type="bibr">[21,</ref><ref type="bibr">34]</ref> and Discourse Representation Structures (DRS) <ref type="bibr">[45,</ref><ref type="bibr">13]</ref> have been designed to support logical inference, AMR lacks modal, aspectual, and scopal annotation that is crucial to logical inference. Finally, while multisentence AMR <ref type="bibr">[58]</ref> includes inter-sentential coreference that goes beyond AMR's original sentence-level focus and allow, for example, the concept he posited for the pronoun "him" in (1) to be linked to its referent in a preceding sentence, it lacks annotation of temporal and modal relations that can also go beyond sentence boundaries.</p><p>In the next four sections, we will present our extensions and refinements that extend AMR in the three ways described in the previous paragraph. In Section 4, we present our extension and refinements to AMR at the sentence level, specifically how UMR annotates aspect (Section 4.1) and scope (Section 4.2). We also show how UMR scope annotation can be used to support conversion to first-order logical expressions. In Section 5, we discuss document-level refinements and extensions to AMR. In particular, we discuss how we add coreference and temporal and modal dependencies to the UMR document-level representation, and how sentence-level and document-level representations are combined. In Section 6, we refine UMR to accommodate cross-linguistic variation in semantic distinctions that are encoded in languages. Specifically, we discuss how to annotate grammatical semantic distinctions in different languages in a comparable way (Section 6.1), and cross-linguistic issues in mapping word tokens in sentences to UMR concepts (Section 6.2). Finally, in Section 7, we present a road map for annotating UMRs for low-resource languages, particularly focusing on annotating participant roles in languages without frame files.</p><p>4 Extensions: expanding the semantic range of AMR at the sentence level Sentence-level representation refers to semantic categories pertaining to single events. These include participants in the event (predicate-argument structure) semantic categories such as valency, aspect, and participant roles; and quantification and scope relations among participants. Predicate-argument structure is well represented in current AMR; in this section we describe the extension of AMR to include annotation of aspect, quantification and scope relations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Aspect in UMR</head><p>The UMR aspect annotation, building on <ref type="bibr">[27,</ref><ref type="bibr">28]</ref>, marks a feature on events that captures their internal temporal and qualitative structure. The annotation values do not correspond to specific verbs or constructions in a language, but characterize the event in context.</p><p>The annotation distinguishes five base level aspectual values -State, Habitual, Activity, Endeavor, and Performance -and a range of more fine-grained and more coarse-grained values organized in a lattice format <ref type="bibr">[75]</ref> as described in Section 6.1. The State value corresponds to stative events in <ref type="bibr">[76]</ref>; no change occurs during the event. It also includes predicate nominals (be a doctor), predicate locations (be in the forest), and thetic (presentational) possession (have a cat). The Habitual value is annotated on events that occur regularly in the past or present. The Activity value indicates an event has not necessarily ended and may be ongoing at Document Creation Time (DCT). Endeavor is used for processes that end without reaching completion (i.e., termination), whereas Performance is used for processes that reach a completed result state. The Performance value corresponds to achievements and accomplishments <ref type="bibr">[76]</ref>. Event nominals are typically hard to annotate for aspect, since they lack the grammatical cues that verbs often show. Therefore, they are all annotated with the coarse-grained value Process.</p><p>The aspect annotation is implemented as an aspect feature (e.g., :aspect Peformance) in UMR. For examples, please refer to Figure <ref type="figure">1</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Scope in UMR</head><p>One notable shortcoming of AMR is its lack of representation for scoping relations, leading to scope ambi-guity even in cases where it is not warranted, a problem when translating AMRs to first-order logic expressions. While other meaning representations, including MRS and DRS, explicitly represent scope, we want to preserve the advantages of AMR, including its relative simplicity and focus on predicate-argument structure. We therefore follow <ref type="bibr">[62]</ref> and augment predicates with an optional scope node, that specifies the relative scope ordering of each of its arguments. For example, consider the UMR for the sentence in (2):</p><p>(2) "Someone didn't answer all the questions"</p><p>(a / answer-01 :ARG0 (p / person) :ARG1 (q / question :quant A :polarity -) :pred-of (s / scope :ARG0 p :ARG1 q))</p><p>The scope node indicates that "someone" takes wide scope over (not) "all the questions" (i.e., there exists someone who didn't answer all the questions). If the argument order were reversed (:ARG0 q and :ARG1 p), another interpretation arises, where some questions were not answered by anyone. Note pred-of is an inverse relation that indicates answer-01 is a predicate under the scope node.</p><p>We adopt a continuation-passing style semantics for scope <ref type="bibr">[7]</ref>, inspired by the semantics for AMRs in <ref type="bibr">[12]</ref>. Briefly, the relative scopes of each argument are determined by the order of evaluation. A scope node, if present, then acts as a restriction on the possible orderings. For example, a continuized representation for the above AMR, with [[someone]] evaluated before [[not all the questions]], is given below in (3a), with the corresponding first-order logic expression in (3b):</p><p>(3) a. &#955;k.[[someone]](&#955;n.[[not all the questions]]</p><p>b. &#8707;p(person(p) &#8743; &#172;&#8704;q(question(q) &#8594; &#8707;a(answer-01(a) &#8743; ARG1(a, q) &#8743; ARG0(a, p))))</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Extensions: Expanding AMR to document-level representation</head><p>The sentence-level representation presented in Section 4 in and of itself is insufficient to properly interpret the semantic content of a text, as some semantic relations go beyond sentence boundaries. Such semantic relations include entity and event coreference, temporal relations between events, and modal dependencies. We represent semantic relations that go beyond sentence boundaries in a document-level representation that complements the sentence-level representation. It is important to note that the document-level semantic relations can but do not necessarily go beyond sentence boundaries. For instance, coreference can occur across sentence boundaries, or within the same sentence. The same is true for temporal relations and modal dependencies.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Coreference</head><p>As we can see from our AMR example in (1), resolving anaphoric expressions such as pronouns is essential to the interpretation of semantic content of a text. The UMR entity coreference annotation, like AMR, includes both entity and event coreference, and extends it to inter-sentential coreference. The entity annotation includes identity relations where an anaphoric expression refers to the same entity as another expression (same-entity) in a document as in (4), and subset relations where the referent of an anaphoric expression is a subset of the referents for another expression, as in <ref type="bibr">(5)</ref>.</p><p>(4) a. Edmund Pope tasted freedom today for the first time in more than eight months. b. He denied any wrongdoing.</p><p>(5) He is very possesive and controlling but he has no right to be as we are not together.</p><p>The UMR event coreference annotation includes event identity where there are multiple mentions of the same event (same-event) as in <ref type="bibr">(6)</ref>, as well as cases where one event mention is a subset of another event mention as in <ref type="bibr">(7)</ref>. The decision to annotate event coreference is partially motivated by the need to make inferences on the temporal relations between events, which we discuss in the next section. Clearly, there are other types of coreference such as bridging that UMR does not currently consider, in order to keep UMR simple and practical. We demonstrate how coreference is annotated in UMR in Section 5.4.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Temporal dependencies</head><p>UMR adopts the TimeML view that the temporal relations in a text can be interpreted in terms of relations between a time expression and an event, between two events, and between two time expressions <ref type="bibr">[61]</ref>. This is a broader temporal annotation approach than annotating just tense, which is the temporal relation between the time when an event occurs and the document creation time (DCT). There are two reasons for this. One is that for many languages (e.g., Chinese), tense is not overtly grammaticalized, and as a result, the relation is not intuitive to speakers of those languages <ref type="bibr">[84]</ref>. Another reason is that tense annotation alone is insufficient for interpreting the temporal relations in a text. Two events that are both in the past may have a temporal precedence that cannot be captured with tense alone. UMR further adopts the idea that the temporal relations in a document are hierarchically organized in a temporal dependency structure <ref type="bibr">[88]</ref>, a view that is compatible with graph representation of the rest of the UMR annotation. For the most part, the event-time relations are annotated as part of the predicate-argument structure annotation at the sentence-level. In example (9) in Section 5.4, for example, based on the predicateargument structure annotation at the sentence level, we know that Edmund tasted freedom "today". To properly interpret the temporal content of this sentence, however, we also need to properly interpret when "today" is. This is done at the UMR document-level annotation, which focuses on event-event and time-time relations. The UMR temporal annotation proceeds as follows. For each relative time expression (e.g., "today", "yesterday"), we identify another time expression it depends on to resolve it to an absolute time. For example "today" in (9) depends on the DCT of the sentence in order to be resolved. For each event, we identify another reference event with respect to which the temporal location of this event can be mostly specifically defined. The determination of the most specific event can be determined based on grammatical or contextual clues. For example, from the context, we can determine that the convict event in <ref type="bibr">(10)</ref> happened before the tasted event in <ref type="bibr">(9)</ref>. The implementation of UMR document-level annotation is illustrated in Figure <ref type="figure">1</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3">Modal dependencies</head><p>The UMR modal dependency captures the epistemic strength and polarity of events, as related to conceivers (or, sources) <ref type="bibr">[77]</ref>. The epistemic strength and polarity relations are largely based on FactBank <ref type="bibr">[67]</ref>: full affirmative (:AFF), partial affirmative (:PARTAFF), neutral affirmative (:NEUTAFF), neutral negative (:NEUT-NEG), partial negative (:PARTNEG) and negative (:NEG). Events and conceivers (sources) make up the nodes in the dependency structure and epistemic strength and polarity characterize the edges. The dependency structure parallels the annotation for temporal relations <ref type="bibr">[88]</ref>.</p><p>The dependency structure permits the representation of nested modal values, necessary to annotate certain linguistic constructions, as in <ref type="bibr">(8)</ref>, where the author of the text has only a partial affirmative commitment to the senator's belief, which in turn represents a neutral affirmative commitment to the bill's passing tomorrow. <ref type="bibr">(8)</ref> The senator probably thinks that the bill could pass tomorrow.</p><p>:modal ((pass :NEUTAFF senator) (senator :PARTAFF AUTH))</p><p>The dependency structure uses the same modal strength values for the links between two conceivers, two events, or a conceiver and an event. Scope relations between modality and negation are represented in the dependency structure.</p><p>When annotating modal dependencies, annotators do not need to construct the entire dependency structure themselves, especially at Stage 0 of the road map (see Section 7, simplifying the annotation process). Annotators are expected to give events a MODSTR ("modal strength") value, but conceivers remain unspecified. Events under the scope of a modal predicate and events under the scope of a reporting predicate receive a special annotation: MODAL and QUOT, respectively. Events with a MODAL annotation do not receive a MODSTR annotation, as this can be automatically inferred from the main verb (e.g. want conveys a NEUTRAL modal strength onto its complement). Events with a QUOT annotation do receive a regular MODSTR value in addition. The MODSTR, MODAL and QUOT annotations, together with the argument structure and the lexical semantics of modal verbs, can be used to automatically create the dependency structure.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.4">Integrating document-level and sentence-level annotation</head><p>We provide an integrated example for a short text in Figure <ref type="figure">1</ref> to illustrate how UMR is implemented. For each sentence, the sentence-level representation is on the left and the document-level representation is on the right <ref type="foot">1</ref> . The document-level representation makes reference to sentence-level concepts, using an ID that combines the sentence ID and the concept ID. For instance, "s1t2" refers the concept "t2" in the sentence "s1". The document-level representation consists of a coreference, temporal and modal relations that link entity and event concepts in the current sentence to other concepts, potentially in a previous sentence. :name (s1n2 / name . :op1 "Edmund" .</p><p>:op2 "Pope")) .</p><p>:ARG1 (s1f / free-04 :ARG1 s1p) .</p><p>:time (s1t3 / today) .</p><p>:ord (s1o3 / ordinal-entity :value 1 .</p><p>:range (s1m / more-than .</p><p>:op1 (s1t / temporal-quantity :quant 8 .</p><p>:unit (s1m2 / month)))))</p><p>. (s1 / sentence . :temporal ((s1t2 :before DCT) .</p><p>(s1m :before s1t2) .</p><p>(s1t3 :depends-on DCT)) .</p><p>:modal ((s1t2 :AFF AUTH))) <ref type="bibr">(10)</ref> Pope is the American businessman who was convicted last week on spying charges and sentenced to 20 years in a Russian prison.</p><p>. :ARG1-of (s3w / wrong-02))))</p><p>. (s3 / sentence . :temporal ((s3d :before DCT)) .</p><p>:modal ( (s3d :AFF AUTH) .</p><p>(s3d2 :NEG (s3h :AFF AUTH))) .</p><p>:coref ((s3p :same-entity s1p))) One goal of UMR is application to as many languages as possible. Therefore, the annotation values must accommodate cross-linguistic diversity in linguistic distinctions in semantic space; and the annotation process must accommodate cross-linguistic diversity in the se-mantic distinctions encoded in a language, and in the mapping between words and concepts.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1">Adapting UMR to accommodate cross-linguistic diversity in linguistic distinctions</head><p>The cross-lingual annotation is guided by two main principles. First, the default categories reflect semantic distinctions overtly expressed in as large as possible a proportion of the world's languages, so that annotators for the majority of languages need not infer meanings that are not expressed through overt forms, except for easily-defined distinctions such as past vs. present. Second, the scheme is flexible enough to accommodate typological diversity in grammaticalized semantic distinctions, and yet preserve cross-linguistic comparability of annotations. Annotation values are organized in typologically-motivated lattices, as proposed in <ref type="bibr">[75]</ref>. A paradigmatic lattice organizes potentially overlapping categories of greater or lesser generality.</p><p>For instance, annotators are encouraged to use the default (bolded) level of modal annotation in Figure <ref type="figure">2</ref>. Annotators for languages with different grammaticalized distinctions may use labels from the higher and lower levels of the paradigmatic lattice for access to more coarse-grained and more fine-grained categories, respectively. The levels are connected, keeping annotations within and across languages comparable. In addition to the lattice for modal categories presented in Figure <ref type="figure">2</ref> lattices for number, certain spatial relations, aspect and time reference can be found in <ref type="bibr">[75]</ref>. The categories represented in the lattice are based on existing typological work in <ref type="bibr">[22,</ref><ref type="bibr">24]</ref> and <ref type="bibr">[14]</ref>, </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Mapping between words and concepts across languages</head><p>Taking a cross-linguistic perspective on concept-word mappings raises issues that have not typically been at the forefront in computational linguistics. There is, for one, no accepted cross-linguistic definition for what constitutes a "word" across languages ( <ref type="bibr">[26]</ref>, but see <ref type="bibr">[89]</ref>). Even in closely related languages like English and German, the same concept may be considered one word (German Schifffahrten) or two words (English boat rides). Despite the typological differences, our annotations do not depend on language-internal word-hood tests, as they are based in semantic rather than formal criteria. Criteria for concepthood are equally challenging, since languages differ in how they allocate senses to words. But how many concepts are combined into a single word? For example, English cut could be broken down into concepts of causation, instrumentality (a bladelike implement) and change of state; and the Arapaho example in <ref type="bibr">(12)</ref> divides it into morphemes that way. <ref type="bibr">(12)</ref>  :Aspect Performance)</p><p>In practice, descriptive linguists and typologists use as a common denominator a set of concepts that are expressed by single morphemes in at least some languages. Typologically, the mapping between such concepts and words is highly variable. This fact has not been fully appreciated because high-resource languages are skewed towards languages with relatively little morphology. Therefore, prior work in computational linguistics has been focused on multiword expressions (MWEs) <ref type="bibr">[66]</ref>, where a single concept is expressed with multiple words. UMR builds on such previous research, allowing individual languages to propose criteria on how to map MWEs to UMR concepts for that language. Many low-resource languages are morphologically complex, using a single word to express concepts for which English needs multiple words. Such "multiconcept words" pose a different issue for semantic annotation. In such multiconcept words, concepts may be expressed by clearly distinct morphemes (teb 'break / remove.stick.like' and -e'ei 'head' in <ref type="bibr">(12)</ref>), a portmanteau morpheme (-o' 1S/3S in ( <ref type="formula">12</ref>)), or a single, unanalyzable morpheme (English kill encodes both the concepts die and CAUSE).</p><p>As a matter of principle, UMR does not require the decomposition of morphologically complex words into morphemes that map to UMR concepts. Instead, such words can as a whole map to multiple concepts. A number of considerations lead to this design decision. Portmanteau morphemes generally cannot be split into separate parts for each concept. Even when morphemes are separable, annotators with less linguistic training or field linguists in the early stages of analysis may not be aware of where morpheme boundaries lie. Finally, concept-word mismatches (multi-word concepts and multi-concept words) threaten the consistency of annotations across languages. Broadly, we adopt four different solutions to these issues. Each solution depends on the semantic categories involved and their behavior across languages.</p><p>First, for categories in which concepts are clearly distinct from each other semantically, we will ask annotators to identify multiple concepts in one word. For now, we apply this solution to argument indexation (often called pronominal affixation) and certain types of noun incorporation.</p><p>Many languages use verbal affixes to index arguments, which then may not be expressed elsewhere in the clause. Such indexed arguments are treated in the same way as free pronouns (i.e., identified as arguments of the verb). For example, the affix -o' in <ref type="bibr">(12)</ref> indicates that the verbal predicate takes a 1st person actor and a 3rd person undergoer. This portmanteau morpheme cannot be further decomposed into an actor morpheme and an undergoer morpheme. The event is annotated as having two arguments, since these are clearly identifiable.</p><p>The Arapaho example in ( <ref type="formula">12</ref>) also has implications for the annotation of pronominal references cross-linguistically. AMR currently treats person and number lexically, inserting English lexical pronoun forms as concepts in the AMR graph, as "he" in <ref type="bibr">(1)</ref>. However, it is cross-linguistically useful for the semantic representation to be independent of whether a participant is referred to with pronouns and with morphological agreement marking <ref type="bibr">[23,</ref><ref type="bibr">40]</ref>. To make this consistent, we encode reduced reference with a named entity type (such as "person") and add additional referential information such as person or number through the use of additional ":ref" attributes. Person and number lattices are being defined according to typological study <ref type="bibr">[25]</ref>, and also use ":ref" to encode definiteness, obviation, and language-specific referential information such as grammatical gender.</p><p>There are different types of noun incorporation <ref type="bibr">[54,</ref><ref type="bibr">55]</ref>, where one word expresses both an event and a participant. In less grammaticalized constructions, as in <ref type="bibr">(12)</ref> above, the incorporated noun (-e'ei 'head') functions as an argument of the verb, meaning that no overt NP can fill this semantic role. While UMR treats the whole word teb-e'ei-s as the predicate, it does posit a separate theme concept for 'head'. For more grammaticalized types of noun incorporation where the incorporated noun does not replace one of the verbal arguments, such as Arapaho instrumental suffixes like -s 'by blade' in <ref type="bibr">(12)</ref>, we do not posit a separate concept for the instrument.</p><p>Second, verb forms that differ in valency, e.g., causatives, passives and applicatives, are not decom-posed into multiple concepts. Instead, their semantics can be inferred from the participant role annotations associated with the verb. For example, the difference between an intransitive verb and its causative (compare intransitive nih-teb-e'ei-t [PAST-break/remove.stick.likehead-3S] 'his head broke off' to the causative in ( <ref type="formula">12</ref>)), are reflected by annotating only an Undergoer participant for the former, but both an Actor (the causer) and an Undergoer for the latter.</p><p>Third, some categories expressed by either separate words or verbal affixes will not be treated as separate concepts from the verb they modify. For example, aspect, which may be expressed by verbal morphology or auxiliaries, will simply be annotated with the aspect feature, such as the Performance value annotated for the verb in <ref type="bibr">(12)</ref>; see Section 4.1. The same is true of modal and tense constructions, which will inform the temporal and modal dependency annotations (see Sections 5.2 and 5.3, respectively). Certain semi-modals, such as want, and associated motion constructions will be identified as independent predicates when there is evidence that they are interpreted as independent events (e.g. when an associated motion construction can take locative or directional NPs as its arguments, or when a desiderative construction can be modalized with a modal scoping only over the desire). When such evidence is lacking, they will not be identified as independent events.</p><p>Example <ref type="bibr">(13)</ref> illustrates associated motion in Sanapan&#225;: the morphemes expressing motion are suffixed to the verb root 'see'. In the English translation in <ref type="bibr">(14)</ref>, arrival and motion are expressed as separate predicates. In <ref type="bibr">(14)</ref>, arrive can be modified by a locative phrase, as in They arrived at the village.... The Sanapan&#225; associated motion construction in (13) can also occur with an NP expressing a location, but such NPs are likely better analyzed as circumstantial locatives expressing the location of the seeing-event than arguments of the arriving-event. The associated motion verb form in <ref type="bibr">(13)</ref> is therefore represented as denoting a single complex event.</p><p>(13) netamen afterwards apk-el-vet-angv-ay-akm-e' :mod (w / woman) .</p><p>:quant 1))</p><p>Fourth, certain categories are expressed across languages by either derivational verbal morphology or by separate words. This includes types of "nonverbal clauses", such as locatives, nominals, property predication, and possession. Languages differ in how the strategies they use to express these meanings package concepts into words. There are three common strategies, two of which are problematic for the predicate-argument structure of existing meaning representations such as AMR. The cross-linguistic distribution of these strategies is based on our own research, and re-interpretation of the data in <ref type="bibr">[70,</ref><ref type="bibr">71]</ref>.</p><p>In English examples, such as John has a book or John is a doctor, a predicate can easily be identified (have and be, respectively), and so can the NPs that function as its arguments. However, in the Kukama object predication in <ref type="bibr">(15)</ref>, the predicate does not map to a specific word: object predication is expressed through juxtaposition of two NPs, with the predicational meaning implicit, but inherent in the construction. In the Kukama thetic (presentational) possession construction in <ref type="bibr">(16)</ref>, the possessum and the possession relation are combined in a single word which functions as a predicate: something typically thought of as an "argument" is predicativized.</p><p>From a perspective of cross-linguistic portability, it is important that these different strategies are annotated in comparable ways, but from the perspective of ease of annotation, one may want to make allowances for annotators of individual languages. <ref type="bibr">(15)</ref>  Different solutions are proposed for these two cases: in constructions with predicativized arguments, such as ( <ref type="formula">16</ref>), a non-verbal clause function and an argument are identified and linked to the same word (since this word contains both the predication meaning and the argument-like meaning). When there is no overt predicateword, such as in <ref type="bibr">(15)</ref>, we assume that annotators will be able to recognize the type of non-verbal clause function, and use an abstract predicate from Table <ref type="table">1</ref>. The resulting annotations have a comparable structure as seen above.</p><p>7 Strategies for applying UMR to all languages, including low-resource languages</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.1">A "roadmap" approach to achieve UMR annotation for low-resource languages</head><p>To allow annotation of indigenous minority languages, certain social and practical concerns must be dealt with too. Firstly, field linguists and native speaker communities must be convinced that semantic annotation can contribute to descriptive analysis, language documentation, and revitalization. It may be argued that semantic annotation can be a useful way of engaging with the meanings conventionalized in a language, and may bring up semantic questions that may not have otherwise come up. Annotated corpora can also aid in language documentation, by enriching (learners') dictionaries with frame files created during argument structure annotation.</p><p>Secondly, for many languages, annotators both highly proficient in the language and familiar with linguistic theory and computational tools may be scarce. Additionally, linguistic attitudes, such as cultural taboos against representing or sharing a language in written form, may further complicate annotation.</p><p>Thirdly, there is considerable cross-linguistic diversity in the availability of computational resources. AMR annotation of predicate-argument structure currently relies on the availability of a verbal lexicon with frame files. However, many languages do not have a standard dictionary or comprehensive grammatical description available, or even grammatical analysis or orthography.</p><p>UMR is being structured as a "road map" that (i) allows for flexibility depending on the material, technological, and linguistic resources available, and (ii) allows field linguists to base annotation on existing data collected in widely used software such as Flex and Tool-  <ref type="bibr">[59]</ref> from those data in a relatively short time frame. This will lower the threshold for linguists and communities to start doing semantic annotation.</p><p>We envision this road map as having two extreme points. Stage 0 will be a starting point for annotation of languages with few resources and little description. UMR annotation in this stage would be based on whole words (more precisely, stems whose inflections have been analyzed), without relying on lexical resources. Stage 1 will be a fully specified end point for annotation of languages with significant corpora, description, and other resources, taking advantage of morphological analysis and lexical resources comparable to Prop-Bank's frame files. These should not be seen as discrete annotation stages, but instead languages can gradually move from Stage 0 to Stage 1 as resources are developed, and annotations at later stages will be compatible with those at the earlier stages. This road map proposal is treated in more depth in <ref type="bibr">[78]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.2">Lexical vs Non-lexical roles</head><p>UMR aims for a representation in which concepts in the graph are labeled with language-specific word senses, and where core roles of a predicate are defined using predicate-specific terms. However, we view this as a destination rather than a starting point, as many languages lack lexicons. To resolve this, we define frameworks for both non-lexicalized and lexicalized annotation of predicates and semantic roles, and propose that projects should expand lexical coverage during annotation.</p><p>For the non-lexicalized UMR predicates, the predicate is simply annotated with a lemmatized form and the arguments of that predicate are annotated using a general inventory of core participant roles given in Table 2, which extends AMR's non-core roles based on cross-linguistic argument realization patterns in Val-PaL <ref type="bibr">[41]</ref>. The first row of the following Arapaho example illustrates such a non-lexicalized annotation. <ref type="bibr">(17)</ref>  Transitioning from such non-lexicalized annotations to lexicalized semantic roles requires the construction of predicate-specific role definitions, as illustrated in the lower row of this example. This shift from a nonlexicalized representation to a lexicalized one is necessary for both establishing consistent argument annotations, and for slowly developing the language-specific conventions regarding multi-word expressions and the decomposition of multiconcept words. We suggest that these lexical entries should be mapped to non-lexicalized roles (i.e. the general inventory of participant roles), to enable automatic or semi-automatic conversion of existing annotations to the lexicalized form. The end result will be a purely lexicalized annotation.</p><p>While the same road map approach might be adopted for entity typing and word senses, we expect that word senses will be defined for individual languages in UMR annotation. While the general inventory of entity types currently used in AMR can be reframed into a more cross-linguistically robust form, we leave that to future work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8">Pilot annotation experiments</head><p>As of now, we have not been able yet to conduct significant amounts of annotation with the full UMR scheme in any language. We have, however, conducted a number of annotation experiments aimed specifically at testing the robustness of the proposed UMR annotation  schemes for newly added semantic domains (temporality, modality, and aspect), and at evaluating the efficiency of lattice-style annotation schemes in maintaining cross-linguistic comparability. These experiments were all conducted in English (annotation pilots of Arapaho, Kukama, Navajo, and Sanapan&#225; are under way). As long as these new semantic domains can be reliably annotated, we are confident that UMR as a whole can be reliably annotated as it builds on AMR which has been successfully annotated in large-scale annotation projects <ref type="bibr">[6]</ref>, and the typological adaptations are based on large-scale cross-linguistic studies of the categories and constructions described above.</p><p>Experiments on temporal dependency annotation have been carried out in a crowd-sourcing setting with untrained annotators on Amazon Mechanical Turk <ref type="bibr">[86]</ref>. This is arguably a more challenging annotation setting as it is impractical to ensure that the annotators have proper linguistic background and require that they read detailed guidelines. In our temporal dependency annotation experiments, for each relative time expression (e.g., today), the crowd-worker is asked to identify its reference time so that it can be resolved to an absolute time that can be properly interpreted. For each event, the crowd-worker is asked to identify either a reference event or a reference time and then determine the temporal relations (e.g., Before, After) between them. We measure the agreement between an expert annotator and the aggregated majority opinion of crowd-workers. The agreement is measured separately for events and time expressions, and we use two metrics, unlabeled and labeled F1 score. Unlabeled F1 score measures whether the annotators agree on the reference time or reference event, while labeled F1 is a more stringent metric that also measures whether the same temporal relation is identified. Assuming that the time expressions and events are properly identified, which is a reasonable assumption in UMR annotation as the sentencelevel annotation is already in place, the unlabeled and labeled agreement is 0.85 and 0.77 for time expressions and 0.75 and 0.83 for events. A more detailed breakdown shows that annotators can more reliably determine temporal relations but find it more challenging to agree on the same reference time or event. Since UMR annotation is designed for "traditional" annotation approaches that assume detailed annotation guidelines and trained annotators, we expect annotation agreement on temporal dependency can only improve in such a setting.</p><p>The most extensive annotation experiment was conducted for the modal dependency annotation scheme. As reported in <ref type="bibr">[77]</ref>, six English texts were annotated by two independent expert annotators using this scheme, amounting to 377 events expressed in 108 sentences. In the first pass of annotation, the identification of events to be given a modal value, inter-annotator agreement scores were very high (precision: 0.94, recall: 0.93, Fscore: 0.93). For the second annotation pass, setting up the modal superstructure (the relations between conceivers), precision was high (0.91) but recall was much lower (0.77). The F-score for this pass was 0.83. For the third pass, the attribution of modal strength values to each event, inter-annotator agreement was once again high (precision, recall, and F-score all 0.88). Even though there were significant and interesting genre-based differences in inter-annotator agreement -agreement was consistently much higher for events from newswire text than for events from messages on discussion forums -these results show that the annotation scheme can be successfully implemented.</p><p>To test the robustness of the aspect annotation, secondly, five English texts were annotated by the same two expert annotators as in the modal annotation experiment described above. A total of 238 events were given one of six aspect labels: State, Habitual, Activity, Endeavor, Performance, or N/A (the latter was used for event nominals and future events, and changed to Process later). Inter-annotator agreement for this task was slightly lower than for the modal annotation, but still much higher than chance: Cohen's K = 0.84; Siegel &amp; Castellan's adjusted K to account for bias = 0.84; 2 * observed proportion of agreement -1 = 0.80. Agreement was again considerably higher in news text than in narrative text -for one news text there was perfect agreement. Once again, these agreement results are encouraging. Both for the modal and the aspectual annotation, these results await confirmation from annotation in other languages.</p><p>Lastly, a short cross-linguistic annotation experiment was conducted to gauge whether the organization of semantic categories in a lattice leads to higher cross-linguistic inter-annotator agreement <ref type="bibr">[75]</ref>. Thirtysix English sentences expressing spatial relations and their Czech, Dutch, and Korean translations were annotated, each by one native speaker of the language in question. Annotators used a lattice with support, attachment, containment as the default level values. On the level above, non-containment grouped together support and attachment, while non-support grouped to-gether attachment and containment. More fine-grained values were adhesion (intermediate between support and attachment, e.g. a band-aid on an arm), and attached containment (intermediate between attachment and containment, e.g. an apple on a tree branch, where it is also enveloped by leaves).</p><p>A fairly large proportion of sentences were annotated with an identical value across languages: for the lowest-scoring language pair (Czech and English), Cohen's K was 0.64 for exact agreement, while for the highest-scoring language pair (Czech and Dutch), Cohen's K was 0.86. However, when looking at sentences with compatible annotations, rather than identical ones (e.g. an event annotated as attachment in one language but non-containment in another), scores went up considerably: the lowest-scoring language pair was now Czech and Korean (Cohen's K = 0.79), while the highest scoring language pair was now Dutch and Korean (Cohen's K = 0.94). Therefore, the proposed lattice architecture seems fairly successful at abstracting away from language-specific differences in category boundaries. These results as well await further confirmation from cross-linguistic annotation of different semantic domains.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="9">Related work</head><p>Though necessarily somewhat superficially, we discuss the design choices of UMR in relation to five existing meaning representations: Abstract Meaning Representation (AMR) <ref type="bibr">[6]</ref>, Discourse Representation Structures (DRS) <ref type="bibr">[44,</ref><ref type="bibr">13]</ref>, the tectogrammatical layer of the Prague Dependency TreeBank (PDT) <ref type="bibr">[37,</ref><ref type="bibr">38]</ref>, Minimal Recursion Semantics (MRS) <ref type="bibr">[21]</ref>, and Universal Conceptual Cognitive Annotation (UCCA) <ref type="bibr">[1]</ref>. These five meaning representations are selected as a basis for comparison because: i) they have all been deployed in large-scale annotation projects; ii) they all provide some degree of abstraction from surface text spans, and iii) they provide a complete meaning representation for at least the entire sentence if not the whole text. These requirements preclude from consideration partial meaning representations such as semantic role labeling frameworks like FrameNet <ref type="bibr">[5]</ref> and Propbank <ref type="bibr">[59]</ref> where the focus is on the argument structure of verbal and nominal predicates. Of course, the meaning representations we have chosen for comparative discussion are not intended to be a complete list of semantic representations that have been proposed over the years.</p><p>Abstract Meaning Representation (AMR) <ref type="bibr">[6]</ref>, firstly, represents sentence meanings as single-rooted, directed, and acyclic graphs in which the nodes are concepts and the edges are relations. AMR concepts include word sense-disambiguated verbal and nominal predicates, en-tities (e.g. "person") and relations (e.g.,"have-org-role-91"), or simple lemmas. AMR relations include Propbank style semantic roles (e.g., "Arg0"), general semantic relations (":degree"), discourse relations (e.g., ":condition"), etc.</p><p>Discourse Representation Theory (DRT) <ref type="bibr">[44]</ref> proposes a discourse-level meaning representation for an entire text that can be easily translated into logical form. The Groningen Meaning Bank (GMB) <ref type="bibr">[13]</ref> is a large data set annotated with DRS that makes use of word senses from the WordNet, semantics roles from Verb-Net, and rhetorical relations from SDRT <ref type="bibr">[3]</ref> and puts the theoretical foundation of DRT into practice. The GMB is produced by associating Combinatory Categorial Grammar (CCG) parses <ref type="bibr">[72]</ref> with semantic forms. Semantic forms of the entire sentence can be constructed from primitive semantic forms associated with lexical entries in the CCG lexicon. As the CCG tree of a sentence is constructed so is its semantic representation. Parallel Meaning Bank (PMB) <ref type="bibr">[2]</ref> is a more recent effort to extends GMB annotation to multiple languages to create a parallel corpus annotated with DRS.</p><p>The tectogrammatical layer of the Prague Dependency TreeBank (PDT) <ref type="bibr">[37,</ref><ref type="bibr">38]</ref> covers many of the same semantic distinctions covered by AMR such as the argument structure (semantic roles, called "functors" in PDT), word senses, coreference, and intra-and inter-sentential discourse relations. Additionally it also annotates tense, modalities, and a host of other "semantic" node attributes, bridging and textual coreference as well as topic/focus (information structure) which are not part of AMR annotation. PDT uses a multi-layered annotation framework where the tectogrammatical layer is explicitly linked by individual node references to the other (lower) layers of linguistic analysis.</p><p>Minimal Recursion Semantics (MRS) <ref type="bibr">[21]</ref> is a sentence-level meaning representation that also focuses on representing predicate-argument structure, sense distinctions where they are grammaticalized, logical semantic phenomena such as quantification and operatorlike scopal predicates, and tense, aspect, modality etc. as determined by morpho-syntax. MRS emphasizes semantic compositionality <ref type="bibr">[20,</ref><ref type="bibr">9]</ref>, and full representations are typically derived in conjunction with grammarbased parsers, e.g. the English Resource Grammar (ERG) <ref type="bibr">[33]</ref>.</p><p>Universal Conceptual Cognitive Annotation (UCCA) <ref type="bibr">[1]</ref> has a foundational layer that focuses on the predicateargument structure. The UCCA foundational layer views text as a collection of scenes, and each scene contains a main relation (a state or process) that is the anchor of the scene, as well as participants of the relation. As it currently stands, UCCA does not annotate word senses, named entities, relations as AMR does, nor does it annotate tense, aspect, modality, and quantification scope like MRS. However, it has a multi-layered de-sign like PDT that allows extensions, and there is ongoing research to add coreference annotation to UCCA <ref type="bibr">[60]</ref>.</p><p>We compare UMR with these existing meaning representations against the four desiderata we have outlined in Section 2, as well as whether they support discourse-level semantic processing.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="9.1">Scalability</head><p>Scalability is a key consideration in the design of UMR, since it needs to be applicable in large-scale annotation settings. We argue that representations that can be annotated independently of other layers of linguistic analysis -particularly syntactic annotation, as this is a highly complex task in itself -have an advantage in this regard. Such independence from syntactic layers of annotation not only improves the scalability of an annotation system within one language, but also improves cross-linguistic portability by making semantic annotation possible for languages which may not have syntactically annotated corpora, nor the resources to create them. Ultimately, we aim to train accurate semantic parsers which is only possible given sufficient amounts of training data.</p><p>AMR relaxes the strict correspondence between the meaning representation and morphosyntax: the concepts and relations in AMR need not be linked to constituents in a morphosyntactic structure, or even sub-segments of the surface linguistic signal. This contributes to scalability by eliminating the time needed to build morphosyntactic structures preceding semantic annotation. It allows great freedom in handling syntax-semantics mismatches, such as "contentless" function words which can be left out of the meaning representation (e.g. infinitival to in English), constructs in the meaning representation which do not correspond to any words in the text but can be inferred from the context (e.g. a concept "person" can be inferred from the surface phrase "the young"), and complicated correspondences between the meaning representation and the surface syntactic structure (e.g. discontinuous constructions such as English "as</p><p>Like AMR, UCCA is also a stand-alone meaning representation that does not have to be linked to a syntactic representation. In addition, like AMR it also allows the annotation of implicit arguments, arguments that do are not lexicalized. In this sense, it has the same advantage as AMR in terms of scalability. In fact, it is shown that shows that UCCA does not require linguistically trained annotators <ref type="bibr">[1]</ref>, adding to its attractiveness. However, UCCA currently lacks many of the semantic elements that other meaning representation has, and these include named entities and relations be-tween them, logical constructs such as tense, aspect, modality, polarity, and quantification scope, so it remains to be seen if the simplificity (thus scalability) of the UCCA foundational layers can be maintained if it is extended to account for these additional semantic elements.</p><p>The three other semantic representation frameworks, except in part for the PDT which keeps the layers separated but interlinked, are more tightly tied to syntactic annotation and harder to annotate independently. Minimal Recursion Semantics (MRS) <ref type="bibr">[21]</ref> and Discourse Representation Theory <ref type="bibr">[45]</ref> are not intended to be created directly through manual annotation, and they are typically produced compositionally via the mediation of syntactic structures. Examples of large-scale initiatives in producing these meaning representations include the Lingo Redwoods Initiative <ref type="bibr">[57]</ref> and the Groningen Meaning Bank (GMB) <ref type="bibr">[8]</ref>, respectively. The goal of the GMB is to generate Discourse Representation Structures (DRS), mediated with syntactic structures produced by Combinatory Categorial Grammar <ref type="bibr">[72]</ref> parsers. The Lingo Redwoods Initiative adopts a similar approach in generating MRS representations via Head-Driven Phrase Structure <ref type="bibr">[65]</ref> parses.</p><p>UMR inherits the approach of AMR by allowing meaning representation to be annotated independently of syntactic structures. This makes it possible to annotate the meaning of morphologically complex lowresource languages without having to tackle their morphosyntactic complexities and so it is easier for semantic annotation to get off the ground more quickly, and thus more scalable. However, see <ref type="bibr">[9]</ref> for a different perspective on this trade-off, where it is argued that basing semantic annotation off syntactic annotation makes it more likely to be consistent. We believe that with detailed guidelines that specify when a new concept can be inferred and when a discontinuous pattern can be mapped to a single concept/relation, annotation consistency can be achieved.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="9.2">Supporting both lexical and logical inference</head><p>Minimal Recursion Semantics has the clear goal of supporting logical inference in reasoning-based AI systems: it is easily translatable into predicate logic. Much of their focus is on proper representation of semantic components such as quantification, negation, tense, and modality. Less emphasis is placed on the representation of lexical semantic information such as semantic roles and word senses, or entities and relations, which are prioritized in AMR.</p><p>On the other end of the spectrum are meaning representations aimed at supporting similarity-based lexical inferences. These are crucial to modern practical natural language understanding applications, and in-clude distinguishing different senses of words (e.g. John ran a race versus John runs a company), and identifying which entities play which semantic roles in events (e.g. John is the Actor (Arg0) of both examples above). Both AMR and the Tectogrammatical (TG) annotation in the PDT family of treebanks <ref type="bibr">[52,</ref><ref type="bibr">38,</ref><ref type="bibr">39]</ref> focus on facilitating such inferences. Similarly to AMR, every verb in the TG annotation is sense-disambiguated and linked to a lexical entry in a valency lexicon of the language being annotated <ref type="bibr">[36,</ref><ref type="bibr">73]</ref>. These lexical entries are further linked to semantic classes of verbal synonyms <ref type="bibr">[74]</ref> and consequently thus also to entries in other lexical resources, such as FrameNet <ref type="bibr">[5]</ref>, VerbNet <ref type="bibr">[68]</ref>, Propbank <ref type="bibr">[59]</ref> and WordNet <ref type="bibr">[53]</ref>. Several TGannotated corpora are available, including the parallel Prague Czech-English Dependency Treebank PTB/WSJ corpus <ref type="bibr">[37,</ref><ref type="bibr">38,</ref><ref type="bibr">39]</ref>.</p><p>UCCA does not currently annotate word senses, but it achieves "semantic stability" by annotating scenes that consists of participants in a relation. As a result, it is indifferent to variations as a result of paraphrasing. In this sense, it supports lexial inference. As it currently does not annotate elements that are important to logical inference, in this aspect it is more in line with AMR and the tectogrammatical layer of PDT.</p><p>Discourse Representation Theory sits somewhere in the middle in the spectrum. It has a clear focus on supporting logical inference. At the same time, it also draws from WordNet <ref type="bibr">[53]</ref> and VerbNet <ref type="bibr">[68]</ref> in that it uses the word sense information in the former and the semantic role labels in the latter.</p><p>UMR adds the annotation of quantifier scope and negation to AMR to support logical inference. <ref type="bibr">[12]</ref> has shown that with disambiguated quantifier scope, AMR can be translated into first-order logic. Apart from scope, UMR adds layers with temporal, aspectual, and modal information to AMR, as described in Section 5. Such additional layers on top of AMR expand its capability for logical inference.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="9.3">Achieving cross-linguistic uniformity</head><p>For a meaning representation to be uniform across languages, it must provide mechanisms for abstracting away from language-specific morphosyntax without losing semantic information. An advantage of AMR is that it already abstracts away from constituent order which varies widely across languages. This has been shown to make AMR more portable across languages, for example to Chinese and Czech <ref type="bibr">[85,</ref><ref type="bibr">47,</ref><ref type="bibr">48]</ref>. UCCA, by intentionally attempting to annotate semantic relations that are indifferent to syntactic relations, have also proved to be readily portable to other languages <ref type="bibr">[43]</ref>. Similarly, DRS, MRS, and PDT have all been annotated in multilingual contexts <ref type="bibr">[2,</ref><ref type="bibr">51,</ref><ref type="bibr">39]</ref>, and have all demonstrated some level of cross-lingual validity.</p><p>To our knowledge, however, UMR is the first to explicitly take low-resource languages into account and is general enough to eventually encompass the range of variation exhibited by the world's roughly 7000 languages. Unlike Interlingua <ref type="bibr">[56,</ref><ref type="bibr">29,</ref><ref type="bibr">30,</ref><ref type="bibr">42]</ref>, UMR does not aim to build a common cross-linguistic vocabulary for all concepts. Instead, UMR aims to factor out what is common for all languages -roughly, it proposes the use of language-specific lexical databases and Propbankstyle frame files, and cross-linguistically portable annotation layers for "grammatical" semantics such as aspect and modality. UMR thus uses a combination of language-specific (concrete, lexical) concepts and a shared inventory of (abstract, grammatical) concepts, features, and relations for all languages. This shared inventory is expected to have a manageable vocabulary size of a few hundred items. As it is based on cross-linguistic typological research from the last six decades, UMR can represent robust cross-linguistic commonalities in a uniform manner while allowing for crosslinguistic variation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="9.4">Document-level representation</head><p>We have shown in Section 5 that it is impossible to properly interpret the meaning of a text without representing document-level language phenomena. Existing meaning representations are either sentence-level representations (MRS), or have mostly focused on coreference and limited discourse relations across sentences (the tectogrammatical layer of PDT). For example, the Groningen Meaning Bank, based on DRS, includes coreference annotation and rhetorical relations based on Segmented Discourse Representation Theory (SDRT) <ref type="bibr">[3]</ref>. The Multi-Sentence AMRs <ref type="bibr">[58]</ref> focus on various forms of coreference annotation. UCCA annotation can go over sentence boundaries and can include several paragraphs, and there is also research to add coreference annotation to UCCA. UMR advances the state of the art by providing a robust document-level representation that represents coreference, temporal and modal relations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="10">Conclusion and future work</head><p>In this paper we presented the design of a Uniform Meaning Representation (UMR) that draws on existing meaning representations while substantially extending them. To make UMR cross-linguistically plausible, we proposed a staged strategy for annotating morphologically complex low-resource languages with UMR. We also proposed additional features to UMR, and a document-level representation that captures semantic relations that potentially go beyond sentence-boundaries.</p><p>Designing a meaning representation that captures the essential semantic content of a text is crucial to making progress in natural language understanding. While each aspect of the UMR annotation has been performed individually to some extent in prior work, packing all of them in one simple (and thus practical and scalable) framework is still an incredible challenge. We believe UMR has taken a significant step in that direction and hope to produce UMR annotations for many languages. We are currently developing UMR annotation guidelines and tools, and as they become available, plan to annotate UMR on data from multiple languages.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>As can be seen from this example, the document-level representation is a list of triples in the form of &lt;dependent relation parent&gt;, and deviates from the Penn notation used for the sentence-level representation.</p></note>
		</body>
		</text>
</TEI>
