<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Accounting for sentence position and legal domain sentence embedding in learning to classify case sentences</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2021 December</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10386977</idno>
					<idno type="doi"></idno>
					<title level='j'>Legal knowledge and information systems</title>
<idno>1570-3886</idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Huihui Xu</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[In this paper, we treat sentence annotation as a classification task. Weemploy sequence-to-sequence models to take sentence position information into account in identifying case law sentences as issues, conclusions, or reasons. We also compare the legal domain specific sentence embedding with other general purpose sentence embeddings to gauge the effect of legal domain knowledge, captured during pre-training, on text classification. We deployed the models on both summaries and full-text decisions. We found that the sentence position information is especially useful for full-text sentence classification. We also verified that legal domain specific sentence embeddings perform better, and that meta-sentence embedding can further enhance performance when sentence position information is included.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>As an initial step toward automatically generating comprehensible legal summaries, we have been exploring machine learning (ML) methods for classifying sentences of legal cases in terms of issues a court addresses, its conclusions of those issues, and its reasons for so concluding (IRCs). In previous work, we have experimented with different models-both traditional machine learning and deep learning-to identify these types of sentences in both summaries and full texts. While we demonstrated that those models can identify IRC types of sentences to some extent, the task remains challenging for machine encoding.</p><p>In this paper, we employ supervised ML based on a larger annotated dataset, 1049 pairs of full text cases and summaries in which sentences have been manually annotated in terms of IRCs. We also explore if two new techniques, sentence embeddings pretrained on large quantities of legal texts and taking account of sentence order, help machine annotation of legal cases.</p><p>In attempting to leverage the power of state-of-the-art sentence embeddings, pretrained on legal texts, we hypothesize that the broader contextual information associated with the sentence embeddings will improve performance.</p><p>We also hypothesize that taking sentence ordering information into account will improve the classifier's performance. In regular meetings with our two third-year law student annotators to resolve differences concerning annotations, we noticed that they tended to rely on ordering information to mark up certain types of sentences. For example, annotators would look for conclusions following issues or at the end of a case. We wondered if ML could also employ such position information.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.1.">Extracting Issues, Reasons, and Conclusions</head><p>The ultimate goal of our work is to enable an intelligent system to help end users assess a case's potential relevance by effectively and efficiently conveying some important substantive information about the case. Human-prepared legal summaries are available through various on-line legal service providers. For example, the CanLII Connects website <ref type="foot">4</ref> of the non-profit Canadian Legal Information Institute, <ref type="foot">5</ref> features summaries of legal decisions prepared by members of Canadian legal societies.</p><p>Based on the experience of CanLII Connects, summaries as short as three sentences could be even more effective in a legal IR interface. This raises a practical question: "What can a three-sentence case summary provide?". Legal argument triples, IRCs, may be the answer. Issues, reasons, and conclusions form the skeleton of case briefs, a legal writing technique for summarizing cases that has long been taught in American law schools. Thus, the potential utility of summarizing cases in terms of issues, conclusions, and reasons seems clear.</p><p>Based on our annotation experience, the human-prepared CanLII summaries regularly include issues raised by the courts, the conclusions reached, and reasons connecting them. Those summaries also include some procedural information, descriptions of facts, statements of legal rules, case citations and explanations, and other information. Since the expert legal summarizers act as an intelligent and well-informed filter on importance, it made sense to leverage their expertise by annotating their summaries rather than the full texts. CanLII has provided 28,733 paired cases and human-prepared summaries for purposes of this research. The cases cover a variety of kinds of legal claims and issues presented before Canadian courts.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.2.">Hypotheses</head><p>We try to answer two long-standing questions in the Artificial intelligence and Law field: first, whether legal language is so unique that the legal pre-trained models would assist downstream legal natural language processing tasks and which tasks; second, whether sentence position information helps a model as it appears to help human annotators.</p><p>We investigate how well the classification models perform based on sentence embeddings, and annotate full texts of cases and summaries in terms of issues, reasons, and conclusions. As noted, we examine two hypotheses in this paper:</p><p>(1) A model would perform better when incorporating sentence position information.</p><p>(2) A model would perform better when incorporating specific legal domain knowledge.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Work</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Word and Sentence Embeddings</head><p>Word embeddings, dense vector representations trained with neural language models, capture some linguistic relationships between words and assist with various natural language processing tasks. See, e.g., <ref type="bibr">[1]</ref>. Researchers further explored word metaembeddings for operations such as concatenation, SVD, and 1toN <ref type="bibr">[2]</ref>. Experiments in <ref type="bibr">[3]</ref> proved that averaging different sources of word embeddings has similar effects as concatenating those embeddings. Researchers in <ref type="bibr">[4]</ref> used three types of autoencoders to learn meta-embeddings of words.</p><p>Similarly, sentence embedding is the dense vector representation of a sentence. Sentence embedding provides information about larger contexts of words. <ref type="bibr">[5]</ref> introduced Sentence-BERT in 2019; they used siamese and triplet network structures to derive fixedsized 768 dimensional vector representations for input sentences. Google Research developed the Universal Sentence Encoder in 2018 <ref type="bibr">[6]</ref>. The encoder has two model architectures: one based on transformer architecture and the other on Deep Averaging Networks (DAN). Both transfer input sentences into fixed 512 dimensional sentence embeddings. Both Sentence-BERT and Universal Sentence Encoder are state-of-the-art sentence embeddings.</p><p>In the legal domain, words may have different semantic meanings than in other domains. For example, 'sentence' means the judgment that a court formally pronounces after finding a criminal defendant guilty. <ref type="foot">6</ref> In order to address this, we employed Legal-BERT, a BERT model trained on legal domain sentences <ref type="bibr">[7]</ref>. Legal-BERT was pretrained on the entire Harvard Law case corpus from 1965 to present, comprising 3,446,187 legal decisions across all federal and state courts <ref type="bibr">[7]</ref>. <ref type="foot">7</ref></p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Argument Mining and Summarization</head><p>Extracting propositions, premises, conclusions, and nested argument structures <ref type="bibr">[8,</ref><ref type="bibr">9]</ref> is an active research topic in the legal argument mining field. Rhetorical and other roles that sentences play in legal arguments have been employed for legal argument mining <ref type="bibr">[10]</ref>. Citing information and fact patterns <ref type="bibr">[11,</ref><ref type="bibr">12]</ref> that effect the strength of a side's claim in special legal domains are also being explored. Segmenting legal text by functions <ref type="bibr">[13,</ref><ref type="bibr">14]</ref>, and by topic <ref type="bibr">[15]</ref> or by linguistic analysis <ref type="bibr">[16,</ref><ref type="bibr">17,</ref><ref type="bibr">18]</ref> are some initial steps for dissecting a legal document.</p><p>Researchers have applied legal argument mining to the task of summarizing legal cases. In <ref type="bibr">[19]</ref>, the authors propose an unsupervised algorithm that incorporates legal domain knowledge, such as rhetorical roles sentences play in a legal document. <ref type="bibr">[20]</ref> have summarized Japanese judgments in terms of issues, conclusions, and framings. Our legal argument triples have a similar structure but are more understandable types than those tailored to Indian or Japanese legal judgements. In addition, in our work a set of case summaries prepared by legal experts is used to extract argument triples from the full case texts.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Dataset</head><p>Our type system for labeling sentences in legal cases comprises:</p><p>1. Issue -Legal question which a court addressed in the case. 2. Conclusion -Court's decision for the corresponding issue.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Reason -Sentences that elaborate on why the court reached the Conclusion.</head><p>We treat all non-annotated sentences as non-IRC sentences.</p><p>Two hired third-year law school students annotated sentences from the humanprepared summaries to identify and annotate the issues, reasons, and conclusions. Both students have annotated 1049 randomly selected pairs from the 28,733 case/summary pairs available. The total number of sentences from the corresponding full texts is 215,080, which is significantly more than the corresponding summaries' 11,496 sentences.</p><p>Both annotators followed an 8-page detailed Annotation Guide prepared by the third author, a law professor, in order to mark-up instances of IRC sentence types in both the summaries and full texts of cases. The annotators worked on successive batches of summaries using the Gloss annotation environment developed by the second author. After annotating each batch, the annotators resolved any annotation differences in regular Zoom meetings attended by the first and third authors.</p><p>The procedure for annotating the full texts of cases differs from annotating the summaries. The Annotation Guide instructs annotators to search the full text of the case for those sentences that are most similar to the annotated summary sentences and to assign them the same labels (i.e., Issue, Conclusion, or Reason) as in the summaries. Annotators may pick terms or phrases from the annotated summary sentences as anchors to search for corresponding sentences in the full texts. Annotators do not need to read the full text of the case if they find the corresponding sentences. The Guide warns that there may not be an exact correspondence between the annotated sentences in the summary and those in the full text of the case. This is fairly common, because human summarizers tend to edit selected sentences in the full case texts. For example, a human summarizer may combine some shorter sentences into a longer one.</p><p>By using the summaries' annotations as anchors to target corresponding sentences in the full text, we attempted to leverage the summarizers' work in selecting important sentences and the annotators' work in marking up some of those full texts sentences as issues, conclusions, or reasons. We developed this strategy to expedite the full text annotation process, since it would be much more time-consuming and costly if annotators had to read the full texts of cases. The strategy is based on the observation that sentences of summaries stem from those in the full texts. The strategy also helps us to confirm the mapping relationship between summaries and full texts, which is a step towards generating summaries automatically.</p><p>Cohen's &#954; <ref type="bibr">[21]</ref> is used to measure the degree of agreement between two annotators after their independent annotations. The mean of Cohen's &#954; coefficients across all Table <ref type="table">1</ref>. Descriptive statistics of the resulting dataset. We report the basic descriptive statistics of each type in both summaries and full texts. The lengths of the summary and the full texts are also included in the types for summaries is 0.734, and the mean for full texts is 0.602. According to <ref type="bibr">[22]</ref>, both scores indicate substantial agreement between annotators about the sentence type.</p><p>For the summary annotation, the mean of Reason agreement is the lowest among those three types. Annotating Reasons is more challenging since they are entwined with case facts. The agreement scores of full texts are lower than the summaries' scores, since sentences from summaries and full texts are not in a one-to-one mapping. This increases the difficulty of full text annotation. Figure <ref type="figure">1</ref> reports the distributions of final consensus labels from summaries and full texts. The most frequent label is the non-IRC label for both summaries and full texts. The second most frequent label is the Reason label for both summaries and full texts. The label distribution is aligned with our observation: Reasons tend to be more elaborated than Issues and Conclusions.</p><p>The descriptive statistics of the processed dataset are shown in Table <ref type="table">1</ref>. The average number of sentences in a full text is 205.03, while the range of the full text length is quite large. Comparatively, the average number of sentences in summaries is 10.96 which, as expected, is much shorter than full texts. We also observe that the average length of Issues is the highest in both summaries and full texts.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Experiment</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Models</head><p>We use Sentence-BERT <ref type="bibr">[5]</ref>, Universal Sentence Encoder(USE) <ref type="bibr">[6]</ref>, and Legal-BERT <ref type="bibr">[7]</ref> to encode sentences from summaries and full texts into a semantic space. Each sentence then becomes a fixed sized vector. Each document is comprised of a series of converted sentence vectors.</p><p>Sentence-BERT uses two BERT models with tied weights and adds a pooling operation to the output to derive fixed sized sentence embedding. We chose the 'all-mpnetbase-v2' model trained on a dataset of over 1 billion pairs. <ref type="foot">8</ref> This model encodes sentences into 768-dimensional vectors and has achieved competitive performance over different datasets. USE takes a tokenized string and outputs a fixed 512-dimensional vector as sentence embedding. <ref type="foot">9</ref> Legal-BERT was trained on the entire Harvard Law case corpus. In order to derive the fixed sized sentence embedding, we simply keep the output of the last pooling layer of this model as the sentence embedding. The dimension of the Legal-BERT sentence embedding is also 768.</p><p>The Long Short-Term (LSTM) neural network <ref type="bibr">[23]</ref>, a variant of a recurrent neural network (RNN), can deal with arbitrary lengths of input. A traditional RNN does not perform well on long sequences due to the problem of vanishing gradients. LSTM tackles the problem by incorporating different gates. Bidirectional LSTM consists of two separate LSTMs: one takes an input from right to left; the other one from left to right.</p><p>We also examined the effect of one of the meta-sentence embedding techniques. Averaging is one of the commonly used meta-embedding techniques. It simply requires averaging different sources of embeddings. According to <ref type="bibr">[24]</ref>, averaging has similar performance to concatenation while taking less time and resources in terms of meta-word embedding. We extend this idea to the sentence embedding. We construct two types of meta-sentence embeddings: Legal-BERT + USE and Legal-BERT + Sentence-BERT. Both types of meta-sentence embeddings are 768-dimensional.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Experimental Design</head><p>This work employs two designs which differ as to the way in which the sentence embeddings are fed into the bidirectional LSTM model: 1) Single time step is associated with a sentence, and no sentence position information is provided. 2) Fixed sized document matrices are input into the model; each time step is associated with a sentence, where sentence position information is provided. We refer to <ref type="bibr">[14]</ref> for the padding procedure. For example, since the maximum length of full texts is 2411, we transferred each full text document into a 2411 &#215; 768 matrix when using Sentence-BERT embedding. The shorter case will be padded to the maximum length. In this paper, we chose pre-padding over post-padding since <ref type="bibr">[25]</ref> demonstrates that pre-padding for LSTM performs substantially better than post-padding. Figure <ref type="figure">2</ref> shows the structure of the models and the main difference between the two designs. Our rolled LSTM reads one sentence at each time step. The returning arrow (left) represents multiple time steps for "with position information". "Without position information" (right) involves only a single time step.</p><p>We split the dataset into training, validation, and test sets. The training set comprises 70% of 1049 cases; the validation and test sets each have 15% of the cases. The data is fed into the bidirectional LSTM model with 256 units and a dropout rate of 0.2. Categorical cross-entropy loss function and Adam optimizer are used for optimizing the model. The initial learning rate is set to 1e -3 and reduced at factor 0.1 if the validation loss has stopped decreasing with a patience of 20. The training procedure will be stopped when the validation accuracy has not been increased in 20 epochs. Validation accuracy is used to select the best model.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Results and Discussion</head><p>The results of the two experimental designs are shown in Table <ref type="table">2</ref>. The first 5 rows of the table show how well the model performs on the summary and the full text without position information; the next 5 rows show the performance of the model when incorporating sentence position information. All the numbers are reported as F 1 scores.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.">Without Position Information vs. With Position Information</head><p>For summaries, the model performs better on identifying Reasons with position information no matter which sentence embedding is used. However, Legal-BERT and Legal-BERT + USE sentence embeddings with position information do not have better performance in terms of Issue and Conclusion classification. The average F 1 scores of all sentence embeddings are higher when the model digests sentence position information at the same time.</p><p>For the full text sentence classification, the pattern is much clearer: Issue and Reason can be more easily identified by the model when including sentence position information. Sentence-BERT and Legal-BERT embeddings do not perform better with position information in terms of Conclusion classification. The average F 1 scores are better when the model is fed with sentence position information.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.">Domain Specific Sentence Embedding vs. General Purpose Sentence Embedding</head><p>Legal-BERT sentence embedding achieves the best performance on Issue, Reason and Conclusion on both summary and full text, if the model was not fed sentence position information. Legal-BERT sentence embedding performs better than other types of sentence embeddings when the model takes sentence position information into account except on classifying Issues in summaries.</p><p>Sentence-BERT sentence embedding is the second best embedding on most of the classification tasks, while USE is the second best on full text Reason classification.</p><p>summaries is not as reliable as in full texts. We found that the model ignores some Issue instances that appear in the middle of the summaries.</p><p>Compared to our prior work <ref type="bibr">[26]</ref>, the F 1 scores across all types decrease substantially. In <ref type="bibr">[26]</ref>, we obtained F 1 scores of 0.58, 0.15, and 0.53 on Issue, Reason and Conclusions, respectively. Our expectation that the performance would improve after training on more data was not confirmed. Several reasons could contribute to this result: first, the initial learning rates are different; this will lead to different performance. Second, noisy data also increase along with the increase of data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Conclusion and future work</head><p>We analyzed the effect of sentence position information and legal domain specific sentence embedding in a task of labelling case sentences in terms of legal argument triples. We found that the sentence position information does assist the model to perform better, especially for full texts. We also verified that legal domain specific sentence embedding performed better on this legally intensive task than the other general purpose sentence embeddings. Meta-sentence embedding that inherits benefits from general purpose sentence embedding and legal sentence embedding can outperform its components when the position information is incorporated. The result suggests a promising path to annotate legal documents automatically. This is also a step towards automatically generating succinct legal summaries since the model can identify the important sentences.</p><p>This work is subject to certain limitations as well. As mentioned before, paradoxically, the overall performance on full texts tended to decrease with the larger training set. For future work, we will explore a more effective model to improve the performance, such as by introducing additional linguistic features and their semantic values.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>huihui.xu@pitt.edu</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1"><p>jsavelka@cs.cmu.edu</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2"><p>ashley@pitt.edu</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3"><p>https://canliiconnects.org/en</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_4"><p>https://www.canlii.org/en/</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_5"><p>https://www.law.cornell.edu/wex/sentence</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_6"><p>The pre-trained Legal-BERT model can be found here: https://huggingface.co/zlucia/legalbert</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="8" xml:id="foot_7"><p>https://huggingface.co/sentence-transformers/all-mpnet-base-v2</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="9" xml:id="foot_8"><p>https://tfhub.dev/google/universal-sentence-encoder/4</p></note>
		</body>
		</text>
</TEI>
