This paper presents a data-driven study focusing on analyzing and predicting sentence deletion — a prevalent but understudied phenomenon in document simplification — on a large English text simplification corpus. We inspect various document and discourse factors associated with sentence deletion, using a new manually annotated sentence alignment corpus we collected. We reveal that professional editors utilize different strategies to meet readability standards of elementary and middle schools. To predict whether a sentence will be deleted during simplification to a certain level, we harness automatically aligned data to train a classification model. Evaluated on our manually annotated data, our best models reached F1 scores of 65.2 and 59.7 for this task at the levels of elementary and middle school, respectively. We find that discourse level factors contribute to the challenging task of predicting sentence deletion for simplification.
more »
« less
Controllable Text Simplification with Explicit Paraphrasing
Text Simplification improves the readability of sentences through several rewriting transformations, such as lexical paraphrasing, deletion, and splitting. Current simplification systems are predominantly sequence-to-sequence models that are trained end-to-end to perform all these operations simultaneously. However, such systems limit themselves to mostly deleting words and cannot easily adapt to the requirements of different target audiences. In this paper, we propose a novel hybrid approach that leverages linguistically-motivated rules for splitting and deletion, and couples them with a neural paraphrasing model to produce varied rewriting styles. We introduce a new data augmentation method to improve the paraphrasing capability of our model. Through automatic and manual evaluations, we show that our proposed model establishes a new state-of-the-art for the task, paraphrasing more often than the existing systems, and can control the degree of each simplification operation applied to the input texts.
more »
« less
- Award ID(s):
- 2055699
- PAR ID:
- 10270675
- Date Published:
- Journal Name:
- Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
- Page Range / eLocation ID:
- 3536 to 3553
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
null (Ed.)Monolingual word alignment is important for studying fine-grained editing operations (i.e., deletion, addition, and substitution) in text-to-text generation tasks, such as paraphrase generation, text simplification, neutralizing biased language, etc. In this paper, we present a novel neural semi-Markov CRF alignment model, which unifies word and phrase alignments through variable-length spans. We also create a new benchmark with human annotations that cover four different text genres to evaluate monolingual word alignment models in more realistic settings. Experimental results show that our proposed model outperforms all previous approaches for monolingual word alignment as well as a competitive QA-based baseline, which was previously only applied to bilingual data. Our model demonstrates good generalizability to three out-of-domain datasets and shows great utility in two downstream applications: automatic text simplification and sentence pair classification tasks.more » « less
-
We explore the Collatz conjecture and its variants through the lens of termination of string rewriting. We construct a rewriting system that simulates the iterated application of the Collatz function on strings corresponding to mixed binary–ternary representations of positive integers. Termination of this rewriting system is equivalent to the Collatz conjecture. To show the feasibility of our approach in proving mathematically interesting statements, we implement a minimal termination prover that uses the automated method of matrix/arctic interpretations and we perform experiments where we obtain proofs of nontrivial weakenings of the Collatz conjecture. Finally, we adapt our rewriting system to show that other open problems in mathematics can also be approached as termination problems for relatively small rewriting systems. Although we do not succeed in proving the Collatz conjecture, we believe that the ideas here represent an interesting new approach.more » « less
-
The problem of reconstructing a sequence when observed through multiple looks over deletion channels occurs in “de novo” DNA sequencing. The DNA could be sequenced multiple times, yielding several “looks” of it, but each time the sequencer could be noisy with (independent) deletion impairments. The main goal of this paper is to develop reconstruction algorithms for a sequence observed through the lens of a fixed number of deletion channels. We use the probabilistic model of the deletion channels to develop both symbol-wise and sequence maximum likelihood decoding crite- ria, and algorithms motivated by them. Numerical evaluations demonstrate improvement in terms of edit distance error, over earlier algorithms.more » « less
-
The problem of reconstructing a sequence when observed through multiple looks over deletion channels occurs in “de novo” DNA sequencing. The DNA could be sequenced multiple times, yielding several “looks” of it, but each time the sequencer could be noisy with (independent) deletion impairments. The main goal of this paper is to develop reconstruction algorithms for a sequence observed through the lens of a fixed number of deletion channels. We use the probabilistic model of the deletion channels to develop both symbol-wise and sequence maximum likelihood decoding crite- ria, and algorithms motivated by them. Numerical evaluations demonstrate improvement in terms of edit distance error, over earlier algorithms.more » « less
An official website of the United States government

