<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Tell Me A Story Like I'm Five: Story Generation via Question Answering</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2021</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10249509</idno>
					<idno type="doi"></idno>
					<title level='j'>Proceedings of the 3rd Workshop on Narrative Understanding</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Louis Castricato</author><author>Spencer Frazier</author><author>Jonathan Balloch</author><author>Mark Riedl</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Neural language model-based approaches to automated story generation suffer from two important limitations. First, language model-based story generators generally do not work toward a given goal or ending. Second, they often lose coherence as the story gets longer. We propose a novel approach to automated story generation that treats the problem as one of generative question-answering. Our proposed story generation system starts with sentences encapsulating the final event of the story. The system then iteratively (1) analyzes the text describing the most recent event, (2) generates a question about "why" a character is doing the thing they are doing in the event, and then (3) attempts to generate another, preceding event by answering this question. We show that the coherency of a story can be measured as the relative entropy over the distribution of responses to claims about said story's events. Using a within-subjects human evaluation we measure this coherency entropy over the responses to sets of True-False statements for multiple stories generated by our model and each baseline. The evaluation shows that our system generates stories that are on average 15.9% more coherent that those generated by the BART [Lewis et  al., 2019]  language model fine-tuned on a story corpus to generate sentences in reversed order to more closely match our process.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Consider a story, at the ending of which a princess is reunited with her lover thought to be lost at sea, a swordsman has enacted revenge on the man who killed his father, and a giant becomes a pirate. One might reasonably wonder how this situation came to pass. Aristotle writes in Poetics that the events of the story serve the plot and the end. Under this interpretation, storytelling is explanation-every event answers the question of "how did the next event come to pass?". In this paper, we propose an automated story generation system using the principles of question-answering and show how it can improve automated story generation capabilities.</p><p>Automated Story Generation is the challenge of designing an artificial intelligence system that can generate a story from a minimal number of inputs-often just a prompt and some storytelling knowledge. Symbolic story and plot generation systems have traditionally relied on planning or case-based reasoning (see <ref type="bibr">Gerv&#225;s [2009]</ref> for an overview of symbolic story generation systems). Some of these systems start with an end state-the state the fictional world should be in at the end of the story-and work backward, determining what must have happened to transform an initial world state into the goal. These systems often generate coherent stories guaranteed to end in a given state. Their drawback is that they require significant hand-authored domain knowledge.</p><p>Machine learning-based story generation systems acquire or learn story domain knowledge from data, often corpora of human-authored stories. Most machine learning-based story generation systems have relied on neural network-based language models. Auto-regressive neural language models trained on a corpus of stories learn a probability distribution over tokens p(t n |t n-1 , t n-2 , ..., t n-k ) based on the tokens that occur in the training corpus. This distribution can then be sampled to create new texts that emulate the training corpus. Training a neural language model on story corpora results in a generative model that produces texts that look like stories <ref type="bibr">[Roemmele, 2016;</ref><ref type="bibr">Khalifa et al., 2017;</ref><ref type="bibr">Martin et al., 2018]</ref>. However, language model based approaches are unable to bring stories to a particular conclusion or goal state. Stories generated by language models also tend to lose coherence over time as they rely on probabilistic sampling and do not learn a richer model of the story world.</p><p>We consider how neural story generation systems can be induced to generate more coherent narratives that also end in a pre-determined, desirable way. Narratives are perceived to be coherent when events are related to each other in a way that is comprehensible by the reader <ref type="bibr">[Trabasso and Van Den Broek, 1985;</ref><ref type="bibr">Graesser et al., 1991]</ref>. There are many relations between events which fit this need, the most important are: (1) causal relations-one event cannot happen if another event had not happened prior to it-and (2) character goal hierarchies-an action is in service of a goal or another action that is in service of a goal.</p><p>Our insight is that if each event in the story is generated to explicitly answer the question of "why" the next event in the story happens, then readers will perceive the story as more coherent. To generate a story that will be perceived as a coherent and build up to a pre-determined ending, we propose to generate the story backward. This is achieved by starting from a textual description of the final event; each event added best answering the question of what must have proceeded it. Our system, EDGAR, repeats this process for a specified number of iterations. Questions are generated using a commonsense inference model, Para-COMET <ref type="bibr">[Gabriel et al., 2020],</ref> to predict what readers are likely to believe about a story event; the inferences are transformed into questions using templates. EDGAR then attempts to answer each question using a generative question-answering model. We evaluate our system against a baseline neural transformer-based language model approach that is finetuned to generate story events backward, matching the backward process of EDGAR We measure story coherence with two human-participant studies. In the first, perceived coherence is measured as the entropy in participant responses to true/false questions about the story; a story that is more comprehensible results in less random guessing by human readers. We find that EDGAR generates more coherent stories than the baseline as evidenced by the entropy of answers about stories generated by EDGAR had 15.9% lower entropy than those of the baseline. The second evaluation is subjective-we qualitatively measure coherency via subjective questionnaire about coherence. Participants consider stories written by EDGAR twice as coherent as those written by the baseline.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Related Work</head><p>Gerv&#225;s <ref type="bibr">[2009]</ref> overviews early symbolic story generation systems. Story generation systems that use symbolic story planners utilize logic-like domain representations that provide knowledge about available actions, their preconditions, and their effects. A search process-such as that by <ref type="bibr">Riedl and Young [2010]</ref>-selects a goal condition or a precondition of an action in the plan and attempts to find another, preceding action that has an effect that establishes the condition. This process iterates, creating chains of preconditions and effects until everything is grounded in the initial world state. However, the chaining can be done forward from the initial state to the goal as well <ref type="bibr">[Ware and Young, 2010]</ref>.</p><p>Neural networks have the potential to generate a greater range of stories by learning model for how to tell stories from a corpus of exemplar stories.</p><p>Neural language models learn the probability that one or more tokens will occur given a history of one or more prior tokens, P &#952; (t n+1 , ..., t n+m |t n-k , ..., t n-1 , t n ), according to token occurrence patterns in a corpus. Neural language models can be induced to generate text that can be read as a story by sampling from the learned distribution over tokens and appending them to a prompt. Some neural language model based story generation techniques include <ref type="bibr">[Roemmele, 2016;</ref><ref type="bibr">Martin et al., 2018;</ref><ref type="bibr">Khalifa et al., 2017]</ref>. However, a neural language model alone is incapable of achieving a specific end state or event. Sampling from a distribution over tokens only considers the most likely successive tokens given a win-dow of prior tokens. Neural language models also tend to lose story coherence over time. This is due to the fact that a language model only models a distribution over tokens in the training set. Additionally, the hidden parameters of current neural networks are unlikely to encode the state of a fictional world, as human readers would understand. <ref type="bibr">Tambwekar et al. [2018]</ref> attempt to train a neural language model to generate toward a given goal. They fine-tune a neural language model with a policy-gradient reinforcement learning technique that rewards the language model for generating events progressively closer to the goal event. This has the benefit of improving readers' perceptions of coherence, but-being based on a language model-does not ensure that any transition from one event to the next will always be perceived as related.</p><p>Other neural language model approaches to story generation using neural networks use hierarchical conditioning, in which a high-level guidance specification is given either periodically or per sentence in the story <ref type="bibr">[Fan et al., 2018;</ref><ref type="bibr">Yao et al., 2019;</ref><ref type="bibr">Rashkin et al., 2020;</ref><ref type="bibr">Ammanabrolu et al., 2020b]</ref>. These high-level guidance specifications turn the generation problem into a supervised learning problem. We do not consider these approaches further in this paper because we do not assume the existence of a guidance specification.</p><p>One approach to automated story generation that uses neural networks that are not based on language modeling is <ref type="bibr">C2PO [Ammanabrolu et al., 2020a]</ref>, which uses the <ref type="bibr">COMET [Bosselut et al., 2019]</ref> commonsense inference engine to generate successor and predecessor events, performing a bi-directional search from a given start event and a given end event. It is relevant to our work in that it does partially chain backward from a given end event, and also uses a commonsense inference engine. However, C2PO generates plots made up of short statements of character intentions, whereas our system generates stories that have more descriptive detail.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">The EDGAR System</head><p>The Explanatory Drama Generation And Recall (EDGAR) system constructs a story backwards from a given sentence describing the end of the story. The system contains three major components. The first component is a question generator. Given a story context-the sequence of text describing the earliest event in the ending context-a set of questions about the event is generated. Second, a question answering component attempts to generate text describing one or more events that answer that question. A number of candidate answers are generated for each question. Finally, the answers are iteratively pre-pended to the context and a ranker chooses the best sequence. The best sequence is added to the story and the process iterates. See the pipeline in Figure <ref type="figure">1</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Question Generation</head><p>We use Para-COMET <ref type="bibr">[Gabriel et al., 2020]</ref>  </p><p>Figure <ref type="figure">1</ref>: EDGAR generates stories backward. Given the end of the story where S0 is the earliest event sequence and S is the remainder, Para-COMET generates a set of n inferences. Each inference is converted a question and the ELI5 QA model generates k + 1 answers. The answers are concatenated to the beginning of the story and the ranker selects the best scoring story. This process is repeated.</p><p>actions in the sentences. These correspond to goal relations in reader comprehension <ref type="bibr">[Graesser et al., 1991]</ref>. xNeed explains what a character might have needed to perform any actions in the sentences. These provide precondition-like inferences, corresponding to causal relations in reader comprehension <ref type="bibr">[Trabasso and Van Den Broek, 1985]</ref>. We discard all other relation types.</p><p>Because Para-COMET works on multi-sentence sequences, we extract a rolling window of the last 5 xIntent and xNeed inferences. However, Para-COMET does not identify which character is associated with each xIntent and xNeed, which is problematic for stories with more than one character. To associate the xNeed and xIntent clauses with a character, we generate the following templates:</p><p>&#8226; "Who needs to xIntent"</p><p>&#8226; "Who needs xN eed" filling in the details of the inferences. These filled templates are provided as input to RoBERTa <ref type="bibr">[Liu et al., 2019]</ref>, a question-answering model. The outputs are the names of the characters most likely to have had these needs and intents.</p><p>Finally, we use a second set of templates to assemble the final set of questions:</p><p>&#8226; "Why does character do xIntent?"</p><p>&#8226; "What does character do to need xN eed?" This process generates a total of 8 questions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Question Answering</head><p>Once we have a set of questions, EDGAR generates candidate answers, such that each candidate can be added to the beginning of the story context so far. To generate sentences describing the preceding event that answers the questions generated, we feed the questions into the ELI5 QA model <ref type="bibr">[Fan et al., 2019]</ref>. The ELI5 QA model is a long-form, questionanswering model trained on the Explain Like I'm Five Reddit corpus, 1 in which people give long, yet easily comprehensible answers to open-ended questions as one might give to a five-year old. ELI5 QA requires a reference document from which to abstract answers. The reference document is the source material-in this case a story-that ELI5 QA uses to generate an answer. Because EDGAR is an unsupervised 1 <ref type="url">https://www.reddit.com/r/explainlikeimfive/</ref> technique designed to generate novel stories, there is no one reference document that should be used; using a single reference document would run the risk of accidentally recreating a human-written story. For every iteration we randomly select a reference document from the Flash Fiction Online repository. <ref type="foot">2</ref> The question templates above were constructed to induce relatively short answers from ELI5, which has a tendency to generate very long explanations.</p><p>We use beam search to generate 15 candidate answers for each question. As another measure to prevent ELI5 from providing overly verbose explanations, we have accumulated a list of over 700 unique banned phrases, which occur when ELI5 commentators point out "facts" or likening a character's action to mental disability. This blocked phrases list was accumulated iteratively, by rerunning the model repeatedly and adding any toxic phrases to this excluded list. The result is n &#215; k story continuations where n is the number of questions, k is the number of beams per question on ELI5.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Ranking</head><p>Once EDGAR has generated a set of candidates, the final step in the process is to select the best candidate for prepending to the context (the end of the story). We prepend each answer to the context and rate each resulting text sequence using <ref type="bibr">GPT-2 [Radford et al., 2019]</ref> to assess the probability of the sequence. GPT2 was fine-tuned on the science fiction summary corpus <ref type="bibr">[Ammanabrolu et al., 2020b]</ref> dataset, which consists of 2,276 high-quality plot summaries from science fiction TV and movie wikis. We fine-tune on the science fiction summary corpus because wiki plots do not include descriptive details or dialogue; our ranker thus prefers more plot-like narrative content. Candidates are ranked by perplexity of the GPT2 model. The normalized perplexity distribution over the beams outputted by ELI5 refers to the 1probability distribution of a body of text existing within the distribution of science fiction summaries.</p><p>Ranking is an important step because of the numerous processes involved; as a consequence the ranking of ELI5 beam distribution does not necessarily correlate with the final ranking, which roughly measures fluency. The best scoring candidate is added to the overall story. The process repeats with the new, longer story, attempting to determine what happened just before the new context.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Objective Evaluation</head><p>We hypothesize that EDGAR, by virtue of questionanswering, can generate more coherent stories than a pure language modeling technique. We define coherence as any perceivable relationship between events in a story. Research on reading comprehension <ref type="bibr">[Trabasso and Van Den Broek, 1985;</ref><ref type="bibr">Graesser et al., 1991]</ref> suggest that causal and goal relationships are particularly important.</p><p>Common automated evaluation metrics for story generation such as perplexity and BLEU are insufficient as they only measure whether a generator can recreate the ground truth corpus. A story may deviate from the ground truth and be considered a good story-indeed this is a desirable property of an automated story generator. Furthermore, systems such as ours may be unsupervised and have many components that intentionally push a language model away from any one corpus, thus making perplexity less meaningful. For these reasons, story generation research often relies on human participant studies with subjective questions.</p><p>We assert human participant studies are the best way to assess the coherence of generated stories. Question-answering protocols, wherein questions are asked about a story, have been proposed as a means to make human-participant evaluations more objective <ref type="bibr">[Riedl and Young, 2010;</ref><ref type="bibr">Cardona-Rivera et al., 2016]</ref>. We conduct a human-participant evaluation using a new metric based on question-answering protocols, Entropy Index, which is an objective measure of story coherence based on human question-answering.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Baselines</head><p>The BART <ref type="bibr">[Lewis et al., 2019]</ref> neural language model was used as a baseline, but fine-tuned to generate events backward to conform to EDGAR and guarantee the presence of a given end event. The dataset used to fine-tune consisted of 2276 narratives from a science fiction summary corpus <ref type="bibr">[Ammanabrolu et al., 2020b]</ref>. The narratives are preprocessed to create our dataset. From every narrative, 2 + 2k sequential sentences are obtained, where k is a random integer less than 5. The 2 + 2k sentences are split apart into 2 sentences and 2k sentences, creating the source and target of the dataset respectively. The 2 sentences generated in the 2 + 2k sentence chunk always precede the 2k sentences, establishing a relationship between sequential sentences. We preprocess this data to this format because an attribute found in most narrative summaries within our dataset is that preceding sentences to any given sentences gives some notion of causality. BART utilizes seq2seq as its translation architecture. As a consequence of the input data format, our fine-tuned Backward-BART-which we refer to as bBART-can generate narratives backwards by assessing the causality between sequentially sentences.</p><p>Human-written stories from the ROCStories corpus <ref type="bibr">[Mostafazadeh et al., 2016]</ref> were are also included in our evaluation as a point of comparison. These stories have a definitive causality between sequential sentences.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Method</head><p>To evaluate the objective coherence of stories, we turn to cognitive psychology. Cognitive psychology research suggests that recall is strongly correlated with narrative causal coherence <ref type="bibr">[Trabasso and Van Den Broek, 1985]</ref>. The cognitive load of inferring entailments about a story is strongly correlated with how well the story conveys information about its fabula<ref type="foot">foot_1</ref>  <ref type="bibr">[Carney, 2019]</ref>. We devise a new evaluation methodology wherein we ask participants to read stories and then answer true/false questions about how the events of the story relate to each other. We measure the amount of agreement between readers' answers in terms of entropy. If the story is coherent, readers will come to the same conclusions about the truth or falseness of the questions, and entropy will be low. If the story is incoherent, readers-forced to choose between true and false-will choose more randomly, resulting in higher entropy. We do not require a ground truth "correct" answer to each question in order to compute the entropy; this is a desirable property of our methodology given (1) there are no algorithmically produced ground truth answers to the true/false questions and (2) obtaining a ground truth answer from humans can be noisy. Our index method is inspired by the evaluation used in <ref type="bibr">Li et al. [2012]</ref> where human participants were asked to choose event orderings and participant agreement was assessed as entropy.</p><p>We generated 11 stories using EDGAR, 11 stories using backward-BART, and randomly selected 11 stories from the ROCStories corpus. Stories were generated by running the respective systems 3 iterations. Stories ranged from 5 sentences to 20 sentences in length. See Table <ref type="table">1</ref> for examples from EDGAR and bBART. The Appendix gives the entire set of stories used in the evaluation. For the 33 stories, we produced 7 true/false questions for each story using the technique described in Section 4.3. To avoid bias, a non-computer-science graduate student not affiliated with the research project was paid to write the questions using the above template.</p><p>We recruited 180 human-subject participants from Mechanical Turk. Participants were recruited from countries that speak English as the national language. Each participant was asked to read 3 stories and answer the 7 true/false questions after each story. Participants were paid $7 for a 15 minute task. The first story is a "screener" story, an uncommon fable that is easy to understand. If a participant did not answer the questions how we expect, we eliminated the participant from the pool. Participants were also eliminated from the pool if they resorted to marking all questions true or false or marked questions in some otherwise visually obvious repeating pattern; we eliminated 26 participants.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Computing the Entropy Index</head><p>We define the entropy index of each story as follows. For a given story generation system, we randomly selected 11 generated stories. For each story, we then produced 7 entailment questions about each story. Entailment questions are of the form of implications. By asking the reader to answer true or Stories generated by EDGAR Stories generated by bBART He needs to find a way out of the house. If he wants to go to the kitchen, he must first find a way to escape from the house. This means that he must either run away or fight his way past the lock. He goes to the bathroom. The only way to get to the kitchen is through a locked door, which Hansel has never been able to do before. However, when he opens the door, it doesn't take him very long at all. Hansel's hand still trembles as he pushes open the twice-cooked door. The last time he saw the house he was glancing back over his shoulder as he and his sister fled into the trees.</p><p>A young boy named Hansel has been walking through town recently looking at some strange things that are going on outside their walls. In fact, they were just playing cards before anyone noticed them. A young boy called Hansel has Been walking through town lately looking at some odd things that are gone missing outside their walls -maybe even someone noticing them. In reality, they were probably playing cards before everyone noticed them. He wants to learn more about it though, but can always tell for sure. He wants to know more about it, but can never tell for sure. After all, this is a city of humans who have no interest in living here.He wants to know what happened to him, but can only tell for sure if this is a City of Humans who have little interest in living there. Hansel's hand still feels like an old man with no interest whatsoever. His mind wanders frantically around trying to figure out how to live here. Hansel's hand still trembles as he pushes open the twice-cooked door. The last time he saw the house he was glancing back over his shoulder as he and his sister fled into the trees.</p><p>If it does not have a pulse then it is dead. The heart rate of an animal tells them if it is alive or dead. There is no such thing as absolute death. He is able to tell the difference between something being alive and something being dead, so when he looks at the house, he feels like he's seeing things that aren't there. This makes him feel uncomfortable because he doesn't want to be in that situation. It's similar to how people can see ghosts or monsters from inside their head but they don't know what those things are. Hansel's hand still trembles as he pushes open the twice-cooked door. The last time he saw the house he was glancing back over his shoulder as he and his sister fled into the trees. false we are asking the reader to prove or disprove the statement within the realm of what has been presented about the story world.</p><p>In order to ensure our questions were not biased, we provide annotators the following templates, two of which are given as examples here:</p><p>&#8226; E i depends on E j</p><p>&#8226; E i could be removed and the story would still make sense. i &lt; j and E refers to an event within the story. The full set of templates can be found in the appendix. The questions themselves were manually written to ensure grammatically correctness and readability.</p><p>The answers to the entailment questions give us a measure of entropy. When participants disagree, it can be determined how ambiguous their model of the story world is, such that they must rely heavily on external bias.</p><p>Consider that we have some story, S, composed of an event chain E = {E i } n . An event chain being a sequence of events discussed in a story, one path in a fabula. Generate two events, one that could be inserted into E and preserve coherence and its negation. We'll refer to these events as A and B. Refer to their insertions as E A and E B . Assume that we had some function f (&#8226;) that could take either E A or E B and rank all of the explanations for A and B respectively by mental load induced on the reader. Then, if E B is coherent, consider what mental leaps are required by the reader for justification. Let D(A) and D(B) refer to these normalized distributions respectively. Measure the following:</p><p>Where KL is Kullback-Leibler divergence and U is a uniform distribution. Inductively if A and B are in direct contradiction of each other, we can collapse the above statement to</p><p>In this case, since U is of dimension two, simplify the above to entropy. We can conclude that measuring the coherence of such an insertion is equivalent to measuring the entropy over the answers to a similarly constructed T/F statement about a causal relationship within a story. Over a large number of questions and stories per model, the above serves as a sound proxy for coherence. Consider a coherent story and a set of T/F questions concerning this story. It is often easier to disprove a statement about a coherent story than it is to prove a statement about an incoherent story <ref type="bibr">[O'Brien and Albrecht, 1992;</ref><ref type="bibr">Albrecht and O'Brien, 1993]</ref>. By utilizing the format of T/F questions, the above will tend to converge to zero on a coherent story as there will always be one option that is disprovable. To get a large enough sample, we used 77 questions per model over 11 stories.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4">Results</head><p>The evaluation results are plotted in Figure <ref type="figure">2</ref>. The evaluation shows that EDGAR scores a median of 0.427 on the entropy index, compared to bBART's median of 0.508. Human written stories from the ROCStories corpus scored a median entropy of 0.26. From these results we can draw a number of conclusions. First, the median entropy of human-authored stories is over 95% better than bBART and over 63% better than EDGAR. This implies that human-authored stories are much more coherent than computer-generated stories according to our Entropy Index metric. This is the expected result and shows that our Entropy Index metric is operating as expected. The human story entropy index is a lower bound. Importantly, the median entropy EDGAR is 15.9% lower than that of the bBART baseline, indicating that our technique has improved the coherence of generated stories when generating backwards in order to ensure a given ending.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Subjective Evaluation</head><p>We conducted a second human-participant evaluation in which participants read stories and answered subjective questions about the coherence of the stories. We would expect the results of this experiment to concur with the results of the previous experiment. <ref type="bibr">Purdy et al. [2018]</ref> proposes a number of questions to be used to evaluate story generation systems. They have been used in a number of story generation system evaluations (cf. <ref type="bibr">[Tambwekar et al., 2018;</ref><ref type="bibr">Ammanabrolu et al., 2020b;</ref><ref type="bibr">Ammanabrolu et al., 2020a]</ref>). We use a subset of the questions and adapt them to rank-order choice between stories from two systems:</p><p>&#8226; Which story's events occur in a more PLAUSIBLE OR-DER?</p><p>&#8226; Which story's sentences MAKE MORE SENSE given sentences before and after them?</p><p>&#8226; Which story better follows a SINGLE PLOT?</p><p>&#8226; Which story is of HIGHER QUALITY?</p><p>&#8226; Which story is more ENJOYABLE?</p><p>The first three questions ask about different aspects of perceived story coherence.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Method</head><p>We used the same stories from the first evaluation and the same baselines. Participants read two stories from two different sources back-to-back. Then for that pair of stories, the participant was asked to answer the subjective questions above, picking between the two stories. We recruited 48 human-subject participants from Mechanical Turk. Participants were recruited from countries that speak English as the national language. Each participant was asked to read 4 stories, presented in pairs of two, and answer  <ref type="table">2</ref>: Total counts of times per question in the subjective evaluation that participants selected a story generated by each system. P -tests were determined to ensure that the chance of EDGAR winning a pairing was greater than 50/50.</p><p>the 5 questions after each story. Participants were paid $5 for a 10 minute task. We screened participants by asking them similar questions about human written stories but inserted the answers to the questions in the directions, to determine their attentiveness. Participants that were considered inattentive where disqualified.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Results</head><p>The results are summarized in Table <ref type="table">2</ref>, which shows the number of times, per question, a participant selected the story from each system. When forced to pick between stories generated by EDGAR and stories generated by Backward-BART, participants chose stories generated by EDGAR twice as often for every question asked. A one-tailed binomial ptest for the results of each question determines EDGAR was significantly preferred above the baseline for every dimension at p &lt;= 0.013 except the "Makes sense" dimension, which was significant at p = 0.052. These results suggest that EDGAR generates more coherent and overall better quality stories than Backward-BART. These results are consistent with the Entropy Index metric, confirming that the metric is also measuring coherence.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Conclusions</head><p>We propose a new approach to neural story generation that treats story generation as question-answering problemgiven an ending, the story must answer the question of how the ending comes about. Our proposed EDGAR system generates backward from the ending event to ensure the presence of the desired ending. It decomposes the generation process into distinct processes for using human commonsense to produce questions and then to answer them. These processes are grounded in reader narrative comprehension. We show that stories generated by EDGAR are more coherent than stories generated in a more conventional language modeling approach based on subjective and objective measures of perceived coherence. The EDGAR technique is a significant departure from techniques that sample from a language model that opens up new avenues for improving neural story generation in ways that are inspired by the comprehension needs of the human reader.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_0"><p>https://www.flashfictiononline.com</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_1"><p>A story's fabula denotes the chronological sequence of events in a narrative.</p></note>
		</body>
		</text>
</TEI>
