<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Medical Question Understanding and Answering with Knowledge Grounding and Semantic Self-Supervision</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2022 October</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10400307</idno>
					<idno type="doi"></idno>
					<title level='j'>Proceedings of the 29th International Conference on Computational Linguistics</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Khalil Mrini</author><author>Harpreet Singh</author><author>Franck Dernoncourt</author><author>Seunghyun Yoon</author><author>Trung Bui</author><author>Walter Chang</author><author>Emilia Farcas</author><author>Ndapa Nakashole</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Current medical question answering systems have difficulty processing long, detailed and informally worded questions submitted by patients, called Consumer Health Questions (CHQs). To address this issue, we introduce a medical question understanding and answering system with knowledge grounding and semantic self-supervision. Our system is a pipeline that first summarizes a long, medical, userwritten question, using a supervised summarization loss. Then, our system performs a two-step retrieval to return answers. The system first matches the summarized user question with an FAQ from a trusted medical knowledge base, and then retrieves a fixed number of relevant sentences from the corresponding answer document. In the absence of labels for question matching or answer relevance, we design 3 novel, self-supervised and semanticallyguided losses. We evaluate our model against two strong retrieval-based question answering baselines. Evaluators ask their own questions and rate the answers retrieved by our baselines and own system according to their relevance. They find that our system retrieves more relevant answers, while achieving speeds 20 times faster. Our self-supervised losses also help the summarizer achieve higher scores in ROUGE, as well as in human evaluation metrics. We release our code to encourage further research. 1]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Motivation. Users of medical question answering systems often write long questions, called Consumer Health Questions (CHQs). Several aspects of CHQs hinder the capacity of current question answering (QA) systems to process them: long medical questions may contain peripheral information like patient history <ref type="bibr">(Roberts and Demner-Fushman, 2016)</ref> that are not necessary to retrieve relevant answers. Consumer health questions may also use the corresponding CHQ. As part of their participation in this task, <ref type="bibr">Yang et al. (2017)</ref> find that online search engine queries introduce noise in performance, and that even collected and curated medical knowledge available offline can fare better. Contributions. To enable the use of a curated medical knowledge base for answering long user questions, we introduce a novel, knowledge-grounded and semantically self-supervised system for Consumer Health Question Understanding and Answering (CHQUA). We tackle a challenging aspect of CHQUA: providing answers when no relevance labels are available. Our contributions are as follows:</p><p>(1) We propose an end-to-end pipeline, as shown in Figure <ref type="figure">1</ref>, that takes as input a consumer health question, and trains a summarizer model to generate a short, formally worded question. We optimize a summarization training objective using the medical question summarization datasets.</p><p>(2) The medical knowledge base we use is separate from the question summarization datasets, and therefore we have no labels to indicate which knowledge base question matches a given consumer health question. We design a novel, semantically-guided self-supervised loss function to ground the generated summary with knowledge base FAQs, using semantic similarity as proxy to question matching. The Matching FAQ similarity loss helps the encoder pick the most semantically similar knowledge base question.</p><p>(3) The large medical knowledge base we use has no answer sentence relevance labels. We adapt to this scenario by designing two complementary self-supervised losses on the same encoder, and by considering semantic similarity as a proxy to relevance. The Answer Similarity loss pushes the model to distinguish between relevant and irrelevant answer sentences, whereas the Answer Selection loss works in a complementary way to push the model to select a given number of sentences.</p><p>Finally, we conduct an evaluation to compare the relevance of our system with two strong baselines of retrieval-based question answering. We ask evaluators to ask their own questions, and then perform a blind evaluation of the retrieved answers by each system. Seven evaluators find that our system retrieves more relevant answers compared to the two baselines, while achieving significantly faster processing speeds. We also find that the self-supervised losses help achieve better scores in ROUGE and human evaluation metrics. How-ever, we find that the task remains challenging, with room for improvement. We release our code, model, and matched datasets to encourage further research in consumer health question understanding and answering.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Related Work</head><p>Consumer Health Question Answering. <ref type="bibr">Ben Abacha et al. (2017)</ref> introduce the Medical QA shared task at TREC 2017 LiveQA, where the goal is to develop a consumer health question answering system. The training data is comprised of question-answer pairs. The questions are informally worded CHQs received by the U.S. National Library of Medicine (NLM). The answers are formally worded and come from websites of the U.S. National Institutes of Health or manually collected by librarians. The evaluation scores are given by humans, using a test set of CHQs and reference answers.</p><p>Many participating teams adopt a question matching approach, and train their models on question similarity datasets like the Quora question pair dataset <ref type="bibr">(Iyer et al., 2017)</ref>, or other datasets collected from community question answering websites. TODO <ref type="bibr">(Mrini et al., 2021b)</ref> In the MEDIQA 2019 Shared task, Ben Abacha and Demner-Fushman (2019a) introduce a differently defined consumer health question answering task. Here, the goal is to rank a given list of answers according to their relevance with regard to a CHQ. <ref type="bibr">He et al. (2020)</ref> introduce a new disease knowledge infusion training procedure for BERT <ref type="bibr">(Devlin et al., 2019)</ref> that scores well in this task. Medical Question Answering. Medical QA approaches include translating questions to SPARQL queries <ref type="bibr">(Ben Abacha and Zweigenbaum, 2012)</ref>, semantic similarity between questions and candidate answers <ref type="bibr">(Hao et al., 2019)</ref>, knowledge representations <ref type="bibr">(Terol et al., 2007;</ref><ref type="bibr">Goodwin and Harabagiu, 2017)</ref>, ranking candidate answers <ref type="bibr">(Ben Abacha et al., 2017</ref><ref type="bibr">, 2019)</ref>, summarization of questions and/or answers <ref type="bibr">(Ben Abacha et al., 2021;</ref><ref type="bibr">Mrini et al., 2021d,b,c)</ref>, and medical entity linking <ref type="bibr">(Basaldella et al., 2020;</ref><ref type="bibr">Mrini et al., 2022)</ref>.</p><p>There is a variety of definitions for the task of medical QA and related sub-tasks in the literature. <ref type="bibr">Hao et al. (2019)</ref> define medical QA as the task of finding the correct answer from a set of candidates and a body of evidence documents. They propose to work on two datasets: the National Medical Li-censing Examination of China (NMLEC) <ref type="bibr">(Shen et al., 2020)</ref>, and Clinical Diagnosis based on Electronic Medical Records (CD-EMR), where the goal is to predict the correct diagnosis based on patient history. <ref type="bibr">Sharma et al. (2018)</ref> propose to tackle three kinds of medical questions found in the BioASQ challenge <ref type="bibr">(Balikas et al., 2015)</ref>: factoid questions where answers are single entities, list-type questions where answers are a set of entities, and yes/no questions.</p><p>Retrieval-based Question Answering. Recent methods for retrieval-based QA systems use contextual text embeddings to evaluate a candidate answer's relevance to a given question. <ref type="bibr">Tay et al. (2018)</ref> propose to use Multi-Cast Attention Networks (MCAN), a new attention mechanism, to model question-answer pairs. <ref type="bibr">Mrini et al. (2021e)</ref> introduce a recursive, treestructured model that models sentences according to their syntactic tree. Their results show that tree structure sets a new state of the art in conventional, formally worded QA benchmarks like TrecQA and WikiQA <ref type="bibr">(Yang et al., 2015)</ref>, but does not fare well in informally worded, user-written datasets. <ref type="bibr">Karpukhin et al. (2020)</ref> introduce Dense Passage Retrieval (DPR): a dual-encoder based on BERT <ref type="bibr">(Devlin et al., 2019)</ref>, that predicts relevance scores of passages with regard to a question. DPR encoders are trained on the relevance of passages from datasets containing such labels, using a supervised negative log-likelihood loss based on the semantic similarity of questions and relevant passages. <ref type="bibr">Mao et al. (2021)</ref> modify the query part of retrieval-based QA: they propose to use language models to generate context for queries. They then feed the extended queries to retrieval systems, such as DPR or BM-25.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Problem Definition</head><p>We define knowledge-grounded Consumer Health Question Understanding and Answering (CHQUA) as the problem of retrieving a fixed number of answer sentences from a medical knowledge base that are the most relevant given a long and informal user question -called a Consumer Health Question (CHQ). There are three steps in CHQUA: question summarization, matching the summarized user question with a relevant FAQ from the knowledge base, and retrieval of the relevant answer sentences from the corresponding answer document.</p><p>Knowledge-grounded CHQUA is comprised of three elements used for training. First, the CHQ is the input of the task. Second, the Reference FAQ (Frequently Asked Question) is the golden or expert-written summary corresponding to the CHQ. Whereas the CHQ is a long and informally worded question, the reference FAQ is the corresponding short, one-sentence, formally worded question. At inference time, the reference FAQ is not available, and we will therefore use a summary generated by the model. Third, the medical knowledge base is comprised of FAQs, where each FAQ has a corresponding answer document with at least one sentence. FAQs in the knowledge base are also short, one-sentence, formally worded questions.</p><p>The goal of knowledge-grounded CHQUA is to find a set R of n relevant answer sentences, from a document comprised of answer sentences A i , such that A i corresponds to question q i from the knowledge base. We call q i the retrieved or matching FAQ, such that q i is the most similar question to the user's summarized question q u :</p><p>where Q is the set of questions (FAQs) in the knowledge base, and f is a given similarity scoring function. q u is the reference FAQ (during training) or a generated summary (during inference). We find the set R of n relevant answer sentences such that it maximizes the relevance score with the user's summarized question q u : R = arg max</p><p>where a is an answer sentence, and g is a given relevance scoring function.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Our Pipeline</head><p>Our proposed pipeline for Consumer Health Question Understanding and Answering has three main components.</p><p>In the first step, our approach learns to understand the intent of user questions (CHQs) by summarizing them. We use an encoder-decoder-based summarization model for this step.</p><p>The second step is question matching, or the retrieval of the relevant FAQ from the knowledge base: we ground the generated summary to a medical knowledge base of FAQs and corresponding</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Consumer Health Question:</head><p>Asking about Hairy cell leukemia. I get report for my father from hospital it is saying that he have Hairy cell leukemia i am here to ask if this dissease dangerous and there is treatment for it Also if The one who have it will live for long or not? My father age is 55 We discover the dissease by blood test.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Encoder</head><p>Decoder Generated Summary:</p><p>What are the treatments for hairy cell leukemia and how long does it live?</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Reference FAQ (Summarization):</head><p>Where can I find information on hairy cell leukemia, including treatment and prognosis?  answer documents. As there are no question matching labels, we consider semantic similarity as a proxy to question matching, and we optimize a self-supervised similarity loss. The third step is the retrieval of the relevant answer sentences: our model learns to select the top-k most relevant answer sentences from the matching answer document. To achieve this task in the absence of answer relevance labels, we consider semantic similarity as a proxy for relevance, and we optimize two novel, semantically-guided, and self-supervised loss functions. The first pushes the model to discriminate between relevant and irrelevant sentences, and the other pushes the model to consider only a fixed number of sentences as relevant.</p><p>We show an overview of the model and learning objectives in Figure <ref type="figure">1</ref>. The entire pipeline is trained together, as the summarizer encoder is re-used to encode the questions and answer sentences.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Question Understanding through Summarization</head><p>Our work aims to flip the burden of question understanding on the question answering model. Instead of asking the user to shorten or reformulate their question, we train an encoder-decoder abstractive summarizer to shorten user questions. Figure <ref type="figure">2</ref> illustrates this part of the model.</p><p>At training time, we input a Consumer Health Question (CHQ) to the summarization model. The reference Frequently Asked Question (FAQ) is the corresponding shorter and formal question. Given a CHQ embedding x and the corresponding reference FAQ embedding y ref , the summarization loss is defined as the following negative log-likelihood objective:</p><p>4.2 Question Matching through Self-Supervised Knowledge Grounding</p><p>In the next step, we match the summarized user question with the most relevant FAQ from the medical knowledge base. We use semantic similarity as a proxy for question matching, in the absence of such labels. The knowledge-grounding process is comprised of two steps. First, we use TF-IDF-weighted bagof-word and n-gram vectors to get the top k most relevant FAQs from the knowledge base. This first step acts as a fast filter to extract a small subset of candidate FAQs. Our retrieval approach follows the retrieval methods commonly used in question answering systems <ref type="bibr">(Chen et al., 2017;</ref><ref type="bibr">Dinan et al., 2018)</ref>. <ref type="bibr">Dinan et al. (2018)</ref> note that the retriever is a potentially learnable part of the model. In our case, using TF-IDF retrieval is computationally optimal and scalable given a large knowledge base with thousands of FAQs. We use a TF-IDF embedder fitted on all the FAQs of the knowledge base, as well as reference FAQs from the training set of the question summarization dataset.</p><p>The second step of knowledge-grounding is to  rank the top k FAQs using semantic similarity. To get semantic embeddings of the generated summary and the corresponding top k most relevant FAQs from the knowledge base, we use the encoder of the summarization model. We take inspiration from the precision formula of BERTSCORE <ref type="bibr">(Zhang et al., 2019)</ref>, and compute the weighted semantic similarity score as follows:</p><p>where q u is the reference FAQ (during training) or the generated summary (during inference), q i is the i-th question from the top k most relevant FAQs, W u and W i are the corresponding sets of words, CosSim is the cosine similarity function, and idf(w) is the inverse document frequency of the word w.</p><p>The matching FAQ is the knowledge base FAQ with the highest similarity score with q u , as shown in the example in Figure <ref type="figure">2</ref>. During training, the summarization model may produce low-quality or degenerate FAQs. For this reason, at training time, we choose to use the reference FAQ instead to compute the semantic similarity scores and find the matching FAQ. At test time, we only use the generated summary.</p><p>Since we are using different datasets for the question summarization and for the knowledge base, we have to reconcile the questions from the knowledge base and the reference questions. We propose to force the model to learn a representation space that does not distinguish between the reference FAQ and the most similar knowledge base FAQ.</p><p>To accomplish this, we compute the matching FAQ similarity loss. Given the embedding of a summarization reference FAQ q sum and the embedding of a matching FAQ q mat , the matching FAQ similarity loss is defined as:</p><p>L mat = 1 -ReLU (Sim (q sum , q mat ; &#952;)) (5)</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Answer Retrieval through Self-Supervised Similarity and Selection Losses</head><p>After summarizing the user question and retrieving a relevant FAQ from the knowledge base, the next step is to retrieve relevant sentences from the corresponding answer document. In our setting, we need to retrieve a fixed number of sentences relevant to the user question. However, we have no labels for the answer sentences indicating relevance to the user question. We propose two complementary self-supervised learning objectives, that use semantic similarity as a proxy to relevance scoring, and satisfy the constraint of selecting a fixed number of answer sentences.</p><p>We show an overview of our answer retrieval approach in Figure <ref type="figure">3</ref>. In the example of the figure, we show for simplicity a relatively short answer document with four sentences, from which the model chooses the two most relevant ones. In practice, there are close to ten sentences in answer documents.</p><p>We compute semantic similarity scores between the generated summary (for inference) or the reference FAQ (for training), and each of the sentences of the retrieved answer document. We obtain the semantic embeddings of each sentence using the encoder of the summarization model. We then compute semantic similarity scores as shown in equation 4. Cosine similarity scores have values in the [-1; 1] range. For a pair of sentences, a cosine similarity value closer to -1 means that the corresponding sentence embeddings are negatively correlated, or that the sentences have opposite meanings. A value closer to 0 means that the embeddings are not correlated, and that there is no particular semantic relation between the sentences. A value closer to 1 means that the sentence embeddings are positively correlated, and the sentences are close semantically. We consider that a sentence is relevant when the values are closer to 1, and irrelevant otherwise. For this reason, we apply a ReLU activation on the cosine similarity scores before feeding them to the loss functions.</p><p>We propose two learning objectives to achieve the self-supervised selection of relevant answer sentences. The semantic similarity loss pushes the model to increase its confidence in the relevance of answer sentences, whereas the answer selection loss pushes the model to select only a fixed number of sentences. The intuition for sharing the encoder with the summarization model, is that these two losses will enable the summarizer to absorb notions of relevance and semantic similarity.</p><p>Given the summarization reference FAQ q sum and the i-th sentence of the retrieved answer document a i , we compute the ReLU-activated semantic similarity score as follows: S(q sum , a i ; &#952;) = ReLU (Sim (q sum , a i ; &#952;)) (6)</p><p>We then define the semantic similarity loss L sim and the answer selection loss L sel as follows:</p><p>where A is the set of sentences in the retrieved answer document, and n is the fixed number of sentences to be retrieved. The semantic similarity loss L sim pushes the semantic similarity values to be either 1 (relevant) or 0 (irrelevant). In combination with L sim , the answer selection loss pushes the model to only select up to n sentences to have semantic similarity values close to 1. Our system then outputs the sentences with the highest semantic similarity values in the order in which they appear in the answer document. Therefore, the particular semantic similarity ranking of the relevant sentences does not matter -it only matters that relevant sentences have the n highest values.</p><p>Finally, the learning objective L is as follows:</p><p>where &#955; and &#947; are hyperparameters. We use only one weight for L sim and L sel as these two losses are complementary.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Experiments and Results</head><p>In this section, we evaluate our proposed pipeline for Consumer Health Question Understanding and Answering, and we propose to compare our proposed pipeline against two strong baselines. Seven medical experts judge the performance of our system and baselines by asking their own questions, and rating the relevance of the answers retrieved. Then, we analyze the results through the lens of summarization metrics, human evaluation, and computational speed.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Datasets</head><p>We use one medical knowledge base, MedQuAD (Ben Abacha and Demner-Fushman, 2019b), and two medical question summarization datasets: MeQSum (Ben Abacha and Demner-Fushman, 2019a) and HealthCareMagic <ref type="bibr">(Zeng et al., 2020)</ref>. All datasets are in English. We show dataset statistics in Table <ref type="table">1</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.1">Dataset Details</head><p>MedQuAD is a large-scale Medical Question Answering Dataset. Ben Abacha and Demner-Fushman (2019b) collect trusted medical questionanswer pairs by crawling them from 12 websites of the U.S. National Institutes of Health (NIH). Each web page contains information about a healthrelated topic, like a disease or a drug. The authors automatically collect the question-answer pairs by composing handcrafted patterns adapted to each website based on document structure and section titles. They manually evaluate 1,721 CHQs to come up with automatic wording patterns for each of 36 question types. Therefore, even though answers are curated and written by medical experts, questions are automatically formulated and may have some noise.</p><p>We collect the publicly available (e.g. not copyrighted) question-answer pairs from the MedQuAD dataset<ref type="foot">foot_0</ref> . We then use the NLTK sentence tokenizer <ref type="bibr">(Bird, 2006</ref>) to split answer documents into sentences. We get 16,423 questions and 157,592 answer sentences, making for an average of 9.6 answer sentences for each question.</p><p>MeQSum (Ben Abacha and Demner-Fushman, 2019a) is a medical question summarization dataset released by the U.S. National Institutes of Health (NIH). It contains 1,000 consumer health questions summarized into FAQ-style single-sentence questions by medical experts.</p><p>HealthCareMagic is a medical dialogue dataset issued as part of the MedDialog dataset <ref type="bibr">(Zeng et al., 2020)</ref>  <ref type="foot">3</ref> .</p><p>It is crawled from HealthCareMagic.com, an online healthcare service platform. This dataset includes first a formally worded, one-sentence question describing the intent of the patient question, followed by 2 long utterances: a CHQ from the patient that includes a description of the problem and a question, and then an answer from the doctor. To form a medical question summarization dataset, we consider the single-sentence descriptions as summaries of the patient's CHQ. We collect 226,405 question pairs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.2">Knowledge-based Filtering of Datasets</head><p>We conduct experiments for each of the two question summarization datasets, and we use MedQuAD as the underlying knowledge base in all experiments. For this reason, we decide to filter each of the question summarization datasets to reconcile their differences with MedQuAD.</p><p>We first fit a TF-IDF embedding model, similar to the one of <ref type="bibr">(Dinan et al., 2018)</ref>, on the refer-ence FAQs of each question summarization dataset and the questions of MedQuAD. We then compute the dot products of the TF-IDF-weighted vectors for all possible pairs of summarization FAQs and MedQuAD questions. We assign a matching score m(q sum ) to each summarization reference FAQ:</p><p>(10) We manually evaluate the matching scores for each summarization dataset to set a cutoff matching score of filtering. This way, we obtain question summarization datasets where reference FAQs have matches in the medical knowledge base. Finally, we perform a random and rough 80/10/10 split for the train/dev/test sets. The dataset statistics are in the main paper.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Training Settings</head><p>We adopt the BART encoder-decoder model <ref type="bibr">(Lewis et al., 2020)</ref>, as it set a state of the art in abstractive summarization benchmarks. We train our model using the HuggingFace implementation <ref type="bibr">(Wolf et al., 2020)</ref>, on a learning rate of 2 &#8226; 10 -6 . The question matching pool retrieved by TF-IDF is comprised of k = 32 knowledge base FAQs. Our answer selection loss L sel is optimized to select up to n = 3 sentences. We use &#955; = 0.01 and &#947; = 0.01 as weights for the self-supervised losses. The BART encoder is used for embedding sentences for question matching and answer selection.</p><p>We train for 50 epochs for MeQSum, and 20 epochs for HealthCareMagic. Each training epoch takes about 10 minutes for MeQSum, and about 35 minutes for HealthCareMagic. Inference takes 1 minute for the MeQSum test set and 3 minutes for the HealthCareMagic test set. The best checkpoint is selected based on the lowest loss value L on the dev set.</p><p>We use BART Large pre-trained on the CNN-Dailymail dataset, and each BART Large model contains 406 million parameters, as per the Hug-gingFace implementation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3">Baselines</head><p>We propose the two following baselines in retrievalbased question answering: Dense Passage Retrieval (DPR) <ref type="bibr">(Karpukhin et al., 2020)</ref>, and Generation-Augmented Retrieval (GAR) <ref type="bibr">(Mao et al., 2021)</ref>. We adapt these two baselines to our case, and adopt BART-based pre-trained encoders.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>SYSTEM</head><p>MeQSum HealthCareMagic Time/Query DPR <ref type="bibr">(Karpukhin et al., 2020)</ref> 1.42 1.73 47 seconds GAR <ref type="bibr">(Mao et al., 2021)</ref> 1.40 1.64 48 seconds Ours 2.13 2.35 2 seconds</p><p>Table <ref type="table">2</ref>: Evaluation of the relevance (out of 5) of answers retrieved by our proposed system and two strong baselines for questions asked by seven evaluators. The systems trained on MeQSum are evaluated on 60 questions by 3 evaluators, and the ones trained on the larger HealthCareMagic dataset are evaluated on 80 questions by 4 evaluators. The column on the right shows the number of seconds it takes for a loaded system to retrieve the answer to a query.</p><p>Similarly to our own pipeline, we create a twostage retrieval to get answers. The first stage encodes questions from the knowledge base, and retrieves the question that is most relevant to the query. The second stage encodes the corresponding answer document, and retrieves the three sentences that are most relevant to the query.</p><p>For DPR, the query is simply the user question. For GAR, we need to generate a context to add to the user question: we choose to add the summary of the user question as the context. We train a BART encoder to summarize user question, using the question summarization datasets.</p><p>Whereas our system's retrieval encoder is trained on our proposed self-supervised objectives, the retrieval encoders of the baselines are trained on Wikipedia for the task of retrieval-based question answering.</p><p>5.4 Do we retrieve relevant answers?</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.4.1">Evaluation Strategy</head><p>We hire seven annotators: four of which are medical doctors, and the remaining three hold degrees related to healthcare or immunology.</p><p>We ask the evaluators to first write user questions, and then evaluate the answers retrieved by our system and the two existing systems. Given that our medical knowledge base has limited questions, we ask the evaluators to limit their questions to the topics covered by the nine sources from which the knowledge base was extracted.</p><p>Then, we ask the evaluators to rate the relevance of the answers retrieved by each system independently, on a scale of 1 (not relevant) to 5 (relevant). The full description of scores given to the annotators is in the Appendix.</p><p>Each of the seven annotators wrote 20 questions, and each question gets three answers (one per system). We assign three annotators to the models trained on MeQSum, and four to the models trained on HealthCareMagic. The annotators rate answers only for the questions that they wrote themselves.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.4.2">Results and Discussion</head><p>We show the results of the evaluations in Table <ref type="table">2</ref>. The first three columns show the averages of relevance scores that were given by annotators for all systems.</p><p>The results show that the evaluators have preferred our system's answers over the answers retrieved by the two baselines. Our system gets relevance scores that are 0.6 to 0.7 points higher, out of 5 on the relevance scale. An annotator commented that they find our system to be "more organized and to-the-point than the rest of systems." <ref type="foot">4</ref>The two baselines seem to perform similarly to each other. This is likely due to the fact that the main difference between them is that the query is generation-augmented for GAR, whereas the query is simply the user question for DPR.</p><p>Overall, the relevance scores are on the lower side, as no system exceeds an average score of 2.5/5. This shows that consumer health question answering and understanding is a challenging task, especially since there are no labels to indicate whether an answer is relevant to a particular question, or which FAQ matches the user's intent.</p><p>In addition, the challenges of the task are also due to the limitations of the knowledge base. Some annotators noted that the retrieved answers were often not appropriate, or close to the topic but not answering the question. This is due to the fact that MedQuAD does not cover all possible illnesses and medical conditions that the users could ask about. Whereas a larger database would potentially solve coverage problems, it could be at the expense of the quality or verifiability of the answers. The MedQuAD dataset is at times noisy, and contains generic sentences that may not answer any question, or generic templates related to percentages of symptoms and how frequent they are. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>CRITERIA</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.5">Computational Speed</head><p>We run our system on a single 11GB GPU, whereas the two baselines are each run on four 16GB GPUs. We show the average duration required to retrieve answers for a single query in the right column of Table <ref type="table">2</ref>. We notice that, in addition to the higher relevance scores, the advantage of our system is that it is significantly (more than 20 times) faster compared to the two baselines. This is largely due to the fact that we limit to 32 the number of knowledge base questions that we encode and compare the query embedding to. In contrast, DPR and GAR encode all questions in the knowledge base. This is done at the beginning when loading the models, but the query similarity computation is done at each run, thereby lengthening the processing time.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.6">Analysis of Question Understanding</head><p>An additional way that our system outperforms the two baselines could be through summarization. We evaluate the summarization of consumer health questions using the ROUGE metric <ref type="bibr">(Lin, 2004)</ref>. Our GAR baseline uses a BART model trained on the summarization loss only. We show the results in Table <ref type="table">4</ref>. We notice that sharing encoder parameters between the summarization loss and our proposed self-supervised losses generally increases ROUGE F1 scores across both datasets. For HealthCareMagic, score increases exceed 2 points in ROUGE-1 and ROUGE-L.</p><p>Given that ROUGE is notoriously unreliable, we hire two additional annotators on Upwork who are healthcare workers to judge the fluency, coherence, informativeness and correctness of generated sum-maries. We show the annotators the consumer health question (source text), the reference FAQ (target text) and two generated summaries. The annotators do not know which system generated which summary. We show the evaluation scores in Table <ref type="table">3</ref>. We remove repetitions of reference FAQs in the test sets put up for evaluation. The results confirm that our self-supervised losses increase the quality of generated summaries. Summaries generated with our model score more wins more often than losses on all four metrics, and score more wins than ties with the summarization-only baseline for HealthCareMagic.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Conclusions</head><p>We introduce an end-to-end pipeline for knowledgegrounded consumer health question answering and understanding (CHQUA). Our challenge is that we have no labels for question matching or answer relevance. We propose to use semantic similarity as a proxy for those labels, and we design three novel self-supervised losses: one works to match the user's summarized question to a knowledge base question, and the other two losses work complementarily to teach our model to select a fixed number of relevant answer sentences. We compare our proposed system against two strong baselines of retrieval-based question answering. We hire seven medical experts to ask their questions, and they find that our system provides more relevant answers. Our system also achieves processing times that are more than 20 times faster. Finally, we find that our proposed self-supervised losses enable the summarizer model to achieve higher scores in ROUGE and human evaluation metrics, compared to a summarization-only baseline. However, we find that this task remains challenging and that there is still room for improvement. We release our code and model to encourage further research.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Ethical Considerations</head><p>Our model is for medical question answering, but should be used with caution as it does not claim to provide medical advice. Potential users of our system should be warned to not blindly trust the answers given to their medical questions. Potential users should always consult their physician for medical advice.</p><p>Each of our annotators spent between two and four hours on the task we gave them. Each annotator was compensated fairly for their work. We answered all of the annotators' questions about the task before they started. Hiring platform Upwork guarantees the payment, fair treatment and informed consent of our nine hired annotators through a mutually agreed-upon contract. The platform fee for Upwork was paid by us, and not deducted from the compensation of the annotators.</p><p>&#8226; Score of 3/5: The system's answer mentions one or more words or concepts from the question, but does not actually answer the question.</p><p>&#8226; Score of 4/5: The system's answer partially answers the question, mentions one or more words or concepts from the question, but does not fully answer the question.</p><p>&#8226; Score of 5/5: The system's answer fully answers the question.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.3 Question Understanding</head><p>For question summarization, we evaluate the generated summaries on 4 criteria. We define these criteria for the two healthcare worker annotators as follows:</p><p>&#8226; Fluency: which generated FAQ is more grammatically correct, and easier to read and to understand?</p><p>&#8226; Coherence: which generated FAQ is better structured and more organized?</p><p>&#8226; Informativeness: which generated FAQ captures the most out of the concern of the patient who wrote the CHQ?</p><p>&#8226; Correctness: which generated FAQ is more factually correct given the CHQ?</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.4 Upwork</head><p>We ask annotators to work on Google docs that we share with them. We show in Figure <ref type="figure">4</ref> an example of a Google doc that we shared with an annotator (medical doctor) to ask their own question, and the answers we pasted for them to evaluate.</p><p>Figure <ref type="figure">4</ref>: Example of a Google document, where a hired annotator (medical doctor) asks a question, and rates the answers that we pasted once retrieved by our system and the two baselines.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_0"><p>https://github.com/abachaa/MedQuAD</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_1"><p>https://github.com/UCSD-AI4H/ Medical-Dialogue-System</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_2"><p>Annotators were not told that either system was ours or not. The systems were simply numbered for a blind evaluation.</p></note>
		</body>
		</text>
</TEI>
