skip to main content


Title: COUGH: A Challenge Dataset and Models for COVID-19 FAQ Retrieval
We present a large, challenging dataset, COUGH, for COVID-19 FAQ retrieval. Similar to a standard FAQ dataset, COUGH consists of three parts: FAQ Bank, Query Bank and Relevance Set. The FAQ Bank contains ~16K FAQ items scraped from 55 credible websites (e.g., CDC and WHO). For evaluation, we introduce Query Bank and Relevance Set, where the former contains 1,236 human-paraphrased queries while the latter contains ~32 human-annotated FAQ items for each query. We analyze COUGH by testing different FAQ retrieval models built on top of BM25 and BERT, among which the best model achieves 48.8 under P@5, indicating a great challenge presented by COUGH and encouraging future research for further improvement. Our COUGH dataset is available at https://github.com/sunlab-osu/covid-faq.  more » « less
Award ID(s):
1942980 1815674
NSF-PAR ID:
10334272
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Query by Example is a well-known information retrieval task in which a document is chosen by the user as the search query and the goal is to retrieve relevant documents from a large collection. However, a document often covers multiple aspects of a topic. To address this scenario we introduce the task of faceted Query by Example in which users can also specify a finer grained aspect in addition to the input query document. We focus on the application of this task in scientific literature search. We envision models which are able to retrieve scientific papers analogous to a query scientific paper along specifically chosen rhetorical structure elements as one solution to this problem. In this work, the rhetorical structure elements, which we refer to as facets, indicate objectives, methods, or results of a scientific paper. We introduce and describe an expert annotated test collection to evaluate models trained to perform this task. Our test collection consists of a diverse set of 50 query documents in English, drawn from computational linguistics and machine learning venues. We carefully follow the annotation guideline used by TREC for depth-k pooling (k = 100 or 250) and the resulting data collection consists of graded relevance scores with high annotation agreement. State of the art models evaluated on our dataset show a significant gap to be closed in further work. Our dataset may be accessed here: https://github.com/iesl/CSFCube. 
    more » « less
  2. null (Ed.)
    We introduce the concept of \emph{expected exposure} as the average attention ranked items receive from users over repeated samples of the same query. Furthermore, we advocate for the adoption of the principle of equal expected exposure: given a fixed information need, no item should receive more or less expected exposure than any other item of the same relevance grade. We argue that this principle is desirable for many retrieval objectives and scenarios, including topical diversity and fair ranking. Leveraging user models from existing retrieval metrics, we propose a general evaluation methodology based on expected exposure and draw connections to related metrics in information retrieval evaluation. Importantly, this methodology relaxes classic information retrieval assumptions, allowing a system, in response to a query, to produce a \emph{distribution over rankings} instead of a single fixed ranking. We study the behavior of the expected exposure metric and stochastic rankers across a variety of information access conditions, including \emph{ad hoc} retrieval and recommendation. We believe that measuring and optimizing expected exposure metrics using randomization opens a new area for retrieval algorithm development and progress. 
    more » « less
  3. Cough detection can provide an important marker to monitor chronic respiratory conditions. However, manual techniques which require human expertise to count coughs are both expensive and time-consuming. Recent Automatic Cough Detection Algorithms (ACDAs) have shown promise to meet clinical monitoring requirements, but only in recent years they have made their way to non-clinical settings due to the required portability of sensing technologies and the extended duration of data recording. More precisely, these ACDAs operate at high sampling frequencies, which leads to high power consumption and computing requirements, making these difficult to implement on a wearable device. Additionally, reproducibility of their performance is essential. Unfortunately, as the majority of ACDAs were developed using private clinical data, it is difficult to reproduce their results. We, hereby, present an ACDA that meets clinical monitoring requirements and reliably operates at a low sampling frequency. This ACDA is implemented using a convolutional neural network (CNN), and publicly available data. It achieves a sensitivity of 92.7%, a specificity of 92.3%, and an accuracy of 92.5% using a sampling frequency of just 750 Hz. We also show that a low sampling frequency allows us to preserve patients’ privacy by obfuscating their speech, and we analyze the trade-off between speech obfuscation for privacy and cough detection accuracy. Clinical relevance—This paper presents a new cough detection technique and preliminary analysis on the trade-off between detection accuracy and obfuscation of speech for privacy. These findings indicate that, using a publicly available dataset, we can sample signals at 750 Hz while still maintaining a sensitivity above 90%, suggested to be sufficient for clinical monitoring [1]. 
    more » « less
  4. Considering the widespread use of mobile and voice search, answer passage retrieval for non-factoid questions plays a critical role in modern information retrieval systems. Despite the importance of the task, the community still feels the significant lack of large-scale non-factoid question answering collections with real questions and comprehensive relevance judgments. In this paper, we develop and release a collection of 2,626 open-domain non-factoid questions from a diverse set of categories. The dataset, called ANTIQUE, contains 34k manual relevance annotations. The questions were asked by real users in a community question answering service, i.e., Yahoo! Answers. Relevance judgments for all the answers to each question were collected through crowdsourcing. To facilitate further research, we also include a brief analysis of the data as well as baseline results on both classical and recently developed neural IR models. 
    more » « less
  5. A search engine's ability to retrieve desirable datasets is important for data sharing and reuse. Existing dataset search engines typically rely on matching queries to dataset descriptions. However, a user may not have enough prior knowledge to write a query using terms that match with description text. We propose a novel schema label generation model which generates possible schema labels based on dataset table content. We incorporate the generated schema labels into a mixed ranking model which not only considers the relevance between the query and dataset metadata but also the similarity between the query and generated schema labels. To evaluate our method on real-world datasets, we create a new benchmark specifically for the dataset retrieval task. Experiments show that our approach can effectively improve the precision and NDCG scores of the dataset retrieval task compared with baseline methods. We also test on a collection of Wikipedia tables to show that the features generated from schema labels can improve the unsupervised and supervised web table retrieval task as well. 
    more » « less