NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

ED-Copilot: Reduce Emergency Department Wait Time with Language Model Diagnostic Assistance

Sun, Liwen; Agawal, Abhineet; Kornblith, Aaron; Yu, Bin; Xiong, Chenyan (May 2024, ICML)

Full Text Available
ClueWeb22: 10 Billion Web Documents with Rich Information

https://doi.org/10.1145/3477495.3536321

Overwijk, Arnold; Xiong, Chenyan; Callan, Jamie (July 2022, ACM)

ClueWeb22, the newest iteration of the ClueWeb line of datasets, is the result of more than a year of collaboration between industry and academia. Its design is influenced by the research needs of the academic community and the real-world needs of large-scale industry systems. Compared with earlier ClueWeb datasets, the ClueWeb22 corpus is larger, more varied, and has higher-quality documents. Its core is raw HTML, but it includes clean text versions of documents to lower the barrier to entry. Several aspects of ClueWeb22 are available to the research community for the first time at this scale, for example, visual representations of rendered web pages, parsed structured information from the HTML document, and the alignment of document distributions (domains, languages, and topics) to commercial web search. This talk shares the design and construction of ClueWeb22, and discusses its new features. We believe this newer, larger, and richer ClueWeb corpus will enable and support a broad range of research in IR, NLP, and deep learning.
more » « less
Full Text Available
Improving Query Representations for Dense Retrieval with Pseudo Relevance Feedback

https://doi.org/10.1145/3459637.3482124

Yu, HongChien; Xiong, Chenyan; Callan, Jamie (October 2021, CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge Management)

Dense retrieval systems conduct first-stage retrieval using embedded representations and simple similarity metrics to match a query to documents. Its effectiveness depends on encoded embeddings to capture the semantics of queries and documents, a challenging task due to the shortness and ambiguity of search queries. This paper proposes ANCE-PRF, a new query encoder that uses pseudo relevance feedback (PRF) to improve query representations for dense retrieval. ANCE-PRF uses a BERT encoder that consumes the query and the top retrieved documents from a dense retrieval model, ANCE, and it learns to produce better query embeddings directly from relevance labels. It also keeps the document index unchanged to reduce overhead. ANCE-PRF significantly outperforms ANCE and other recent dense retrieval systems on several datasets. Analysis shows that the PRF encoder effectively captures the relevant and complementary information from PRF documents, while ignoring the noise with its learned attention mechanism.
more » « less
Full Text Available
Multi-Step Reasoning Over Unstructured Text with Beam Dense Retrieval

https://doi.org/10.18653/v1/2021.naacl-main.368

Zhao, Chen; Xiong, Chenyan; Boyd-Graber, Jordan; Daumé III, Hal (January 2021, Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies)

Complex question answering often requires finding a reasoning chain that consists of multiple evidence pieces. Current approaches incorporate the strengths of structured knowledge and unstructured text, assuming text corpora is semi-structured. Building on dense retrieval methods, we propose a new multi-step retrieval approach (BEAMDR) that iteratively forms an evidence chain through beam search in dense representations. When evaluated on multi-hop question answering, BEAMDR is competitive to state-of-the-art systems, without using any semi-structured information. Through query composition in dense space, BEAMDR captures the implicit relationships between evidence in the reasoning chain. The code is available at https://github.com/ henryzhao5852/BeamDR.
more » « less
Full Text Available
Complex Factoid Question Answering with a Free-Text Knowledge Graph

https://doi.org/10.1145/3366423.3380197

Zhao, Chen; Xiong, Chenyan; Qian, Xin; Boyd-Graber, Jordan (April 2020, WWW '20: Proceedings of The Web Conference 2020)
null (Ed.)
We introduce delft, a factoid question answering system which combines the nuance and depth of knowledge graph question answering approaches with the broader coverage of free-text. delft builds a free-text knowledge graph from Wikipedia, with entities as nodes and sentences in which entities co-occur as edges. For each question, delft finds the subgraph linking question entity nodes to candidates using text sentences as edges, creating a dense and high coverage semantic graph. A novel graph neural network reasons over the free-text graph—combining evidence on the nodes via information along edge sentences—to select a final answer. Experiments on three question answering datasets show delft can answer entity-rich questions better than machine reading based models, bert-based answer ranking and memory networks. delft’s advantage comes from both the high coverage of its free-text knowledge graph—more than double that of dbpedia relations—and the novel graph neural network which reasons on the rich but noisy free-text evidence.
more » « less
Full Text Available
TaxoExpan: Self-supervised Taxonomy Expansion with Position-Enhanced Graph Neural Network

https://doi.org/10.1145/3366423.3380132

Shen, Jiaming; Shen, Zhihong; Xiong, Chenyan; Wang, Chi; Wang, Kuansan; Han, Jiawei (April 2020, WWW '20: The Web Conference 2020)

Taxonomies consist of machine-interpretable semantics and provide valuable knowledge for many web applications. For example, online retailers (e.g., Amazon and eBay) use taxonomies for product recommendation, and web search engines (e.g., Google and Bing) leverage taxonomies to enhance query understanding. Enormous efforts have been made on constructing taxonomies eithermanually or semi-automatically. However, with the fast-growing volume of web content, existing taxonomies will become outdated and fail to capture emerging knowledge. Therefore, in many applications, dynamic expansions of an existing taxonomy are in great demand. In this paper, we study how to expand an existing taxonomy by adding a set of new concepts. We propose a novel self-supervised framework, named TaxoExpan, which automatically generates a set of ⟨query concept, anchor concept⟩ pairs from the existing taxonomy as training data. Using such self-supervision data, TaxoExpan learns a model to predict whether a query concept is the direct hyponym of an anchor concept. We develop two innovative techniques in TaxoExpan: (1) a position-enhanced graph neural network that encodes the local structure of an anchor concept in the existing taxonomy, and (2) a noise-robust training objective that enables the learned model to be insensitive to the label noise in the self-supervision data. Extensive experiments on three large-scale datasets from different domains demonstrate both the effectiveness and the efficiency of TaxoExpan for taxonomy expansion.
more » « less
Full Text Available
Text Classification Using Label Names Only: A Language Model Self-Training Approach

https://doi.org/10.18653/v1/2020.emnlp-main.724

Meng, Yu; Zhang, Yunyi; Huang, Jiaxin; Xiong, Chenyan; Ji, Heng; Zhang, Chao; Han, Jiawei (January 2020, EMNLP'20: 2020 Conf. on Empirical Methods in Natural Language Processing, Nov. 2020)
null (Ed.)
Current text classification methods typically require a good number of human-labeled documents as training data, which can be costly and difficult to obtain in real applications. Hu-mans can perform classification without seeing any labeled examples but only based on a small set of words describing the categories to be classified. In this paper, we explore the potential of only using the label name of each class to train classification models on un-labeled data, without using any labeled documents. We use pre-trained neural language models both as general linguistic knowledge sources for category understanding and as representation learning models for document classification. Our method (1) associates semantically related words with the label names, (2) finds category-indicative words and trains the model to predict their implied categories, and (3) generalizes the model via self-training. We show that our model achieves around 90% ac-curacy on four benchmark datasets including topic and sentiment classification without using any labeled documents but learning from unlabeled data supervised by at most 3 words (1 in most cases) per class as the label name1.
more » « less
Full Text Available

Search for: All records