NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

FactBench: A Dynamic Benchmark for In-the-Wild Language Model Factuality Evaluation

Bayat, Farima Fatahi; Zhang, Lechen; Munir, Sheza; Wang, Lu (July 2025, Association for Computational Linguistics)

Free, publicly-accessible full text available July 1, 2026
Evaluating Design Choices in Verifiable Generation with Open-source Models

Cao, Shuyang; Wang, Lu (May 2025, Association for Computational Linguistics)

Free, publicly-accessible full text available May 1, 2026
Enhancing Language Model Factuality via Activation-Based Confidence Calibration and Guided Decoding

https://doi.org/10.18653/v1/2024.emnlp-main.583

Liu, Xin; Fatahi_Bayat, Farima; Wang, Lu (November 2024, Association for Computational Linguistics)

Full Text Available
Verifiable Generation with Subsentence-Level Fine-Grained Citations

Cao, Shuyang; Wang, Lu (August 2024, Findings of the Annual Meeting of the Association for Computational Linguistics (Findings of ACL))

Verifiable generation requires large language models (LLMs) to cite source documents supporting their outputs, thereby improve output transparency and trustworthiness. Yet, previous work mainly targets the generation of sentencelevel citations, lacking specificity about which part of the sentence is backed by which cited source. This work studies verifiable generation with subsentence-level fine-grained citations to locate the generated content that is supported by the cited sources in a more precise way. We first present a dataset, SCIFI, comprising 10K Wikipedia paragraphs with subsentence-level citations.1 Each paragraph in SCIFI is paired with a set of candidate source documents for citation and a query that triggers the generation of the paragraph content. On SCIFI, we then evaluate the performance of state-of-the-a rt LLMs and strategies for processing long documents designed for these models. Our experiment results reveal key factors that can enhance the quality of citations, including the expansion of the source documents’ context to be accessible to the models and the implementation of specialized model tuning.
more » « less
Full Text Available
AWESOME: GPU Memory-constrained Long Document Summarization using Memory Mechanism and Global Salient Content

Cao, Shuyang; Wang, Lu (June 2024, Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL))

Long document summarization systems are critical for domains with lengthy and jargonladen text, yet they present significant challenges to researchers and developers with limited computing resources. Existing solutions mainly focus on efficient attentions or divideand- conquer strategies. The former reduces theoretical time complexity, but is still memoryheavy. The latter methods sacrifice global context, leading to uninformative and incoherent summaries. This work aims to leverage the memory-efficient nature of divide-and-conquer methods while preserving global context. Concretely, our framework AWESOME uses two novel mechanisms: (1) External memory mechanisms track previously encoded document segments and their corresponding summaries, to enhance global document understanding and summary coherence. (2) Global salient content is further identified beforehand to augment each document segment to support its summarization. Extensive experiments on diverse genres of text, including government reports, meeting transcripts, screenplays, scientific papers, and novels, show that AWESOME produces summaries with improved informativeness, faithfulness, and coherence than competitive baselines on longer documents, while having a smaller GPU memory footprint.
more » « less
Full Text Available
LitCab: Lightweight Language Model Calibration over Short- and Long-form Responses

Liu, Xin; Khalifa, Muhammad; Wang, Lu (May 2024, Proceedings of the International Conference on Learning Representations (ICLR))

A model is considered well-calibrated when its probability estimate aligns with the actual likelihood of the output being correct. Calibrating language models (LMs) is crucial, as it plays a vital role in detecting and mitigating hallucinations of LMs as well as building more trustworthy models. However, standard calibration techniques may not be suited for LM calibration. For instance, post-processing methods such as temperature scaling do not reorder the candidate generations. On the other hand, training-based methods require fine-tuning the entire model, which is impractical for LMs of large scale. We present LITCAB, a lightweight calibration mechanism consisting of a single linear layer that takes the input text representation and predicts a bias term, which is then added to the LM output logits. LITCAB improves model calibration by only adding < 2% of the original model parameters. For evaluation, we construct CAT, a benchmark consisting of eight text generation tasks, covering responses ranging from short phrases to paragraphs. We test LITCAB with Llama2-7B, where it improves calibration across all tasks, reducing the average ECE score by as large as 30%. We further conduct a comprehensive evaluation with multiple popular open-sourced LMs from GPT and LLaMA families, yielding the following key findings: (i) Larger models within the same family exhibit better calibration on tasks with short generation tasks, but not necessarily for longer ones. (ii) GPT-family models show superior calibration compared to LLaMA, Llama2, and Vicuna models, despite having much fewer parameters. (iii) Fine-tuning pretrained model (e.g., LLaMA) with samples of limited purpose (e.g., conversations) may lead to worse calibration, highlighting the importance of fine-tuning setups for calibrating LMs.
more » « less
Full Text Available
BOLT: Fast Energy-based Controlled Text Generation with Tunable Biases

Liu, Xin Liu; Khalifa, Muhammad; Wang, Lu (January 2023, Proceedings of Annual Meeting of the Association for Computational Linguistics (ACL))

Full Text Available
HIBRIDS: Attention with Hierarchical Biases for Structure-aware Long Document Summarization

https://doi.org/10.18653/v1/2022.acl-long.58

Cao, Shuyang; Wang, Lu (July 2022, Proceedings of Annual Meeting of the Association for Computational Linguistics (ACL))

Document structure is critical for efficient information consumption. However, it is challenging to encode it efficiently into the modern Transformer architecture. In this work, we present HIBRIDS, which injects Hierarchical Biases foR Incorporating Document Structure into the calculation of attention scores. We further present a new task, hierarchical question-summary generation, for summarizing salient content in the source document into a hierarchy of questions and summaries, where each follow-up question inquires about the content of its parent question-summary pair. We also annotate a new dataset with 6, 153 question-summary hierarchies labeled on long government reports. Experiment results show that our model produces better question-summary hierarchies than comparisons on both hierarchy quality and content coverage, a finding also echoed by human judges. Additionally, our model improves the generation of long-form summaries from lengthy government reports and Wikipedia articles, as measured by ROUGE scores.
more » « less
Full Text Available
Time-aware Prompting for Text Generation

Cao, Shuyang; Wang, Lu (January 2022, Findings of the Conference on Empirical Methods in Natural Language Processin)

Full Text Available
CLIFF: Contrastive Learning for Improving Faithfulness and Factuality in Abstractive Summarization

https://doi.org/10.18653/v1/2021.emnlp-main.532

Cao, Shuyang; Wang, Lu (November 2021, Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP))

We study generating abstractive summaries that are faithful and factually consistent with the given articles. A novel contrastive learning formulation is presented, which leverages both reference summaries, as positive training data, and automatically generated erroneous summaries, as negative training data, to train summarization systems that are better at distinguishing between them. We further design four types of strategies for creating negative samples, to resemble errors made commonly by two state-of-the-art models, BART and PEGASUS, found in our new human annotations of summary errors. Experiments on XSum and CNN/Daily Mail show that our contrastive learning framework is robust across datasets and models. It consistently produces more factual summaries than strong comparisons with post error correction, entailmentbased reranking, and unlikelihood training, according to QA-based factuality evaluation. Human judges echo the observation and find that our model summaries correct more errors.
more » « less
Full Text Available

« Prev Next »

Search for: All records