Language models (LMs) often struggle to pay enough attention to the input context, and generate texts that are unfaithful or contain hallucinations. To mitigate this issue, we present context-aware decoding (CAD), which follows a contrastive output distribution that amplifies the difference between the output probabilities when a model is used with and without context. Our experiments show that CAD, without additional training, significantly improves the faithfulness of different LM families, including OPT, GPT, LLaMA, and FLAN-T5 for summarization tasks (e.g., 14.3{\%} gain for LLaMA in factuality metrics). Furthermore, CAD is particularly effective in overriding a model{'}s prior knowledge when it contradicts the provided context, leading to substantial improvements in tasks where resolving the knowledge conflict is essential. 
                        more » 
                        « less   
                    
                            
                            How Language Model Hallucinations Can Snowball
                        
                    
    
            A major risk of using language models in practical applications is their tendency to hallucinate incorrect statements. Hallucinations are often attributed to knowledge gaps in LMs, but we hypothesize that in some cases, when justifying previously generated hallucinations, LMs output false claims that they can separately recognize as incorrect. We construct three question-answering datasets where ChatGPT and GPT-4 often state an incorrect answer and offer an explanation with at least one incorrect claim. Crucially, we find that ChatGPT and GPT-4 can identify 67% and 87% of their own mistakes, respectively. We refer to this phenomenon as hallucination snowballing: an LM over-commits to early mistakes, leading to more mistakes that it otherwise would not make. 
        more » 
        « less   
        
    
                            - Award ID(s):
- 1922658
- PAR ID:
- 10535879
- Publisher / Repository:
- International Conference on Machine Learning (ICML) 2024
- Date Published:
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            A model is considered well-calibrated when its probability estimate aligns with the actual likelihood of the output being correct. Calibrating language models (LMs) is crucial, as it plays a vital role in detecting and mitigating hallucinations of LMs as well as building more trustworthy models. However, standard calibration techniques may not be suited for LM calibration. For instance, post-processing methods such as temperature scaling do not reorder the candidate generations. On the other hand, training-based methods require fine-tuning the entire model, which is impractical for LMs of large scale. We present LITCAB, a lightweight calibration mechanism consisting of a single linear layer that takes the input text representation and predicts a bias term, which is then added to the LM output logits. LITCAB improves model calibration by only adding < 2% of the original model parameters. For evaluation, we construct CAT, a benchmark consisting of eight text generation tasks, covering responses ranging from short phrases to paragraphs. We test LITCAB with Llama2-7B, where it improves calibration across all tasks, reducing the average ECE score by as large as 30%. We further conduct a comprehensive evaluation with multiple popular open-sourced LMs from GPT and LLaMA families, yielding the following key findings: (i) Larger models within the same family exhibit better calibration on tasks with short generation tasks, but not necessarily for longer ones. (ii) GPT-family models show superior calibration compared to LLaMA, Llama2, and Vicuna models, despite having much fewer parameters. (iii) Fine-tuning pretrained model (e.g., LLaMA) with samples of limited purpose (e.g., conversations) may lead to worse calibration, highlighting the importance of fine-tuning setups for calibrating LMs.more » « less
- 
            Abstract In the face of climate change, climate literacy is becoming increasingly important. With wide access to generative AI tools, such as OpenAI’s ChatGPT, we explore the potential of AI platforms for ordinary citizens asking climate literacy questions. Here, we focus on a global scale and collect responses from ChatGPT (GPT-3.5 and GPT-4) on climate change-related hazard prompts over multiple iterations by utilizing the OpenAI’s API and comparing the results with credible hazard risk indices. We find a general sense of agreement in comparisons and consistency in ChatGPT over the iterations. GPT-4 displayed fewer errors than GPT-3.5. Generative AI tools may be used in climate literacy, a timely topic of importance, but must be scrutinized for potential biases and inaccuracies moving forward and considered in a social context. Future work should identify and disseminate best practices for optimal use across various generative AI tools.more » « less
- 
            The recent public releases of AI tools such as ChatGPT have forced computer science educators to reconsider how they teach. These tools have demonstrated considerable ability to generate code and answer conceptual questions, rendering them incredibly useful for completing CS coursework. While overreliance on AI tools could hinder students’ learning, we believe they have the potential to be a helpful resource for both students and instructors alike. We propose a novel system for instructor-mediated GPT interaction in a class discussion board. By automatically generating draft responses to student forum posts, GPT can help Teaching Assistants (TAs) respond to student questions in a more timely manner, giving students an avenue to receive fast, quality feedback on their solutions without turning to ChatGPT directly. Additionally, since they are involved in the process, instructors can ensure that the information students receive is accurate, and can provide students with incremental hints that encourage them to engage critically with the material, rather than just copying an AI-generated snippet of code. We utilize Piazza—a popular educational forum where TAs help students via text exchanges—as a venue for GPT-assisted TA responses to student questions. These student questions are sent to GPT-4 alongside assignment instructions and a customizable prompt, both of which are stored in editable instructor-only Piazza posts. We demonstrate an initial implementation of this system, and provide examples of student questions that highlight its benefits.more » « less
- 
            The development and measurable improvements in performance of large language models on natural language tasks opens the opportunity to utilize large language models in an educational setting to replicate human tutoring, which is often costly and inaccessible. We are particularly interested in large language models from the GPT series, created by OpenAI. In the original study we found that the quality of explanations generated with GPT-3.5 was poor, where two different approaches to generating explanations resulted in a 43% and 10% successrate. In a replication study, we were interested in whether the measurable improvements in GPT-4 performance led to a higher rate of success for generating valid explanations compared to GPT-3.5. A replication of the original study was conducted by using GPT-4 to generate explanations for the same problems given to GPT-3.5. Using GPT-4, explanation correctness dramatically improved to a success rate of 94%. We were further interested in evaluating if GPT-4 explanations were positively perceived compared to human-written explanations. A preregistered, follow-up study was implemented where 10 evaluators were asked to rate the quality of randomized GPT-4 and teacher-created explanations. Even with 4% of problems containing some amount of incorrect content, GPT-4 explanations were preferred over human explanations.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
 
                                    