An interesting behavior in large language models (LLMs) is prompt sensitivity. When provided with different but semantically equivalent versions of the same prompt, models may produce very different distributions of answers. This suggests that the uncertainty reflected in a model's output distribution for one prompt may not reflect the model's uncertainty about the meaning of the prompt. We model prompt sensitivity as a type of generalization error, and show that sampling across the semantic concept space with paraphrasing perturbations improves uncertainty calibration without compromising accuracy. Additionally, we introduce a new metric for uncertainty decomposition in black-box LLMs that improves upon entropy-based decomposition by modeling semantic continuities in natural language generation. We show that this decomposition metric can be used to quantify how much LLM uncertainty is attributed to prompt sensitivity. Our work introduces a new way to improve uncertainty calibration in prompt-sensitive language models, and provides evidence that some LLMs fail to exhibit consistent general reasoning about the meanings of their inputs.
more »
« less
COSTELLO: Contrastive Testing for Embedding-Based Large Language Model as a Service Embeddings
Large language models have gained significant popularity and are often provided as a service (i.e., LLMaaS). Companies like OpenAI and Google provide online APIs of LLMs to allow downstream users to create innovative applications. Despite its popularity, LLM safety and quality assurance is a well-recognized concern in the real world, requiring extra efforts for testing these LLMs. Unfortunately, while end-to-end services like ChatGPT have garnered rising attention in terms of testing, the LLMaaS embeddings have comparatively received less scrutiny. We state the importance of testing and uncovering problematic individual embeddings without considering downstream applications. The abstraction and non-interpretability of embedded vectors, combined with the black-box inaccessibility of LLMaaS, make testing a challenging puzzle. This paper proposes COSTELLO, a black-box approach to reveal potential defects in abstract embedding vectors from LLMaaS bycontrastive testing. Our intuition is that high-quality LLMs can adequately capture the semantic relationships of the input texts and properly represent their relationships in the high-dimensional space. For the given interface of LLMaaS and seed inputs, COSTELLO can automatically generate test suites and output words with potential problematic embeddings. The idea is to synthesize contrastive samples with guidance, including positive and negative samples, by mutating seed inputs. Our synthesis guide will leverage task-specific properties to control the mutation procedure and generate samples with known partial relationships in the high-dimensional space. Thus, we can compare the expected relationship (oracle) and embedding distance (output of LLMs) to locate potential buggy cases. We evaluate COSTELLO on 42 open-source (encoder-based) language models and two real-world commercial LLMaaS. Experimental results show that COSTELLO can effectively detect semantic violations, where more than 62% of violations on average result in erroneous behaviors (e.g., unfairness) of downstream applications.
more »
« less
- Award ID(s):
- 2342250
- PAR ID:
- 10598651
- Publisher / Repository:
- ACM
- Date Published:
- Journal Name:
- Proceedings of the ACM on Software Engineering
- Volume:
- 1
- Issue:
- FSE
- ISSN:
- 2994-970X
- Page Range / eLocation ID:
- 906 to 928
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Large Language Models (LLMs) have received much recent attention due to their human-level accuracy. While existing works mostly focus on either improving accuracy or testing accuracy robustness, the computation efficiency of LLMs, which is of paramount importance due to often vast generation demands and real-time requirements, has surprisingly received little attention. In this article, we make the first attempt to understand and test potential computation efficiency robustness in state-of-the-art LLMs. By analyzing the working mechanism and implementation of 20,543 public-accessible LLMs, we observe a fundamental property in LLMs that could be manipulated in an adversarial manner to reduce computation efficiency significantly. Our interesting observation is that the output length determines the computation efficiency of LLMs instead of the input, where the output length depends on two factors: an often sufficiently large yet pessimistic pre-configured threshold controlling the max number of iterations and a runtime-generated end of sentence (EOS) token. Our key motivation is to generate test inputs that could sufficiently delay the generation of EOS such that LLMs would have to go through enough iterations to satisfy the pre-configured threshold. We presentLLMEffiChecker, which can work under both white-box setting and black-box setting. In the white-box scenario,LLMEffiCheckerdevelops a gradient-guided technique that searches for a minimal and unnoticeable perturbation at character-level, token-level, and structure-level. In the black-box scenario,LLMEffiCheckeremploys a causal inference-based approach to find critical tokens and similarly applies three levels of imperceptible perturbation to them. Both the white-box and black-box settings effectively delay the appearance of EOS, compelling these inputs to reach the naturally unreachable threshold. To demonstrate the effectiveness ofLLMEffiChecker, we conduct a systematic evaluation on nine publicly available LLMs: Google T5, AllenAI WMT14, Helsinki-NLP translator, Facebook FairSeq, UNICAMP-DL translator, MarianMT, Google FLAN-T5, MBZUAI LaMini-GPT, and Salesforce CodeGen. Experimental results show thatLLMEffiCheckercan increase on average LLMs’ response latency and energy consumption by 325% to 3,244% and 344% to 3,616%, respectively, by perturbing just one character or token in the input sentence. Our case study shows that inputs generated byLLMEffiCheckersignificantly affect the battery power in real-world mobile devices (i.e., drain more than 30 times battery power than normal inputs).more » « less
-
Learning sentence representations which capture rich semantic meanings has been crucial for many NLP tasks. Pre-trained language models such as BERT have achieved great success in NLP, but sentence embeddings extracted directly from these models do not perform well without fine-tuning. We propose Contrastive Learning of Sentence Representations (CLSR), a novel approach which applies contrastive learning to learn universal sentence representations on top of pre-trained language models. CLSR utilizes semantic similarity of two sentences to construct positive instance for contrastive learning. Semantic information that has been captured by the pre-trained models is kept by getting sentence embeddings from these models with proper pooling strategy. An encoder followed by a linear projection takes these embeddings as inputs and is trained under a contrastive objective. To evaluate the performance of CLSR, we run experiments on a range of pre-trained language models and their variants on a series of Semantic Contextual Similarity tasks. Results show that CLSR gains significant performance improvements over existing SOTA language models.more » « less
-
ord embeddings are commonly used to measure word-level semantic similarity in text, especially in direct word- to-word comparisons. However, the relationships between words in the embedding space are often viewed as approximately linear and concepts comprised of multiple words are a sort of linear combination. In this paper, we demonstrate that this is not generally true and show how the relationships can be better captured by leveraging the topology of the embedding space. We propose a technique for directly computing new vectors representing multiple words in a way that naturally combines them into a new, more consistent space where distance better correlates to similarity. We show that this technique works well for natural language, even when it comprises multiple words, on a simple task derived from WordNet synset descriptions and examples of words. Thus, the generated vectors better represent complex concepts in the word embedding space.more » « less
-
Speech is a natural channel for human-computer interaction in robotics and consumer applications. Natural language understanding pipelines that start with speech can have trouble recovering from speech recognition errors. Black-box automatic speech recognition (ASR) systems, built for general purpose use, are unable to take advantage of in-domain language models that could otherwise ameliorate these errors. In this work, we present a method for re-ranking black-box ASR hypotheses using an in-domain language model and semantic parser trained for a particular task. Our re-ranking method significantly improves both transcription accuracy and semantic understanding over a state-of-the-art ASR’s vanilla output.more » « less
An official website of the United States government

