skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: A Suite of LMs Comprehend Puzzle Statements as Well or Better Than Humans
Abstract This paper reexamines a recent claim that Large Language Models lag behind humans in language comprehension on what were described as minimally complex statements. We argue that human performance was overestimated and LM performance, underestimated. Moreover, both people and lower-performing LMs are disproportionately challenged by queries involving potentially appropriate inferences, suggesting shared pragmatic sensitivity rather than model-specific deficits. Analysis of more sensitive log probabilities of Llama-2-70B demonstrate ceiling-level accuracy and pragmatic sensitivity. A separate group of LM grammaticality judgments previously characterized as incorrect are shown to correlate with human judgments, while certain reasoning models approximate idealized judgments when prompted to respond as an expert generative syntactician. Overall, the findings suggest that apparent deficits in LM performance may reflect task design, evaluation choices, and assumptions about human performance, rather than deficiencies in current models.  more » « less
Award ID(s):
2339729
PAR ID:
10675137
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
MIT Press
Date Published:
Journal Name:
Open Mind
Volume:
10
ISSN:
2470-2986
Page Range / eLocation ID:
431 to 440
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract In recent years, large language models (LLMs) and vision language models (VLMs) have excelled at tasks requiring human-like reasoning, inspiring researchers in engineering design to use language models (LMs) as surrogate evaluators of design concepts. But do these models actually evaluate designs like humans? While recent work has shown that LM evaluations sometimes fall within human variance on Likert-scale grading tasks, those tasks often obscure the reasoning and biases behind the scores. To address this limitation, we compare LM word embeddings (trained to capture semantic similarity) with human-rated similarity embeddings derived from triplet comparisons (“is A closer to B than C?”) on a dataset of design sketches and descriptions. We assess alignment via local tripletwise similarity and embedding distances, allowing for deeper insights than raw Likert-scale scores provide. We also explore whether describing the designs to LMs through text or images improves alignment with human judgments. Our findings suggest that text alone may not fully capture the nuances humans key into, yet text-based embeddings outperform their multimodal counterparts on satisfying local triplets. On the basis of these insights, we offer recommendations for effectively integrating LMs into design evaluation tasks. 
    more » « less
  2. We present a game-theoretic model of pragmatics that we call ReCo (for Regularized Conventions). This model formulates pragmatic communication as a game in which players are rewarded for communicating successfully and penalized for deviating from a shared, “default” semantics. As a result, players assign utterances context-dependent meanings that jointly optimize communicative success and naturalness with respect to speakers’ and listeners’ background knowledge of language. By using established game-theoretic tools to compute equilibrium strategies for this game, we obtain principled pragmatic language generation procedures with formal guarantees of communicative success. Across several datasets capturing real and idealized human judgments about pragmatic implicature, ReCo matches, or slightly improves upon, predictions made by Iterated Best Response and Rational Speech Acts models of language understanding. 
    more » « less
  3. Despite the growing success of diffusion models in continuous-valued domains (e.g., images), similar efforts for discrete domains such as text have yet to match the performance of autoregressive language models. In this work, we present SSD-LM—a diffusion-based language model with two key design choices. First, SSD-LM is semi-autoregressive, iteratively generating blocks of text, allowing for flexible output length at decoding time while enabling local bidirectional context updates. Second, it is simplex-based, performing diffusion on the natural vocabulary space rather than a learned latent space, allowing us to incorporate classifier guidance and modular control using off-the-shelf classifiers without any adaptation. We evaluate SSD-LM on unconstrained text generation benchmarks, and show that it matches or outperforms strong autoregressive GPT-2 models across standard quality and diversity metrics, while vastly outperforming diffusion-based baselines. On controlled text generation, SSD-LM also outperforms competitive baselines, with an extra advantage in modularity. 
    more » « less
  4. Abstract Liquid metal (LM) exhibits a distinct combination of high electrical conductivity comparable to that of metals and exceptional deformability derived from its liquid state, thus it is considered a promising material for high-performance soft electronics. However, rapid patterning LM to achieve a sensory system with high sensitivity remains a challenge, mainly attributed to the poor rheological property and wettability. Here, we report a rheological modification strategy of LM and strain redistribution mechanics to simultaneously simplify the scalable manufacturing process and significantly enhance the sensitivity of LM sensors. By incorporating SiO2particles into LM, the modulus, yield stress, and viscosity of the LM-SiO2composite are drastically enhanced, enabling 3D printability on soft materials for stretchable electronics. The sensors based on printed LM-SiO2composite show excellent mechanical flexibility, robustness, strain, and pressure sensing performances. Such sensors are integrated onto different locations of the human body for wearable applications. Furthermore, by integrating onto a tactile glove, the synergistic effect of strain and pressure sensing can decode the clenching posture and hitting strength in boxing training. When assisted by a deep-learning algorithm, this tactile glove can achieve recognition of the technical execution of boxing punches, such as jab, swing, uppercut, and combination punches, with 90.5% accuracy. This integrated multifunctional sensory system can find wide applications in smart sport-training, intelligent soft robotics, and human-machine interfaces. 
    more » « less
  5. Martelli, Pier Luigi (Ed.)
    Abstract MotivationThe identification and understanding of drug–target interactions (DTIs) play a pivotal role in the drug discovery and development process. Sequence representations of drugs and proteins in computational model offer advantages such as their widespread availability, easier input quality control, and reduced computational resource requirements. These make them an efficient and accessible tools for various computational biology and drug discovery applications. Many sequence-based DTI prediction methods have been developed over the years. Despite the advancement in methodology, cold start DTI prediction involving unknown drug or protein remains a challenging task, particularly for sequence-based models. Introducing DTI-LM, a novel framework leveraging advanced pretrained language models, we harness their exceptional context-capturing abilities along with neighborhood information to predict DTIs. DTI-LM is specifically designed to rely solely on sequence representations for drugs and proteins, aiming to bridge the gap between warm start and cold start predictions. ResultsLarge-scale experiments on four datasets show that DTI-LM can achieve state-of-the-art performance on DTI predictions. Notably, it excels in overcoming the common challenges faced by sequence-based models in cold start predictions for proteins, yielding impressive results. The incorporation of neighborhood information through a graph attention network further enhances prediction accuracy. Nevertheless, a disparity persists between cold start predictions for proteins and drugs. A detailed examination of DTI-LM reveals that language models exhibit contrasting capabilities in capturing similarities between drugs and proteins. Availability and implementationSource code is available at: https://github.com/compbiolabucf/DTI-LM. 
    more » « less