skip to main content

Title: CoMPosT: Characterizing and Evaluating Caricature in LLM Simulations
Recent work has aimed to capture nuances of human behavior by using LLMs to simulate responses from particular demographics in settings like social science experiments and public opinion surveys. However, there are currently no established ways to discuss or evaluate the quality of such LLM simulations. Moreover, there is growing concern that these LLM simulations are flattened caricatures of the personas that they aim to simulate, failing to capture the multidimensionality of people and perpetuating stereotypes. To bridge these gaps, we present CoMPosT, a framework to characterize LLM simulations using four dimensions: Context, Model, Persona, and Topic. We use this framework to measure open-ended LLM simulations’ susceptibility to caricature, defined via two criteria: individuation and exaggeration. We evaluate the level of caricature in scenarios from existing work on LLM simulations. We find that for GPT-4, simulations of certain demographics (political and marginalized groups) and topics (general, uncontroversial) are highly susceptible to caricature.  more » « less
Award ID(s):
Author(s) / Creator(s):
; ;
Publisher / Repository:
Association for Computational Linguistics
Date Published:
Journal Name:
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Page Range / eLocation ID:
10853 to 10875
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Practitioners frequently take multiple samples from large language models (LLMs) to explore the distribution of completions induced by a given prompt. While individual samples can give high-quality results for given tasks, collectively there are no guarantees of the distribution over these samples induced by the generating LLM. In this paper, we empirically evaluate LLMs’ capabilities as distribution samplers. We identify core concepts and metrics underlying LLM-based sampling, including different sampling methodologies and prompting strategies. Using a set of controlled domains we evaluate the error and variance of the distributions induced by the LLM. We find that LLMs struggle to induce reasonable distributions over generated elements, suggesting that practitioners should more carefully consider the semantics and methodologies of sampling from LLMs. 
    more » « less
  2. Abstract Understanding propagation of scintillation light is critical for maximizing the discovery potential of next-generation liquid xenon detectors that use dual-phase time projection chamber technology. This work describes a detailed optical simulation of the DARWIN detector implemented using Chroma, a GPU-based photon tracking framework. To evaluate the framework and to explore ways of maximizing efficiency and minimizing the time of light collection, we simulate several variations of the conventional detector design. Results of these selected studies are presented. More generally, we conclude that the approach used in this work allows one to investigate alternative designs faster and in more detail than using conventional Geant4 optical simulations, making it an attractive tool to guide the development of the ultimate liquid xenon observatory. 
    more » « less
  3. Recent innovation in large language models (LLMs), and their myriad use cases have rapidly driven up the compute demand for datacenter GPUs. Several cloud providers and other enterprises plan to substantially grow their datacenter capacity to support these new workloads. A key bottleneck resource in datacenters is power, which LLMs are quickly saturating due to their rapidly increasing model sizes.We extensively characterize the power consumption patterns of a variety of LLMs and their configurations. We identify the differences between the training and inference power consumption patterns. Based on our analysis, we claim that the average and peak power utilization in LLM inference clusters should not be very high. Our deductions align with data from production LLM clusters, revealing that inference workloads offer substantial headroom for power oversubscription. However, the stringent set of telemetry and controls that GPUs offer in a virtualized environment make it challenging to build a reliable and robust power management framework.We leverage the insights from our characterization to identify opportunities for better power management. As a detailed use case, we propose a new framework called POLCA, which enables power oversubscription in LLM inference clouds. POLCA is robust, reliable, and readily deployable. Using open-source models to replicate the power patterns observed in production, we simulate POLCA and demonstrate that we can deploy 30% more servers in existing clusters with minimal performance loss. 
    more » « less
  4. Abstract

    Kilometer-scale climate model simulations are useful tools to investigate past and future changes in extreme precipitation, particularly in mountain regions, where convection is influenced by complex topography and land–atmosphere interactions. In this study, we evaluate simulations of a flood-producing mesoscale convective system (MCS) downstream of the Tibetan Plateau (TP) in the Sichuan basin from a kilometer-scale multimodel and multiphysics ensemble. The aim is to better understand the physical processes that need to be correctly simulated for successfully capturing downstream MCS formation. We assess how the ensemble members simulate these processes and how sensitive the simulations are to different model configurations. The preceding vortex evolution over the TP, its interaction with the jet stream, and water vapor advection into the basin are identified as key processes for the MCS formation. Most modeling systems struggle to capture the interaction between the vortex and jet stream, and perturbing the model physics has little impact, while constraining the large-scale flow by spectral nudging improves the simulation. This suggests that an accurate representation of the large-scale forcing is crucial to correctly simulate the MCS and associated precipitation. To verify whether the identified shortcomings systematically affect the MCS climatology in longer-term simulations, we evaluate a 1-yr WRF simulation and find that the seasonal cycle and spatial distribution of MCSs are reasonably well captured and not improved by spectral nudging. While the simulations of the MCS case highlight challenges in extreme precipitation forecasting, we conclude that these challenges do not systematically affect simulated climatological MCS characteristics.

    Significance Statement

    Convective storm systems in mountain regions are not well understood, because the spatial resolution in conventional regional climate models is too coarse to resolve relevant processes. Here, we evaluate high-resolution climate model simulations of a storm system on the downwind side of the Tibetan Plateau. Understanding which models and model setups work well to represent this type of storm system is important because high-resolution models can help us understand mechanisms of storm formation in mountain regions and how climate change will affect these. A key finding is that most of the models struggle to capture the selected storm case, while a 1-yr simulation shows that the general statistics of storm systems around the Tibetan Plateau are still reasonably well captured.

    more » « less
  5. With recent advancements, large language models (LLMs) such as ChatGPT and Bard have shown the potential to disrupt many industries, from customer service to healthcare. Traditionally, humans interact with geospatial data through software (e.g., ArcGIS 10.3) and programming languages (e.g., Python). As a pioneer study, we explore the possibility of using an LLM as an interface to interact with geospatial datasets through natural language. To achieve this, we also propose a framework to (1) train an LLM to understand the datasets, (2) generate geospatial SQL queries based on a natural language question, (3) send the SQL query to the backend database, (4) parse the database response back to human language. As a proof of concept, a case study was conducted on real-world data to evaluate its performance on various queries. The results show that LLMs can be accurate in generating SQL code for most cases, including spatial joins, although there is still room for improvement. As all geospatial data can be stored in a spatial database, we hope that this framework can serve as a proxy to improve the efficiency of spatial data analyses and unlock the possibility of automated geospatial analytics.

    more » « less