skip to main content

Attention:

The NSF Public Access Repository (PAR) system and access will be unavailable from 8:00 PM ET on Friday, March 21 until 8:00 AM ET on Saturday, March 22 due to maintenance. We apologize for the inconvenience.


Title: Reconciling the contrasting narratives on the environmental impact of large language models
Abstract

The recent proliferation of large language models (LLMs) has led to divergent narratives about their environmental impacts. Some studies highlight the substantial carbon footprint of training and using LLMs, while others argue that LLMs can lead to more sustainable alternatives to current practices. We reconcile these narratives by presenting a comparative assessment of the environmental impact of LLMs vs. human labor, examining their relative efficiency across energy consumption, carbon emissions, water usage, and cost. Our findings reveal that, while LLMs have substantial environmental impacts, their relative impacts can be dramatically lower than human labor in the U.S. for the same output, with human-to-LLM ratios ranging from 40 to 150 for a typical LLM (Llama-3-70B) and from 1200 to 4400 for a lightweight LLM (Gemma-2B-it). While the human-to-LLM ratios are smaller with regard to human labor in India, these ratios are still between 3.4 and 16 for a typical LLM and between 130 and 1100 for a lightweight LLM. Despite the potential benefit of switching from humans to LLMs, economic factors may cause widespread adoption to lead to a new combination of human and LLM-driven work, rather than a simple substitution. Moreover, the growing size of LLMs may substantially increase their energy consumption and lower the human-to-LLM ratios, highlighting the need for further research to ensure the sustainability and efficiency of LLMs.

 
more » « less
Award ID(s):
2121572
PAR ID:
10553329
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
Nature Publishing Group
Date Published:
Journal Name:
Scientific Reports
Volume:
14
Issue:
1
ISSN:
2045-2322
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Recent innovation in large language models (LLMs), and their myriad use cases have rapidly driven up the compute demand for datacenter GPUs. Several cloud providers and other enterprises plan to substantially grow their datacenter capacity to support these new workloads. A key bottleneck resource in datacenters is power, which LLMs are quickly saturating due to their rapidly increasing model sizes.We extensively characterize the power consumption patterns of a variety of LLMs and their configurations. We identify the differences between the training and inference power consumption patterns. Based on our analysis, we claim that the average and peak power utilization in LLM inference clusters should not be very high. Our deductions align with data from production LLM clusters, revealing that inference workloads offer substantial headroom for power oversubscription. However, the stringent set of telemetry and controls that GPUs offer in a virtualized environment make it challenging to build a reliable and robust power management framework.We leverage the insights from our characterization to identify opportunities for better power management. As a detailed use case, we propose a new framework called POLCA, which enables power oversubscription in LLM inference clouds. POLCA is robust, reliable, and readily deployable. Using open-source models to replicate the power patterns observed in production, we simulate POLCA and demonstrate that we can deploy 30% more servers in existing clusters with minimal performance loss. 
    more » « less
  2. Large language models (LLMs) exhibit a wide range of promising capabilities -- from step-by-step planning to commonsense reasoning -- that may provide utility for robots, but remain prone to confidently hallucinated predictions. In this work, we present KnowNo, which is a framework for measuring and aligning the uncertainty of LLM-based planners such that they know when they don't know and ask for help when needed. KnowNo builds on the theory of conformal prediction to provide statistical guarantees on task completion while minimizing human help in complex multi-step planning settings. Experiments across a variety of simulated and real robot setups that involve tasks with different modes of ambiguity (e.g., from spatial to numeric uncertainties, from human preferences to Winograd schemas) show that KnowNo performs favorably over modern baselines (which may involve ensembles or extensive prompt tuning) in terms of improving efficiency and autonomy, while providing formal assurances. KnowNo can be used with LLMs out of the box without model-finetuning, and suggests a promising lightweight approach to modeling uncertainty that can complement and scale with the growing capabilities of foundation models. 
    more » « less
  3. While large language models (LLMs) equipped with techniques like chain-of-thought prompting have demonstrated impressive capabilities, they still fall short in their ability to reason robustly in complex settings. However, evaluating LLM reasoning is challenging because system capabilities continue to grow while benchmark datasets for tasks like logical deduction have remained static. We introduce MuSR, a dataset for evaluating language models on multistep soft reasoning tasks specified in a natural language narrative. This dataset has two crucial features. First, it is created through a novel neurosymbolic synthetic-to-natural generation algorithm, enabling the construction of complex reasoning instances that challenge GPT-4 (e.g., murder mysteries roughly 1000 words in length) and which can be scaled further as more capable LLMs are released. Second, our dataset instances are free text narratives corresponding to real-world domains of reasoning; this makes it simultaneously much more challenging than other synthetically-crafted benchmarks while remaining realistic and tractable for human annotators to solve with high accuracy. We evaluate a range of LLMs and prompting techniques on this dataset and characterize the gaps that remain for techniques like chain-of-thought to perform robust reasoning. 
    more » « less
  4. Large Language Models (LLMs) have made significant strides in various intelligent tasks but still struggle with complex action reasoning tasks that require systematic search. To address this limitation, we introduce a method that bridges the natural language understanding capability of LLMs with the symbolic reasoning capability of action languages---formal languages for reasoning about actions. Our approach, termed {\sf LLM+AL}, leverages the LLM's strengths in semantic parsing and commonsense knowledge generation alongside the action language's expertise in automated reasoning based on encoded knowledge. We compare {\sf LLM+AL} against state-of-the-art LLMs, including {\sc ChatGPT-4}, {\sc Claude 3 Opus}, {\sc Gemini Ultra 1.0}, and {\sc o1-preview}, using benchmarks for complex reasoning about actions. Our findings indicate that while all methods exhibit various errors, {\sf LLM+AL}, with relatively simple human corrections, consistently leads to correct answers, whereas using LLMs alone does not yield improvements even after human intervention. {\sf LLM+AL} also contributes to automated generation of action languages. 
    more » « less
  5. Despite their extensive application in language understanding tasks, large language models (LLMs) still encounter challenges including hallucinations - occasional fabrication of information - and alignment issues - lack of associations with human-curated world models (e.g., intuitive physics or common-sense knowledge). Moreover, the black-box nature of LLMs presents significant obstacles in training them effectively to achieve desired behaviors. In particular, modifying the concept embedding spaces of LLMs can be highly intractable. This process involves analyzing the implicit impact of such adjustments on the myriad parameters within LLMs and the resulting inductive biases. We propose a novel architecture that wraps powerful function approximation architectures within an outer, interpretable read-out layer. This read-out layer can be scrutinized to explicitly observe the effects of concept modeling during the training of the LLM. Our method stands in contrast with gradient-based implicit mechanisms, which depend solely on adjustments to the LLM parameters and thus evade scrutiny. By conducting extensive experiments across both generative and discriminative language modeling tasks, we evaluate the capabilities of our proposed architecture relative to state-of-the-art LLMs of similar sizes. Additionally, we offer a qualitative examination of the interpretable read-out layer and visualize the concepts it captures. The results demonstrate the potential of our approach for effectively controlling LLM hallucinations and enhancing the alignment with human expectations.

     
    more » « less