skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Zero-Shot Multi-Label Topic Inference with Sentence Encoders and LLMs
In this paper, we conducted a comprehensive study with the latest Sentence Encoders and Large Language Models (LLMs) on the challenging task of “definition-wild zero-shot topic inference”, where users define or provide the topics of interest in real-time. Through extensive experimentation on seven diverse data sets, we observed that LLMs, such as ChatGPT-3.5 and PaLM, demonstrated superior generality compared to other LLMs, e.g., BLOOM and GPT-NeoX. Furthermore, Sentence-BERT, a BERT-based classical sentence encoder, outperformed PaLM and achieved performance comparable to ChatGPT-3.5.  more » « less
Award ID(s):
2302974
PAR ID:
10524586
Author(s) / Creator(s):
; ;
Editor(s):
Bouamor, Houda; Pino, Juan; Bali, Kalika
Publisher / Repository:
Association for Computational Linguistics
Date Published:
Page Range / eLocation ID:
16218 to 16233
Format(s):
Medium: X
Location:
Singapore
Sponsoring Org:
National Science Foundation
More Like this
  1. The advanced capabilities of Large Language Models (LLMs) have made them invaluable across various applications, from conversational agents and content creation to data analysis, research, and innovation. However, their effectiveness and accessibility also render them susceptible to abuse for generating malicious content, including phishing attacks. This study explores the potential of using four popular commercially available LLMs, i.e., ChatGPT (GPT 3.5 Turbo), GPT 4, Claude, and Bard, to generate functional phishing attacks using a series of malicious prompts. We discover that these LLMs can generate both phishing websites and emails that can convincingly imitate well-known brands and also deploy a range of evasive tactics that are used to elude detection mechanisms employed by anti-phishing systems. These attacks can be generated using unmodified or "vanilla" versions of these LLMs without requiring any prior adversarial exploits such as jailbreaking. We evaluate the performance of the LLMs towards generating these attacks and find that they can also be utilized to create malicious prompts that, in turn, can be fed back to the model to generate phishing scams - thus massively reducing the prompt-engineering effort required by attackers to scale these threats. As a countermeasure, we build a BERT-based automated detection tool that can be used for the early detection of malicious prompts to prevent LLMs from generating phishing content. Our model is transferable across all four commercial LLMs, attaining an average accuracy of 96% for phishing website prompts and 94% for phishing email prompts. We also disclose the vulnerabilities to the concerned LLMs, with Google acknowledging it as a severe issue. Our detection model is available for use at Hugging Face, as well as a ChatGPT Actions plugin. 
    more » « less
  2. Human-conducted rating tasks are resource-intensive and demand significant time and financial commitments. As Large Language Models (LLMs) like GPT emerge and exhibit prowess across various domains, their potential in automating such evaluation tasks becomes evident. In this research, we leveraged four prominent LLMs: GPT-4, GPT-3.5, Vicuna, and PaLM 2, to scrutinize their aptitude in evaluating teacher-authored mathematical explanations. We utilized a detailed rubric that encompassed accuracy, explanation clarity, the correctness of mathematical notation, and the efficacy of problem-solving strategies. During our investigation, we unexpectedly discerned the influence of HTML formatting on these evaluations. Notably, GPT-4 consistently favored explanations formatted with HTML, whereas the other models displayed mixed inclinations. When gauging Inter-Rater Reliability (IRR) among these models, only Vicuna and PaLM 2 demonstrated high IRR using the conventional Cohen’s Kappa metric for explanations formatted with HTML. Intriguingly, when a more relaxed version of the metric was applied, all model pairings showcased robust agreement. These revelations not only underscore the potential of LLMs in providing feedback on student-generated content but also illuminate new avenues, such as reinforcement learning, which can harness the consistent feedback from these models. 
    more » « less
  3. Abstract Recent advances in large language models (LLMs) have demonstrated remarkable successes in zero- and few-shot performance on various downstream tasks, paving the way for applications in high-stakes domains. In this study, we systematically examine the capabilities and limitations of LLMs, specifically GPT-3.5 and ChatGPT, in performing zero-shot medical evidence summarization across six clinical domains. We conduct both automatic and human evaluations, covering several dimensions of summary quality. Our study demonstrates that automatic metrics often do not strongly correlate with the quality of summaries. Furthermore, informed by our human evaluations, we define a terminology of error types for medical evidence summarization. Our findings reveal that LLMs could be susceptible to generating factually inconsistent summaries and making overly convincing or uncertain statements, leading to potential harm due to misinformation. Moreover, we find that models struggle to identify the salient information and are more error-prone when summarizing over longer textual contexts. 
    more » « less
  4. Introduction: Because developing integrated computer science (CS) curriculum is a resource-intensive process, there is interest in leveraging the capabilities of AI tools, including large language models (LLMs), to streamline this task. However, given the novelty of LLMs, little is known about their ability to generate appropriate curriculum content. Research Question: How do current LLMs perform on the task of creating appropriate learning activities for integrated computer science education? Methods: We tested two LLMs (Claude 3.5 Sonnet and ChatGPT 4-o) by providing them with a subset of national learning standards for both CS and language arts and asking them to generate a high-level description of learning activities that met standards for both disciplines. Four humans rated the LLM output – using an aggregate rating approach – in terms of (1) whether it met the CS learning standard, (2) whether it met the language arts learning standard, (3) whether it was equitable, and (4) its overall quality. Results: For Claude AI, 52% of the activities met language arts standards, 64% met CS standards, and the average quality rating was middling. For ChatGPT, 75% of the activities met language arts standards, 63% met CS standards, and the average quality rating was low. Virtually all activities from both LLMs were rated as neither actively promoting nor inhibiting equitable instruction. Discussion: Our results suggest that LLMs are not (yet) able to create appropriate learning activities from learning standards. The activities were generally not usable by classroom teachers without further elaboration and/or modification. There were also grammatical errors in the output, something not common with LLM-produced text. Further, standards in one or both disciplines were often not addressed, and the quality of the activities was often low. We conclude with recommendations for the use of LLMs in curriculum development in light of these findings. 
    more » « less
  5. Simulations are widely used to teach science in grade schools. These Ralph Knipper rak0035@auburn.edu Auburn University Auburn, Alabama, USA Sadhana Puntambekar puntambekar@education.wisc.edu University of Wisconsin-Madison Madison, Wisconsin, USA Large Language Models, Conversational AI, Meta-Conversation, simulations are often augmented with a conversational artificial intelligence (AI) agent to provide real-time scaffolding support for students conducting experiments using the simulations. AI agents are highly tailored for each simulation, with a predesigned set of Instructional Goals (IGs). This makes it difficult for teachers to adjust IGs as the agent may no longer align with the revised IGs. Additionally, teachers are hesitant to adopt new third-party simulations for the same reasons. In this research, we introduce SimPal, a Large Language Model (LLM) based meta-conversational agent, to solve this misalignment issue between a pre-trained conversational AI agent and the constantly evolving pedagogy of instructors. Through natural conversation with SimPal, teachers first explain their desired IGs, based on which SimPal identifies a set of relevant physical variables and their relationships to create symbolic representations of the desired IGs. The symbolic representations can then be leveraged to design prompts for the original AI agent to yield better alignment with the desired IGs. We empirically evaluated SimPal using two LLMs, ChatGPT-3.5 and PaLM 2, on 63 Physics simulations from PhET and Golabz. Additionally, we examined the impact of different prompting techniques on LLM’s performance by utilizing the TELeR taxonomy to identify relevant physical variables for the IGs. Our findings showed that SimPal can do this task with a high degree of accuracy when provided with a well-defined prompt. 
    more » « less