skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on May 23, 2026

Title: InterChat: Enhancing Generative Visual Analytics using Multimodal Interactions
Abstract The rise of Large Language Models (LLMs) and generative visual analytics systems has transformed data‐driven insights, yet significant challenges persist in accurately interpreting users analytical and interaction intents. While language inputs offer flexibility, they often lack precision, making the expression of complex intents inefficient, error‐prone, and time‐intensive. To address these limitations, we investigate the design space of multimodal interactions for generative visual analytics through a literature review and pilot brainstorming sessions. Building on these insights, we introduce a highly extensible workflow that integrates multiple LLM agents for intent inference and visualization generation. We develop InterChat, a generative visual analytics system that combines direct manipulation of visual elements with natural language inputs. This integration enables precise intent communication and supports progressive, visually driven exploratory data analyses. By employing effective prompt engineering, and contextual interaction linking, alongside intuitive visualization and interaction designs, InterChat bridges the gap between user interactions and LLM‐driven visualizations, enhancing both interpretability and usability. Extensive evaluations, including two usage scenarios, a user study, and expert feedback, demonstrate the effectiveness of InterChat. Results show significant improvements in the accuracy and efficiency of handling complex visual analytics tasks, highlighting the potential of multimodal interactions to redefine user engagement and analytical depth in generative visual analytics.  more » « less
Award ID(s):
2427770
PAR ID:
10592712
Author(s) / Creator(s):
 ;  ;  ;  ;  ;  ;  ;  ;  
Publisher / Repository:
Wiley-Blackwell
Date Published:
Journal Name:
Computer Graphics Forum
ISSN:
0167-7055
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Researchers collect large amounts of user interaction data with the goal of mapping user's workflows and behaviors to their high‐level motivations, intuitions, and goals. Although the visual analytics community has proposed numerous taxonomies to facilitate this mapping process, no formal methods exist for systematically applying these existing theories to user interaction logs. This paper seeks to bridge the gap between visualization task taxonomies and interaction log data by making the taxonomies more actionable for interaction log analysis. To achieve this, we leverage structural parallels between how people express themselves through interactions and language by reformulating existing theories asregular grammars.We represent interactions asterminalswithin a regular grammar, similar to the role of individual words in a language, and patterns of interactions ornon‐terminalsasregular expressionsover these terminals to capture common language patterns. To demonstrate our approach, we generate regular grammars for seven existing visualization taxonomies and develop code to apply them to three public interaction log datasets. In analyzing these regular grammars, we find that the taxonomies at the low‐level (i.e., terminals) show mixed results in expressing multiple interaction log datasets, and taxonomies at the high‐level (i.e., regular expressions) have limited expressiveness, due to primarily two challenges: inconsistencies in interaction log dataset granularity and structure, and under‐expressiveness of certain terminals. Based on our findings, we suggest new research directions for the visualization community to augment existing taxonomies, develop new ones, and build better interaction log recording processes to facilitate the data‐driven development of user behavior taxonomies. 
    more » « less
  2. Large Language Models are typically trained with next-turn rewards, limiting their ability to optimize for long-term interaction. As a result, they often respond passively to ambiguous or open-ended user requests, failing to help users reach their ultimate intents and leading to inefficient conversations. To address these limitations, we introduce COLLABLLM, a novel and general training framework that enhances multiturn human-LLM collaboration. Its key innovation is a collaborative simulation that estimates the long-term contribution of responses using Multiturn-aware Rewards. By reinforcement fine-tuning these rewards, COLLABLLM goes beyond responding to user requests, and actively uncovers user intent and offers insightful suggestions—a key step towards more humancentered AI. We also devise a multiturn interaction benchmark with three challenging tasks such as document creation. COLLABLLM significantly outperforms our baselines with averages of 18.5% higher task performance and 46.3% improved interactivity by LLM judges. Finally, we conduct a large user study with 201 judges, where COLLABLLM increases user satisfaction by 17.6% and reduces user spent time by 10.4%. 
    more » « less
  3. Generative models such as Large Language Models (LLM) and Multimodal Large Language models (MLLMs) trained on massive web corpora can memorize and disclose individuals’ confidential and private data, raising legal and ethical concerns. While many previous works have addressed this issue in LLM via machine unlearning, it remains largely unexplored for MLLMs. To tackle this challenge, we introduce Multimodal Large Language Model Unlearning Benchmark (MLLMU-Bench), a novel benchmark aimed at advancing the understanding of multimodal machine unlearning. MLLMU-Bench consists of 500 fictitious profiles and 153 profiles for public celebrities, each profile feature over 14 customized question-answer pairs, evaluated from both multimodal (image+text) and unimodal (text) perspectives. The benchmark is divided into four sets to assess unlearning algorithms in terms of efficacy, generalizability, and model utility. Finally, we provide baseline results using existing generative model unlearning algorithms. Surprisingly, our experiments show that unimodal unlearning algorithms excel in generation tasks, while multimodal unlearning approaches perform better in classification with multimodal inputs. 
    more » « less
  4. In the field of visualization, understanding users’ analytical reasoning is important for evaluating the effectiveness of visualization applications. Several studies have been conducted to capture and analyze user interactions to comprehend this reasoning process. However, few have successfully linked these interactions to users’ reasoning processes. This paper introduces an approach that addresses the limitation by correlating semantic user interactions with analysis decisions using an interactive wire transaction analysis system and a visual state transition matrix, both designed as visual analytics applications. The system enables interactive analysis for evaluating financial fraud in wire transactions. It also allows mapping captured user interactions and analytical decisions back onto the visualization to reveal their decision differences. The visual state transition matrix further aids in understanding users’ analytical flows, revealing their decision-making processes. Classification machine learning algorithms are applied to evaluate the effectiveness of our approach in understanding users’ analytical reasoning process by connecting the captured semantic user interactions to their decisions (i.e., suspicious, not suspicious, and inconclusive) on wire transactions. With the algorithms, an average of 72% accuracy is determined to classify the semantic user interactions. For classifying individual decisions, the average accuracy is 70%. Notably, the accuracy for classifying ‘inconclusive’ decisions is 83%. Overall, the proposed approach improves the understanding of users’ analytical decisions and provides a robust method for evaluating user interactions in visualization tools. 
    more » « less
  5. Recent advances in Large Language Models (LLMs) have demonstrated significant potential in the field of Recommendation Systems (RSs). Most existing studies have focused on converting user behavior logs into textual prompts and leveraging techniques such as prompt tuning to enable LLMs for recommendation tasks. Meanwhile, research interest has recently grown in multimodal recommendation systems that integrate data from images, text, and other sources using modality fusion techniques. This introduces new challenges to the existing LLM-based recommendation paradigm which relies solely on text modality information. Moreover, although Multimodal Large Language Models (MLLMs) capable of processing multi-modal inputs have emerged, how to equip MLLMs with multi-modal recommendation capabilities remains largely unexplored. To this end, in this paper, we propose the Multimodal Large Language Model-enhanced Sequential Multimodal Recommendation (MLLM-MSR) model. To capture the dynamic user preference, we design a two-stage user preference summarization method. Specifically, we first utilize an MLLM-based item-summarizer to extract image feature given an item and convert the image into text. Then, we employ a recurrent user preference summarization generation paradigm to capture the dynamic changes in user preferences based on an LLM-based user-summarizer. Finally, to enable the MLLM for multi-modal recommendation task, we propose to fine-tune a MLLM-based recommender using Supervised Fine-Tuning (SFT) techniques. Extensive evaluations across various datasets validate the effectiveness of MLLM-MSR, showcasing its superior ability to capture and adapt to the evolving dynamics of user preferences. 
    more » « less