skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Search for: All records

Creators/Authors contains: "Zhang, Zhihan"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Multimodal Large Language Models (MLLMs) have demonstrated impressive abilities across various tasks, including visual question answering and chart comprehension, yet existing benchmarks for chart-related tasks fall short in capturing the complexity of real-world multi-chart scenarios. Current benchmarks primarily focus on single-chart tasks, neglecting the multi-hop reasoning required to extract and integrate information from multiple charts, which is essential in practical applications. To fill this gap, we introduce MultiChartQA, a benchmark that evaluates MLLMs’ capabilities in four key areas: direct question answering, parallel question answering, comparative reasoning, and sequential reasoning. Our evaluation of a wide range of MLLMs reveals significant performance gaps compared to humans. These results highlight the challenges in multi-chart comprehension and the potential of MultiChartQA to drive advancements in this field. Our code and data are available at https://github.com/Zivenzhu/Multi-chart-QA. 
    more » « less
    Free, publicly-accessible full text available April 27, 2026
  2. Midcircuit measurements (MCMs) are crucial ingredients in the development of fault-tolerant quantum computation. While there have been rapid experimental progresses in realizing MCMs, a systematic method for characterizing noisy MCMs is still under exploration. In this work, we develop a cycle benchmarking (CB)-type algorithm to characterize noisy MCMs. The key idea is to use a joint Fourier transform on the classical and quantum registers and then estimate parameters in the Fourier space, analogous to Pauli fidelities used in CB-type algorithms for characterizing the Pauli-noise channel of Clifford gates. Furthermore, we develop a theory of the noise learnability of MCMs, which determines what information can be learned about the noise model (in the presence of state preparation and terminating measurement noise) and what cannot, which shows that all learnable information can be learned using our algorithm. As an application, we show how to use the learned information to test the independence between measurement noise and state-preparation noise in an MCM. Finally, we conduct numerical simulations to illustrate the practical applicability of the algorithm. Similar to other CB-type algorithms, we expect the algorithm to provide a useful toolkit that is of experimental interest. Published by the American Physical Society2025 
    more » « less
  3. Climate change demands urgent action, yet understanding the environmental impact (EI) of everyday objects and activities remains challenging for the general public. While Life Cycle Assessment (LCA) offers a comprehensive framework for EI analysis, its traditional implementation requires extensive domain expertise, structured input data, and significant time investment, creating barriers for non-experts seeking real-time sustainability insights. Here we present the first autonomous sustainability assessment tool that bridges this gap by transforming unstructured natural language descriptions into in-context, interactive EI visualizations. Our approach combines language modeling and AI agents, and achieves >97% accuracy in transforming natural language into a data abstraction designed for simplified LCA modeling. The system employs a non-parametric datastore to integrate proprietary LCA databases while maintaining data source attribution and allowing personalized source management. We demonstrate through case studies that our system achieves results within 11% of traditional full LCA, while accelerating from hours of expert time to real-time. We conducted a formative elicitation study (N=6) to inform the design objectives of such EI communication augmentation tools. We implemented and deployed the tool as a Chromium browser extension and further evaluated it through a user study (N=12). This work represents a significant step toward democratizing access to environmental impact information for the general public with zero LCA expertise. 
    more » « less
    Free, publicly-accessible full text available September 3, 2026
  4. Free, publicly-accessible full text available February 12, 2026
  5. Chiruzzo, Luis; Ritter, Alan; Wang, Lu (Ed.)
    The instruction hierarchy, which establishes a priority order from system messages to user messages, conversation history, and tool outputs, is essential for ensuring consistent and safe behavior in language models (LMs). Despite its importance, this topic receives limited attention, and there is a lack of comprehensive benchmarks for evaluating models’ ability to follow the instruction hierarchy. We bridge this gap by introducing IHEval, a novel benchmark comprising 3,538 examples across nine tasks, covering cases where instructions in different priorities either align or conflict. Our evaluation of popular LMs highlights their struggle to recognize instruction priorities. All evaluated models experience a sharp performance decline when facing conflicting instructions, compared to their original instruction-following performance. Moreover, the most competitive open-source model only achieves 48% accuracy in resolving such conflicts. Our results underscore the need for targeted optimization in the future development of LMs. 
    more » « less
    Free, publicly-accessible full text available April 27, 2026
  6. Free, publicly-accessible full text available April 25, 2026
  7. Instruction tuning has remarkably advanced large language models (LLMs) in understand- ing and responding to diverse human instruc- tions. Despite the success in high-resource lan- guages, its application in lower-resource ones faces challenges due to the imbalanced foun- dational abilities of LLMs across different lan- guages, stemming from the uneven language distribution in their pre-training data. To tackle this issue, we propose pivot language guided generation (PLUG), an approach that utilizes a high-resource language, primarily English, as the pivot to enhance instruction tuning in lower-resource languages. It trains the model to first process instructions in the pivot language, and then produce responses in the target lan- guage. To evaluate our approach, we introduce a benchmark, X-AlpacaEval, of instructions in 4 languages (Chinese, Korean, Italian, and Spanish), each annotated by professional trans- lators. Our approach demonstrates a significant improvement in the instruction-following abili- ties of LLMs by 29% on average, compared to directly responding in the target language alone. Further experiments validate the versatility of our approach by employing alternative pivot languages beyond English to assist languages where LLMs exhibit lower proficiency. 
    more » « less