Multimodal Large Language Models (MLLMs) have demonstrated impressive abilities across various tasks, including visual question answering and chart comprehension, yet existing benchmarks for chart-related tasks fall short in capturing the complexity of real-world multi-chart scenarios. Current benchmarks primarily focus on single-chart tasks, neglecting the multi-hop reasoning required to extract and integrate information from multiple charts, which is essential in practical applications. To fill this gap, we introduce MultiChartQA, a benchmark that evaluates MLLMs’ capabilities in four key areas: direct question answering, parallel question answering, comparative reasoning, and sequential reasoning. Our evaluation of a wide range of MLLMs reveals significant performance gaps compared to humans. These results highlight the challenges in multi-chart comprehension and the potential of MultiChartQA to drive advancements in this field. Our code and data are available at https://github.com/Zivenzhu/Multi-chart-QA.
more »
« less
Plot Twist: Multimodal Models Don’t Comprehend Simple Chart Details
Recent advances in multimodal models show remarkable performance in real-world benchmarks for chart and figure understanding like ChartQA that involve interpreting trends, comparing data points, and extracting insights from visuals.In this paper, we investigate the extent to which these models truly comprehend the underlying information in charts by posing direct, elementary questions about simple features such as axes ranges and values to examine their fundamental visual understanding abilities in the context of charts.Our questions are applied to two sets of figures: synthetic and real-world.The empirical evaluation of 5 popular multimodal models on our dataset reveals shortfalls in understanding charts and figures, contrary to what their performance on complex benchmarks might suggest.For instance, Gemini Pro Vision only achieves 57.9% accuracy on our elementary set of questions on real-world plots, while other popular multimodal models showed similar or less performance.This work highlights an important limitation of current multimodal models, and cautions against overly optimistic interpretations of their abilities based on results of canonical evaluations.
more »
« less
- Award ID(s):
- 2046873
- PAR ID:
- 10635514
- Publisher / Repository:
- Association for Computational Linguistics
- Date Published:
- Page Range / eLocation ID:
- 5922 to 5937
- Format(s):
- Medium: X
- Location:
- Miami, Florida, USA
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Introduction: Informational graphics and data representations (e.g., charts and figures) are critical for accessing educational content. Novel technologies, such as the multimodal touchscreen which displays audio, haptic, and visual information, are promising for being platforms of diverse means to access digital content. This work evaluated educational graphics rendered on a touchscreen compared to the current standard for accessing graphical content. Method: Three bar charts and geometry figures were evaluated on student ( N = 20) ability to orient to and extract information from the touchscreen and print. Participants explored the graphics and then were administered a set of questions (11–12 depending on graphic group). In addition, participants’ attitudes using the mediums were assessed. Results: Participants performed statistically significantly better on questions assessing information orientation using the touchscreen than print for both bar chart and geometry figures. No statistically significant difference in information extraction ability was found between mediums on either graphic type. Participants responded significantly more favorably to the touchscreen than the print graphics, indicating them as more helpful, interesting, fun, and less confusing. Discussion: Accessing and orienting to information was highly successful by participants using the touchscreen, and was the preferred means of accessing graphical information when compared to the print image for both geometry figures and bar charts. This study highlights challenges in presenting graphics both on touchscreens and in print. Implications for Practitioners: This study offers preliminary support for the use of multimodal, touchscreen tablets as educational tools. Student ability using touchscreen-based graphics seems to be comparable to traditional types of graphics (large print and embossed, tactile graphics), although further investigation may be necessary for tactile graphic users. In summary, educators of students with blindness and visual impairments should consider ways to utilize new technologies, such as touchscreens, to provide more diverse access to graphical information.more » « less
-
Krause, Andreas; Brunskill, Emma_Patricia; Cho, Kyunghyun; Engelhardt, Barbara_Elizabeth; Sabato, Sivan; Scarlett, Jonathan (Ed.)Real-world data can be multimodal distributed, e.g., data describing the opinion divergence in a community, the interspike interval distribution of neurons, and the oscillators' natural frequencies. Generating multimodal distributed real-world data has become a challenge to existing generative adversarial networks (GANs). For example, it is often observed that Neural SDEs have only demonstrated successful performance mainly in generating unimodal time series datasets. In this paper, we propose a novel time series generator, named directed chain GANs (DC-GANs), which inserts a time series dataset (called a neighborhood process of the directed chain or input) into the drift and diffusion coefficients of the directed chain SDEs with distributional constraints. DC-GANs can generate new time series of the same distribution as the neighborhood process, and the neighborhood process will provide the key step in learning and generating multimodal distributed time series. The proposed DC-GANs are examined on four datasets, including two stochastic models from social sciences and computational neuroscience, and two real-world datasets on stock prices and energy consumption. To our best knowledge, DC-GANs are the first work that can generate multimodal time series data and consistently outperforms state-of-the-art benchmarks with respect to measures of distribution, data similarity, and predictive ability.more » « less
-
Real-world data can be multimodal distributed, e.g., data describing the opinion divergence in a community, the interspike interval distribution of neurons, and the oscillators’ natural frequen- cies. Generating multimodal distributed real- world data has become a challenge to existing generative adversarial networks (GANs). For ex- ample, it is often observed that Neural SDEs have only demonstrated successful performance mainly in generating unimodal time series datasets. In this paper, we propose a novel time series gen- erator, named directed chain GANs (DC-GANs), which inserts a time series dataset (called a neigh- borhood process of the directed chain or input) into the drift and diffusion coefficients of the di- rected chain SDEs with distributional constraints. DC-GANs can generate new time series of the same distribution as the neighborhood process, and the neighborhood process will provide the key step in learning and generating multimodal dis- tributed time series. The proposed DC-GANs are examined on four datasets, including two stochas- tic models from social sciences and computa- tional neuroscience, and two real-world datasets on stock prices and energy consumption. To our best knowledge, DC-GANs are the first work that can generate multimodal time series data and con- sistently outperforms state-of-the-art benchmarks with respect to measures of distribution, data sim- ilarity, and predictive ability.more » « less
-
Modern generative models exhibit unprecedented capabilities to generate extremely realistic data. However, given the inherent compositionality of real world, reliable use of these models in practical applications mandates they exhibit the ability to compose their capabilities, generating and reasoning over entirely novel samples never seen in the training distribution. Prior work demonstrates recent vision diffusion models exhibit intriguing compositional generalization abilities, but also fail rather unpredictably. What are the reasons underlying this behavior? Which concepts does the model generally find difficult to compose to form novel data? To address these questions, we perform a controlled study of compositional generalization in conditional diffusion models in a synthetic setting, varying different attributes of the training data and measuring the model's ability to generate samples out-of-distribution. Our results show that: (i) the compositional structure of the data-generating process governs the order in which capabilities and an ability to compose them emerges; (ii) learning individual concepts impacts performance on compositional tasks, multiplicatively explaining sudden emergence; and (iii) learning and composing capabilities is difficult under correlations. We hope our study inspires further grounded research on understanding capabilities and compositionality in generative models from a data-centric perspective.more » « less
An official website of the United States government

