Benchmarking large and small MLLMs

Feng, Xuelu; Li, Yunsheng; Chen, Dongdong; Gao, Mei; Liu, Mengchen; Yuan, Junsong; Qiao, Chunming

doi:10.1007/s00138-025-01762-0

Citation Details

This content will become publicly available on November 1, 2026

Benchmarking large and small MLLMs

Abstract Large multimodal language models (MLLMs) such as GPT-4V and GPT-4o have achieved remarkable advancements in understanding and generating multimodal content, showcasing superior quality and capabilities across diverse tasks. However, their deployment faces significant challenges, including slow inference, high computational cost, and impracticality for on-device applications. In contrast, the emergence of small MLLMs, exemplified by the LLava-series models and Phi-3-Vision, offers promising alternatives with faster inference, reduced deployment costs, and the ability to handle domain-specific scenarios. Despite their growing presence, the capability boundaries between large and small MLLMs remain underexplored. In this work, we conduct a systematic and comprehensive evaluation to benchmark both small and large MLLMs, spanning general capabilities such as object recognition, temporal reasoning, and multimodal comprehension, as well as real-world applications in domains like industry and automotive. Our evaluation reveals that small MLLMs can achieve comparable performance to large models in specific scenarios but lag significantly in complex tasks requiring deeper reasoning or nuanced understanding. Furthermore, we identify common failure cases in both small and large MLLMs, highlighting domains where even state-of-the-art models struggle. We hope our findings will guide the research community in pushing the quality boundaries of MLLMs, advancing their usability and effectiveness across diverse applications. more »

Award ID(s):: 2120369

PAR ID:: 10646599

Author(s) / Creator(s):: Feng, Xuelu; Li, Yunsheng; Chen, Dongdong; Gao, Mei; Liu, Mengchen; Yuan, Junsong; Qiao, Chunming

Publisher / Repository:: Springer Science+Business Media

Date Published:: 2025-11-01

Journal Name:: Machine Vision and Applications

Volume:: 36

Issue:: 6

ISSN:: 0932-8092

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
This content will become publicly available on November 1, 2026
Journal Article:
https://doi.org/10.1007/s00138-025-01762-0

More Like this