The ability to accurately interpret complex visual information is a crucial topic of multimodal large language models (MLLMs). Recent work indicates that enhanced visual perception significantly reduces hallucinations and improves performance on resolution-sensitive tasks, such as optical character recognition and document analysis. A number of recent MLLMs achieve this goal using a mixture of vision encoders. Despite their success, there is a lack of systematic comparisons and detailed ablation studies addressing critical aspects, such as expert selection and the integration of multiple vision experts. This study provides an extensive exploration of the design space for MLLMs using a mixture of vision encoders and resolutions. Our findings reveal several underlying principles common to various existing strategies, leading to a streamlined yet effective design approach. We discover that simply concatenating visual tokens from a set of complementary vision encoders is as effective as more complex mixing architectures or strategies. We additionally introduce Pre-Alignment to bridge the gap between vision-focused encoders and language tokens, enhancing model coherence. The resulting family of MLLMs, Eagle, surpasses other leading open-source models on major MLLM benchmarks.
more »
« less
This content will become publicly available on August 17, 2026
Can Multimodal Large Language Models Be Guided to Improve Industrial Anomaly Detection?
Abstract Industrial environments demand accurate detection of anomalies to maintain product quality and ensure operational safety. Traditional industrial anomaly detection (IAD) methods often lack the flexibility and adaptability needed in dynamic production settings, where new defect types and operational changes continually emerge. Recent advancements in multimodal large language models (MLLMs) have shown promise by combining visual and textual processing capabilities, yet they are often limited by their lack of domain-specific expertise, particularly regarding industry-standard defect tolerances. To overcome limitations, we introduce Echo, a novel multi-expert framework designed to enhance MLLM performance for IAD. Echo integrates four specialized modules: the Reference Extractor retrieves similar normal images to establish contextual baselines; the Knowledge Guide provides critical, industry-specific insights; the Reasoning Expert enables structured, stepwise analysis for complex queries; and the Decision Maker synthesizes information from the preceding modules to deliver precise, context-aware responses. Evaluations on the MMAD benchmark reveal that Echo significantly improves adaptability, precision, and robustness compared to conventional approaches. Our results demonstrate that guided MLLMs, when augmented with expert modules, can effectively bridge the gap between general visual understanding and the specialized requirements of industrial anomaly detection, paving the way for more reliable and interpretable inspection systems.
more »
« less
- Award ID(s):
- 2434519
- PAR ID:
- 10650988
- Publisher / Repository:
- American Society of Mechanical Engineers
- Date Published:
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
This paper examines the performance of Multimodal LLMs (MLLMs) in skilled production work, with a focus on welding. Using a novel data set of real-world and online weld images, annotated by a domain expert, we evaluate the performance of two state-of-the-art MLLMs in assessing weld acceptability across three contexts: RV & Marine, Aeronautical, and Farming. While both models perform better on online images, likely due to prior exposure or memorization, they also perform relatively well on unseen, real-world weld images. Additionally, we introduce WeldPrompt, a prompting strategy that combines Chain-of-Thought generation with in-context learning to mitigate hallucinations and improve reasoning. WeldPrompt improves model recall in certain contexts but exhibits inconsistent performance across others. These results underscore the limitations and potentials of MLLMs in high-stakes technical domains and highlight the importance of fine-tuning, domain-specific data, and more sophisticated prompting strategies to improve model reliability. The study opens avenues for further research into multimodal learning in industry applications.more » « less
-
Abstract Large multimodal language models (MLLMs) such as GPT-4V and GPT-4o have achieved remarkable advancements in understanding and generating multimodal content, showcasing superior quality and capabilities across diverse tasks. However, their deployment faces significant challenges, including slow inference, high computational cost, and impracticality for on-device applications. In contrast, the emergence of small MLLMs, exemplified by the LLava-series models and Phi-3-Vision, offers promising alternatives with faster inference, reduced deployment costs, and the ability to handle domain-specific scenarios. Despite their growing presence, the capability boundaries between large and small MLLMs remain underexplored. In this work, we conduct a systematic and comprehensive evaluation to benchmark both small and large MLLMs, spanning general capabilities such as object recognition, temporal reasoning, and multimodal comprehension, as well as real-world applications in domains like industry and automotive. Our evaluation reveals that small MLLMs can achieve comparable performance to large models in specific scenarios but lag significantly in complex tasks requiring deeper reasoning or nuanced understanding. Furthermore, we identify common failure cases in both small and large MLLMs, highlighting domains where even state-of-the-art models struggle. We hope our findings will guide the research community in pushing the quality boundaries of MLLMs, advancing their usability and effectiveness across diverse applications.more » « less
-
Abstract With the continuous modernization of water plants, the risk of cyberattacks on them potentially endangers public health and the economic efficiency of water treatment and distribution. This article signifies the importance of developing improved techniques to support cyber risk management for critical water infrastructure, given an evolving threat environment. In particular, we propose a method that uniquely combines machine learning, the theory of belief functions, operational performance metrics, and dynamic visualization to provide the required granularity for attack inference, localization, and impact estimation. We illustrate how the focus on visual domain‐aware anomaly exploration leads to performance improvement, more precise anomaly localization, and effective risk prioritization. Proposed elements of the method can be used independently, supporting the exploration of various anomaly detection methods. It thus can facilitate the effective management of operational risk by providing rich context information and bridging the interpretation gap.more » « less
-
Abstract This review paper examines the application and challenges of machine learning (ML) in intelligent welding processes within the automotive industry, focusing on resistance spot welding (RSW) and laser welding. RSW is predominant in body-in-white assembly, while laser welding is critical for electric vehicle battery packs due to its precision and compatibility with dissimilar materials. The paper categorizes ML applications into three key areas: sensing, in-process decision-making, and post-process optimization. It reviews supervised learning models for defect detection and weld quality prediction, unsupervised learning for feature extraction and data clustering, and emerging generalizable ML approaches like transfer learning and federated learning that enhance adaptability across different manufacturing conditions. Additionally, the paper highlights the limitations of current ML models, particularly regarding generalizability when moving from lab environments to real-world production, and discusses the importance of adaptive learning techniques to address dynamically changing conditions. Case studies like virtual sensing, defect detection in RSW, and optimization in laser welding illustrate practical applications. The paper concludes by identifying future research directions to improve ML adaptability and robustness in high-variability manufacturing environments, aiming to bridge the gap between experimental ML models and real-world implementation in automotive welding.more » « less
An official website of the United States government
