The development and training of deep learning models have become increasingly costly and complex. Consequently, software engineers are adopting pre-trained models (PTMs) for their downstream applications. The dynamics of the PTM supply chain remain largely unexplored, signaling a clear need for structured datasets that document not only the metadata but also the subsequent applications of these models. Without such data, the MSR community cannot comprehensively understand the impact of PTM adoption and reuse.This paper presents the PeaTMOSS dataset, which comprises metadata for 281,638 PTMs and detailed snapshots for all PTMs with over 50 monthly downloads (14,296 PTMs), along with 28,575 open-source software repositories from GitHub that utilize these models. Additionally, the dataset includes 44,337 mappings from 15,129 downstream GitHub repositories to the 2,530 PTMs they use. To enhance the dataset’s comprehensiveness, we developed prompts for a large language model to automatically extract model metadata, including the model’s training datasets, parameters, and evaluation metrics. Our analysis of this dataset provides the first summary statistics for the PTM supply chain, showing the trend of PTM development and common shortcomings of PTM package documentation. Our example application reveals inconsistencies in software licenses across PTMs and their dependent projects. PeaTMOSS lays the foundation for future research, offering rich opportunities to investigate the PTM supply chain. We outline mining opportunities on PTMs, their downstream usage, and cross-cutting questions.Our artifact is available at https://github.com/PurdueDualityLab/PeaTMOSS-Artifact. Our dataset is available at https://transfer.rcac.purdue.edu/file-manager?origin_id=ff978999-16c2-4b50-ac7a-947ffdc3eb1d&origin_path=%2F. 
                        more » 
                        « less   
                    
                            
                            An Empirical Study of Artifacts and Security Risks in the Pre-trained Model Supply Chain
                        
                    
    
            Deep neural networks achieve state-of-the-art performance on many tasks, but require increasingly complex architectures and costly training procedures. Engineers can reduce costs by reusing a pre-trained model (PTM) and fine-tuning it for their own tasks. To facilitate software reuse, engineers collaborate around model hubs, collections of PTMs and datasets organized by problem domain. Although model hubs are now comparable in popularity and size to other software ecosystems, the associated PTM supply chain has not yet been examined from a software engineering perspective. We present an empirical study of artifacts and security features in 8 model hubs. We indicate the potential threat models and show that the existing defenses are insufficient for ensuring the security of PTMs. We compare PTM and traditional supply chains, and propose directions for further measurements and tools to increase the reliability of the PTM supply chain. 
        more » 
        « less   
        
    
    
                            - PAR ID:
- 10427457
- Date Published:
- Journal Name:
- Proceedings of the 1st ACM Workshop on Software Supply Chain Offensive Research and Ecosystem Defenses (SCORED
- Page Range / eLocation ID:
- 105 to 114
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            The development and training of deep learning models have become increasingly costly and complex. Consequently, software engineers are adopting pre-trained models (PTMs) for their downstream applications. The dynamics of the PTM supply chain remain largely unexplored, signaling a clear need for structured datasets that document not only the metadata but also the subsequent applications of these models. Without such data, the MSR community cannot comprehensively understand the impact of PTM adoption and reuse. This paper presents the PeaTMOSS dataset, which comprises metadata for 281,638 PTMs and detailed snapshots for all PTMs with over 50 monthly downloads (14,296 PTMs), along with 28,575 open-source software repositories from GitHub that utilize these models. Additionally, the dataset includes 44,337 mappings from 15,129 downstream GitHub repositories to the 2,530 PTMs they use. To enhance the dataset’s comprehensiveness, we developed prompts for a large language model to automatically extract model metadata, including the model’s training datasets, parameters, and evaluation metrics. Our analysis of this dataset provides the first summary statistics for the PTM supply chain, showing the trend of PTM development and common shortcomings of PTM package documentation. Our example application reveals inconsistencies in software licenses across PTMs and their dependent projects. PeaTMOSS lays the foundation for future research, offering rich opportunities to investigate the PTM supply chain. We outline mining opportunities on PTMs, their downstream usage, and cross-cutting questions. Our artifact is available at https://github.com/PurdueDualityLab/PeaTMOSS-Artifact. Our dataset is available at https://transfer.rcac.purdue.edu/file-manager?origin_id=ff978999-16c2-4b50-ac7a-947ffdc3eb1d&origin_path=%2F.more » « less
- 
            Abstract As innovation in deep learning continues, many engineers are incorporating Pre-Trained Models (PTMs) as components in computer systems. Some PTMs are foundation models, and others are fine-tuned variations adapted to different needs. When these PTMs are named well, it facilitates model discovery and reuse. However, prior research has shown that model names are not always well chosen and can sometimes be inaccurate and misleading. The naming practices for PTM packages have not been systematically studied, which hampers engineers’ ability to efficiently search for and reliably reuse these models. In this paper, we conduct the first empirical investigation of PTM naming practices in the Hugging Face PTM registry. We begin by reporting on a survey of 108 Hugging Face users, highlighting differences from traditional software package naming and presenting findings on PTM naming practices. The survey results indicate a mismatch between engineers’ preferences and current practices in PTM naming. We then introduce DARA, the first automatedDNNARchitectureAssessment technique designed to detect PTM naming inconsistencies. Our results demonstrate that architectural information alone is sufficient to detect these inconsistencies, achieving an accuracy of 94% in identifying model types and promising performance (over 70%) in other architectural metadata as well. We also highlight potential use cases for automated naming tools, such as model validation, PTM metadata generation and verification, and plagiarism detection. Our study provides a foundation for automating naming inconsistency detection. Finally, we envision future work focusing on automated tools for standardizing package naming, improving model selection and reuse, and strengthening the security of the PTM supply chain.“The main idea is to treat a program as a piece of literature, addressed to human beings rather than to a computer”—D. Knuthmore » « less
- 
            Background: Software Package Registries (SPRs) are an integral part of the software supply chain. These collaborative platforms unite contributors, users, and packages, and they streamline pack- age management. Much engineering work focuses on synthesizing packages from SPRs into a downstream project. Prior work has thoroughly characterized the SPRs associated with traditional soft- ware, such as NPM (JavaScript) and PyPI (Python). Pre-Trained Model (PTM) Registries are an emerging class of SPR of increasing importance, because they support the deep learning supply chain. Aims: A growing body of empirical research has examined PTM reg- istries from various angles, such as vulnerabilities, reuse processes, and evolution. However, no existing research synthesizes them to provide a systematic understanding of the current knowledge. Furthermore, much of the existing research includes unsupported qualitative claims and lacks sufficient quantitative analysis. Our research aims to fill these gaps by providing a thorough knowledge synthesis and use it to inform further quantitative analysis. Methods: To consolidate existing knowledge on PTM reuse, we first conduct a systematic literature review (SLR). We then observe that some of the claims are qualitative and lack quantitative evi- dence. We identify quantifiable metrics assoiated with those claims, and measure in order to substantiate these claims. Results: From our SLR, we identify 12 claims about PTM reuse on the HuggingFace platform, 4 of which lack quantitative validation. We successfully test 3 of these claims through a quantitative analysis, and directly compare one with traditional software. Our findings corroborate qualitative claims with quantitative measurements. Our two most notable findings are: (1) PTMs have a significantly higher turnover rate than traditional software, indicating a dynamic and rapidly evolving reuse environment within the PTM ecosystem; and (2) There is a strong correlation between documentation quality and PTM popularity. Conclusions: Our findings validate several qual- itative research claims with concrete metrics, confirming prior qualitative and case study research. Our measures show further dynamics of PTM reuse, motivating further research infrastructure and new kinds of measurements.more » « less
- 
            Background: Software Package Registries (SPRs) are an integral part of the software supply chain. These collaborative platforms unite contributors, users, and packages, and they streamline pack- age management. Much engineering work focuses on synthesizing packages from SPRs into a downstream project. Prior work has thoroughly characterized the SPRs associated with traditional soft- ware, such as NPM (JavaScript) and PyPI (Python). Pre-Trained Model (PTM) Registries are an emerging class of SPR of increasing importance, because they support the deep learning supply chain. Aims: A growing body of empirical research has examined PTM registries from various angles, such as vulnerabilities, reuse processes, and evolution. However, no existing research synthesizes them to provide a systematic understanding of the current knowledge. Furthermore, much of the existing research includes unsupported qualitative claims and lacks sufficient quantitative analysis. Our research aims to fill these gaps by providing a thorough knowledge synthesis and use it to inform further quantitative analysis. Methods: To consolidate existing knowledge on PTM reuse, we first conduct a systematic literature review (SLR). We then observe that some of the claims are qualitative and lack quantitative evidence. We identify quantifiable metrics associated with those claims, and measure in order to substantiate these claims. Results: From our SLR, we identify 12 claims about PTM reuse on the HuggingFace platform, 4 of which lack quantitative validation. We successfully test 3 of these claims through a quantitative analysis, and directly compare one with traditional software. Our findings corroborate qualitative claims with quantitative measurements. Our two most notable findings are: (1) PTMs have a significantly higher turnover rate than traditional software, indicating a dynamic and rapidly evolving reuse environment within the PTM ecosystem; and (2) There is a strong correlation between documentation quality and PTM popularity. Conclusions: Our findings validate several qual- stative research claims with concrete metrics, confirming prior qualitative and case study research. Our measures show further dynamics of PTM reuse, motivating further research infrastructure and new kinds of measurements.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
 
                                    