NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

What do we know about Hugging Face? A systematic literature review and quantitative validation of qualitative claims

Jones, J; Jiang, W; Synovic, N; Thiruvathukal, GK; Davis, JC (October 2024, Proceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM) 2024.)

Background: Software Package Registries (SPRs) are an integral part of the software supply chain. These collaborative platforms unite contributors, users, and packages, and they streamline pack- age management. Much engineering work focuses on synthesizing packages from SPRs into a downstream project. Prior work has thoroughly characterized the SPRs associated with traditional soft- ware, such as NPM (JavaScript) and PyPI (Python). Pre-Trained Model (PTM) Registries are an emerging class of SPR of increasing importance, because they support the deep learning supply chain. Aims: A growing body of empirical research has examined PTM registries from various angles, such as vulnerabilities, reuse processes, and evolution. However, no existing research synthesizes them to provide a systematic understanding of the current knowledge. Furthermore, much of the existing research includes unsupported qualitative claims and lacks sufficient quantitative analysis. Our research aims to fill these gaps by providing a thorough knowledge synthesis and use it to inform further quantitative analysis. Methods: To consolidate existing knowledge on PTM reuse, we first conduct a systematic literature review (SLR). We then observe that some of the claims are qualitative and lack quantitative evidence. We identify quantifiable metrics associated with those claims, and measure in order to substantiate these claims. Results: From our SLR, we identify 12 claims about PTM reuse on the HuggingFace platform, 4 of which lack quantitative validation. We successfully test 3 of these claims through a quantitative analysis, and directly compare one with traditional software. Our findings corroborate qualitative claims with quantitative measurements. Our two most notable findings are: (1) PTMs have a significantly higher turnover rate than traditional software, indicating a dynamic and rapidly evolving reuse environment within the PTM ecosystem; and (2) There is a strong correlation between documentation quality and PTM popularity. Conclusions: Our findings validate several qual- stative research claims with concrete metrics, confirming prior qualitative and case study research. Our measures show further dynamics of PTM reuse, motivating further research infrastructure and new kinds of measurements.
more » « less
Full Text Available
PeaTMOSS: A Dataset and Initial Analysis of Pre-Trained Models in Open-Source Software

Jiang, W; Yasmin, J; Jones, J; Synovic, N; Kuo, J; Bielanski, N; Tian, Y; Thiruvathukal, G K; Davis, J C (May 2024, 2024 IEEE/ACM 21st International Conference on Mining Software Repositories (MSR))

The development and training of deep learning models have become increasingly costly and complex. Consequently, software engineers are adopting pre-trained models (PTMs) for their downstream applications. The dynamics of the PTM supply chain remain largely unexplored, signaling a clear need for structured datasets that document not only the metadata but also the subsequent applications of these models. Without such data, the MSR community cannot comprehensively understand the impact of PTM adoption and reuse. This paper presents the PeaTMOSS dataset, which comprises metadata for 281,638 PTMs and detailed snapshots for all PTMs with over 50 monthly downloads (14,296 PTMs), along with 28,575 open-source software repositories from GitHub that utilize these models. Additionally, the dataset includes 44,337 mappings from 15,129 downstream GitHub repositories to the 2,530 PTMs they use. To enhance the dataset’s comprehensiveness, we developed prompts for a large language model to automatically extract model metadata, including the model’s training datasets, parameters, and evaluation metrics. Our analysis of this dataset provides the first summary statistics for the PTM supply chain, showing the trend of PTM development and common shortcomings of PTM package documentation. Our example application reveals inconsistencies in software licenses across PTMs and their dependent projects. PeaTMOSS lays the foundation for future research, offering rich opportunities to investigate the PTM supply chain. We outline mining opportunities on PTMs, their downstream usage, and cross-cutting questions. Our artifact is available at https://github.com/PurdueDualityLab/PeaTMOSS-Artifact. Our dataset is available at https://transfer.rcac.purdue.edu/file-manager?origin_id=ff978999-16c2-4b50-ac7a-947ffdc3eb1d&origin_path=%2F.
more » « less
Full Text Available
A Signal Injection Attack Against Zero Involvement Pairing and Authentication for the Internet of Things

https://doi.org/10.1109/DESTION62938.2024.00008

Ahlgren, Isaac; West, Jack; Lee, Kyuin; Thiruvathukal, George; Klingensmith, Neil (May 2024, DESTION 2024)

Zero Involvement Pairing and Authentication (ZIPA) is a promising technique for autoprovisioning large networks of Internet-of-Things (IoT) devices. In this work, we present the first successful signal injection attack on a ZIPA system. Most existing ZIPA systems assume there is a negligible amount of influence from the unsecured outside space on the secured inside space. In reality, environmental signals do leak from adjacent unsecured spaces and influence the environment of the secured space. Our attack takes advantage of this fact to perform a signal injection attack on the popular Schurmann & Sigg algorithm. The keys generated by the adversary with a signal injection attack at 95 dBA is within the standard error of the legitimate device.
more » « less
Full Text Available
An automated approach for improving the inference latency and energy efficiency of pretrained CNNs by removing irrelevant pixels with focused convolutions

https://doi.org/10.1109/ASP-DAC58780.2024.10473884

Tung, Caleb; Eliopoulos, Nicholas; Jajal, Purvish; Ramshankar, Gowri; Yang, Cheng-Yun; Synovic, Nicholas; Zhang, Xuecen; Chaudhary, Vipin; Thiruvathukal, George K; Lu, Yung-Hsiang (January 2024, Asia and South Pacific Design Automation Conference (ASP-DAC))

Computer vision often uses highly accurate Convolutional Neural Networks (CNNs), but these deep learning models are associated with ever-increasing energy and computation requirements. Producing more energy-efficient CNNs often requires model training which can be cost-prohibitive. We propose a novel, automated method to make a pretrained CNN more energyefficient without re-training. Given a pretrained CNN, we insert a threshold layer that filters activations from the preceding layers to identify regions of the image that are irrelevant, i.e. can be ignored by the following layers while maintaining accuracy. Our modified focused convolution operation saves inference latency (by up to 25%) and energy costs (by up to 22%) on various popular pretrained CNNs, with little to no loss in accuracy
more » « less
Full Text Available
Evolution of Winning Solutions in the 2021 Low-Power Computer Vision Challenge

https://doi.org/10.1109/MC.2023.3250246

Hu, Xiao; Jiao, Ziteng; Kocher, Ayden; Wu, Zhenyu; Liu, Junjie; Davis, James C; Thiruvathukal, George K; Lu, Yung-Hsiang (August 2023, Computer)

Full Text Available
Reusing Deep Learning Models: Challenges and Directions in Software Engineering

https://doi.org/10.1109/JVA60410.2023.00015

Davis, James C; Jajal, Purvish; Jiang, Wenxin; Schorlemmer, Taylor R; Synovic, Nicholas; Thiruvathukal, George K (July 2023, IEEE)

Full Text Available
An Empirical Study of Pre-Trained Model Reuse in the Hugging Face Deep Learning Model Registry

https://doi.org/10.1109/ICSE48619.2023.00206

Jiang, Wenxin; Synovic, Nicholas; Hyatt, Matt; Schorlemmer, Taylor R; Sethi, Rohan; Lu, Yung-Hsiang; Thiruvathukal, George K; Davis, James C (May 2023, IEEE)
PTMTorrent: A Dataset for Mining Open-source Pre-trained Model Packages

https://doi.org/10.1109/MSR59073.2023.00021

Jiang, Wenxin; Synovic, Nicholas; Jajal, Purvish; Schorlemmer, Taylor R; Tewari, Arav; Pareek, Bhavesh; Thiruvathukal, George K; Davis, James C (May 2023, IEEE)
An Empirical Study of Artifacts and Security Risks in the Pre-trained Model Supply Chain

https://doi.org/10.1145/3560835.3564547

Jiang, Wenxin; Synovic, Nicholas; Sethi, Rohan; Indarapu, Aryan; Hyatt, Matt; Schorlemmer, Taylor R.; Thiruvathukal, George K.; Davis, James C. (November 2022, Proceedings of the 1st ACM Workshop on Software Supply Chain Offensive Research and Ecosystem Defenses (SCORED)

Deep neural networks achieve state-of-the-art performance on many tasks, but require increasingly complex architectures and costly training procedures. Engineers can reduce costs by reusing a pre-trained model (PTM) and fine-tuning it for their own tasks. To facilitate software reuse, engineers collaborate around model hubs, collections of PTMs and datasets organized by problem domain. Although model hubs are now comparable in popularity and size to other software ecosystems, the associated PTM supply chain has not yet been examined from a software engineering perspective. We present an empirical study of artifacts and security features in 8 model hubs. We indicate the potential threat models and show that the existing defenses are insufficient for ensuring the security of PTMs. We compare PTM and traditional supply chains, and propose directions for further measurements and tools to increase the reliability of the PTM supply chain.
more » « less
Full Text Available
Snapshot Metrics Are Not Enough: Analyzing Software Repositories with Longitudinal Metrics

https://doi.org/10.1145/3551349.3559517

Synovic, Nicholas M.; Hyatt, Matt; Sethi, Rohan; Thota, Sohini; Shilpika; Miller, Allan J.; Jiang, Wenxin; Amobi, Emmanuel S.; Pinderski, Austin; Läufer, Konstantin; et al (October 2022, Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering)

Software metrics capture information about software development processes and products. These metrics support decision-making, e.g., in team management or dependency selection. However, existing metrics tools measure only a snapshot of a software project. Little attention has been given to enabling engineers to reason about metric trends over time—longitudinal metrics that give insight about process, not just product. In thiswork,we present PRIME (PRocess MEtrics), a tool to compute and visualize process metrics. The currently-supported metrics include productivity, issue density, issue spoilage, and bus factor.We illustrate the value of longitudinal data and conclude with a research agenda. The tool’s demo video can be watched at https://bit.ly/ase2022-prime. Source code can be found at https://github.com/SoftwareSystemsLaboratory/prime.
more » « less
Full Text Available

« Prev Next »

Search for: All records