skip to main content

Attention:

The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 11:00 PM ET on Thursday, October 10 until 2:00 AM ET on Friday, October 11 due to maintenance. We apologize for the inconvenience.


Title: MalwareBench: Malware Samples are Not Enough
The prevalent use of third-party components in modern software development, coupled with rapid modernization and digitization, has significantly amplified the risk of software supply chain security attacks. Popular large registries like npm and PyPI are highly targeted malware distribution channels for attackers due to heavy growth and dependence on third-party components. Industry and academia are working towards building tools to detect malware in the software supply chain. However, a lack of benchmark datasets containing both malware and neutral packages hampers the evaluation of the performance of these malware detection tools. The goal of our study is to aid researchers and tool developers in evaluating and improving malware detection tools by contributing a benchmark dataset built by systematically collecting malicious and neutral packages from the npm and PyPI ecosystems. We present MalwareBench, a labeled dataset of 20,534 packages (of which 6,475 are malicious) of npm and PyPI ecosystems. We constructed the benchmark dataset by incorporating pre-existing malware datasets with the Socket internal benchmark data and including popular and newly released npm and PyPI packages. The ground truth labels of these packages were determined using the Socket AI Scanner and manual inspection.  more » « less
Award ID(s):
2207008
NSF-PAR ID:
10516042
Author(s) / Creator(s):
; ; ; ;
Publisher / Repository:
IEEE/ACM
Date Published:
Journal Name:
Proceedings of the IEEE/ACM International Conference on Mining Software Repositories (MSR)
Format(s):
Medium: X
Location:
Lisbon, Portugal
Sponsoring Org:
National Science Foundation
More Like this
  1. The large amount of third-party packages available in fast-moving software ecosystems, such as Node.js/npm, enables attackers to compromise applications by pushing malicious updates to their package dependencies. Studying the npm repository, we observed that many packages in the npm repository that are used in Node.js applications perform only simple computations and do not need access to filesystem or network APIs. This offers the opportunity to enforce least-privilege design per package, protecting applications and package dependencies from malicious updates. We propose a lightweight permission system that protects Node.js applications by enforcing package permissions at runtime. We discuss the design space of solutions and show that our system makes a large number of packages much harder to be exploited, almost for free. 
    more » « less
  2. Background: As software development becomes more interdependent, unique relationships among software packages arise and form complex software ecosystems. Aim: We aim to understand the behavior of these ecosystems better through the lens of software supply chains and model how the effects of software dependency network affect the change in downloads of Javascript packages. Method: We analyzed 12,999 popular packages in NPM, between 01-December-2017 and 15-March-2018, using Linear Regression and Random Forest models and examined the effects of predictors representing different aspects of the software dependency supply chain on changes in numbers of downloads for a package. Result: Preliminary results suggest that the count and downloads of upstream and downstream runtime dependencies have a strong effect on the change in downloads, with packages having fewer, more popular packages as dependencies (upstream or downstream) likely to see an increase in downloads. This suggests that in order to interpret the number of downloads for a package properly, it is necessary to take into account the peculiarities of the supply chain (both upstream and downstream) of that package. Conclusion: Future work is needed to identify the effects of added, deleted, and unchanged dependencies for different types of packages, e.g. build tools, test tools. 
    more » « less
  3. Software supply chain attacks occur during the processes of producing software is compromised, resulting in vulnerabilities that target downstream customers. While the number of successful exploits is limited, the impact of these attacks is significant. Despite increased awareness and research into software supply chain attacks, there is limited information available on mitigating or architecting for these risks, and existing information is focused on singular and independent elements of the supply chain. In this paper, we extensively review software supply chain security using software development tools and infrastructure. We investigate the path that attackers find is least resistant followed by adapting and finding the next best way to complete an attack. We also provide a thorough discussion on how common software supply chain attacks can be prevented, preventing malicious hackers from gaining access to an organization's development tools and infrastructure including the development environment. We considered various SSC attacks on stolen code-sign certificates by malicious attackers and prevented unnoticed malware from passing by security scanners. We are aiming to extend our research to contribute to preventing software supply chain attacks by proposing novel techniques and frameworks. 
    more » « less
  4. Software supply chain attacks occur during the processes of producing software is compromised, resulting in vulnerabilities that target downstream customers. While the number of successful exploits is limited, the impact of these attacks is significant. Despite increased awareness and research into software supply chain attacks, there is limited information available on mitigating or architecting for these risks, and existing information is focused on singular and independent elements of the supply chain. In this paper, we extensively review software supply chain security using software development tools and infrastructure. We investigate the path that attackers find is least resistant followed by adapting and finding the next best way to complete an attack. We also provide a thorough discussion on how common software supply chain attacks can be prevented, preventing malicious hackers from gaining access to an organization’s development tools and infrastructure including the development environment. We considered various SSC attacks on stolen codesign certificates by malicious attackers and prevented unnoticed malware from passing by security scanners. We are aiming to extend our research to contribute to preventing software supply chain attacks by proposing novel techniques and frameworks. 
    more » « less
  5. Background: Open source requires participation of volunteer and commercial developers (users) in order to deliver functional high-quality components. Developers both contribute effort in the form of patches and demand effort from the component maintainers to resolve issues reported against it. Open source components depend on each other directly and transitively, and evidence suggests that more effort is required for reporting and resolving the issues reported further upstream in this supply chain. Aim: Identify and characterize patterns of effort contribution and demand throughout the open source supply chain and investigate if and how these patterns vary with developer activity; identify different groups of developers; and predict developers' company affiliation based on their participation patterns. Method: 1,376,946 issues and pull-requests created for 4433 NPM packages with over 10,000 monthly downloads and full (public) commit activity data of the 272,142 issue creators is obtained and analyzed and dependencies on NPM packages are identified. Fuzzy c-means clustering algorithm is used to find the groups among the users based on their effort contribution and demand patterns, and Random Forest is used as the predictive modeling technique to identify their company affiliations. Result: Users contribute and demand effort primarily from packages that they depend on directly with only a tiny fraction of contributions and demand going to transitive dependencies. A significant portion of demand goes into packages outside the users' respective supply chains (constructed based on publicly visible version control data). Three and two different groups of users are observed based on the effort demand and effort contribution patterns respectively. The Random Forest model used for identifying the company affiliation of the users gives a AUC-ROC value of 0.68, and variables representing aggregate participation patterns proved to be the important predictors. Conclusion: Our results give new insights into effort demand and supply at different parts of the supply chain of the NPM ecosystem and its users and suggests the need to increase visibility further 
    more » « less