NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

AssetHarvester: A Static Analysis Tool for Detecting Secret-Asset Pairs in Software Artifacts

https://doi.org/10.1109/ICSE55347.2025.00067

Basak, Setu Kumar; English, K Virgil; Ogura, Ken; Kambara, Vitesh; Reaves, Bradley; Williams, Laurie (April 2025, IEEE)

GitGuardian monitored secrets exposure in public GitHub repositories and reported that developers leaked over 12 million secrets (database and other credentials) in 2023, indicating a 113% surge from 2021. Despite the availability of secret detection tools, developers ignore the tools' reported warnings because of false positives (25%−99%). However, each secret protects assets of different values accessible through asset identifiers (a DNS name and a public or private IP address). The asset information for a secret can aid developers in filtering false positives and prioritizing secret removal from the source code. However, existing secret detection tools do not provide the asset information, thus presenting difficulty to developers in filtering secrets only by looking at the secret value or finding the assets manually for each reported secret. The goal of our study is to aid software practitioners in prioritizing secrets removal by providing the assets information protected by the secrets through our novel static analysis tool. We present AssetHarvester, a static analysis tool to detect secret-asset pairs in a repository. Since the location of the asset can be distant from where the secret is defined, we investigated secret-asset co-location patterns and found four patterns. To identify the secret-asset pairs of the four patterns, we utilized three approaches (pattern matching, data flow analysis, and fast-approximation heuristics). We curated a benchmark of 1,791 secret-asset pairs of four database types extracted from 188 public GitHub repositories to evaluate the performance of AssetHarvester. AssetHarvester demonstrates precision of (97%), recall (90 %), and F1-score (94 %) in detecting secret-asset pairs. Our findings indicate that data flow analysis employed in AssetHarvester detects secret-asset pairs with 0 % false positives and aids in improving the recall of secret detection tools. Additionally, AssetHarvester shows 43 % increase in precision for database secret detection compared to existing detection tools through the detection of assets, thus reducing developer's alert fatigue.
more » « less
Free, publicly-accessible full text available April 26, 2026
A Comparative Study of Software Secrets Reporting by Secret Detection Tools

https://doi.org/10.1109/ESEM56168.2023.10304853

Basak, Setu Kumar; Cox, Jamison; Reaves, Bradley; Williams, Laurie (October 2023, IEEE)

According to GitGuardian’s monitoring of public GitHub repositories, secrets sprawl continued accelerating in 2022 by 67% compared to 2021, exposing over 10 million secrets (API keys and other credentials). Though many open-source and proprietary secret detection tools are available, these tools output many false positives, making it difficult for developers to take action and teams to choose one tool out of many. To our knowledge, the secret detection tools are not yet compared and evaluated. Aims: The goal of our study is to aid developers in choosing a secret detection tool to reduce the exposure of secrets through an empirical investigation of existing secret detection tools. Method: We present an evaluation of five opensource and four proprietary tools against a benchmark dataset. Results: The top three tools based on precision are: GitHub Secret Scanner (75%), Gitleaks (46%), and Commercial X (25%), and based on recall are: Gitleaks (88%), SpectralOps (67%) and TruffleHog (52%). Our manual analysis of reported secrets reveals that false positives are due to employing generic regular expressions and ineffective entropy calculation. In contrast, false negatives are due to faulty regular expressions, skipping specific file types, and insufficient rulesets. Conclusions: We recommend developers choose tools based on secret types present in their projects to prevent missing secrets. In addition, we recommend tool vendors update detection rules periodically and correctly employ secret verification mechanisms by collaborating with API vendors to improve accuracy.
more » « less
Full Text Available
SecretBench: A Dataset of Software Secrets

https://doi.org/10.1109/MSR59073.2023.00053

Basak, Setu Kumar; Neil, Lorenzo; Reaves, Bradley; Williams, Laurie (May 2023, IEEE)

According to GitGuardian’s monitoring of public GitHub repositories, the exposure of secrets (API keys and other credentials) increased two-fold in 2021 compared to 2020, totaling more than six million secrets. However, no benchmark dataset is publicly available for researchers and tool developers to evaluate secret detection tools that produce many false positive warnings. The goal of our paper is to aid researchers and tool developers in evaluating and improving secret detection tools by curating a benchmark dataset of secrets through a systematic collection of secrets from open-source repositories. We present a labeled dataset of source codes containing 97,479 secrets (of which 15,084 are true secrets) of various secret types extracted from 818 public GitHub repositories. The dataset covers 49 programming languages and 311 file types.
more » « less
Full Text Available
What Challenges Do Developers Face About Checked-in Secrets in Software Artifacts?

https://doi.org/10.1109/ICSE48619.2023.00141

Basak, Setu Kumar; Neil, Lorenzo; Reaves, Bradley; Williams, Laurie (May 2023, IEEE)

Throughout 2021, GitGuardian’s monitoring of public GitHub repositories revealed a two-fold increase in the number of secrets (database credentials, API keys, and other credentials) exposed compared to 2020, accumulating more than six million secrets. To our knowledge, the challenges developers face to avoid checked-in secrets are not yet characterized. The goal of our paper is to aid researchers and tool developers in understanding and prioritizing opportunities for future research and tool automation for mitigating checked-in secrets through an empirical investigation of challenges and solutions related to checked-in secrets. We extract 779 questions related to checkedin secrets on Stack Exchange and apply qualitative analysis to determine the challenges and the solutions posed by others for each of the challenges. We identify 27 challenges and 13 solutions. The four most common challenges, in ranked order, are: (i) store/version of secrets during deployment; (ii) store/version of secrets in source code; (iii) ignore/hide of secrets in source code; and (iv) sanitize VCS history. The three most common solutions, in ranked order, are: (i) move secrets out of source code/version control and use template config file; (ii) secret management in deployment; and (iii) use local environment variables. Our findings indicate that the same solution has been mentioned to mitigate multiple challenges. However, our findings also identify an increasing trend in questions lacking accepted solutions substantiating the need for future research and tool automation on managing secrets.
more » « less
Full Text Available
What are the Practices for Secret Management in Software Artifacts?

https://doi.org/10.1109/SecDev53368.2022.00026

Basak, Setu Kumar; Neil, Lorenzo; Reaves, Bradley; Williams, Laurie (October 2022, 2022 IEEE Secure Development Conference (SecDev))

hroughout 2021, GitGuardian's monitoring of public GitHub repositories revealed a two-fold increase in the number of secrets (database credentials, API keys, and other credentials) exposed compared to 2020, accumulating more than six million secrets. A systematic derivation of practices for managing secrets can help practitioners in secure development. The goal of our paper is to aid practitioners in avoiding the exposure of secrets by identifying secret management practices in software artifacts through a systematic derivation of practices disseminated in Internet artifacts. We conduct a grey literature review of Internet artifacts, such as blog articles and question and answer posts. We identify 24 practices grouped in six categories comprised of developer and organizational practices. Our findings indicate that using local environment variables and external secret management services are the most recommended practices to move secrets out of source code and to securely store secrets. We also observe that using version control system scanning tools and employing short-lived secrets are the most recommended practices to avoid accidentally committing secrets and limit secret exposure, respectively.
more » « less
Full Text Available
Characterizing the Security of Github {CI} Workflows

Koishybayev, Igibek; Nahapetyan, Aleksandr; Zachariah, Raima; Muralee, Siddharth; Reaves, Bradley; Kapravelos, Alexandros; Machiry, Aravind (August 2022, 31st USENIX Security Symposium (USENIX Security 22))

Continuous integration and deployment (CI/CD) has revolutionized software development and maintenance. Commercial CI/CD platforms provide services for specifying and running CI/CD actions. However, they present a security risk in their own right, given their privileged access to secrets, infrastructure, and ability to fetch and execute arbitrary code. In this paper, we study the security of the newly popular GitHub CI platform. We first identify four fundamental security properties that must hold for any CI/CD system: Admittance Control, Execution Control, Code Control, and Access to Secrets. We then examine if GitHub CI enforces these properties in comparison with the other five popular CI/CD platforms. We perform a comprehensive analysis of 447,238 workflows spanning 213,854 GitHub repositories. We made several disturbing observations. Our analysis shows that 99.8% of workflows are overprivileged and have read-write access (instead of read-only) to the repository. In addition, 23.7% of workflows are triggerable by a pull_request and use code from the underlying repository. An attacker can exploit these workflows and execute arbitrary code as part of the workflow. Due to the modular nature of workflows, we find that 99.7% of repositories in our dataset execute some externally developed plugin, called "Actions" , for various purposes. We found that 97% of repositories execute at least one Action that does not originate with a verified creator, and 18% of repositories in our dataset execute at least one Action with missing security updates. These represent potential attack vectors that can be used to compromise the execution of workflows, consequently leading to supply chain attacks. This work highlights the systemic risks inherent in CI/CD platforms like GitHub CI; we also present our own Github action, GWChecker, which functions as an early warning system for bad practices that violate the identified security properties.
more » « less
Full Text Available
Why secret detection tools are not enough: It’s not just about false positives - An industrial case study

https://doi.org/10.1007/s10664-021-10109-y

Rahman, Md Rayhanur; Imtiaz, Nasif; Storey, Margaret-Anne; Williams, Laurie (May 2022, Empirical Software Engineering)

Full Text Available

Search for: All records