skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Cryptocurrency Fraud and Code Sharing Data Set and Analysis Code
This release covers the state of the data and associated analysis code for determining code sharing between cryptocurrency codebases funded through the end of the original NSF CRII award. This material is based on work supported by the National Science Foundation under Grant CNS-1849729.</p>  more » « less
Award ID(s):
1849729
PAR ID:
10377942
Author(s) / Creator(s):
; ;
Publisher / Repository:
Zenodo
Date Published:
Edition / Version:
v1.0
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Big data, the “new oil” of the modern data science era, has attracted much attention in the GIScience community. However, we have ignored the role of code in enabling the big data revolution in this modern gold rush. Instead, what attention code has received has focused on computational efficiency and scalability issues. In contrast, we have missed the opportunities that the more transformative aspects of code afford as ways to organize our science. These “big code” practices hold the potential for addressing some ill effects of big data that have been rightly criticized, such as algorithmic bias, lack of representation, gatekeeping, and issues of power imbalances in our communities. In this article, I consider areas where lessons from the open source community can help us evolve a more inclusive, generative, and expansive GIScience. These concern best practices for codes of conduct, data pipelines and reproducibility, refactoring our attribution and reward systems, and a reinvention of our pedagogy. 
    more » « less
  2. Background: Previous work has shown that students can understand more complicated pieces of code through the use of common software development tools (code execution, debuggers) than they can without them. Objectives: Given that tools can enable novice programmers to understand more complex code, we believe that students should be explicitly taught to do so, to facilitate their plan acquisition and development as independent programmers. In order to do so, this paper seeks to understand: (1) the relative utility of these tools, (2) the thought process students use to choose a tool, and (3) the degree to which students can choose an appropriate tool to understand a given piece of code. Method: We used a mixed-methods approach. To explore the relative effectiveness of the tools, we used a randomized control trial study (𝑁 = 421) to observe student performance with each tool in understanding a range of different code snippets. To explore tool selection, we used a series of think-aloud interviews (𝑁 = 18) where students were presented with a range of code snippets to understand and were allowed to choose which tool they wanted to use. Findings: Overall, novices were more often successful comprehending code when provided with access to code execution, perhaps because it was easier to test a larger set of inputs than the debugger. As code complexity increased (as indicated by cyclomatic complexity), students become more successful with the debugger. We found that novices preferred code execution for simpler or familiar code, to quickly verify their understanding and used the debugger on more complex or unfamiliar code or when they were confused about a small subset of the code. High-performing novices were adept at switching between tools, alternating from a detail-oriented to a broader perspective of the code and vice versa, when necessary. Novices who were unsuccessful tended to be overconfident in their incorrect understanding or did not display a willingness to double check their answers using a debugger. Implications: We can likely teach novices to independently understand code they do not recognize by utilizing code execution and debuggers. Instructors should teach students to recognize when code is complex (e.g., large number of nested loops present), and to carefully step through these loops using debuggers. We should additionally teach students to be cautious to double check their understanding of the code and to self-assess whether they are familiar with the code. They can also be encouraged to strategically switch between execution and debuggers to manage cognitive load, thus maximizing their problem-solving capabilities. 
    more » « less
  3. With the rapidly increasing capabilities and adoption of code agents for AI-assisted coding and software development, safety and security concerns, such as generating or executing malicious code, have become significant barriers to the real-world deployment of these agents. To provide comprehensive and practical evaluations on the safety of code agents, we propose RedCode, an evaluation platform with benchmarks grounded in four key principles: real interaction with systems, holistic evaluation of unsafe code generation and execution, diverse input formats, and high-quality safety scenarios and tests. RedCode consists of two parts to evaluate agents’ safety in unsafe code execution and generation: (1) RedCode-Exec provides challenging code prompts in Python as inputs, aiming to evaluate code agents’ ability to recognize and handle unsafe code. We then map the Python code to other programming languages (e.g., Bash) and natural text summaries or descriptions for evaluation, leading to a total of over 4,000 testing instances. We provide 25 types of critical vulnerabilities spanning various domains, such as websites, file systems, and operating systems. We provide a Docker sandbox environment to evaluate the execution capabilities of code agents and design corresponding evaluation metrics to assess their execution results. (2) RedCode-Gen provides 160 prompts with function signatures and docstrings as input to assess whether code agents will follow instructions to generate harmful code or software. Our empirical findings, derived from evaluating three agent frameworks based on 19 LLMs, provide insights into code agents’ vulnerabilities. For instance, evaluations on RedCode-Exec show that agents are more likely to reject executing unsafe operations on the operating system, but are less likely to reject executing technically buggy code, indicating high risks. Unsafe operations described in natural text lead to a lower rejection rate than those in code format. Additionally, evaluations on RedCode-Gen reveal that more capable base models and agents with stronger overall coding abilities, such as GPT4, tend to produce more sophisticated and effective harmful software. Our findings highlight the need for stringent safety evaluations for diverse code agents. Our dataset and code are publicly available at https://github.com/AI-secure/RedCode. 
    more » « less
  4. To accelerate software development, much research has been performed to help people understand and reuse the huge amount of available code resources. Two important tasks have been widely studied: code retrieval, which aims to retrieve code snippets relevant to a given natural language query from a code base, and code annotation, where the goal is to annotate a code snippet with a natural language description. Despite their advancement in recent years, the two tasks are mostly explored separately. In this work, we investigate a novel perspective of Code annotation for Code retrieval (hence called “CoaCor”), where a code annotation model is trained to generate a natural language annotation that can represent the semantic meaning of a given code snippet and can be leveraged by a code retrieval model to better distinguish relevant code snippets from others. To this end, we propose an effective framework based on reinforcement learning, which explicitly encourages the code annotation model to generate annotations that can be used for the retrieval task. Through extensive experiments, we show that code annotations generated by our framework are much more detailed and more useful for code retrieval, and they can further improve the performance of existing code retrieval models significantly. 
    more » « less
  5. null (Ed.)
    Dynamic code, i.e., code that is created or modified at runtime, is ubiquitous in today’s world. The behavior of dynamic code can depend on the logic of the dynamic code generator in subtle and non-obvious ways, e.g., JIT compiler bugs can lead to exploitable vulnerabilities in the resulting JIT-compiled code. Existing approaches to program analysis do not provide adequate support for reasoning about such behavioral relationships. This paper takes a first step in addressing this problem by describing a program representation and a new notion of dependency that allows us to reason about dependency and information flow relationships between the dynamic code generator and the generated dynamic code. Experimental results show that analyses based on these concepts are able to capture properties of dynamic code that cannot be identified using traditional program analyses. 
    more » « less