Programming languages are essential tools for developers, and their evolution plays a crucial role in supporting the activities of developers. One instance of programming language evolution is the introduction of syntactic sugars, which are additional syntax elements that provide alternative, more readable code constructs. However, the process of designing and evolving a programming language has traditionally been guided by anecdotal experiences and intuition. Recent advances in tools and methodologies for mining open-source repositories have enabled developers to make data-driven software engineering decisions. In light of this, this paper proposes an approach for motivating data-driven programming evolution by applying frequent subgraph mining techniques to a large dataset of 166,827,154 open-source Java methods. The dataset is mined by generalizing Java control-flow graphs to capture broad programming language usages and instances of duplication. Frequent subgraphs are then extracted to identify potentially impactful opportunities for new syntactic sugars. Our diverse results demonstrate the benefits of the proposed technique by identifying new syntactic sugars involving a variety of programming constructs that could be implemented in Java, thus simplifying frequent code idioms. This approach can potentially provide valuable insights for Java language designers, and serve as a proof-of-concept for data-driven programming language design and evolution.
more »
« less
A randomized controlled trial on the effects of embedded computer language switching
Polyglot programming, the use of multiple programming languages during the development process, is common practice in modern software development. This study investigates this practice through a randomized controlled trial conducted under the context of database programming. Participants in the study were given coding tasks written in Java and one of three SQL-like embedded languages. One was plain SQL in strings, one was in Java only, and the third was a hybrid embedded language that was closer to the host language. We recorded 109 valid data points. Results showed significant differences in how developers of different experience levels code using polyglot techniques. Notably, less experienced programmers wrote correct programs faster in the hybrid condition (frequent, but less severe, switches), while more experienced developers that already knew both languages performed better in traditional SQL (less frequent, but more complete, switches). The results indicate that the productivity impact of polyglot programming is complex and experience level dependent.
more »
« less
- PAR ID:
- 10205025
- Date Published:
- Journal Name:
- Proceedings of the 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engi- neering (ESEC/FSE ’20)
- Page Range / eLocation ID:
- 410 to 420
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Abstract Source code is a form of human communication, albeit one where the information shared between the programmers reading and writing the code is constrained by the requirement that the code executes correctly. Programming languages are more syntactically constrained than natural languages, but they are also very expressive, allowing a great many different ways to express even very simple computations. Still, code written by developers is highly predictable, and many programming tools have taken advantage of this phenomenon, relying on language modelsurprisalas a guiding mechanism. While surprisal has been validated as a measure of cognitive load in natural language, its relation to human cognitive processes in code is still poorly understood. In this paper, we explore the relationship between surprisal and programmer preference at a small granularity—do programmers prefer more predictable expressions in code? Usingmeaning‐preserving transformations, we produce equivalent alternatives to developer‐written code expressions and run a corpus study on Java and Python projects. In general, language models rate the code expressions developerschooseto write as more predictable than these transformed alternatives. Then, we perform two human subject studies asking participants to choose between two equivalent snippets of Java code with different surprisal scores (one original and transformed). We find that programmersdoprefer more predictable variants, and that stronger language models like the transformer align more often and more consistently with these preferences.more » « less
-
Despite the best efforts of the security community, security vulnerabilities in software are still prevalent, with new vulnerabilities reported daily and older ones stubbornly repeating themselves. One potential source of these vulnerabilities is shortcomings in the used language and library APIs. Developers tend to trust APIs, but can misunderstand or misuse them, introducing vulnerabilities. We call the causes of such misuse blindspots. In this paper, we study API blindspots from the developers' perspective to: (1) determine the extent to which developers can detect API blindspots in code and (2) examine the extent to which developer characteristics (i.e., perception of code correctness, familiarity with code, confidence, professional experience, cognitive function, and personality) affect this capability. We conducted a study with 109 developers from four countries solving programming puzzles that involve Java APIs known to contain blindspots. We find that (1) The presence of blindspots correlated negatively with the developers' accuracy in answering implicit security questions and the developers' ability to identify potential security concerns in the code. This effect was more pronounced for I/O-related APIs and for puzzles with higher cyclomatic complexity. (2) Higher cognitive functioning and more programming experience did not predict better ability to detect API blindspots. (3) Developers exhibiting greater openness as a personality trait were more likely to detect API blindspots. This study has the potential to advance API security in (1) design, implementation, and testing of new APIs; (2) addressing blindspots in legacy APIs; (3) development of novel methods for developer recruitment and training based on cognitive and personality assessments; and (4) improvement of software development processes (e.g., establishment of security and functionality teams).more » « less
-
Static Analysis (SA) in Cybersecurity is a practice aimed at detecting vulnerabilities within the source code of a program. Modern SA applications, though highly sophisticated, lack programming language agnostic generalization, instead requiring codebase specific implementations for each programming language. The manner in which SA is implemented today, though functional, requires significant man hours to develop and maintain, higher costs due to custom applications for each language, and creates inconsistencies in implementation from SA-tool to SA-tool. One promising source of programming language generalization occurs within the compilers used to compile code for programming languages like C, C++, and Java. During the compilation process, source code of varying languages moves through several validation passes before being converted into a grammatically consistent Intermediate Representation (IR). The grammatical consistencies provided by IRs allow the same program derived from different programming languages to be represented uniformly and thus analyzed for vulnerabilities. By using IRs of compiled programming languages as the codebase of SA practices, multiple programming languages can be encompassed by a single SA tool. To begin understanding the possibilities the combination of SA and IRs may reveal, this research presents the following outcomes: 1) a systematic literature search, 2) a literature review, and 3) the classification of existing work pertaining to SA practices using IRs. The results of the study indicate that generalized Static Analysis using IRs is already a common practice in all compilers, but that the extended use of IRs in Cybersecurity SA practices aimed at finding vulnerabilities in source code remains underdeveloped.more » « less
-
The programming language Julia is designed to solve the 'two language problem', where developers who write scientific software can achieve desired performance, without sacrificing productivity. Since its inception in 2012, developers who have been using other programming languages have transitioned to Julia. A systematic investigation of the questions that developers ask about Julia can help in understanding the challenges that developers face while using Julia. Such understanding can be helpful (i) for toolsmiths who can construct tools so that developers can maximize their experience of using Julia, and (ii) for Julia language maintainers with empirical evidence on areas to improve the language as well as the Julia ecosystem. We conduct an empirical study with 3,093 Stack Overflow posts where we identify 13 categories of questions related to Julia-based software development. We observe developers to ask about a diverse set of topics, such as GC, Julia's garbage collector, JuMP, a domain-specific language constructed using Julia, and symbols, a metaprogramming utility in Julia. Based on our emerging results, we recommend enhancing support for developers with Julia-based tools and techniques for cross language transfer, type-related assistance, and package resolution.more » « less
An official website of the United States government

