skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Experiences on Clustering High-Dimensional Data using pbdR
Motivation: Software engineering for High Performace Computing (HPC) environments in general [1] and for big data in particular [5] faces a set of unique challenges including high complexity of middleware and of computing environments. Tools that make it easier for scientists to utilize HPC are, therefore, of paramount importance. We provide an experience report of using one of such highly effective middleware pbdR [9] that allow the scientist to use R programming language without, at least nominally, having to master many layers of HPC infrastructure, such as OpenMPI [4] and ScalaPACK [2]. Objective: to evaluate the extent to which middleware helps improve scientist productivity, we use pbdR to solve a real problem that we, as scientists, are investigating. Our big data comes from the commits on GitHub and other project hosting sites and we are trying to cluster developers based on the text of these commit messages. Context: We need to be able to identify developer for every commit and to identify commits for a single developer. Developer identifiers in the commits, such as login, email, and name are often spelled in multiple ways since that information may come from different version control systems (Git, Mercurial, SVN, ...) and may depend on which computer is used (what is specified in .git/config of the home folder). Method: We train Doc2Vec [7] model where existing credentials are used as a document identifier and then use the resulting 200-dimensional vectors for the 2.3M identifiers to cluster these identifiers so that each cluster represents a specific individual. The distance matrix occupies 32TB and, therefore, is a good target for HPC in general and pbdR in particular. pbdR allows data to be distributed over computing nodes and even has implemented K-means and mixture-model clustering techniques in the package pmclust. Results: We used strategic prototyping [3] to evaluate the capabilities of pbdR and discovered that a) the use of middleware required extensive understanding of its inner workings thus negating many of the expected benefits; b) the implemented algorithms were not suitable for the particular combination of n, p, and k (sample size, data dimension, and the number of clusters); c) the development environment based on batch jobs increases development time substantially. Conclusions: In addition to finding from Basili et al., we find that the quality of the implementation of HPC infrastructure and its development environment has a tremendous effect on development productivity.  more » « less
Award ID(s):
1633437
PAR ID:
10106645
Author(s) / Creator(s):
;
Date Published:
Journal Name:
Proceedings of the 1st International Workshop on Software Engineering for High Performance Computing in Computational and Data-enabled Science & Engineering
Page Range / eLocation ID:
9 to 12
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Transparent environments and social-coding platforms asGitHub help developers to stay abreast of changes during the development and maintenance phase of a project. Especially, notification feeds can help developers to learn about relevant changes in other projects. Unfortunately, transparent environments can quickly overwhelm developers with too many notifications, such that they lose the important ones in a sea of noise. Complementing existing prioritization and filtering strategies based on binary compatibility and code ownership, we develop an anomaly detection mechanism to identify unusual commits in a repository, which stand out with respect to other changes in the same repository or by the same developer. Among others, we detect exceptionally large commits, commits at unusual times, and commits touching rarely changed file types given the characteristics of a particular repository or developer. We automatically flag unusual commits on GitHub through a browser plug-in. In an interactive survey with 173 active GitHub users, rating commits in a project of their interest, we found that, although our unusual score is only a weak predictor of whether developers want to be notified about a commit, information about unusual characteristics of a commit changes how developers regard commits. Our anomaly detection mechanism is a building block for scaling transparent environments. 
    more » « less
  2. Evidence shows that developer reputation is extremely important when accepting pull requests or resolving reported issues. It is particularly salient in Free/Libre Open Source Software since the developers are distributed around the world, do not work for the same organization and, in most cases, never meet face to face. The existing solutions to expose developer reputation tend to be forge specific (GitHub), focus on activity instead of impact, do not leverage social or technical networks, and do not correct often misspelled developer identities. We aim to remedy this by amalgamating data from all public Git repositories, measuring the impact of developer work, expose developer's collaborators, and correct notoriously problematic developer identity data. We leverage World of Code (WoC), a collection of an almost complete (and continuously updated) set of Git repositories by first allowing developers to select which of the 34 million(M) Git commit author IDs belong to them and then generating their profiles by treating the selected collection of IDs as that single developer. As a side-effect, these selections serve as a training set for a supervised learning algorithm that merges multiple identity strings belonging to a single individual. As we evaluate the tool and the proposed impact measure, we expect to build on these findings to develop reputation badges that could be associated with pull requests and commits so developers could easier trust and prioritize them. 
    more » « less
  3. Critical open-source projects form the basis of many large software systems. They provide trusted and extensible implementations of important functionality for cryptography, compatibility, and security. Verifying commit authorship authenticity in open-source projects is essential and challenging. Git users can freely configure author details such as names and email addresses. Platforms like GitHub use such information to generate profile links to user accounts. We demonstrate three attack scenarios malicious actors can use to manipulate projects and profiles on GitHub to appear trustworthy. We designed a mixed-research study to assess the effect on critical open-source software projects and evaluated countermeasures. First, we conducted a large-scale measurement among 50,328 critical open-source projects on GitHub and demonstrated that contribution workflows can be abused in 85.9% of the projects. We identified 573,043 email addresses that a malicious actor can claim to hijack historic contributions and improve the trustworthiness of their accounts. When looking at commit signing as a countermeasure, we found that the majority of users (95.4%) never signed a commit, and for the majority of projects (72.1%), no commit was ever signed. In contrast, only 2.0% of the users signed all their commits, and for 0.2% of the projects all commits were signed. Commit signing is not associated with projects’ programming languages, topics, or other security measures. Second, we analyzed online security advice to explore the awareness of contributor spoofing and identify recommended countermeasures. Most documents exhibit awareness of the simple spoofing technique via Git commits but no awareness of problems with GitHub’s handling of email addresses. 
    more » « less
  4. Background: Some developer activity traditionally performed manually, such as making code commits, opening, managing, or closing issues is increasingly subject to automation in many OSS projects. Specifically, such activity is often performed by tools that react to events or run at specific times. We refer to such automation tools as bots and, in many software mining scenarios related to developer productivity or code quality it is desirable to identify bots in order to separate their actions from actions of individuals. Aim: Find an automated way of identifying bots and code committed by these bots, and to characterize the types of bots based on their activity patterns. Method and Result: We propose BIMAN, a systematic approach to detect bots using author names, commit messages, files modified by the commit, and projects associated with the ommits. For our test data, the value for AUC-ROC was 0.9. We also characterized these bots based on the time patterns of their code commits and the types of files modified, and found that they primarily work with documentation files and web pages, and these files are most prevalent in HTML and JavaScript ecosystems. We have compiled a shareable dataset containing detailed information about 461 bots we found (all of whom have more than 1000 commits) and 13,762,430 commits they created. 
    more » « less
  5. Middleware is required to support and interface multi-modal Dynamic Data Driven Application Systems (DDDAS) with back-end and other computing facilities. Middleware is also needed to support distributed simulations and emulations needed in earlier phases of system development. This work describes the Green Runtime Infrastructure (G-RTI), an energy-efficient client server based middleware developed to support distributed DDDAS simulation, emulation and deployment. G-RTI eases and accelerates the development and testing of multi-modal studies, testbeds and DDDAS systems. It serves as a platform for research in energy reduction techniques for middleware services. The services implemented by G-RTI are described and results of benchmarking studies are reported. Its application is demonstrated through a use-case for an end-to-end implementation of a connected vehicle application. G-RTI is open source. 
    more » « less