skip to main content


Title: A Complete Set of Related Git Repositories Identified viaCommunity Detection Approaches Based on Shared Commits
In order to understand the state and evolution of the entirety of open source software we need to get a handle on the set of distinct software projects. Most of open source projects presently utilize Git, which is a distributed version control system allowing easy creation of clones and resulting in numerous repositories that are almost entirely based on some parent repository from which they were cloned. Git commits are unlikely to get produce and represent a way to group cloned repositories. We use World of Code infrastructure containing approximately 2B commits and 100M repositories to create and share such a map. We discover that the largest group contains almost 14M repositories most of which are unrelated to each other. As it turns out, the developers can push git object to an arbitrary repository or pull objects from unrelated repositories, thus linking unrelated repositories. To address this, we apply Louvain community detection algorithm to this very large graph consisting of links between commits and projects. The approach successfully reduces the size of the megacluster with the largest group of highly interconnected projects containing under 400K repositories. We expect that the resulting map of related projects as well as tools and methods to handle the very large graph will serve as a reference set for mining software projects and other applications. Further work is needed to determine different types of relationships among projects induced by shared commits and other relationships, for example, by shared source code or similar filenames.  more » « less
Award ID(s):
1925615 1901102 1633437
NSF-PAR ID:
10189286
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
IEEE International Working Conference on Mining Software Repositories
ISSN:
2160-1860
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Binder is a publicly accessible online service for executing interactive notebooks based on Git repositories. Binder dynamically builds and deploys containers following a recipe stored in the repository, then gives the user a browser-based notebook interface. The Binder group periodically releases a log of container launches from the public Binder service. Archives of launch records are available here. These records do not include identifiable information like IP addresses, but do give the source repo being launched along with some other metadata. The main content of this dataset is in the binder.sqlite file. This SQLite database includes launch records from 2018-11-03 to 2021-06-06 in the events table, which has the following schema.

    CREATE TABLE events( version INTEGER, timestamp TEXT, provider TEXT, spec TEXT, origin TEXT, ref TEXT, guessed_ref TEXT ); CREATE INDEX idx_timestamp ON events(timestamp);
    • version indicates the version of the record as assigned by Binder. The origin field became available with version 3, and the ref field with version 4. Older records where this information was not recorded will have the corresponding fields set to null.
    • timestamp is the ISO timestamp of the launch
    • provider gives the type of source repo being launched ("GitHub" is by far the most common). The rest of the explanations assume GitHub, other providers may differ.
    • spec gives the particular branch/release/commit being built. It consists of <github-id>/<repo>/<branch>.
    • origin indicates which backend was used. Each has its own storage, compute, etc. so this info might be important for evaluating caching and performance. Note that only recent records include this field. May be null.
    • ref specifies the git commit that was actually used, rather than the named branch referenced by spec. Note that this was not recorded from the beginning, so only the more recent entries include it. May be null.
    • For records where ref is not available, we attempted to clone the named reference given by spec rather than the specific commit (see below). The guessed_ref field records the commit found at the time of cloning. If the branch was updated since the container was launched, this will not be the exact version that was used, and instead will refer to whatever was available at the time (early 2021). Depending on the application, this might still be useful information. Selecting only records with version 4 (or non-null ref) will exclude these guessed commits. May be null.

    The Binder launch dataset identifies the source repos that were used, but doesn't give any indication of their contents. We crawled GitHub to get the actual specification files in the repos which were fed into repo2docker when preparing the notebook environments, as well as filesystem metadata of the repos. Some repos were deleted/made private at some point, and were thus skipped. This is indicated by the absence of any row for the given commit (or absence of both ref and guessed_ref in the events table). The schema is as follows.

    CREATE TABLE spec_files ( ref TEXT NOT NULL PRIMARY KEY, ls TEXT, runtime BLOB, apt BLOB, conda BLOB, pip BLOB, pipfile BLOB, julia BLOB, r BLOB, nix BLOB, docker BLOB, setup BLOB, postbuild BLOB, start BLOB );

    Here ref corresponds to ref and/or guessed_ref from the events table. For each repo, we collected spec files into the following fields (see the repo2docker docs for details on what these are). The records in the database are simply the verbatim file contents, with no parsing or further processing performed.

    • runtime: runtime.txt
    • apt: apt.txt
    • conda: environment.yml
    • pip: requirements.txt
    • pipfile: Pipfile.lock or Pipfile
    • julia: Project.toml or REQUIRE
    • r: install.R
    • nix: default.nix
    • docker: Dockerfile
    • setup: setup.py
    • postbuild: postBuild
    • start: start

    The ls field gives a metadata listing of the repo contents (excluding the .git directory). This field is JSON encoded with the following structure based on JSON types:

    • Object: filesystem directory. Keys are file names within it. Values are the contents, which can be regular files, symlinks, or subdirectories.
    • String: symlink. The string value gives the link target.
    • Number: regular file. The number value gives the file size in bytes.
    CREATE TABLE clean_specs ( ref TEXT NOT NULL PRIMARY KEY, conda_channels TEXT, conda_packages TEXT, pip_packages TEXT, apt_packages TEXT );

    The clean_specs table provides parsed and validated specifications for some of the specification files (currently Pip, Conda, and APT packages). Each column gives either a JSON encoded list of package requirements, or null. APT packages have been validated using a regex adapted from the repo2docker source. Pip packages have been parsed and normalized using the Requirement class from the pkg_resources package of setuptools. Conda packages have been parsed and normalized using the conda.models.match_spec.MatchSpec class included with the library form of Conda (distinct from the command line tool). Users might want to use these parsers when working with the package data, as the specifications can become fairly complex.

    The missing table gives the repos that were not accessible, and event_logs records which log files have already been added. These tables are used for updating the dataset and should not be of interest to users.

     
    more » « less
  2. Evidence shows that developer reputation is extremely important when accepting pull requests or resolving reported issues. It is particularly salient in Free/Libre Open Source Software since the developers are distributed around the world, do not work for the same organization and, in most cases, never meet face to face. The existing solutions to expose developer reputation tend to be forge specific (GitHub), focus on activity instead of impact, do not leverage social or technical networks, and do not correct often misspelled developer identities. We aim to remedy this by amalgamating data from all public Git repositories, measuring the impact of developer work, expose developer's collaborators, and correct notoriously problematic developer identity data. We leverage World of Code (WoC), a collection of an almost complete (and continuously updated) set of Git repositories by first allowing developers to select which of the 34 million(M) Git commit author IDs belong to them and then generating their profiles by treating the selected collection of IDs as that single developer. As a side-effect, these selections serve as a training set for a supervised learning algorithm that merges multiple identity strings belonging to a single individual. As we evaluate the tool and the proposed impact measure, we expect to build on these findings to develop reputation badges that could be associated with pull requests and commits so developers could easier trust and prioritize them. 
    more » « less
  3. Open Source Software (OSS) forms an infrastructure on which numerous (often critical) software applications are based. Substantial research was done to investigate central projects such as Linux kernel but we have only a limited understanding of how the periphery of the larger OSS ecosystem is interconnected through technical dependencies, code sharing, and knowledge flows. We aim to close this gap by a) creating a nearly complete and rapidly updateable collection of version control data for FLOSS projects; b) by cleaning, correcting, and augmenting the data to measure several types of dependencies among code, developers, and projects; c) by creating models that rely on the resulting supply chains to investigate structural and dynamic properties of the entire OSS. The current implementation is capable of being updated each month, occupies over 300Tb of disk space with 1.5B commits and 12B git objects. Highly accurate algorithms to correct identity data and extract dependencies from the source code are used to characterize the current structure of OSS and the way it has evolved. In particular, models of technology spread demonstrate the implicit factors developers use when choosing software components. We expect the resulting research platform will both spur investigations on how the huge periphery in OSS both sustains and is sustained by the central OSS projects and, as a result, will increase resiliency and effectiveness of the OSS. 
    more » « less
  4. null (Ed.)
    The first major goal of this project is to build a state-of-the-art information storage, retrieval, and analysis system that utilizes the latest technology and industry methods. This system is leveraged to accomplish another major goal, supporting modern search and browse capabilities for a large collection of tweets from the Twitter social media platform, web pages, and electronic theses and dissertations (ETDs). The backbone of the information system is a Docker container cluster running with Rancher and Kubernetes. Information retrieval and visualization is accomplished with containers in a pipelined fashion, whether in the cluster or on virtual machines, for Elasticsearch and Kibana, respectively. In addition to traditional searching and browsing, the system supports full-text and metadata searching. Search results include facets as a modern means of browsing among related documents. The system supports text analysis and machine learning to reveal new properties of collection data. These new properties assist in the generation of available facets. Recommendations are also presented with search results based on associations among documents and with logged user activity. The information system is co-designed by five teams of Virginia Tech graduate students, all members of the same computer science class, CS 5604. Although the project is an academic exercise, it is the practice of the teams to work and interact as though they are groups within a company developing a product. The teams on this project include three collection management groups -- Electronic Theses and Dissertations (ETD), Tweets (TWT), and Web-Pages (WP) -- as well as the Front-end (FE) group and the Integration (INT) group to help provide the overarching structure for the application. This submission focuses on the work of the Integration (INT) team, which creates and administers Docker containers for each team in addition to administering the cluster infrastructure. Each container is a customized application environment that is specific to the needs of the corresponding team. Each team will have several of these containers set up in a pipeline formation to allow scaling and extension of the current system. The INT team also contributes to a cross-team effort for exploring the use of Elasticsearch and its internally associated database. The INT team administers the integration of the Ceph data storage system into the CS Department Cloud and provides support for interactions between containers and the Ceph filesystem. During formative stages of development, the INT team also has a role in guiding team evaluations of prospective container components and workflows. The INT team is responsible for the overall project architecture and facilitating the tools and tutorials that assist the other teams in deploying containers in a development environment according to mutual specifications agreed upon with each team. The INT team maintains the status of the Kubernetes cluster, deploying new containers and pods as needed by the collection management teams as they expand their workflows. This team is responsible for utilizing a continuous integration process to update existing containers. During the development stage the INT team collaborates specifically with the collection management teams to create the pipeline for the ingestion and processing of new collection documents, crossing services between those teams as needed. The INT team develops a reasoner engine to construct workflows with information goal as input, which are then programmatically authored, scheduled, and monitored using Apache Airflow. The INT team is responsible for the flow, management, and logging of system performance data and making any adjustments necessary based on the analysis of testing results. The INT team has established a Gitlab repository for archival code related to the entire project and has provided the other groups with the documentation to deposit their code in the repository. This repository will be expanded using Gitlab CI in order to provide continuous integration and testing once it is available. Finally, the INT team will provide a production distribution that includes all embedded Docker containers and sub-embedded Git source code repositories. The INT team will archive this distribution on the Virginia Tech Docker Container Registry and deploy it on the Virginia Tech CS Cloud. The INT-2020 team owes a sincere debt of gratitude to the work of the INT-2019 team. This is a very large undertaking and the wrangling of all of the products and processes would not have been possible without their guidance in both direct and written form. We have relied heavily on the foundation they and their predecessors have provided for us. We continue their work with systematic improvements, but also want to acknowledge their efforts Ibid. Without them, our progress to date would not have been possible. 
    more » « less
  5. Designing and maintaining a software system’s architecture typically involve making numerous design decisions, each potentially affecting the system’s functional and nonfunctional properties. Understanding these design decisions can help inform future decisions and implementation choices and can avoid introducing regressions and architectural inefficiencies later. Unfortunately, design decisions are rarely well documented and are typically a lost artifact of the architecture creation and maintenance process. The loss of this information can thus hurt development. To address this shortcoming, we develop RecovAr, a technique for automatically recovering design decisions from the project’s readily available history artifacts, such as an issue tracker and version control repository. RecovAr uses state-ofthe- art architectural recovery techniques on a series of version control commits and maps those commits to issues to identify decisions that affect system architecture. While some decisions can still be lost through this process, our evaluation on Hadoop and Struts, two large open-source systems with over 8 years of development each and, on average, more than 1 million lines of code, shows that RecovAr has the recall of 75% and a precision of 77%. Our work formally defines architectural design decisions and develops an approach for tracing such decisions in project histories. Additionally, the work introduces methods to classify whether decisions are architectural and to map decisions to code elements. Finally, our work contributes a methodology engineers can follow to preserve design-decision knowledge in their projects. 
    more » « less