skip to main content

Title: Detecting and Characterizing Bots that Commit Code
Background: Some developer activity traditionally performed manually, such as making code commits, opening, managing, or closing issues is increasingly subject to automation in many OSS projects. Specifically, such activity is often performed by tools that react to events or run at specific times. We refer to such automation tools as bots and, in many software mining scenarios related to developer productivity or code quality it is desirable to identify bots in order to separate their actions from actions of individuals. Aim: Find an automated way of identifying bots and code committed by these bots, and to characterize the types of bots based on their activity patterns. Method and Result: We propose BIMAN, a systematic approach to detect bots using author names, commit messages, files modified by the commit, and projects associated with the ommits. For our test data, the value for AUC-ROC was 0.9. We also characterized these bots based on the time patterns of their code commits and the types of files modified, and found that they primarily work with documentation files and web pages, and these files are most prevalent in HTML and JavaScript ecosystems. We have compiled a shareable dataset containing detailed information about 461 bots we found (all of whom have more than 1000 commits) and 13,762,430 commits they created.  more » « less
Award ID(s):
1633437 1901102 1925615
Author(s) / Creator(s):
; ; ; ; ; ;
Date Published:
Journal Name:
IEEE International Working Conference on Mining Software Repositories
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Background: Bots help automate many of the tasks performed by software developers and are widely used to commit code in various social coding platforms. At present, it is not clear what types of activities these bots perform and understanding it may help design better bots, and find application areas which might benefit from bot adoption. Aim: We aim to categorize the Bot Commits by the type of change (files added, deleted, or modified), find the more commonly changed file types, and identify the groups of file types that tend to get updated together. Method: 12,326,137 commits made by 461 popular bots (that made at least 1000 commits) were examined to identify the frequency and the type of files added/ deleted/ modified by the commits, and association rule mining was used to identify the types of files modified together. Result: Majority of the bot commits modify an existing file, a few of them add new files, while deletion of a file is very rare. Commits involving more than one type of operation are even rarer. Files containing data, configuration, and documentation are most frequently updated, while HTML is the most common type in terms of the number of files added, deleted, and modified. Files of the type "Markdown", "Ignore List", "YAML", "JSON" were the types that are updated together with other types of files most frequently. Conclusion: We observe that majority of bot commits involve single file modifications, and bots primarily work with data, configuration, and documentation files. A better understanding if this is a limitation of the bots and, if overcome, would lead to different kinds of bots remains an open question. 
    more » « less
  2. Transparent environments and social-coding platforms asGitHub help developers to stay abreast of changes during the development and maintenance phase of a project. Especially, notification feeds can help developers to learn about relevant changes in other projects. Unfortunately, transparent environments can quickly overwhelm developers with too many notifications, such that they lose the important ones in a sea of noise. Complementing existing prioritization and filtering strategies based on binary compatibility and code ownership, we develop an anomaly detection mechanism to identify unusual commits in a repository, which stand out with respect to other changes in the same repository or by the same developer. Among others, we detect exceptionally large commits, commits at unusual times, and commits touching rarely changed file types given the characteristics of a particular repository or developer. We automatically flag unusual commits on GitHub through a browser plug-in. In an interactive survey with 173 active GitHub users, rating commits in a project of their interest, we found that, although our unusual score is only a weak predictor of whether developers want to be notified about a commit, information about unusual characteristics of a commit changes how developers regard commits. Our anomaly detection mechanism is a building block for scaling transparent environments. 
    more » « less
  3. Binder is a publicly accessible online service for executing interactive notebooks based on Git repositories. Binder dynamically builds and deploys containers following a recipe stored in the repository, then gives the user a browser-based notebook interface. The Binder group periodically releases a log of container launches from the public Binder service. Archives of launch records are available here. These records do not include identifiable information like IP addresses, but do give the source repo being launched along with some other metadata. The main content of this dataset is in the binder.sqlite file. This SQLite database includes launch records from 2018-11-03 to 2021-06-06 in the events table, which has the following schema.

    CREATE TABLE events( version INTEGER, timestamp TEXT, provider TEXT, spec TEXT, origin TEXT, ref TEXT, guessed_ref TEXT ); CREATE INDEX idx_timestamp ON events(timestamp);
    • version indicates the version of the record as assigned by Binder. The origin field became available with version 3, and the ref field with version 4. Older records where this information was not recorded will have the corresponding fields set to null.
    • timestamp is the ISO timestamp of the launch
    • provider gives the type of source repo being launched ("GitHub" is by far the most common). The rest of the explanations assume GitHub, other providers may differ.
    • spec gives the particular branch/release/commit being built. It consists of <github-id>/<repo>/<branch>.
    • origin indicates which backend was used. Each has its own storage, compute, etc. so this info might be important for evaluating caching and performance. Note that only recent records include this field. May be null.
    • ref specifies the git commit that was actually used, rather than the named branch referenced by spec. Note that this was not recorded from the beginning, so only the more recent entries include it. May be null.
    • For records where ref is not available, we attempted to clone the named reference given by spec rather than the specific commit (see below). The guessed_ref field records the commit found at the time of cloning. If the branch was updated since the container was launched, this will not be the exact version that was used, and instead will refer to whatever was available at the time (early 2021). Depending on the application, this might still be useful information. Selecting only records with version 4 (or non-null ref) will exclude these guessed commits. May be null.

    The Binder launch dataset identifies the source repos that were used, but doesn't give any indication of their contents. We crawled GitHub to get the actual specification files in the repos which were fed into repo2docker when preparing the notebook environments, as well as filesystem metadata of the repos. Some repos were deleted/made private at some point, and were thus skipped. This is indicated by the absence of any row for the given commit (or absence of both ref and guessed_ref in the events table). The schema is as follows.

    CREATE TABLE spec_files ( ref TEXT NOT NULL PRIMARY KEY, ls TEXT, runtime BLOB, apt BLOB, conda BLOB, pip BLOB, pipfile BLOB, julia BLOB, r BLOB, nix BLOB, docker BLOB, setup BLOB, postbuild BLOB, start BLOB );

    Here ref corresponds to ref and/or guessed_ref from the events table. For each repo, we collected spec files into the following fields (see the repo2docker docs for details on what these are). The records in the database are simply the verbatim file contents, with no parsing or further processing performed.

    • runtime: runtime.txt
    • apt: apt.txt
    • conda: environment.yml
    • pip: requirements.txt
    • pipfile: Pipfile.lock or Pipfile
    • julia: Project.toml or REQUIRE
    • r: install.R
    • nix: default.nix
    • docker: Dockerfile
    • setup:
    • postbuild: postBuild
    • start: start

    The ls field gives a metadata listing of the repo contents (excluding the .git directory). This field is JSON encoded with the following structure based on JSON types:

    • Object: filesystem directory. Keys are file names within it. Values are the contents, which can be regular files, symlinks, or subdirectories.
    • String: symlink. The string value gives the link target.
    • Number: regular file. The number value gives the file size in bytes.
    CREATE TABLE clean_specs ( ref TEXT NOT NULL PRIMARY KEY, conda_channels TEXT, conda_packages TEXT, pip_packages TEXT, apt_packages TEXT );

    The clean_specs table provides parsed and validated specifications for some of the specification files (currently Pip, Conda, and APT packages). Each column gives either a JSON encoded list of package requirements, or null. APT packages have been validated using a regex adapted from the repo2docker source. Pip packages have been parsed and normalized using the Requirement class from the pkg_resources package of setuptools. Conda packages have been parsed and normalized using the conda.models.match_spec.MatchSpec class included with the library form of Conda (distinct from the command line tool). Users might want to use these parsers when working with the package data, as the specifications can become fairly complex.

    The missing table gives the repos that were not accessible, and event_logs records which log files have already been added. These tables are used for updating the dataset and should not be of interest to users.

    more » « less
  4. Code changes are often reviewed before they are deployed. Popular source control systems aid code review by presenting textual differences between old and new versions of the code, leaving developers with the difficult task of determining whether the differences actually produced the desired behavior. Fortunately, we can mine such information from code repositories. We propose aiding code review with inter-version semantic differential analysis. During review of a new commit, a developer is presented with summaries of both code differences and behavioral differences, which are expressed as diffs of likely invariants extracted by running the system's test cases. As a result, developers can more easily determine that the code changes produced the desired effect. We created an invariant-mining tool chain, GETTY, to support our concept of semantically-assisted code review. To validate our approach, 1) we applied GETTY to the commits of 6 popular open source projects, 2) we assessed the performance and cost of running GETTY in different configurations, and 3) we performed a comparative user study with 18 developers. Our results demonstrate that semantically-assisted code review is feasible, effective, and that real programmers can leverage it to improve the quality of their reviews. 
    more » « less
  5. Nigel Bosch ; Antonija Mitrovic ; Agathe Merceron (Ed.)
    Demand for education in Computer Science has increased markedly in recent years. With increased demand has come to an increased need for student support, especially for courses with large programming projects. Instructors commonly provide online post forums or office hours to address this massive demand for help requests. Identifying what types of questions students are asking in those interactions and what triggers their help requests can in turn assist instructors in better managing limited help-providing resources. In this study, we aim to explore students’ help-seeking actions from the two separate approaches we mentioned before and investigate their coding actions before help requests to understand better what motivates students to seek help in programming projects. We collected students’ help request data and commit logs from two Fall offerings of a CS2 course. In our analysis, we first believe that different types of questions should be related to different behavioral patterns. Therefore, we first categorized students’ help requests based on their content (e.g., Implementation, General Debugging, or Addressing Teaching Staff (TS) Test Failures). We found that General Debugging is the most frequently asked question. Then we analyzed how the popularity of each type of request changed over time. Our results suggest that implementation is more popular in the early stage of the project cycle, and it changes to General Debugging and Addressing TS Failures in the later stage. We also calculated the accuracy of students’ commit frequency one hour before their help requests; the results show that before Implementation requests, the commit frequency is significantly lower, and before TS failure requests, the frequency is significantly higher. Moreover, we checked before any help request whether students changed their source code or test code. The results show implementation requests related to higher chances of source code changes and coverage questions related to more test code changes. Moreover, we use a Markov Chain model to show students’ action sequences before, during, and after the requests. And finally, we explored students’ progress after the office hours interaction and found that over half of the students improved the correctness of their code after 20 minutes of their office hours interaction addressing TS failures ends. 
    more » « less