skip to main content


Title: Assessing Perceived Sentiment in Pull Requests with Emoji: Evidence from Tools and Developer Eye Movements
The paper presents an eye tracking pilot study on understanding how developers read and assess sentiment in twenty-four GitHub pull requests containing emoji randomly selected from five different open source applications. Gaze data was collected on various elements of the pull request page in Google Chrome while the developers were tasked with determining perceived sentiment. The developer perceived sentiment was compared with sentiment output from five state-of-the-art sentiment analysis tools. SentiStrength-SE had the highest performance, with 55.56% of its predictions being agreed upon by study participants. On the other hand, Stanford CoreNLP fared the worst, with only 5.56% of its predictions matching that of the participants'. Gaze data shows the top three areas that developers looked at the most were the comment body, added lines of code, and username (the person writing the comment). The results also show high attention given to emoji in the pull request comment body compared to the rest of the comment text. These results can help provide additional guidelines on the pull request review process.  more » « less
Award ID(s):
1855756
NSF-PAR ID:
10267633
Author(s) / Creator(s):
;
Date Published:
Journal Name:
2021 IEEE/ACM Sixth International Workshop on Emotion Awareness in Software Engineering (SEmotion)
Page Range / eLocation ID:
1 to 6
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Developers in open source projects must make decisions on contributions from other community members, such as whether or not to accept a pull request. However, secondary factors—beyond the code itself—can influence those decisions. For example, signals from GitHub profiles, such as a number of followers, activity, names, or gender can also be considered when developers make decisions. In this paper, we examine how developers use these signals (or not) when making decisions about code contributions. To evaluate this question, we evaluate how signals related to perceived gender identity and code quality influenced decisions on accepting pull requests. Unlike previous work, we analyze this decision process with data collected from an eye-tracker. We analyzed differences in what signals developers said are important for themselves versus what signals they actually used to make decisions about others. We found that after the code snippet (x=57%), the second place programmers spent their time fixating on supplemental technical signals(x=32%), such as previous contributions and popular repositories. Diverging from what participants reported themselves, we also found that programmers fixated on social signals more than recalled. 
    more » « less
  2. An eye-tracking study of 18 developers reading and summarizing Java methods is presented. The developers provide a written summary for methods assigned to them. In total, 63 methods are used from five different systems. Previous studies on this topic use only short methods presented in isolation usually as images. In contrast, this work presents the study in the Eclipse IDE allowing access to all the source code in the system. The developer can navigate via scrolling and switching files while writing the summary. New eye-tracking infrastructure allows for this improvement in the study environment. Data collected includes eye gazes on source code, written summaries, and time to complete each summary. Unlike prior work that concluded developers focus on the signature the most, these results indicate that they tend to focus on the method body more than the signature. Moreover, both experts and novices tend to revisit control flow terms rather than reading them for a long period. They also spend a significant amount of gaze time and have higher gaze visits when they read call terms. Experts tend to revisit the body of the method significantly more frequently than its signature as the size of the method increases. Moreover, experts tend to write their summaries from source code lines that they read the most. 
    more » « less
  3. Unit testing focuses on verifying the functions of individual units of a software system. It is challenging due to the high inter dependencies among software units. Developers address this by mocking—replacing the dependency by a “fake” object. Despite the existence of powerful, dedicated mocking frameworks, developers often turn to a “hand-rolled” approach—inheritance. That is, they create a subclass of the dependent class and mock its behavior through method overriding. However, this requires tedious implementation and compromises the design quality of unit tests. This work contributes a fully automated refactoring framework to identify and replace the usage of inheritance by using Mockito—a well received mocking framework. Our approach is built upon the empirical experience from five open source projects that use inheritance for mocking. We evaluate our approach on nine other projects. Results show that our framework is efficient, generally applicable to new datasets, mostly preserves test case behaviors in detecting defects (in the form of mutants), and decouples test code from production code. The qualitative evaluation by experienced developers suggests that the auto-refactoring solutions generated by our framework improve the quality of the unit test cases in various aspects, such as making test conditions more explicit, as well as improved cohesion, readability, understandability, and maintainability with test cases. Finally, we submit 23 pull requests containing our refactoring solutions to the open-source projects. It turns out that, 9 requests are accepted/merged, 6 requests are rejected, the remaining requests are pending (5 requests), with unexpected exceptions (2 requests), or undecided (1 request). In particular, among the 21 open source developers that are involved in the reviewing process, 81% give positive votes. This indicates that our refactoring solutions are quite well received by the open-source projects and developers. 
    more » « less
  4. Background Multiple strategies can be used when self-monitoring diet, physical activity, and perceived stress, but no gold standards are available. Although self-monitoring is a core element of self-management and behavior change, the success of mHealth behavioral tools depends on their validity and reliability, which lack evidence. African American and Latina mothers in the United States are high-priority populations for apps that can be used for self-monitoring of diet, physical activity, and stress because the body mass index (BMI) of mothers typically increases for several years after childbirth and the risks of obesity and its’ sequelae diseases are elevated among minority populations. Objective To examine the intermethod reliability and concurrent validity of smartphone-based self-monitoring via ecological momentary assessments (EMAs) and use of daily diaries for diet, stress, and physical activity compared with brief recall measures, anthropometric biomeasures, and bloodspot biomarkers. Methods A purposive sample (n=42) of primarily African American (16/42, 39%) and Latina (18/42, 44%) mothers was assigned Android smartphones for using Ohmage apps to self-monitor diet, perceived stress, and physical activity over 6 months. Participants were assessed at 3- and 6-month follow-ups. Recall measures included brief food frequency screeners, physical activity assessments adapted from the National Health and Nutrition Examination Survey, and the nine-item psychological stress measure. Anthropometric biomeasures included BMI, body fat, waist circumference, and blood pressure. Bloodspot assays for Epstein–Barr virus and C-reactive protein were used as systemic load and stress biomarkers. EMAs and daily diary questions assessed perceived quality and quantity of meals, perceived stress levels, and moderate, vigorous, and light physical activity. Units of analysis were follow-up assessments (n=29 to n=45 depending on the domain) of the participants (n=29 with sufficient data for analyses). Correlations, R2 statistics, and multivariate linear regressions were used to assess the strength of associations between variables. Results Almost all participants (39/42, 93%) completed the study. Intermethod reliability between smartphone-based EMAs and diary reports and their corresponding recall reports was highest for stress and diet; correlations ranged from .27 to .52 (P<.05). However, it was unexpectedly low for physical activity; no significant associations were observed. Concurrent validity was demonstrated for diet EMAs and diary reports on systolic blood pressure (r=−.32), C-reactive protein level (r=−.34), and moderate and vigorous physical activity recalls (r=.35 to.48), suggesting a covariation between healthy diet and physical activity behaviors. EMAs and diary reports on stress were not associated with Epstein–Barr virus and C-reactive protein level. Diary reports on moderate and vigorous physical activity were negatively associated with BMI and body fat (r=−.35 to −.44, P<.05). Conclusions Brief smartphone-based EMA use may be valid and reliable for long-term self-monitoring of diet, stress, and physical activity. Lack of intermethod reliability for physical activity measures is consistent with prior research, warranting more research on the efficacy of smartphone-based self-monitoring of self-management and behavior change support. 
    more » « less
  5. Abstract Observational data collection is extremely hazardous in supercell storm environments, which makes for a scarcity of data used for evaluating the storm-scale guidance from convection allowing models (CAMs) like the National Oceanic and Atmospheric Administration (NOAA) Warn-on-Forecast System (WoFS). The Targeted Observations with UAS and Radar of Supercells (TORUS) 2019 field mission provided a rare opportunity to not only collect these observations, but to do so with advanced technology: vertically pointing Doppler lidar. One standing question for WoFS is how the system forecasts the feedback between supercells and their near-storm environment. The lidar can observe vertical profiles of wind over time, creating unique datasets to compare to WoFS kinematic predictions in rapidly evolving severe weather environments. Mobile radiosonde data are also presented to provide a thermodynamic comparison. The five lidar deployments (three of which observed tornadic supercells) analyzed show WoFS accurately predicted general kinematic trends in the inflow environment; however, the predicted feedback between the supercell and its environment, which resulted in enhanced inflow and larger storm-relative helicity (SRH), were muted relative to observations. The radiosonde observations reveal an overprediction of CAPE in WoFS forecasts, both in the near and far field, with an inverse relationship between the CAPE errors and distance from the storm. Significance Statement It is difficult to evaluate the accuracy of weather prediction model forecasts of severe thunderstorms because observations are rarely available near the storms. However, the TORUS 2019 field experiment collected multiple specialized observations in the near-storm environment of supercells, which are compared to the same near-storm environments predicted by the National Oceanic and Atmospheric Administration (NOAA) Warn-on-Forecast System (WoFS) to gauge its performance. Unique to this study is the use of mobile Doppler lidar observations in the evaluation; lidar can retrieve the horizontal winds in the few kilometers above ground on time scales of a few minutes. Using lidar and radiosonde observations in the near-storm environment of three tornadic supercells, we find that WoFS generally predicts the expected trends in the evolution of the near-storm wind profile, but the response is muted compared to observations. We also find an inverse relationship of errors in instability to distance from the storm. These results can aid model developers in refining model physics to better predict severe storms. 
    more » « less