NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

A Meta-Summary of Challenges in Building Products with ML Components – Collecting Experiences from 4758+ Practitioners

https://doi.org/10.1109/CAIN58948.2023.00034

Nahar, Nadia; Zhang, Haoran; Lewis, Grace; Zhou, Shurui; Kästner, Christian (May 2023, 2023 IEEE/ACM 2nd International Conference on AI Engineering – Software Engineering for AI (CAIN))

Incorporating machine learning (ML) components into software products raises new software-engineering challenges and exacerbates existing ones. Many researchers have invested significant effort in understanding the challenges of industry practitioners working on building products with ML components, through interviews and surveys with practitioners. With the intention to aggregate and present their collective findings, we conduct a meta-summary study: We collect 50 relevant papers that together interacted with over 4758 practitioners using guidelines for systematic literature reviews. We then collected, grouped, and organized the over 500 mentions of challenges within those papers. We highlight the most commonly reported challenges and hope this meta-summary will be a useful resource for the research community to prioritize research and education in this field.
more » « less
Full Text Available
Elevating Jupyter Notebook Maintenance Tooling by Identifying and Extracting Notebook Structures

https://doi.org/10.1109/ICSME55016.2022.00047

Jiang, Yuan; Kastner, Christian; Zhou, Shurui (October 2022, 2022 IEEE International Conference on Software Maintenance and Evolution (ICSME))

Data analysis is an exploratory, interactive, and often collaborative process. Computational notebooks have become a popular tool to support this process, among others because of their ability to interleave code, narrative text, and results. However, notebooks in practice are often criticized as hard to maintain and being of low code quality, including problems such as unused or duplicated code and out-of-order code execution. Data scientists can benefit from better tool support when maintaining and evolving notebooks. We argue that central to such tool support is identifying the structure of notebooks. We present a lightweight and accurate approach to extract notebook structure and outline several ways such structure can be used to improve maintenance tooling for notebooks, including navigation and finding alternatives.
more » « less
Full Text Available
Aspirations and Practice of ML Model Documentation: Moving the Needle with Nudging and Traceability

https://doi.org/10.1145/3544548.3581518

Bhat, Avinash; Coursey, Austin; Hu, Grace; Li, Sixian; Nahar, Nadia; Zhou, Shurui; Kästner, Christian; Guo, Jin L.C. (April 2023, CHI '23: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems)

The documentation practice for machine-learned (ML) models often falls short of established practices for traditional software, which impedes model accountability and inadvertently abets inappropriate or misuse of models. Recently, model cards, a proposal for model documentation, have attracted notable attention, but their impact on the actual practice is unclear. In this work, we systematically study the model documentation in the field and investigate how to encourage more responsible and accountable documentation practice. Our analysis of publicly available model cards reveals a substantial gap between the proposal and the practice. We then design a tool named DocML aiming to (1) nudge the data scientists to comply with the model cards proposal during the model development, especially the sections related to ethics, and (2) assess and manage the documentation quality. A lab study reveals the benefit of our tool towards long-term documentation quality and accountability.
more » « less
Full Text Available
Collaboration challenges in building ML-enabled systems: communication, documentation, engineering, and process

https://doi.org/10.1145/3510003.3510209

Nahar, Nadia; Zhou, Shurui; Lewis, Grace; Kästner, Christian (May 2022, ICSE '22: Proceedings of the 44th International Conference on Software Engineering)

The introduction of machine learning (ML) components in software projects has created the need for software engineers to collaborate with data scientists and other specialists. While collaboration can always be challenging, ML introduces additional challenges with its exploratory model development process, additional skills and knowledge needed, difficulties testing ML systems, need for continuous evolution and monitoring, and non-traditional quality requirements such as fairness and explainability. Through interviews with 45 practitioners from 28 organizations, we identified key collaboration challenges that teams face when building and deploying ML systems into production. We report on common collaboration points in the development of production ML systems for requirements, data, and integration, as well as corresponding team patterns and challenges. We find that most of these challenges center around communication, documentation, engineering, and process, and collect recommendations to address these challenges.
more » « less
Full Text Available
Subtle Bugs Everywhere: Generating Documentation for Data Wrangling Code

https://doi.org/10.1109/ASE51524.2021.9678520

Yang, Chenyang; Zhou, Shurui; Guo, Jin L.C.; Kästner, Christian (November 2021, Proceedings of the 36th IEEE/ACM International Conference on Automated Software Engineering (ASE), Los Alamitos, CA: IEEE Computer Society)

Data scientists reportedly spend 60 to 80 percent of their time in their daily routines on data wrangling, i.e. cleaning data and extracting features. However, data wrangling code is often repetitive and error-prone to write. Moreover, it is easy to introduce subtle bugs when reusing and adopting existing code, which result not in crashes but reduce model quality. To support data scientists with data wrangling, we present a technique to generate interactive documentation for data wrangling code. We use (1) program synthesis techniques to automatically summarize data transformations and (2) test case selection techniques to purposefully select representative examples from the data based on execution information collected with tailored dynamic program analysis. We demonstrate that a JupyterLab extension with our technique can provide documentation for many cells in popular notebooks and find in a user study that users with our plugin are faster and more effective at finding realistic bugs in data wrangling code.
more » « less
Full Text Available
How Has Forking Changed in the Last 20 Years? A Study of Hard Forks on GitHub

https://doi.org/10.1145/3377811.3380412

Zhou, Shurui; Vasilescu, Bogdan; Kaestner, Christian (July 2020, Proceedings of the 42nd International Conference on Software Engineering (ICSE))

The notion of forking has changed with the rise of distributed ver- sion control systems and social coding environments, like GitHub. Traditionally forking refers to splitting off an independent devel- opment branch (which we call hard forks); research on hard forks, conducted mostly in pre-GitHub days showed that hard forks were often seen critical as they may fragment a community. Today, in so- cial coding environments, open-source developers are encouraged to fork a project in order to contribute to the community (which we call social forks), which may have also influenced perceptions and practices around hard forks. To revisit hard forks, we identify, study, and classify 15,306 hard forks on GitHub and interview 18 owners of hard forks or forked repositories. We find that, among others, hard forks often evolve out of social forks rather than being planned deliberately and that perception about hard forks have indeed changed dramatically, seeing them often as a positive non- competitive alternative to the original project.
more » « less
Full Text Available
What the fork: a study of inefficient and efficient forking practices in social coding

https://doi.org/10.1145/3338906.3338918

Zhou, Shurui; Vasilescu, Bogdan; Kästner, Christian (September 2019, Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering)

Forking and pull requests have been widely used in open-source communities as a uniform development and contribution mechanism, giving developers the flexibility to modify their own fork without affecting others before attempting to contribute back. However, not all projects use forks efficiently; many experience lost and duplicate contributions and fragmented communities. In this paper, we explore how open-source projects on GitHub differ with regard to forking inefficiencies. First, we observed that different communities experience these inefficiencies to widely different degrees and interviewed practitioners to understand why. Then, using multiple regression modeling, we analyzed which context factors correlate with fewer inefficiencies.We found that better modularity and centralized management are associated with more contributions and a higher fraction of accepted pull requests, suggesting specific best practices that project maintainers can adopt to reduce forking-related inefficiencies in their communities.
more » « less
Full Text Available
Identifying Redundancies in Fork-based Development

https://doi.org/10.1109/SANER.2019.8668023

Ren, Luyao; Zhou, Shurui; Kastner, Christian; Wasowski, Andrzej (February 2019, 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER))

Fork-based development is popular and easy to use, but makes it difficult to maintain an overview of the whole community when the number of forks increases. This may lead to redundant development where multiple developers are solving the same problem in parallel without being aware of each other. Redundant development wastes effort for both maintainers and developers. In this paper, we designed an approach to identify redundant code changes in forks as early as possible by extracting clues indicating similarities between code changes, and building a machine learning model to predict redundancies. We evaluated the effectiveness from both the maintainer's and the developer's perspectives. The result shows that we achieve 57-83% precision for detecting duplicate code changes from maintainer's perspective, and we could save developers' effort of 1.9-3.0 commits on average. Also, we show that our approach significantly outperforms existing state-of-art.
more » « less
Full Text Available
Perceptions of open‐source software developers on collaborations: An interview and survey study

https://doi.org/10.1002/smr.2393

Constantino, Kattiana; Souza, Mauricio; Zhou, Shurui; Figueiredo, Eduardo; Kästner, Christian (October 2021, Journal of Software: Evolution and Process)

With the emergence of social coding platforms, collaboration has become a key and dynamic aspect to the success of software projects. In such platforms, developers have to collaborate and deal with issues of collaboration in open-source software development. Although collaboration is challenging, collaborative development produces better software systems than any developer could produce alone. Several approaches have investigated collaboration challenges, for instance, by proposing or evaluating models and tools to support collaborative work. Despite the undeniable importance of the existing efforts in this direction, there are few works on collaboration from perspectives of developers. In this work, we aim to investigate the perceptions of open-source software developers on collaborations, such as motivations, techniques, and tools to support global, productive, and collaborative development. Following an ad hoc literature review, an exploratory interview study with 12 open-source software developers from GitHub, our novel approach for this problem also relies on an extensive survey with 121 developers to confirm or refute the interview results. We found different collaborative contributions, such as managing change requests. Besides, we observed that most collaborators prefer to collaborate with the core team instead of their peers. We also found that most collaboration happens in software development (60%) and maintenance (47%) tasks. Furthermore, despite personal preferences to work independently, developers still consider collaborating with others in specific task categories, for instance, software development. Finally, developers also expressed the importance of the social coding platforms, such as GitHub, to support maintainers, and contributors in making decisions and developing tasks of the projects. Therefore, these findings may help project leaders optimize the collaborations among developers and reduce entry barriers. Moreover, these findings may support the project collaborators in understanding the collaboration process and engaging others in the project.
more » « less
Full Text Available
Adding sparkle to social coding: an empirical study of repository badges in the npm ecosystem

https://doi.org/10.1145/3180155.3180209

Trockman, Asher; Zhou, Shurui; Kästner, Christian; Vasilescu, Bogdan (January 2018, International Conference on Software Engineering)

Full Text Available

« Prev Next »

Search for: All records