Open-source software (OSS) has become an essential in knowledge production and innovation in both academic and business sectors around the globe. OSS is developed by a variety of entities and is considered a “unique scholarly activity” due to the complexity of scientific computational tasks and the necessity of cooperation and transparency for research methodology. While the developers of OSS are thought to be very widespread, there remains many questions to be answered about who these contributors are, who are the largest contributors (countries, sectors, organizations), and how they influence each other. Using data collected on Python and R packages from GitHub, we leverage fractional-counting methods to measure the exact contribution of each developer and use weighted counting based on the lines of code added by each developer to accurately sum the contribution of countries. We find that for both Python and R, developers from a small group of top countries account for a considerable share of code additions. Developers from the top 10 countries, which include the United States, Germany, United Kingdom, France, and China comprise of 76.1% of the total R repositories, and 66.6% of Python repositories. Next, we use the dependency relationship between packages and study the pairwise connections between countries to measure their respective impact, finding that the packages attributed to United States are most frequently reused by packages from Germany, Spain, Italy, Australia, and United Kingdom based on the total dependency fractions. In parallel, United States mostly uses packages from Germany, France, and Denmark. Influential contributors to OSS can contribute heavily to the priorities and practices of scientific research when their work is widely used or built upon by other researchers. In this context, studying the global distribution, collaboration, and impact of the contributors is important to understanding the landscape of innovation in scientific research. 
                        more » 
                        « less   
                    
                            
                            Attributing credit and measuring impact of open source software using fractional counting
                        
                    
    
            Open source software (OSS) has become an essential in knowledge production and innovation in both academic and business sectors around the globe. OSS is developed by a variety of entities and is considered a "unique scholarly activity" due to the complexity of scientific computational tasks and the necessity of cooperation and transparency for research methodology. While the developers of OSS are thought to be very widespread, there remains many questions to be answered about who these contributors are, who are the largest contributors (countries, sectors, organizations), and how they influence each other. Using data collected on Python and R packages from GitHub, we leverage fractional-counting methods to measure the exact contribution of each developer and use weighted counting based on the lines of code added to accurately sum the contribution of countries to OSS. We find that for both Python and R, developers from a small group of top countries account for a considerable share of code additions. Developers from the top 10 countries, which include the United States, Germany, United Kingdom, France, and China comprise of 76.1% of the total R repositories, and 66.6% of Python repositories. Next, we use the dependency relationship between packages and study the pairwise connections between countries to measure their respective impact, finding that the packages attributed to United States are most frequently reused by packages from Germany, Spain, Italy, Australia, and United Kingdom based on the total dependency fractions. In parallel, United States mostly uses packages from Germany, France, and Denmark. Influential contributors to OSS can contribute heavily to the priorities and practices of scientific research when their work is widely used or built upon by other researchers. In this context, studying the global distribution, collaboration, and impact of the contributors is important to understanding the landscape of innovation in scientific research. 
        more » 
        « less   
        
    
                            - Award ID(s):
- 2306160
- PAR ID:
- 10528107
- Publisher / Repository:
- Symposium on Data Science and Statistics (SDSS) 2024
- Date Published:
- Format(s):
- Medium: X
- Institution:
- American Statistical Association
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            Open source software (OSS) has become an essential in knowledge production and innovation in both academic and business sectors around the globe. OSS is developed by a variety of entities and is considered a “unique scholarly activity” due to the complexity of scientific computational tasks and the necessity of cooperation and transparency for research methodology. While the developers of OSS are thought to be very widespread, there remains many questions to be answered about who these contributors are, who are the largest contributors (countries, sectors, organizations), and how they influence each other. Using data collected on Python and R packages from GitHub, we leverage fractional-counting methods to measure the exact contribution of each developer and use weighted counting based on the lines of code added by each developer to accurately sum the contribution of countries. We find that for both Python and R, developers from a small group of top countries account for a considerable share of code additions. Developers from the top 10 countries, which include the United States, Germany, United Kingdom, France, and China comprise of 76.1% of the total R repositories, and 66.6% of Python repositories. Next, we use the dependency relationship between packages and study the pairwise connections between countries to measure their respective impact, finding that the packages attributed to United States are most frequently reused by packages from Germany, Spain, Italy, Australia, and United Kingdom based on the total dependency fractions. In parallel, United States mostly uses packages from Germany, France, and Denmark. Influential contributors to OSS can contribute heavily to the priorities and practices of scientific research when their work is widely used or built upon by other researchers. In this context, studying the global distribution, collaboration, and impact of the contributors is important to understanding the landscape of innovation in scientific research.more » « less
- 
            Open source software (OSS) is ubiquitous, serving as specialized applications nurtured by devoted user communities, and as digital infrastructure underlying platforms used by millions of people. OSS is developed, maintained, and extended through the contribution of independent developers as well as people from businesses, universities, government research institutions, and nonprofits. Despite its prevalence, the scope and impact of OSS are not currently well-measured. Recent policies of the U.S. Federal Government promote sharing of software code developed by or for the Federal Government. While the policy to promote reusing and sharing of software created with public funding is relatively new, public funding plays an important and not fully accounted role in the creation of OSS. This paper aims to measure the scope and value of OSS development in the U.S. Federal Government. We collect data from Code.gov, the government’s platform for sharing OSS projects, and study contributions of agencies. The dataset contains 17K repositories from 21 agencies, with the majority of contributions originating from the DOE, NASA and GSA. In addition, we collect data on development activity (e.g., lines of code, contributors) of the repositories on GitHub, the largest hosting facility worldwide. Adopting a cost estimation model from software engineering, we generate estimates of investment in OSS that are consistent with the U.S. national accounting methods used for measuring software investment. Finally, we generate and analyze collaboration network resulting from cross-agency contributions to repositories and explore the centrality of agencies in the network.more » « less
- 
            The analysis of the gender dynamics in scientific research and respective outputs is crucial for ensuring that science policy is inclusive and equitable. Similar to other research outputs such as publications and patents, open source software (OSS) projects are also developed by contributors from universities, government research institutions, and nonprofits, in addition to businesses. Despite its reach and continued rapid growth, reliable and comprehensive survey data on OSS does not exist, limiting insights into contributions by gender and policy- makers’ ability to assess trends in gender representation. Like in scientific research, the inclusion of diverse perspectives in software development enhances creativity and problem-solving. Using GitHub data, researchers have found positive correlations between gender diversity of an OSS development team and its productivity (Vasilescu et al., 2015; Ortu et al., 2017). Yet there is evidence of gender bias, with women facing higher standards to have their contributions accepted (Terrell et al., 2017; Imtiaz et al., 2019). This exploratory study aims to quantify gender differences in development and use (impact) of OSS using publicly available information collected from GitHub. We focus on software packages developed for programming language R, with the majority of contributors from academia. The paper asks (1) what are gender differences in the volume of contributions? (2) has gender representation shifted over time? (3) is there a correlation between the gender of contributors and the impact of a package?more » « less
- 
            Who creates the most innovative open-source software projects? And what fate do these projects tend to have? Building on a long history of research to understand innovation in business and other domains, as well as recent advances towards modeling innovation in scientific research from the science of science field, in this paper we adopt the analogy of innovation as emerging from the novel recombination of existing bits of knowledge. As such, we consider as innovative the software projects that recombine existing software libraries in novel ways, i.e., those built on top of atypical combinations of packages as extracted from import statements. We then report on a large-scale quantitative study of innovation in the Python open-source software ecosystem. Our results show that higher levels of innovativeness are statistically associated with higher GitHub star counts, i.e., novelty begets popularity. At the same time, we find that controlling for project size, the more innovative projects tend to involve smaller teams of contributors, as well as be at higher risk of becoming abandoned in the long term. We conclude that innovation and open source sustainability are closely related and, to some extent, antagonistic.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
 
                                    