Abstract This manuscript shares the lessons learned from providing scientific computing support to over 600 researchers and discipline experts, helping them develop reproducible and scalable analytical workflows to process large amounts of heterogeneous data.When providing scientific computing support, focus is first placed on how to foster the collaborative aspects of multidisciplinary projects on the technological side by providing virtual spaces to communicate and share documents. Then insights on data management planning and how to implement a centralized data management workflow for data‐driven projects are provided.Developing reproducible workflows requires the development of code. We describe tools and practices that have been successful in fostering collaborative coding and scaling on remote servers, enabling teams to iterate more efficiently. We have found short training sessions combined with on‐demand specialized support to be the most impactful combination in helping scientists develop their technical skills.Here we share our experiences in enabling researchers to do science more collaboratively and more reproducibly beyond any specific project, with long‐lasting effects on the way researchers conduct science. We hope that other groups supporting team‐ and data‐driven science (in environmental science and beyond) will benefit from the lessons we have learned over the years through trial and error. 
                        more » 
                        « less   
                    
                            
                            Assessing the state of research data publication in hydrology: A perspective from the Consortium of Universities for the Advancement of Hydrologic Science, Incorporated
                        
                    
    
            Abstract Many have argued that datasets resulting from scientific research should be part of the scholarly record as first class research products. Data sharing mandates from funding agencies and scientific journal publishers along with calls from the scientific community to better support transparency and reproducibility of scientific research have increased demand for tools and support for publishing datasets. Hydrology domain‐specific data publication services have been developed alongside more general purpose and even commercial data repositories. Prominent among these are the Hydrologic Information System (HIS) and HydroShare repositories developed by the Consortium of Universities for the Advancement of Hydrologic Science, Inc. (CUAHSI). More broadly, however, multiple organizations have been involved in the practice of data publication in the hydrology domain, each having different roles that have shaped data publication and reuse. Bibliographic and archival approaches to data publication have been advanced, but both have limitations with respect to hydrologic data. Specific recommendations for improving data publication infrastructure, support, and practices to move beyond existing limitations and enable more effective data publication in support of scientific research in the hydrology domain include: improving support for journal article‐based data access and data citation, considering the workflow for data publication, enhancing support for reproducible science, encouraging publication of curated reference data collections, advancing interoperability standards for sharing data and metadata among repositories, developing partnerships with university libraries offering data services, and developing more specific data management plans. While presented in the context of CUAHSI's data repositories and experience, these recommendations are broadly applicable to other domains. This article is categorized under:Science of Water > Methods 
        more » 
        « less   
        
    
    
                            - PAR ID:
- 10360637
- Publisher / Repository:
- Wiley Blackwell (John Wiley & Sons)
- Date Published:
- Journal Name:
- WIREs Water
- Volume:
- 7
- Issue:
- 3
- ISSN:
- 2049-1948
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            Abstract Contributory science—including citizen and community science—allows scientists to leverage participant‐generated data while providing an opportunity for engaging with local community members. Data yielded by participant‐generated biodiversity platforms allow professional scientists to answer ecological and evolutionary questions across both geographic and temporal scales, which is incredibly valuable for conservation efforts.The data reported to contributory biodiversity platforms, such as eBird and iNaturalist, can be driven by social and ecological variables, leading to biased data. Though empirical work has highlighted the biases in contributory data, little work has articulated how biases arise in contributory data and the societal consequences of these biases.We present a conceptual framework illustrating how social and ecological variables create bias in contributory science data. In this framework, we present four filters—participation,detectability,samplingandpreference—that ultimately shape the type and location of contributory biodiversity data. We leverage this framework to examine data from the largest contributory science platforms—eBird and iNaturalist—in St. Louis, Missouri, the United States, and discuss the potential consequences of biased data.Lastly, we conclude by providing several recommendations for researchers and institutions to move towards a more inclusive field. With these recommendations, we provide opportunities to ameliorate biases in contributory data and an opportunity to practice equitable biodiversity conservation. Read the freePlain Language Summaryfor this article on the Journal blog.more » « less
- 
            Abstract The hydrologic community has experienced a surge in interest in machine learning in recent years. This interest is primarily driven by rapidly growing hydrologic data repositories, as well as success of machine learning in various academic and commercial applications, now possible due to increasing accessibility to enabling hardware and software. This overview is intended for readers new to the field of machine learning. It provides a non‐technical introduction, placed within a historical context, to commonly used machine learning algorithms and deep learning architectures. Applications in hydrologic sciences are summarized next, with a focus on recent studies. They include the detection of patterns and events such as land use change, approximation of hydrologic variables and processes such as rainfall‐runoff modeling, and mining relationships among variables for identifying controlling factors. The use of machine learning is also discussed in the context of integrated with process‐based modeling for parameterization, surrogate modeling, and bias correction. Finally, the article highlights challenges of extrapolating robustness, physical interpretability, and small sample size in hydrologic applications. This article is categorized under:Science of Watermore » « less
- 
            Abstract ChemMLis an open machine learning (ML) and informatics program suite that is designed to support and advance the data‐driven research paradigm that is currently emerging in the chemical and materials domain.ChemMLallows its users to perform various data science tasks and execute ML workflows that are adapted specifically for the chemical and materials context. Key features are automation, general‐purpose utility, versatility, and user‐friendliness in order to make the application of modern data science a viable and widely accessible proposition in the broader chemistry and materials community.ChemMLis also designed to facilitate methodological innovation, and it is one of the cornerstones of the software ecosystem for data‐driven in silico research. This article is categorized under:Software > Simulation MethodsComputer and Information Science > ChemoinformaticsStructure and Mechanism > Computational Materials ScienceSoftware > Molecular Modelingmore » « less
- 
            Phenotypic, especially morphological, data are highly useful in systematics, taxonomy, and phylogenetics. Despite the increased use of genetic information, phenotypic data are necessary when researching the fossil record and remain useful for living taxa by providing independent evidence for testing molecular clades. MorphoBank is a FAIR (Findable, Accessible, Interoperable, and Reusable) database providing open biodiversity data in the form of morphological characters (O’Leary and Kaufman 2011, O'Leary and Kaufman 2012), a similar concept to GenBank for open access sequence data. MorphoBank enables scientists to share morphological character data associated with their peer-reviewed publications in the form of phylogenetic matrices as Tree analysis using New Technology (TNT) or NEXUS files. MorphoBank hosts 1,738 publicly accessible projects (each MorphoBank project is issued a unique identifier (ID) begining with the letter P followed by a number) with 173,559 images and 1,138 matrices as of July 2024. These data can be downloaded by the public, researchers, and students in the scientific community, where the data can be used for educational purposes or reused in additional phylogenetic analyses. MorphoBank encourages scientists to add content in numerous ways throughout the research process, including while actively working on a morphological matrix or in conjunction with a paper to be published that has a morphological matrix. For example, some large projects, such as P773, represents collaborative research that contains a matrix with 4,541 characters and over 12,000 annotated images. Researchers looking to replicate or utilize the data from this study, a task that would normally be extremely time and labor intensive, are able to quickly and easily download and work with the data in their own analyses. MorphoBank has a team of part-time curators and interns who also add content post-publication. Between 2018 and 2023, MorphoBank staff accounted for 25% of project creation and 41% of project publication. The MorphoBank community members created more projects but published fewer of them in the same time frame. The MorphoBank curation team strives to add the matrices to make the data FAIR. A majority of the data are associated with publications in journals that require a subscription; MorphoBank makes the matrix data available with its complete metadata without a financial access barrier. Data standards for morphological character matrices include scored taxa, full taxonomic names, and complete character names with character state descriptions. Since NEXUS files have varying standardization and syntax (Maddison et al. 1997, Vos et al. 2012), importing a matrix can lead to data errors, which MorphoBank does not accept due to its mission to provide complete and reproducible datasets. Hence, users often add incomplete data as file attachments. To help ensure full data is uploaded, MorphoBank has partnered with journals to ensure instructions to authors or emails to authors of accepted manuscripts make clear the need to upload data matrices to MorphoBank. MorphoBank has been cited over 1,500 times, with increasing citations each year (Fig. 1). We examined the use and impact of MorphoBank data on systematic and phylogenetic research and found that most data are used in phylogenetic analyses, describing new species, and examining diversification of taxonomic groups, which span a wide-range organisms from vertebrates such as dinosaurs, reptiles, and mammals (including studies of human evolution) to plants, invertebrates, and micro-organisms. MorphoBank has developed and implemented an internship program for undergraduate biology students focused on training in phylogenetic data, curation, research writing, and conference presenting. Part of this intership program involves utilizing Artificial Intelligence (AI) to increase efficiency by automating the process of extraction of character name and state data from published articles and integrating them into NEXUS files. Three additional activities help raise awareness and increase community contributions to MorphoBank: A partnership with the American Museum of Natural History (AMNH) was established in Summer 2024 to train volunteer curators.MorphoBank workshops have been developed for in-person (i.e., 12th North American Paleontological Convention in Ann Arbor, Michigan) and virtual (i.e., 3rd Joint Congress on Evolutionary Biology supported by the Society of Systematic Biologists) conferences.Virtual workshops will be offered quarterly to educate the scientific community on ways to add their own phylogenetic data to MorphoBank. A partnership with the American Museum of Natural History (AMNH) was established in Summer 2024 to train volunteer curators. MorphoBank workshops have been developed for in-person (i.e., 12th North American Paleontological Convention in Ann Arbor, Michigan) and virtual (i.e., 3rd Joint Congress on Evolutionary Biology supported by the Society of Systematic Biologists) conferences. Virtual workshops will be offered quarterly to educate the scientific community on ways to add their own phylogenetic data to MorphoBank. The long-term sustainability of MorphoBank depends on success in three areas: Financial: MorphoBank is currently supported by membership fees from academic institutions and museums; institutional support from the non-profit organization Phoenix Bioinformatics; and grants from the United States National Science Foundation. Its future depends on continued and growth in membership.Technical: The over 20-year-old MorphoBank codebase is being completely overhauled to provide better performance, add longer term software stability, and enable easier addition of new features.Scientific: The outreach efforts to increase community awareness and contributions aim to ensure the continued relevance and utility of the resource. Growth in data depth and breadth feeds into making MorphoBank indispensable for research in this scientific domain. Financial: MorphoBank is currently supported by membership fees from academic institutions and museums; institutional support from the non-profit organization Phoenix Bioinformatics; and grants from the United States National Science Foundation. Its future depends on continued and growth in membership. Technical: The over 20-year-old MorphoBank codebase is being completely overhauled to provide better performance, add longer term software stability, and enable easier addition of new features. Scientific: The outreach efforts to increase community awareness and contributions aim to ensure the continued relevance and utility of the resource. Growth in data depth and breadth feeds into making MorphoBank indispensable for research in this scientific domain.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
