skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Repositories You Shouldn't be Living Without
Over the last few years, a number of repositories of information relevant to the computing education community have come online, each with different content and purpose. In this special session, we present an overview of these repositories and the content that each provides. Demonstrations of the functionality of the repositories will be shown and attendees are encouraged to come with their questions and suggestions for improvement if they are currently users of the repositories.  more » « less
Award ID(s):
1757402
PAR ID:
10058213
Author(s) / Creator(s):
; ; ; ; ; ;
Date Published:
Journal Name:
SIGCSE '18 Proceedings of the 49th ACM Technical Symposium on Computer Science Education
Page Range / eLocation ID:
920-921
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Over the last few years, a number of repositories of information relevant to the computing education community have come online, each with different content and purpose. In this special session, we present an overview of these repositories and the content that each provides. Demonstrations of the functionality of the repositories will be shown and attendees are encouraged to come with their questions and suggestions for improvement if they are currently users of the repositories. 
    more » « less
  2. Incomplete and inconsistent connections between institutional repository holdings and the global data infrastructure inhibit research data discovery and reusability. Preventing metadata loss on the path from institutional repositories to the global research infrastructure can substantially improve research data reusability. The Realities of Academic Data Sharing (RADS) Initiative, funded by the National Science Foundation, is investigating institutional processes for improving research data FAIRness. Focal points of the RADS inquiry are to understand where researchers are sharing their data and to assess metadata quality, i.e., completeness, at six Data Curation Network (DCN) academic institutions: Cornell University, Duke University, University of Michigan, University of Minnesota, Washington University in St. Louis, and Virginia Tech. RADS is examining where researchers are storing their data, considering local institutional repositories and other popular repositories, and analyzing the completeness of the research data metadata stored in these institutional and other repositories. Metadata FAIRness (Findable, Accessible, Interoperable, Reusable) is used as the metric to assess metadata quality as FAIR complete. Research findings show significant content loss when metadata from local institutional repositories are compared to metadata found in DataCite. After examining the factors contributing to this metadata loss, RADS investigators are developing a set of recommended best practices for institutions to increase the quality of their scholarly metadata. Further, documentation such as README files are of particular importance not only for data reuse, but as sources containing valuable metadata such as Persistent Identifiers (PIDs). DOIs and related PIDs such as ORCID and ROR are still rarely used in institutional repositories. More frequent use would have a positive effect on discoverability, interoperability and reusability, especially when transferring to global infrastructure. 
    more » « less
  3. Advanced imaging and DNA sequencing technologies now enable the diverse biology community to routinely generate and analyze terabytes of high resolution biological data. The community is rapidly heading toward the petascale in single investigator laboratory settings. As evidence, the single NCBI SRA central DNA sequence repository contains over 45 petabytes of biological data. Given the geometric growth of this and other genomics repositories, an exabyte of mineable biological data is imminent. The challenges of effectively utilizing these datasets are enormous as they are not only large in the size but also stored in geographically distributed repositories in various repositories such as National Center for Biotechnology Information (NCBI), DNA Data Bank of Japan (DDBJ), European Bioinformatics Institute (EBI), and NASA’s GeneLab. In this work, we first systematically point out the data-management challenges of the genomics community. We then introduce Named Data Networking (NDN), a novel but well-researched Internet architecture, is capable of solving these challenges at the network layer. NDN performs all operations such as forwarding requests to data sources, content discovery, access, and retrieval using content names (that are similar to traditional filenames or filepaths) and eliminates the need for a location layer (the IP address) for data management. Utilizing NDN for genomics workflows simplifies data discovery, speeds up data retrieval using in-network caching of popular datasets, and allows the community to create infrastructure that supports operations such as creating federation of content repositories, retrieval from multiple sources, remote data subsetting, and others. Named based operations also streamlines deployment and integration of workflows with various cloud platforms. Our contributions in this work are as follows 1) we enumerate the cyberinfrastructure challenges of the genomics community that NDN can alleviate, and 2) we describe our efforts in applying NDN for a contemporary genomics workflow (GEMmaker) and quantify the improvements. The preliminary evaluation shows a sixfold speed up in data insertion into the workflow. 3) As a pilot, we have used an NDN naming scheme (agreed upon by the community and discussed in Section 4 ) to publish data from broadly used data repositories including the NCBI SRA. We have loaded the NDN testbed with these pre-processed genomes that can be accessed over NDN and used by anyone interested in those datasets. Finally, we discuss our continued effort in integrating NDN with cloud computing platforms, such as the Pacific Research Platform (PRP). The reader should note that the goal of this paper is to introduce NDN to the genomics community and discuss NDN’s properties that can benefit the genomics community. We do not present an extensive performance evaluation of NDN—we are working on extending and evaluating our pilot deployment and will present systematic results in a future work. 
    more » « less
  4. Open Educational Resources (OER) are widely used instructional materials that are freely available and promote equitable access. OER research at the undergraduate level largely focuses on measuring student experiences with using the low cost resources, and instructor awareness of resources and perceived barriers to use. Little is known about how instructors work with materials based on their unique teaching context. To explore how instructors engage with OER, we surveyed users of CourseSource , an open-access, peer-reviewed journal that publishes lessons primarily for undergraduate biology courses. We asked questions aligned with the OER life cycle, which is a framework that includes the phases: Search , Evaluation , Adaptation , Use , and Share . The results show that OER users come from a variety of institution types and positions, generally have positions that focus more on teaching than research, and use scientific teaching practices. To determine how instructors engage throughout the OER life cycle, we examined the frequency of survey responses. Notable trends include that instructors search and evaluate OER based on alignment to course needs, quality of the materials, and ease of implementation. In addition, instructors frequently modify the published materials for their classroom context and use them in a variety of course environments. The results of this work can help developers design current and future OER repositories to better coincide with undergraduate instructor needs and aid content producers in creating materials that encourage implementation by their colleagues. 
    more » « less
  5. In order to understand the state and evolution of the entirety of open source software we need to get a handle on the set of distinct software projects. Most of open source projects presently utilize Git, which is a distributed version control system allowing easy creation of clones and resulting in numerous repositories that are almost entirely based on some parent repository from which they were cloned. Git commits are unlikely to get produce and represent a way to group cloned repositories. We use World of Code infrastructure containing approximately 2B commits and 100M repositories to create and share such a map. We discover that the largest group contains almost 14M repositories most of which are unrelated to each other. As it turns out, the developers can push git object to an arbitrary repository or pull objects from unrelated repositories, thus linking unrelated repositories. To address this, we apply Louvain community detection algorithm to this very large graph consisting of links between commits and projects. The approach successfully reduces the size of the megacluster with the largest group of highly interconnected projects containing under 400K repositories. We expect that the resulting map of related projects as well as tools and methods to handle the very large graph will serve as a reference set for mining software projects and other applications. Further work is needed to determine different types of relationships among projects induced by shared commits and other relationships, for example, by shared source code or similar filenames. 
    more » « less