Millions of in situ ocean temperature profiles have been collected historically using various instrument types with varying sensor accuracy and then assembled into global databases. These are essential to our current understanding of the changing state of the oceans, sea level, Earth’s climate, marine ecosystems and fisheries, and for constraining model projections of future change that underpin mitigation and adaptation solutions. Profiles distributed shortly after collection are also widely used in operational applications such as real-time monitoring and forecasting of the ocean state and weather prediction. Before use in scientific or societal service applications, quality control (QC) procedures need to be applied to flag and ultimately remove erroneous data. Automatic QC (AQC) checks are vital to the timeliness of operational applications and for reducing the volume of dubious data which later require QC processing by a human for delayed mode applications. Despite the large suite of evolving AQC checks developed by institutions worldwide, the most effective set of AQC checks was not known. We have developed a framework to assess the performance of AQC checks, under the auspices of the International Quality Controlled Ocean Database (IQuOD) project. The IQuOD-AQC framework is an open-source collaborative software infrastructure built in Python (available from https://github.com/IQuOD ). Sixty AQC checks have been implemented in this framework. Their performance was benchmarked against three reference datasets which contained a spectrum of instrument types and error modes flagged in their profiles. One of these (a subset of the Quality-controlled Ocean Temperature Archive (QuOTA) dataset that had been manually inspected for quality issues by its creators) was also used to identify optimal sets of AQC checks. Results suggest that the AQC checks are effective for most historical data, but less so in the case of data from Mechanical Bathythermographs (MBTs), and much less effective for Argo data. The optimal AQC sets will be applied to generate quality flags for the next release of the IQuOD dataset. This will further elevate the quality and historical value of millions of temperature profile data which have already been improved by IQuOD intelligent metadata and observational uncertainty information ( https://doi.org/10.7289/v51r6nsf ).
more »
« less
DC_OCEAN: an open-source algorithm for identification of duplicates in ocean databases
A high-quality hydrographic observational database is essential for ocean and climate studies and operational applications. Because there are numerous global and regional ocean databases, duplicate data continues to be an issue in data management, data processing and database merging, posing a challenge on effectively and accurately using oceanographic data to derive robust statistics and reliable data products. This study aims to provide algorithms to identify the duplicates and assign labels to them. We propose first a set of criteria to define the duplicate data; and second, an open-source and semi-automatic system to detect duplicate data and erroneous metadata. This system includes several algorithms for automatic checks using statistical methods (such as Principal Component Analysis and entropy weighting) and an additional expert (manual) check. The robustness of the system is then evaluated with a subset of the World Ocean Database (WOD18) with over 600,000in-situtemperature and salinity profiles. This system is an open-source Python package (named DC_OCEAN) allowing users to effectively use the software. Users can customize their settings. The application result from the WOD18 subset also forms a benchmark dataset, which is available to support future studies on duplicate checks, metadata error identification, and machine learning applications. This duplicate checking system will be incorporated into the International Quality-controlled Ocean Database (IQuOD) data quality control system to guarantee the uniqueness of ocean observation data in this product.
more »
« less
- Award ID(s):
- 1840868
- PAR ID:
- 10556097
- Publisher / Repository:
- Frontiers
- Date Published:
- Journal Name:
- Frontiers in Marine Science
- Volume:
- 11
- ISSN:
- 2296-7745
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Abstract MotivationEnvironmental DNA (eDNA), as a rapidly expanding research field, stands to benefit from shared resources including sampling protocols, study designs, discovered sequences, and taxonomic assignments to sequences. High-quality community shareable eDNA resources rely heavily on comprehensive metadata documentation that captures the complex workflows covering field sampling, molecular biology lab work, and bioinformatic analyses. There are limited sources that provide documentation of database development on comprehensive metadata for eDNA and these workflows and no open-source software. ResultsWe present medna-metadata, an open-source, modular system that aligns with Findable, Accessible, Interoperable, and Reusable guiding principles that support scholarly data reuse and the database and application development of a standardized metadata collection structure that encapsulates critical aspects of field data collection, wet lab processing, and bioinformatic analysis. Medna-metadata is showcased with metabarcoding data from the Gulf of Maine (Polinski et al., 2019). Availability and implementationThe source code of the medna-metadata web application is hosted on GitHub (https://github.com/Maine-eDNA/medna-metadata). Medna-metadata is a docker-compose installable package. Documentation can be found at https://medna-metadata.readthedocs.io/en/latest/?badge=latest. The application is implemented in Python, PostgreSQL and PostGIS, RabbitMQ, and NGINX, with all major browsers supported. A demo can be found at https://demo.metadata.maine-edna.org/. Supplementary informationSupplementary data are available at Bioinformatics online.more » « less
-
Abstract Since the mid-2000s, the Argo oceanographic observational network has provided near-real-time four-dimensional data for the global ocean for the first time in history. Internet (i.e., the “web”) applications that handle the more than two million Argo profiles of ocean temperature, salinity, and pressure are an active area of development. This paper introduces a new and efficient interactive Argo data visualization and delivery web application named Argovis that is built on a classic three-tier design consisting of a front end, back end, and database. Together these components allow users to navigate 4D data on a world map of Argo floats, with the option to select a custom region, depth range, and time period. Argovis’s back end sends data to users in a simple format, and the front end quickly renders web-quality figures. More advanced applications query Argovis from other programming environments, such as Python, R, and MATLAB. Our Argovis architecture allows expert data users to build their own functionality for specific applications, such as the creation of spatially gridded data for a given time and advanced time–frequency analysis for a space–time selection. Argovis is aimed to both scientists and the public, with tutorials and examples available on the website, describing how to use the Argovis data delivery system—for example, how to plot profiles in a region over time or to monitor profile metadata.more » « less
-
In this paper, we outline the need for a coordinated international effort toward the building of an open-access Global Ocean Oxygen Database and ATlas (GO 2 DAT) complying with the FAIR principles (Findable, Accessible, Interoperable, and Reusable). GO 2 DAT will combine data from the coastal and open ocean, as measured by the chemical Winkler titration method or by sensors (e.g., optodes, electrodes) from Eulerian and Lagrangian platforms (e.g., ships, moorings, profiling floats, gliders, ships of opportunities, marine mammals, cabled observatories). GO 2 DAT will further adopt a community-agreed, fully documented metadata format and a consistent quality control (QC) procedure and quality flagging (QF) system. GO 2 DAT will serve to support the development of advanced data analysis and biogeochemical models for improving our mapping, understanding and forecasting capabilities for ocean O 2 changes and deoxygenation trends. It will offer the opportunity to develop quality-controlled data synthesis products with unprecedented spatial (vertical and horizontal) and temporal (sub-seasonal to multi-decadal) resolution. These products will support model assessment, improvement and evaluation as well as the development of climate and ocean health indicators. They will further support the decision-making processes associated with the emerging blue economy, the conservation of marine resources and their associated ecosystem services and the development of management tools required by a diverse community of users (e.g., environmental agencies, aquaculture, and fishing sectors). A better knowledge base of the spatial and temporal variations of marine O 2 will improve our understanding of the ocean O 2 budget, and allow better quantification of the Earth’s carbon and heat budgets. With the ever-increasing need to protect and sustainably manage ocean services, GO 2 DAT will allow scientists to fully harness the increasing volumes of O 2 data already delivered by the expanding global ocean observing system and enable smooth incorporation of much higher quantities of data from autonomous platforms in the open ocean and coastal areas into comprehensive data products in the years to come. This paper aims at engaging the community (e.g., scientists, data managers, policy makers, service users) toward the development of GO 2 DAT within the framework of the UN Global Ocean Oxygen Decade (GOOD) program recently endorsed by IOC-UNESCO. A roadmap toward GO 2 DAT is proposed highlighting the efforts needed (e.g., in terms of human resources).more » « less
-
Data users need relevant context and research expertise to effectively search for and identify relevant datasets. Leading data providers, such as the Inter‐university Consortium for Political and Social Research (ICPSR), offer standardized metadata and search tools to support data search. Metadata standards emphasize the machine‐readability of data and its documentation. There are opportunities to enhance dataset search by improving users' ability to learn about, and make sense of, information about data. Prior research has shown that context and expertise are two main barriers users face in effectively searching for, evaluating, and deciding whether to reuse data. In this paper, we propose a novel chatbot‐based search system, DataChat, that leverages a graph database and a large language model to provide novel ways for users to interact with and search for research data. DataChat complements data archives' and institutional repositories' ongoing efforts to curate, preserve, and share research data for reuse by making it easier for users to explore and learn about available research data.more » « less
An official website of the United States government

