DC_OCEAN: an open-source algorithm for identification of duplicates in ocean databases

Song, Xinyi; Tan, Zhetao; Locarnini, Ricardo; Simoncelli, Simona; Cowley, Rebecca; Kizu, Shoichi; Boyer, Tim; Reseghetti, Franco; Castelao, Guilherme; Gouretski, Viktor; Cheng, Lijing

doi:10.3389/fmars.2024.1403175

Citation Details

DC_OCEAN: an open-source algorithm for identification of duplicates in ocean databases

A high-quality hydrographic observational database is essential for ocean and climate studies and operational applications. Because there are numerous global and regional ocean databases, duplicate data continues to be an issue in data management, data processing and database merging, posing a challenge on effectively and accurately using oceanographic data to derive robust statistics and reliable data products. This study aims to provide algorithms to identify the duplicates and assign labels to them. We propose first a set of criteria to define the duplicate data; and second, an open-source and semi-automatic system to detect duplicate data and erroneous metadata. This system includes several algorithms for automatic checks using statistical methods (such as Principal Component Analysis and entropy weighting) and an additional expert (manual) check. The robustness of the system is then evaluated with a subset of the World Ocean Database (WOD18) with over 600,000in-situtemperature and salinity profiles. This system is an open-source Python package (named DC_OCEAN) allowing users to effectively use the software. Users can customize their settings. The application result from the WOD18 subset also forms a benchmark dataset, which is available to support future studies on duplicate checks, metadata error identification, and machine learning applications. This duplicate checking system will be incorporated into the International Quality-controlled Ocean Database (IQuOD) data quality control system to guarantee the uniqueness of ocean observation data in this product. more »

Award ID(s):: 1840868

PAR ID:: 10556097

Author(s) / Creator(s):: Song, Xinyi; Tan, Zhetao; Locarnini, Ricardo; Simoncelli, Simona; Cowley, Rebecca; Kizu, Shoichi; Boyer, Tim; Reseghetti, Franco; Castelao, Guilherme; Gouretski, Viktor; Cheng, Lijing

Publisher / Repository:: Frontiers

Date Published:: 2024-10-02

Journal Name:: Frontiers in Marine Science

Volume:: 11

ISSN:: 2296-7745

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Journal Article:
https://doi.org/10.3389/fmars.2024.1403175

More Like this