Title: Synergies between centralized and federated approaches to data quality: a report from the national COVID cohort collaborative
Abstract Objective In response to COVID-19, the informatics community united to aggregate as much clinical data as possible to characterize this new disease and reduce its impact through collaborative analytics. The National COVID Cohort Collaborative (N3C) is now the largest publicly available HIPAA limited dataset in US history with over 6.4 million patients and is a testament to a partnership of over 100 organizations. Materials and Methods We developed a pipeline for ingesting, harmonizing, and centralizing data from 56 contributing data partners using 4 federated Common Data Models. N3C data quality (DQ) review involves both automated and manual procedures. In the process, several DQ heuristics were discovered in our centralized context, both within the pipeline and during downstream project-based analysis. Feedback to the sites led to many local and centralized DQ improvements. Results Beyond well-recognized DQ findings, we discovered 15 heuristics relating to source Common Data Model conformance, demographics, COVID tests, conditions, encounters, measurements, observations, coding completeness, and fitness for use. Of 56 sites, 37 sites (66%) demonstrated issues through these heuristics. These 37 sites demonstrated improvement after receiving feedback. Discussion We encountered site-to-site differences in DQ which would have been challenging to discover using federated checks alone. We have more »
demonstrated that centralized DQ benchmarking reveals unique opportunities for DQ improvement that will support improved research analytics locally and in aggregate. Conclusion By combining rapid, continual assessment of DQ with a large volume of multisite data, it is possible to support more nuanced scientific questions with the scale and rigor that they require. « less
Peng, Le; Luo, Gaoxiang; Walker, Andrew; Zaiman, Zachary; Jones, Emma K.; Gupta, Hemant; Kersten, Kristopher; Burns, John L.; Harle, Christopher A.; Magoc, Tanja; et al(
, Journal of the American Medical Informatics Association)
AbstractObjective
Federated learning (FL) allows multiple distributed data holders to collaboratively learn a shared model without data sharing. However, individual health system data are heterogeneous. “Personalized” FL variations have been developed to counter data heterogeneity, but few have been evaluated using real-world healthcare data. The purpose of this study is to investigate the performance of a single-site versus a 3-client federated model using a previously described Coronavirus Disease 19 (COVID-19) diagnostic model. Additionally, to investigate the effect of system heterogeneity, we evaluate the performance of 4 FL variations.
Materials and methods
We leverage a FL healthcare collaborative including data from 5 international healthcare systems (US and Europe) encompassing 42 hospitals. We implemented a COVID-19 computer vision diagnosis system using the Federated Averaging (FedAvg) algorithm implemented on Clara Train SDK 4.0. To study the effect of data heterogeneity, training data was pooled from 3 systems locally and federation was simulated. We compared a centralized/pooled model, versus FedAvg, and 3 personalized FL variations (FedProx, FedBN, and FedAMP).
Results
We observed comparable model performance with respect to internal validation (local model: AUROC 0.94 vs FedAvg: 0.95, P = .5) and improved model generalizability with the FedAvg model (P < .05). When investigating the effects of model heterogeneity, we observedmore »poor performance with FedAvg on internal validation as compared to personalized FL algorithms. FedAvg did have improved generalizability compared to personalized FL algorithms. On average, FedBN had the best rank performance on internal and external validation.
Conclusion
FedAvg can significantly improve the generalization of the model compared to other personalization FL algorithms; however, at the cost of poor internal validity. Personalized FL may offer an opportunity to develop both internal and externally validated algorithms.
Primary biodiversity data records that are open access and available in a standardised format are essential for conservation planning and research on policy-relevant time-scales. We created a dataset to document all known occurrence data for the Federally Endangered Poweshiek skipperling butterfly [ Oarismapoweshiek (Parker, 1870; Lepidoptera: Hesperiidae)]. The Poweshiek skipperling was a historically common species in prairie systems across the upper Midwest, United States and Manitoba, Canada. Rapid declines have reduced the number of verified extant sites to six. Aggregating and curating Poweshiek skipperling occurrence records documents and preserves all known distributional data, which can be used to address questions related to Poweshiek skipperling conservation, ecology and biogeography. Over 3500 occurrence records were aggregated over a temporal coverage from 1872 to present. Occurrence records were obtained from 37 data providers in the conservation and natural history collection community using both “HumanObservation” and “PreservedSpecimen” as an acceptable basisOfRecord. Data were obtained in different formats and with differing degrees of quality control. During the data aggregation and cleaning process, we transcribed specimen label data, georeferenced occurrences, adopted a controlled vocabulary, removed duplicates and standardised formatting. We examined the dataset for inconsistencies with known Poweshiek skipperling biogeography and phenology and we verified ormore »removed inconsistencies by working with the original data providers. In total, 12 occurrence records were removed because we identified them to be the western congener Oarismagarita (Reakirt, 1866). This resulting dataset enhances the permanency of Poweshiek skipperling occurrence data in a standardised format. This is a validated and comprehensive dataset of occurrence records for the Poweshiek skipperling ( Oarismapoweshiek ) utilising both observation and specimen-based records. Occurrence data are preserved and available for continued research and conservation projects using standardised Darwin Core formatting where possible. Prior to this project, much of these occurrence records were not mobilised and were being stored in individual institutional databases, researcher datasets and personal records. This dataset aggregates presence data from state conservation agencies, natural heritage programmes, natural history collections, citizen scientists, researchers and the U.S. Fish & Wildlife Service. The data include opportunistic observations and collections, research vouchers, observations collected for population monitoring and observations collected using standardised research methodologies. The aggregated occurrence records underwent cleaning efforts that improved data interoperablitity, removed transcription errors and verified or removed uncertain data. This dataset enhances available information on the spatiotemporal distribution of this Federally Endangered species. As part of this aggregation process, we discovered and verified Poweshiek skipperling occurrence records from two previously unknown states, Nebraska and Ohio.« less
Owen, Christopher L.; Marshall, David C.; Wade, Elizabeth J.; Meister, Russ; Goemans, Geert; Kunte, Krushnamegh; Moulds, Max; Hill, Kathy; Villet, M.; Pham, Thai-Hong; et al(
, Systematic Biology)
Abstract
Contamination of a genetic sample with DNA from one or more nontarget species is a continuing concern of molecular phylogenetic studies, both Sanger sequencing studies and next-generation sequencing studies. We developed an automated pipeline for identifying and excluding likely cross-contaminated loci based on the detection of bimodal distributions of patristic distances across gene trees. When contamination occurs between samples within a data set, a comparison between a contaminated sample and its contaminant taxon will yield bimodal distributions with one peak close to zero patristic distance. This new method does not rely on a priori knowledge of taxon relatedness nor does it determine the causes(s) of the contamination. Exclusion of putatively contaminated loci from a data set generated for the insect family Cicadidae showed that these sequences were affecting some topological patterns and branch supports, although the effects were sometimes subtle, with some contamination-influenced relationships exhibiting strong bootstrap support. Long tip branches and outlier values for one anchored phylogenomic pipeline statistic (AvgNHomologs) were correlated with the presence of contamination. While the anchored hybrid enrichment markers used here, which target hemipteroid taxa, proved effective in resolving deep and shallow level Cicadidae relationships in aggregate, individual markers contained inadequate phylogenetic signal, inmore »part probably due to short length. The cleaned data set, consisting of 429 loci, from 90 genera representing 44 of 56 current Cicadidae tribes, supported three of the four sampled Cicadidae subfamilies in concatenated-matrix maximum likelihood (ML) and multispecies coalescent-based species tree analyses, with the fourth subfamily weakly supported in the ML trees. No well-supported patterns from previous family-level Sanger sequencing studies of Cicadidae phylogeny were contradicted. One taxon (Aragualna plenalinea) did not fall with its current subfamily in the genetic tree, and this genus and its tribe Aragualnini is reclassified to Tibicininae following morphological re-examination. Only subtle differences were observed in trees after the removal of loci for which divergent base frequencies were detected. Greater success may be achieved by increased taxon sampling and developing a probe set targeting a more recent common ancestor and longer loci. Searches for contamination are an essential step in phylogenomic analyses of all kinds and our pipeline is an effective solution. [Auchenorrhyncha; base-composition bias; Cicadidae; Cicadoidea; Hemiptera; phylogenetic conflict.]
To remain competitive in the global economy, the United States needs skilled technical workers in occupations requiring a high level of domain-specific technical knowledge to meet the country’s anticipated shortage of 5 million technically-credentialed workers. The changing demographics of the country are of increasing importance to addressing this workforce challenge. According to federal data, half the students earning a certificate in 2016-17 received credentials from community colleges where the percent enrollment of Latinx (a gender-neutral term referencing Latin American cultural or racial identity) students (56%) exceeds that of other post-secondary sectors. If this enrollment rate persists, then by 2050 over 25% of all students enrolled in higher education will be Latinx. Hispanic Serving Institutions (HSIs) are essential points of access as they enroll 64% of all Latinx college students, and nearly 50% of all HSIs are 2-year institutions. Census estimates predict Latinxs are the fastest-growing segment reaching 30% of the U.S. population while becoming the youngest group comprising 33.5% of those under 18 years by 2060. The demand for skilled workers in STEM fields will be met when workers reflect the diversity of the population, therefore more students—of all ages and backgrounds—must be brought into community colleges and supported throughmore »graduation: a central focus of community colleges everywhere. While Latinx students of color are as likely as white students to major in STEM, their completion numbers drop dramatically: Latinx students often have distinct needs that evolved from a history of discrimination in the educational system. HSI ATE Hub is a three-year collaborative research project funded by the National Science Foundation Advanced Technological Education Program (NSF ATE) being implemented by Florence Darlington Technical College and Science Foundation Arizona Center for STEM at Arizona State University to address the imperative that 2-year Hispanic Serving Institutions (HSIs) develop and improve engineering technology and related technician education programs in a way that is culturally inclusive. Interventions focus on strengthening grant-writing skills among CC HSIs to fund advancements in technician education and connecting 2-year HSIs with resources for faculty development and program improvement. A mixed methods approach will explore the following research questions: 1) What are the unique barriers and challenges for 2-year HSIs related to STEM program development and grant-writing endeavors? 2) How do we build capacity at 2-year HSIs to address these barriers and challenges? 3) How do mentoring efforts/styles need to differ? 4) How do existing ATE resources need to be augmented to better serve 2-year HSIs? 5) How do proposal submission and success rates compare for 2-year HSIs that have gone through the KS STEM planning process but not M-C, through the M-C cohort mentoring process but not KS, and through both interventions? The project will identify HSI-relevant resources, augment existing ATE resources, and create new ones to support 2-year HSI faculty as potential ATE grantees. To address the distinct needs of Latinx students in STEM, resources representing best practices and frameworks for cultural inclusivity, as well as faculty development will be included. Throughout, the community-based tradition of the ATE Program is being fostered with particular emphasis on forming, nurturing, and serving participating 2-year HSIs. This paper will discuss the need, baseline data, and early results for the three-year program, setting the stage for a series of annual papers that report new findings.« less
To remain competitive in the global economy, the United States needs skilled technical workers in occupations requiring a high level of domain-specific technical knowledge to meet the country’s anticipated shortage of 5 million technically-credentialed workers. The changing demographics of the country are of increasing importance to addressing this workforce challenge. According to federal data, half the students earning a certificate in 2016-17 received credentials from community colleges where the percent enrollment of Latinx (a gender-neutral term referencing Latin American cultural or racial identity) students (56%) exceeds that of other post-secondary sectors. If this enrollment rate persists, then by 2050 over 25% of all students enrolled in higher education will be Latinx. Hispanic Serving Institutions (HSIs) are essential points of access as they enroll 64% of all Latinx college students, and nearly 50% of all HSIs are 2-year institutions. Census estimates predict Latinxs are the fastest-growing segment reaching 30% of the U.S. population while becoming the youngest group comprising 33.5% of those under 18 years by 2060. The demand for skilled workers in STEM fields will be met when workers reflect the diversity of the population, therefore more students—of all ages and backgrounds—must be brought into community colleges and supported throughmore »graduation: a central focus of community colleges everywhere. While Latinx students of color are as likely as white students to major in STEM, their completion numbers drop dramatically: Latinx students often have distinct needs that evolved from a history of discrimination in the educational system. HSI ATE Hub is a three-year collaborative research project funded by the National Science Foundation Advanced Technological Education Program (NSF ATE) being implemented by Florence Darlington Technical College and Science Foundation Arizona Center for STEM at Arizona State University to address the imperative that 2-year Hispanic Serving Institutions (HSIs) develop and improve engineering technology and related technician education programs in a way that is culturally inclusive. Interventions focus on strengthening grant-writing skills among CC HSIs to fund advancements in technician education and connecting 2-year HSIs with resources for faculty development and program improvement. A mixed methods approach will explore the following research questions: 1) What are the unique barriers and challenges for 2-year HSIs related to STEM program development and grant-writing endeavors? 2) How do we build capacity at 2-year HSIs to address these barriers and challenges? 3) How do mentoring efforts/styles need to differ? 4) How do existing ATE resources need to be augmented to better serve 2-year HSIs? 5) How do proposal submission and success rates compare for 2-year HSIs that have gone through the KS STEM planning process but not M-C, through the M-C cohort mentoring process but not KS, and through both interventions? The project will identify HSI-relevant resources, augment existing ATE resources, and create new ones to support 2-year HSI faculty as potential ATE grantees. To address the distinct needs of Latinx students in STEM, resources representing best practices and frameworks for cultural inclusivity, as well as faculty development will be included. Throughout, the community-based tradition of the ATE Program is being fostered with particular emphasis on forming, nurturing, and serving participating 2-year HSIs. This paper will discuss the need, baseline data, and early results for the three-year program, setting the stage for a series of annual papers that report new findings.« less
@article{osti_10349490,
place = {Country unknown/Code not available},
title = {Synergies between centralized and federated approaches to data quality: a report from the national COVID cohort collaborative},
url = {https://par.nsf.gov/biblio/10349490},
DOI = {10.1093/jamia/ocab217},
abstractNote = {Abstract Objective In response to COVID-19, the informatics community united to aggregate as much clinical data as possible to characterize this new disease and reduce its impact through collaborative analytics. The National COVID Cohort Collaborative (N3C) is now the largest publicly available HIPAA limited dataset in US history with over 6.4 million patients and is a testament to a partnership of over 100 organizations. Materials and Methods We developed a pipeline for ingesting, harmonizing, and centralizing data from 56 contributing data partners using 4 federated Common Data Models. N3C data quality (DQ) review involves both automated and manual procedures. In the process, several DQ heuristics were discovered in our centralized context, both within the pipeline and during downstream project-based analysis. Feedback to the sites led to many local and centralized DQ improvements. Results Beyond well-recognized DQ findings, we discovered 15 heuristics relating to source Common Data Model conformance, demographics, COVID tests, conditions, encounters, measurements, observations, coding completeness, and fitness for use. Of 56 sites, 37 sites (66%) demonstrated issues through these heuristics. These 37 sites demonstrated improvement after receiving feedback. Discussion We encountered site-to-site differences in DQ which would have been challenging to discover using federated checks alone. We have demonstrated that centralized DQ benchmarking reveals unique opportunities for DQ improvement that will support improved research analytics locally and in aggregate. Conclusion By combining rapid, continual assessment of DQ with a large volume of multisite data, it is possible to support more nuanced scientific questions with the scale and rigor that they require.},
journal = {Journal of the American Medical Informatics Association},
volume = {29},
number = {4},
author = {Pfaff, Emily R and Girvin, Andrew T and Gabriel, Davera L and Kostka, Kristin and Morris, Michele and Palchuk, Matvey B and Lehmann, Harold P and Amor, Benjamin and Bissell, Mark and Bradwell, Katie R and Gold, Sigfried and Hong, Stephanie S and Loomba, Johanna and Manna, Amin and McMurry, Julie A and Niehaus, Emily and Qureshi, Nabeel and Walden, Anita and Zhang, Xiaohan Tanner and Zhu, Richard L and Moffitt, Richard A and Haendel, Melissa A and Chute, Christopher G and Adams, William G and Al-Shukri, Shaymaa and Anzalone, Alfred and Baghal, Ahmad and Bennett, Tellen D and Bernstam, Elmer V and Bernstam, Elmer V and Bissell, Mark M and Bush, Brian and Campion, Thomas R and Castro, Victor and Chang, Jack and Chaudhari, Deepa D and Chen, Wenjin and Chu, San and Cimino, James J and Crandall, Keith A and Crooks, Mark and Davies, Sara J and DiPalazzo, John and Dorr, David and Eckrich, Dan and Eltinge, Sarah E and Fort, Daniel G and Golovko, George and Gupta, Snehil and Haendel, Melissa A and Hajagos, Janos G and Hanauer, David A and Harnett, Brett M and Horswell, Ronald and Huang, Nancy and Johnson, Steven G and Kahn, Michael and Khanipov, Kamil and Kieler, Curtis and Luzuriaga, Katherine Ruiz and Maidlow, Sarah and Martinez, Ashley and Mathew, Jomol and McClay, James C and McMahan, Gabriel and Melancon, Brian and Meystre, Stephane and Miele, Lucio and Morizono, Hiroki and Pablo, Ray and Patel, Lav and Phuong, Jimmy and Popham, Daniel J and Pulgarin, Claudia and Santos, Carlos and Sarkar, Indra Neil and Sazo, Nancy and Setoguchi, Soko and Soby, Selvin and Surampalli, Sirisha and Suver, Christine and Vangala, Uma Maheswara and Visweswaran, Shyam and Oehsen, James von and Walters, Kellie M and Wiley, Laura and Williams, David A and Zai, Adrian},
}