skip to main content

Title: Using Remote Sensing and Machine Learning to Locate Groundwater Discharge to Salmon-Bearing Streams
We hypothesized topographic features alone could be used to locate groundwater discharge, but only where diagnostic topographic signatures could first be identified through the use of limited field observations and geologic data. We built a geodatabase from geologic and topographic data, with the geologic data only covering ~40% of the study area and topographic data derived from airborne LiDAR covering the entire study area. We identified two types of groundwater discharge: shallow hillslope groundwater discharge, commonly manifested as diffuse seeps, and aquifer-outcrop groundwater discharge, commonly manifested as springs. We developed multistep manual procedures that allowed us to accurately predict the locations of both types of groundwater discharge in 93% of cases, though only where geologic data were available. However, field verification suggested that both types of groundwater discharge could be identified by specific combinations of topographic variables alone. We then applied maximum entropy modeling, a machine learning technique, to predict the prevalence of both types of groundwater discharge using six topographic variables: profile curvature range, with a permutation importance of 43.2%, followed by distance to flowlines, elevation, topographic roughness index, flow-weighted slope, and planform curvature, with permutation importance of 20.8%, 18.5%, 15.2%, 1.8%, and 0.5%, respectively. The AUC values for more » the model were 0.95 for training data and 0.91 for testing data, indicating outstanding model performance. « less
Authors:
; ; ; ; ; ;
Award ID(s):
1702029 1702996 1930451
Publication Date:
NSF-PAR ID:
10342679
Journal Name:
Remote Sensing
Volume:
14
Issue:
1
Page Range or eLocation-ID:
63
ISSN:
2072-4292
Sponsoring Org:
National Science Foundation
More Like this
  1. Obeid, I. (Ed.)
    The Neural Engineering Data Consortium (NEDC) is developing the Temple University Digital Pathology Corpus (TUDP), an open source database of high-resolution images from scanned pathology samples [1], as part of its National Science Foundation-funded Major Research Instrumentation grant titled “MRI: High Performance Digital Pathology Using Big Data and Machine Learning” [2]. The long-term goal of this project is to release one million images. We have currently scanned over 100,000 images and are in the process of annotating breast tissue data for our first official corpus release, v1.0.0. This release contains 3,505 annotated images of breast tissue including 74 patients with cancerous diagnoses (out of a total of 296 patients). In this poster, we will present an analysis of this corpus and discuss the challenges we have faced in efficiently producing high quality annotations of breast tissue. It is well known that state of the art algorithms in machine learning require vast amounts of data. Fields such as speech recognition [3], image recognition [4] and text processing [5] are able to deliver impressive performance with complex deep learning models because they have developed large corpora to support training of extremely high-dimensional models (e.g., billions of parameters). Other fields that do notmore »have access to such data resources must rely on techniques in which existing models can be adapted to new datasets [6]. A preliminary version of this breast corpus release was tested in a pilot study using a baseline machine learning system, ResNet18 [7], that leverages several open-source Python tools. The pilot corpus was divided into three sets: train, development, and evaluation. Portions of these slides were manually annotated [1] using the nine labels in Table 1 [8] to identify five to ten examples of pathological features on each slide. Not every pathological feature is annotated, meaning excluded areas can include focuses particular to these labels that are not used for training. A summary of the number of patches within each label is given in Table 2. To maintain a balanced training set, 1,000 patches of each label were used to train the machine learning model. Throughout all sets, only annotated patches were involved in model development. The performance of this model in identifying all the patches in the evaluation set can be seen in the confusion matrix of classification accuracy in Table 3. The highest performing labels were background, 97% correct identification, and artifact, 76% correct identification. A correlation exists between labels with more than 6,000 development patches and accurate performance on the evaluation set. Additionally, these results indicated a need to further refine the annotation of invasive ductal carcinoma (“indc”), inflammation (“infl”), nonneoplastic features (“nneo”), normal (“norm”) and suspicious (“susp”). This pilot experiment motivated changes to the corpus that will be discussed in detail in this poster presentation. To increase the accuracy of the machine learning model, we modified how we addressed underperforming labels. One common source of error arose with how non-background labels were converted into patches. Large areas of background within other labels were isolated within a patch resulting in connective tissue misrepresenting a non-background label. In response, the annotation overlay margins were revised to exclude benign connective tissue in non-background labels. Corresponding patient reports and supporting immunohistochemical stains further guided annotation reviews. The microscopic diagnoses given by the primary pathologist in these reports detail the pathological findings within each tissue site, but not within each specific slide. The microscopic diagnoses informed revisions specifically targeting annotated regions classified as cancerous, ensuring that the labels “indc” and “dcis” were used only in situations where a micropathologist diagnosed it as such. Further differentiation of cancerous and precancerous labels, as well as the location of their focus on a slide, could be accomplished with supplemental immunohistochemically (IHC) stained slides. When distinguishing whether a focus is a nonneoplastic feature versus a cancerous growth, pathologists employ antigen targeting stains to the tissue in question to confirm the diagnosis. For example, a nonneoplastic feature of usual ductal hyperplasia will display diffuse staining for cytokeratin 5 (CK5) and no diffuse staining for estrogen receptor (ER), while a cancerous growth of ductal carcinoma in situ will have negative or focally positive staining for CK5 and diffuse staining for ER [9]. Many tissue samples contain cancerous and non-cancerous features with morphological overlaps that cause variability between annotators. The informative fields IHC slides provide could play an integral role in machine model pathology diagnostics. Following the revisions made on all the annotations, a second experiment was run using ResNet18. Compared to the pilot study, an increase of model prediction accuracy was seen for the labels indc, infl, nneo, norm, and null. This increase is correlated with an increase in annotated area and annotation accuracy. Model performance in identifying the suspicious label decreased by 25% due to the decrease of 57% in the total annotated area described by this label. A summary of the model performance is given in Table 4, which shows the new prediction accuracy and the absolute change in error rate compared to Table 3. The breast tissue subset we are developing includes 3,505 annotated breast pathology slides from 296 patients. The average size of a scanned SVS file is 363 MB. The annotations are stored in an XML format. A CSV version of the annotation file is also available which provides a flat, or simple, annotation that is easy for machine learning researchers to access and interface to their systems. Each patient is identified by an anonymized medical reference number. Within each patient’s directory, one or more sessions are identified, also anonymized to the first of the month in which the sample was taken. These sessions are broken into groupings of tissue taken on that date (in this case, breast tissue). A deidentified patient report stored as a flat text file is also available. Within these slides there are a total of 16,971 total annotated regions with an average of 4.84 annotations per slide. Among those annotations, 8,035 are non-cancerous (normal, background, null, and artifact,) 6,222 are carcinogenic signs (inflammation, nonneoplastic and suspicious,) and 2,714 are cancerous labels (ductal carcinoma in situ and invasive ductal carcinoma in situ.) The individual patients are split up into three sets: train, development, and evaluation. Of the 74 cancerous patients, 20 were allotted for both the development and evaluation sets, while the remain 34 were allotted for train. The remaining 222 patients were split up to preserve the overall distribution of labels within the corpus. This was done in hope of creating control sets for comparable studies. Overall, the development and evaluation sets each have 80 patients, while the training set has 136 patients. In a related component of this project, slides from the Fox Chase Cancer Center (FCCC) Biosample Repository (https://www.foxchase.org/research/facilities/genetic-research-facilities/biosample-repository -facility) are being digitized in addition to slides provided by Temple University Hospital. This data includes 18 different types of tissue including approximately 38.5% urinary tissue and 16.5% gynecological tissue. These slides and the metadata provided with them are already anonymized and include diagnoses in a spreadsheet with sample and patient ID. We plan to release over 13,000 unannotated slides from the FCCC Corpus simultaneously with v1.0.0 of TUDP. Details of this release will also be discussed in this poster. Few digitally annotated databases of pathology samples like TUDP exist due to the extensive data collection and processing required. The breast corpus subset should be released by November 2021. By December 2021 we should also release the unannotated FCCC data. We are currently annotating urinary tract data as well. We expect to release about 5,600 processed TUH slides in this subset. We have an additional 53,000 unprocessed TUH slides digitized. Corpora of this size will stimulate the development of a new generation of deep learning technology. In clinical settings where resources are limited, an assistive diagnoses model could support pathologists’ workload and even help prioritize suspected cancerous cases. ACKNOWLEDGMENTS This material is supported by the National Science Foundation under grants nos. CNS-1726188 and 1925494. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. REFERENCES [1] N. Shawki et al., “The Temple University Digital Pathology Corpus,” in Signal Processing in Medicine and Biology: Emerging Trends in Research and Applications, 1st ed., I. Obeid, I. Selesnick, and J. Picone, Eds. New York City, New York, USA: Springer, 2020, pp. 67 104. https://www.springer.com/gp/book/9783030368432. [2] J. Picone, T. Farkas, I. Obeid, and Y. Persidsky, “MRI: High Performance Digital Pathology Using Big Data and Machine Learning.” Major Research Instrumentation (MRI), Division of Computer and Network Systems, Award No. 1726188, January 1, 2018 – December 31, 2021. https://www. isip.piconepress.com/projects/nsf_dpath/. [3] A. Gulati et al., “Conformer: Convolution-augmented Transformer for Speech Recognition,” in Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), 2020, pp. 5036-5040. https://doi.org/10.21437/interspeech.2020-3015. [4] C.-J. Wu et al., “Machine Learning at Facebook: Understanding Inference at the Edge,” in Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA), 2019, pp. 331–344. https://ieeexplore.ieee.org/document/8675201. [5] I. Caswell and B. Liang, “Recent Advances in Google Translate,” Google AI Blog: The latest from Google Research, 2020. [Online]. Available: https://ai.googleblog.com/2020/06/recent-advances-in-google-translate.html. [Accessed: 01-Aug-2021]. [6] V. Khalkhali, N. Shawki, V. Shah, M. Golmohammadi, I. Obeid, and J. Picone, “Low Latency Real-Time Seizure Detection Using Transfer Deep Learning,” in Proceedings of the IEEE Signal Processing in Medicine and Biology Symposium (SPMB), 2021, pp. 1 7. https://www.isip. piconepress.com/publications/conference_proceedings/2021/ieee_spmb/eeg_transfer_learning/. [7] J. Picone, T. Farkas, I. Obeid, and Y. Persidsky, “MRI: High Performance Digital Pathology Using Big Data and Machine Learning,” Philadelphia, Pennsylvania, USA, 2020. https://www.isip.piconepress.com/publications/reports/2020/nsf/mri_dpath/. [8] I. Hunt, S. Husain, J. Simons, I. Obeid, and J. Picone, “Recent Advances in the Temple University Digital Pathology Corpus,” in Proceedings of the IEEE Signal Processing in Medicine and Biology Symposium (SPMB), 2019, pp. 1–4. https://ieeexplore.ieee.org/document/9037859. [9] A. P. Martinez, C. Cohen, K. Z. Hanley, and X. (Bill) Li, “Estrogen Receptor and Cytokeratin 5 Are Reliable Markers to Separate Usual Ductal Hyperplasia From Atypical Ductal Hyperplasia and Low-Grade Ductal Carcinoma In Situ,” Arch. Pathol. Lab. Med., vol. 140, no. 7, pp. 686–689, Apr. 2016. https://doi.org/10.5858/arpa.2015-0238-OA.« less
  2. Abstract. Numerical simulations of the Greenland Ice Sheet (GrIS) over geologictimescales can greatly improve our knowledge of the critical factors drivingGrIS demise during climatically warm periods, which has clear relevance forbetter predicting GrIS behavior over the upcoming centuries. To assess thefidelity of these modeling efforts, however, observational constraints ofpast ice sheet change are needed. Across southwestern Greenland, geologicrecords detail Holocene ice retreat across both terrestrial-based and marine-terminating environments, providing an ideal opportunity to rigorouslybenchmark model simulations against geologic reconstructions of ice sheetchange. Here, we present regional ice sheet modeling results using theIce-sheet and Sea-level System Model (ISSM) of Holocene ice sheet historyacross an extensive fjord region in southwestern Greenland covering thelandscape around the Kangiata Nunaata Sermia (KNS) glacier and extendingoutward along the 200 km Nuup Kangerula (Godthåbsfjord). Oursimulations, forced by reconstructions of Holocene climate and recentlyimplemented calving laws, assess the sensitivity of ice retreat across theKNS region to atmospheric and oceanic forcing. Our simulations reveal thatthe geologically reconstructed ice retreat across the terrestrial landscapein the study area was likely driven by fluctuations in surface mass balancein response to Early Holocene warming – and was likely not influencedsignificantly by the response of adjacent outlet glaciers to calving andocean-induced melting. The impact ofmore »ice calving within fjords, however,plays a significant role by enhancing ice discharge at the terminus, leadingto interior thinning up to the ice divide that is consistent withreconstructed magnitudes of Early Holocene ice thinning. Our results,benchmarked against geologic constraints of past ice-margin change, suggestthat while calving did not strongly influence Holocene ice-margin migrationacross terrestrial portions of the KNS forefield, it strongly impactedregional mass loss. While these results imply that the implementation andresolution of ice calving in paleo-ice-flow models is important towardsmaking more robust estimations of past ice mass change, they also illustratethe importance these processes have on contemporary and future long-term icemass change across similar fjord-dominated regions of the GrIS.« less
  3. Abstract
    Excessive phosphorus (P) applications to croplands can contribute to eutrophication of surface waters through surface runoff and subsurface (leaching) losses. We analyzed leaching losses of total dissolved P (TDP) from no-till corn, hybrid poplar (Populus nigra X P. maximowiczii), switchgrass (Panicum virgatum), miscanthus (Miscanthus giganteus), native grasses, and restored prairie, all planted in 2008 on former cropland in Michigan, USA. All crops except corn (13 kg P ha−1 year−1) were grown without P fertilization. Biomass was harvested at the end of each growing season except for poplar. Soil water at 1.2 m depth was sampled weekly to biweekly for TDP determination during March–November 2009–2016 using tension lysimeters. Soil test P (0–25 cm depth) was measured every autumn. Soil water TDP concentrations were usually below levels where eutrophication of surface waters is frequently observed (> 0.02 mg L−1) but often higher than in deep groundwater or nearby streams and lakes. Rates of P leaching, estimated from measured concentrations and modeled drainage, did not differ statistically among cropping systems across years; 7-year cropping system means ranged from 0.035 to 0.072 kg P ha−1 year−1 with large interannual variation. Leached P was positively related to STP, which decreased over the 7 years in all systems. These results indicate that both P-fertilized and unfertilized cropping systems mayMore>>
  4. Understanding relationships among tree species, or between tree diversity, distribution, and underlying environmental gradients, is a central concern for forest ecologists, managers, and management agencies. The spatial processes underlying observed spatial patterns of trees or edaphic variables often are complex and violate two fundamental assumptions—isotropy and stationarity—of spatial statistics. Codispersion analysis is a new statistical method developed to assess spatial covariation between two spatial processes that may not be isotropic or stationary. Its application to data from large forest plots has provided new insights into mechanisms underlying observed patterns of species distributions and the relationship between individual species and underlying edaphic and topographic gradients. However, these data are not collected without error, and the performance of the codispersion coefficient when there is noise or measurement error (“contamination”) in the data heretofore has been addressed only theoretically. Here, we use Monte Carlo simulations and real datasets to investigate the sensitivity of codispersion to four types of contamination commonly seen in many forest datasets. Three of these involved comparing codispersion of a spatial dataset with a contaminated version of itself. The fourth examined differences in codispersion between tree species and soil variables, where the estimates of soil characteristics were based on completemore »or thinned datasets. In all cases, we found that estimates of codispersion were robust when contamination was relatively low (<15%), but were sensitive to larger percentages of contamination. We also present a useful method for imputing missing spatial data and discuss several aspects of the codispersion coefficient when applied to noisy data to gain more insight about the performance of codispersion in practice.« less
  5. ABSTRACT Cyanobacteria are prokaryotes capable of oxygenic photosynthesis, and frequently, nitrogen fixation as well. As a result, they contribute substantially to global primary production and nitrogen cycles. Furthermore, the multicellular filamentous cyanobacteria in taxonomic subsections IV and V are developmentally complex, exhibiting an array of differentiated cell types and filaments, including motile hormogonia, making them valuable model organisms for studying development. To investigate the role of sigma factors in the gene regulatory network (GRN) controlling hormogonium development, a combination of genetic, immunological, and time-resolved transcriptomic analyses were conducted in the model filamentous cyanobacterium Nostoc punctiforme , which, unlike other common model cyanobacteria, retains the developmental complexity of field isolates. The results support a model where the hormogonium GRN is driven by a hierarchal sigma factor cascade, with sigJ activating the expression of both sigC and sigF, as well as a substantial portion of additional hormogonium-specific genes, including those driving changes to cellular architecture. In turn, sigC regulates smaller subsets of genes for several processes, plays a dominant role in promoting reductive cell division, and may also both positively and negatively regulate sigJ to reinforce the developmental program and coordinate the timing of gene expression, respectively. In contrast, the sigF regulonmore »is extremely limited. Among genes with characterized roles in hormogonium development, only pilA shows stringent sigF dependence. For sigJ -dependent genes, a putative consensus promoter was also identified, consisting primarily of a highly conserved extended −10 region, here designated a J-Box, which is widely distributed among diverse members of the cyanobacterial lineage. IMPORTANCE Cyanobacteria are integral to global carbon and nitrogen cycles, and their metabolic capacity coupled with their ease of genetic manipulation make them attractive platforms for applications such as biomaterial and biofertilizer production. Achieving these goals will likely require a detailed understanding and precise rewiring of these organisms’ GRNs. The complex phenotypic plasticity of filamentous cyanobacteria has also made them valuable models of prokaryotic development. However, current research has been limited by focusing primarily on a handful of model strains which fail to reflect the phenotypes of field counterparts, potentially limiting biotechnological advances and a more comprehensive understanding of developmental complexity. Here, using Nostoc punctiforme , a model filamentous cyanobacterium that retains the developmental range of wild isolates, we define previously unknown definitive roles for a trio of sigma factors during hormogonium development. These findings substantially advance our understanding of cyanobacterial development and gene regulation and could be leveraged for future applications.« less