skip to main content

Title: Characterizing sediment sources by non-negative matrix factorization of detrital geochronological data
This paper explores an inverse approach to the problem of characterizing sediment sources' (“source” samples) age distributions based on samples from a particular depocenter (“sink” samples) using non-negative matrix factorization (NMF). It also outlines a method to determine the optimal number of sources to factorize from a set of sink samples (i.e., the optimum factorization rank). We demonstrate the power of this method by generating sink samples as random mixtures of known sources, factorizing them, and recovering the number of known sources, their age distributions, and the weighting functions used to generate the sink samples. Sensitivity testing indicates that similarity between factorized and known sources is positively correlated to 1) the number of sink samples, 2) the dissimilarity among sink samples, and 3) sink sample size. Specifically, the algorithm yields consistent, close similarity between factorized and known sources when the number of sink samples is more than ∼3 times the number of source samples, sink data sets are internally dissimilar (cross-correlation coefficient range >0.3, Kuiper V value range >0.35), and sink samples are well-characterized (>150–225 data points). However, similarity between known and factorized sources can be maintained while decreasing some of these variables if other variables are increased. Factorization of three empirical detrital zircon U–Pb data sets from the Book Cliffs, the Grand Canyon, and the Gulf of Mexico yields plausible source age distributions and weights. Factorization of the Book Cliffs data set yields five sources very similar to those recently independently proposed as the primary sources for Book Cliffs strata; confirming the utility of the NMF approach. The Grand Canyon data set exemplifies two general considerations when applying the NMF algorithm. First, although the NMF algorithm is able to identify source age distribution, additional geological details are required to discriminate between primary or recycled sources. Second, the NMF algorithm will identify the most basic elements of the mixed sink samples and so may subdivide sources that are themselves heterogeneous mixtures of more basic elements into those basic elements. Finally, application to a large Gulf of Mexico data set highlights the increased contribution from Appalachian sources during Cretaceous and Holocene time, potentially attributable to drainage reorganization. Although the algorithm reproduces known sources and yields reasonable sources for empirical data sets, inversions are inherently non-unique. Consequently, the results of NMF and their interpretations should be evaluated in light of independent geological evidence. The NMF algorithm is provided both as MATLAB code and a stand-alone graphical user interface for Windows and macOS (.exe and .app) along with all data sets discussed in this contribution.  more » « less
Award ID(s):
Author(s) / Creator(s):
Date Published:
Journal Name:
Earth and planetary science letters
Page Range / eLocation ID:
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Detrital zircon (DZ) U–Pb geochronology has improved the way geologists approach questions of sediment provenance and stratigraphic age. However, there is debate about what constitutes an appropriate sample size (i.e., the number of dates in a DZ sample,n), which depends on project objectives, sample complexity, and, critically, analytical budget. Additionally, there is ongoing concern about bias introduced by zircon grain size. We tested a recently developed rapid (3 s/analysis) data acquisition method by multicollector laser ablation‐inductively coupled plasma‐mass spectrometry (LA‐ICP‐MS) that incorporates an automated selection routine and calculates two‐dimensional grain geometry from polished sample surfaces. Eleven samples were analysed from below and above the Late Cretaceous (Campanian) basal Castlegate unconformity of the Book Cliffs, Utah, in a down‐depositional‐dip transect including Price, Horse, Tusher, and Thompson canyons. 12,448 new concordant dates were generated during two measurement sessions. Results are consistent with recent studies suggesting there is no major provenance change and little time (1–2 Myr) represented across the unconformity. Grain size and sample size both exert a strong control on sample dissimilarity. Age distributions constructed from subsamples of large grains are systematically less similar to whole samples; age distributions composed of small grains are overall more similar to whole samples. As such, North American sediment sources that produce large grains such as the Grenville and Yavapi‐Mazatzal belts can bias age distributions if only large grains are analysed. A sample size ofn = 100 is inadequate for characterizing age distributions as complex as those of the Book Cliffs, whereas a sample size ofn = 300 provides good characterization. Sample size ofn ≈ 1000 or more is unnecessary unless project objectives include scanning for subordinate age groups, such as when identifying the youngest grains for calculating a maximum depositional age (MDA). Dates used in MDA calculations acquired with rapid acquisition are best re‐analysed with longer LA‐ICP‐MS acquisition methods or isotope dilution thermal ionization mass spectrometry for increased accuracy and precision. We include new MATLAB code and open‐source software programs,DZpickandDZmda, for automated spot picking and calculating MDAs.

    more » « less
  2. null (Ed.)
    Abstract Orthoquartzite detrital source regions in the Cordilleran interior yield clast populations with distinct spectra of paleomagnetic inclinations and detrital zircon ages that can be used to trace the provenance of gravels deposited along the western margin of the Cordilleran orogen. An inventory of characteristic remnant magnetizations (CRMs) from >700 sample cores from orthoquartzite source regions defines a low-inclination population of Neoproterozoic–Paleozoic age in the Mojave Desert–Death Valley region (and in correlative strata in Sonora, Mexico) and a moderate- to high-inclination population in the 1.1 Ga Shinumo Formation in eastern Grand Canyon. Detrital zircon ages can be used to distinguish Paleoproterozoic to mid-Mesoproterozoic (1.84–1.20 Ga) clasts derived from the central Arizona highlands region from clasts derived from younger sources that contain late Mesoproterozoic zircons (1.20–1.00 Ga). Characteristic paleomagnetic magnetizations were measured in 44 densely cemented orthoquartzite clasts, sampled from lower Miocene portions of the Sespe Formation in the Santa Monica and Santa Ana mountains and from a middle Eocene section in Simi Valley. Miocene Sespe clast inclinations define a bimodal population with modes near 15° and 45°. Eight samples from the steeper Miocene mode for which detrital zircon spectra were obtained all have spectra with peaks at 1.2, 1.4, and 1.7 Ga. One contains Paleozoic and Mesozoic peaks and is probably Jurassic. The remaining seven define a population of clasts with the distinctive combination of moderate to high inclination and a cosmopolitan age spectrum with abundant grains younger than 1.2 Ga. The moderate to high inclinations rule out a Mojave Desert–Death Valley or Sonoran region source population, and the cosmopolitan detrital zircon spectra rule out a central Arizona highlands source population. The Shinumo Formation, presently exposed only within a few hundred meters elevation of the bottom of eastern Grand Canyon, thus remains the only plausible, known source for the moderate- to high-inclination clast population. If so, then the Upper Granite Gorge of the eastern Grand Canyon had been eroded to within a few hundred meters of its current depth by early Miocene time (ca. 20 Ma). Such an unroofing event in the eastern Grand Canyon region is independently confirmed by (U-Th)/He thermochronology. Inclusion of the eastern Grand Canyon region in the Sespe drainage system is also independently supported by detrital zircon age spectra of Sespe sandstones. Collectively, these data define a mid-Tertiary, SW-flowing “Arizona River” drainage system between the rapidly eroding eastern Grand Canyon region and coastal California. 
    more » « less
  3. Abstract. End-member mixing analysis (EMMA) is a method of interpreting stream water chemistry variations and is widely used for chemical hydrograph separation. It is based on the assumption that stream water is a conservative mixture of varying contributions from well-characterized source solutions (end-members). These end-members are typically identified by collecting samples of potential end-member source waters from within the watershed and comparing these to the observations. Here we introduce a complementary data-driven method (convex hull end-member mixing analysis – CHEMMA) to infer the end-member compositions and their associated uncertainties from the stream water observations alone. The method involves two steps. The first uses convex hull nonnegative matrix factorization (CH-NMF) to infer possible end-member compositions by searching for a simplex that optimally encloses the stream water observations. The second step uses constrained K-means clustering (COP-KMEANS) to classify the results from repeated applications of CH-NMF and analyzes the uncertainty associated with the algorithm. In an example application utilizing the 1986 to 1988 Panola Mountain Research Watershed dataset, CHEMMA is able to robustly reproduce the three field-measured end-members found in previous research using only the stream water chemical observations. CHEMMA also suggests that a fourth and a fifth end-member can be (less robustly) identified. We examine uncertainties in end-member identification arising from non-uniqueness, which is related to the data structure, of the CH-NMF solutions, and from the number of samples using both real and synthetic data. The results suggest that the mixing space can be identified robustly when the dataset includes samples that contain extremely small contributions of one end-member, i.e., samples containing extremely large contributions from one end-member are not necessary but do reduce uncertainty about the end-member composition. 
    more » « less
  4. null (Ed.)
    Abstract The provocative hypothesis that the Shinumo Sandstone in the depths of Grand Canyon was the source for clasts of orthoquartzite in conglomerate of the Sespe Formation of coastal California, if verified, would indicate that a major river system flowed southwest from the Colorado Plateau to the Pacific Ocean prior to opening of the Gulf of California, and would imply that Grand Canyon had been carved to within a few hundred meters of its modern depth at the time of this drainage connection. The proposed Eocene Shinumo-Sespe connection, however, is not supported by detrital zircon nor paleomagnetic-inclination data and is refuted by thermochronology that shows that the Shinumo Sandstone of eastern Grand Canyon was >60 °C (∼1.8 km deep) and hence not incised at this time. A proposed 20 Ma (Miocene) Shinumo-Sespe drainage connection based on clasts in the Sespe Formation is also refuted. We point out numerous caveats and non-unique interpretations of paleomagnetic data from clasts. Further, our detrital zircon analysis requires diverse sources for Sespe clasts, with better statistical matches for the four “most-Shinumo-like” Sespe clasts with quartzites of the Big Bear Group and Ontario Ridge metasedimentary succession of the Transverse Ranges, Horse Thief Springs Formation from Death Valley, and Troy Quartzite of central Arizona. Diverse thermochronologic and geologic data also refute a Miocene river pathway through western Grand Canyon and Grand Wash trough. Thus, Sespe clasts do not require a drainage connection from Grand Canyon or the Colorado Plateau and provide no constraints for the history of carving of Grand Canyon. Instead, abundant evidence refutes the “old” (70–17 Ma) Grand Canyon models and supports a <6 Ma Grand Canyon. 
    more » « less
  5. Obeid, I. (Ed.)
    The Neural Engineering Data Consortium (NEDC) is developing the Temple University Digital Pathology Corpus (TUDP), an open source database of high-resolution images from scanned pathology samples [1], as part of its National Science Foundation-funded Major Research Instrumentation grant titled “MRI: High Performance Digital Pathology Using Big Data and Machine Learning” [2]. The long-term goal of this project is to release one million images. We have currently scanned over 100,000 images and are in the process of annotating breast tissue data for our first official corpus release, v1.0.0. This release contains 3,505 annotated images of breast tissue including 74 patients with cancerous diagnoses (out of a total of 296 patients). In this poster, we will present an analysis of this corpus and discuss the challenges we have faced in efficiently producing high quality annotations of breast tissue. It is well known that state of the art algorithms in machine learning require vast amounts of data. Fields such as speech recognition [3], image recognition [4] and text processing [5] are able to deliver impressive performance with complex deep learning models because they have developed large corpora to support training of extremely high-dimensional models (e.g., billions of parameters). Other fields that do not have access to such data resources must rely on techniques in which existing models can be adapted to new datasets [6]. A preliminary version of this breast corpus release was tested in a pilot study using a baseline machine learning system, ResNet18 [7], that leverages several open-source Python tools. The pilot corpus was divided into three sets: train, development, and evaluation. Portions of these slides were manually annotated [1] using the nine labels in Table 1 [8] to identify five to ten examples of pathological features on each slide. Not every pathological feature is annotated, meaning excluded areas can include focuses particular to these labels that are not used for training. A summary of the number of patches within each label is given in Table 2. To maintain a balanced training set, 1,000 patches of each label were used to train the machine learning model. Throughout all sets, only annotated patches were involved in model development. The performance of this model in identifying all the patches in the evaluation set can be seen in the confusion matrix of classification accuracy in Table 3. The highest performing labels were background, 97% correct identification, and artifact, 76% correct identification. A correlation exists between labels with more than 6,000 development patches and accurate performance on the evaluation set. Additionally, these results indicated a need to further refine the annotation of invasive ductal carcinoma (“indc”), inflammation (“infl”), nonneoplastic features (“nneo”), normal (“norm”) and suspicious (“susp”). This pilot experiment motivated changes to the corpus that will be discussed in detail in this poster presentation. To increase the accuracy of the machine learning model, we modified how we addressed underperforming labels. One common source of error arose with how non-background labels were converted into patches. Large areas of background within other labels were isolated within a patch resulting in connective tissue misrepresenting a non-background label. In response, the annotation overlay margins were revised to exclude benign connective tissue in non-background labels. Corresponding patient reports and supporting immunohistochemical stains further guided annotation reviews. The microscopic diagnoses given by the primary pathologist in these reports detail the pathological findings within each tissue site, but not within each specific slide. The microscopic diagnoses informed revisions specifically targeting annotated regions classified as cancerous, ensuring that the labels “indc” and “dcis” were used only in situations where a micropathologist diagnosed it as such. Further differentiation of cancerous and precancerous labels, as well as the location of their focus on a slide, could be accomplished with supplemental immunohistochemically (IHC) stained slides. When distinguishing whether a focus is a nonneoplastic feature versus a cancerous growth, pathologists employ antigen targeting stains to the tissue in question to confirm the diagnosis. For example, a nonneoplastic feature of usual ductal hyperplasia will display diffuse staining for cytokeratin 5 (CK5) and no diffuse staining for estrogen receptor (ER), while a cancerous growth of ductal carcinoma in situ will have negative or focally positive staining for CK5 and diffuse staining for ER [9]. Many tissue samples contain cancerous and non-cancerous features with morphological overlaps that cause variability between annotators. The informative fields IHC slides provide could play an integral role in machine model pathology diagnostics. Following the revisions made on all the annotations, a second experiment was run using ResNet18. Compared to the pilot study, an increase of model prediction accuracy was seen for the labels indc, infl, nneo, norm, and null. This increase is correlated with an increase in annotated area and annotation accuracy. Model performance in identifying the suspicious label decreased by 25% due to the decrease of 57% in the total annotated area described by this label. A summary of the model performance is given in Table 4, which shows the new prediction accuracy and the absolute change in error rate compared to Table 3. The breast tissue subset we are developing includes 3,505 annotated breast pathology slides from 296 patients. The average size of a scanned SVS file is 363 MB. The annotations are stored in an XML format. A CSV version of the annotation file is also available which provides a flat, or simple, annotation that is easy for machine learning researchers to access and interface to their systems. Each patient is identified by an anonymized medical reference number. Within each patient’s directory, one or more sessions are identified, also anonymized to the first of the month in which the sample was taken. These sessions are broken into groupings of tissue taken on that date (in this case, breast tissue). A deidentified patient report stored as a flat text file is also available. Within these slides there are a total of 16,971 total annotated regions with an average of 4.84 annotations per slide. Among those annotations, 8,035 are non-cancerous (normal, background, null, and artifact,) 6,222 are carcinogenic signs (inflammation, nonneoplastic and suspicious,) and 2,714 are cancerous labels (ductal carcinoma in situ and invasive ductal carcinoma in situ.) The individual patients are split up into three sets: train, development, and evaluation. Of the 74 cancerous patients, 20 were allotted for both the development and evaluation sets, while the remain 34 were allotted for train. The remaining 222 patients were split up to preserve the overall distribution of labels within the corpus. This was done in hope of creating control sets for comparable studies. Overall, the development and evaluation sets each have 80 patients, while the training set has 136 patients. In a related component of this project, slides from the Fox Chase Cancer Center (FCCC) Biosample Repository ( -facility) are being digitized in addition to slides provided by Temple University Hospital. This data includes 18 different types of tissue including approximately 38.5% urinary tissue and 16.5% gynecological tissue. These slides and the metadata provided with them are already anonymized and include diagnoses in a spreadsheet with sample and patient ID. We plan to release over 13,000 unannotated slides from the FCCC Corpus simultaneously with v1.0.0 of TUDP. Details of this release will also be discussed in this poster. Few digitally annotated databases of pathology samples like TUDP exist due to the extensive data collection and processing required. The breast corpus subset should be released by November 2021. By December 2021 we should also release the unannotated FCCC data. We are currently annotating urinary tract data as well. We expect to release about 5,600 processed TUH slides in this subset. We have an additional 53,000 unprocessed TUH slides digitized. Corpora of this size will stimulate the development of a new generation of deep learning technology. In clinical settings where resources are limited, an assistive diagnoses model could support pathologists’ workload and even help prioritize suspected cancerous cases. ACKNOWLEDGMENTS This material is supported by the National Science Foundation under grants nos. CNS-1726188 and 1925494. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. REFERENCES [1] N. Shawki et al., “The Temple University Digital Pathology Corpus,” in Signal Processing in Medicine and Biology: Emerging Trends in Research and Applications, 1st ed., I. Obeid, I. Selesnick, and J. Picone, Eds. New York City, New York, USA: Springer, 2020, pp. 67 104. [2] J. Picone, T. Farkas, I. Obeid, and Y. Persidsky, “MRI: High Performance Digital Pathology Using Big Data and Machine Learning.” Major Research Instrumentation (MRI), Division of Computer and Network Systems, Award No. 1726188, January 1, 2018 – December 31, 2021. https://www. [3] A. Gulati et al., “Conformer: Convolution-augmented Transformer for Speech Recognition,” in Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), 2020, pp. 5036-5040. [4] C.-J. Wu et al., “Machine Learning at Facebook: Understanding Inference at the Edge,” in Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA), 2019, pp. 331–344. [5] I. Caswell and B. Liang, “Recent Advances in Google Translate,” Google AI Blog: The latest from Google Research, 2020. [Online]. Available: [Accessed: 01-Aug-2021]. [6] V. Khalkhali, N. Shawki, V. Shah, M. Golmohammadi, I. Obeid, and J. Picone, “Low Latency Real-Time Seizure Detection Using Transfer Deep Learning,” in Proceedings of the IEEE Signal Processing in Medicine and Biology Symposium (SPMB), 2021, pp. 1 7. https://www.isip. [7] J. Picone, T. Farkas, I. Obeid, and Y. Persidsky, “MRI: High Performance Digital Pathology Using Big Data and Machine Learning,” Philadelphia, Pennsylvania, USA, 2020. [8] I. Hunt, S. Husain, J. Simons, I. Obeid, and J. Picone, “Recent Advances in the Temple University Digital Pathology Corpus,” in Proceedings of the IEEE Signal Processing in Medicine and Biology Symposium (SPMB), 2019, pp. 1–4. [9] A. P. Martinez, C. Cohen, K. Z. Hanley, and X. (Bill) Li, “Estrogen Receptor and Cytokeratin 5 Are Reliable Markers to Separate Usual Ductal Hyperplasia From Atypical Ductal Hyperplasia and Low-Grade Ductal Carcinoma In Situ,” Arch. Pathol. Lab. Med., vol. 140, no. 7, pp. 686–689, Apr. 2016. 
    more » « less