skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: MyCeno
Summary MyCeno (Mycology of the Cenozoic) 1.0 contains 4,209 records of fossil fungi from the Cenozoic era (66 million years ago to present), from around the world. This dataset consists of records of palynomorphs (spores, and other microscopic fossils), as well as fungal macrofossils. Every record in the dataset comes with information about the fossil's location, estimated age range, and geology. This includes latitude and longitude coordinates, names or descriptions of the fungal fossil found, the technique used for dating the fossil, a grade given for the level of dating uncertainty, as well as full citations for the primary source and any supporting literature for every record. Additionally, 90% of records have a recorded sediment type, 72% have geological formation/member/bed names, and 83% have a DOI or hyperlink to the primary source. 86% of records have a current valid scientific name attributed to the fossil, with name authors and synonyms listed. For these records, the higher classification (i.e. the closest higher taxonomic classification that the identified fungus belongs to, from family-level upwards) is also recorded, as well as whether or not the genus is extant.  Nearest living relatives have been identified for 20% of records. Fossil ages in the dataset concentrate around the Miocene, but cover different epochs across the Cenozoic.   Usage & Applications This dataset was designed to be easy to use. Each variable has its own column, and the table is uploaded as a comma-separated values (CSV) file so that it can be opened using various programmes (flexible for different user preferences). For example, it can be opened in Microsoft Excel, or can be viewed and manipulated using code such as in RStudio. This dataset will prove valuable to people interested in studying ancient fungal diversity, understanding the evolution of fungi, or reconstructing palaeoecology, palaeoenvironments or palaeoclimates.  more » « less
Award ID(s):
2015813
PAR ID:
10536706
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
Zenodo
Date Published:
Subject(s) / Keyword(s):
palynology palaeontology geology palaeomycology fungi fungal spores palaeoecology palaeoclimates biodiversity taxonomy
Format(s):
Medium: X
Right(s):
Creative Commons Attribution Share Alike 4.0 International
Sponsoring Org:
National Science Foundation
More Like this
  1. The middle Miocene Climate Optimum (MMCO) was the warmest interval of the last 23 million years and is one of the best analogs for proposed future climate change scenarios. Fungi play a key role in the terrestrial carbon cycle as dominant decomposers of plant debris, and through their interactions with plants and other organisms as symbionts, parasites, and endobionts. Thus, their study in the fossil record, especially during the MMCO, is essential to better understand biodiversity changes and terrestrial carbon cycle dynamics in past analogous environments, as well as to model future ecological and climatic scenarios. The fossil record also offers a unique long-term, large-scale dataset to evaluate fungal assemblage dynamics across long temporal and spatial scales, providing a better understanding of how ecological factors influenced assemblage development through time. In this study, we assessed the fungal diversity and community composition recorded in two geological sections from the middle Miocene from the coal mines of Thailand and Slovakia. We used presence-absence data to quantify the fungal diversity of each locality. Spores and other fungal remains were identified to modern taxa whenever possible; laboratory codes and fossil names were used when this correlation was not possible. This study represents the first of its kind for Thailand, and it expands existing work from Slovakia. Our results indicate a total of 281 morphotaxa. This work will allow us to use modern ecological data to make inferences about ecosystem characteristics and community dynamics for the studied regions. It opens new horizons for the study of past fungal diversity based on modern fungal ecological analyses. It also sheds light on how global variations in fungal species richness and community composition were affected by different climatic conditions and under rapid increases of temperature in the past to make inferences for the near climatic future. 
    more » « less
  2. Typosquatting—the practice of registering a domain name similar to another, usually well-known, domain name—is typically intended to drive traffic to a website for malicious or profit- driven purposes. In this paper we assess the current state of typosquatting, both broadly (across a wide variety of techniques) and deeply (using an extensive and novel dataset). Our breadth derives from the application of eight different candidate-generation techniques to a selection of the most popular domain names. Our depth derives from probing the resulting name set via a unique corpus comprising over 3.3B Domain Name System (DNS) records. We find that over 2.3M potential typosquatting names have been registered that resolve to an IP address. We then assess those names using a framework focused on identifying the intent of the domain from the perspectives of DNS and webpage clustering. Using the DNS information, HTTP responses, and Google SafeBrowsing, we classify the candidate typosquatting names as resolved to private IP, malicious, defensive, parked, legitimate, or unknown intents. Our findings provide the largest-scale and most-comprehensive perspective to date on typosquatting, exposing potential risks to users. Further, our methodology provides a blueprint for tracking and classifying typosquatting on an ongoing basis. 
    more » « less
  3. Typosquatting—the practice of registering a domain name similar to another, usually well-known, domain name—is typically intended to drive traffic to a website for malicious or profitdriven purposes. In this paper we assess the current state of typosquatting, both broadly (across a wide variety of techniques) and deeply (using an extensive and novel dataset). Our breadth derives from the application of eight different candidate-generation techniques to a selection of the most popular domain names. Our depth derives from probing the resulting name set via a unique corpus comprising over 3.3B Domain Name System (DNS) records. We find that over 2.3M potential typosquatting names have been registered that resolve to an IP address. We then assess those names using a framework focused on identifying the intent of the domain from the perspectives of DNS and webpage clustering. Using the DNS information, HTTP responses, and Google SafeBrowsing, we classify the candidate typosquatting names as resolved to private IP, malicious, defensive, parked, legitimate, or unknown intents. Our findings provide the largest-scale and most-comprehensive perspective to date on typosquatting, exposing potential risks to users. Further, our methodology provides a blueprint for tracking and classifying typosquatting on an ongoing basis. 
    more » « less
  4. {"Abstract":["We report a dataset of all known and published occurrence records of animals of the phylum Rotifera, including Bdelloidea, Monogononta, and Seisonacea (with the exclusion of Acanthocephala) for Africa and surrounding islands and archipelagos. The dataset includes 27,225 records of 957 taxa (subspecies: 39; species: 819; genus: 81; family: 17; group: 1), gathered from 706 published papers. The published literature spans from 1854 to 2022, with the highest number of records in the decades 1990-1999 and 2010-2019. \n230 records of "species inquirendae", "nomina nuda", and "genera inquirenda" found in the published literature were not included in the dataset. Almost 90 % of the data are georeferenced.<\/p>\nThe African countries with the highest number of taxa are Nigeria, Algeria, South Africa, and Democratic Republic of the Congo, whereas no records are yet available for a dozen countries. The number of species known from each country can be explained mostly by sampling efforts, measured as the number of papers published for each country up to October 2022.<\/p>\nThis detailed literature search increased the number of known rotifer taxa at species, subspecies, form and variety level reported in previous reviews, which were 639 in 1986 (De Ridder, 1986) and 765 (Smolak et al., 2022) in 2022. Of the taxa reported in the current dataset, 167 (18%) are Bdelloidea, 665 (698%) Ploima, 97 (10%) Flosculariaceae, 27 (3%) Collothecacea and one representative of Seisonacea, the marine epizoic rotifer Seison africanus Sørensen, Segers & Funch, 2005 described and recorded only from coastal waters of Kenya (Sørensen et al., 2005).<\/p>\nThe data were structured based on the Darwin Core standard (Wieczorek et al., 2012). The dataset is structured to have in each row each record of a rotifer taxon from a sample from Africa and surrounding islands, as cited in the literature. The columns report the original and updated taxon name, additional taxonomic information together with origin of the data and habitat.\nAll invalid names (i.e. at the level of species inquirenda, nomen nudum, genus inquirendum) were not included in the records uploaded to GBIF. All names were also checked against the backbone of GBIF.<\/p>"]} 
    more » « less
  5. Abstract Purpose The ability to identify the scholarship of individual authors is essential for performance evaluation. A number of factors hinder this endeavor. Common and similarly spelled surnames make it difficult to isolate the scholarship of individual authors indexed on large databases. Variations in name spelling of individual scholars further complicates matters. Common family names in scientific powerhouses like China make it problematic to distinguish between authors possessing ubiquitous and/or anglicized surnames (as well as the same or similar first names). The assignment of unique author identifiers provides a major step toward resolving these difficulties. We maintain, however, that in and of themselves, author identifiers are not sufficient to fully address the author uncertainty problem. In this study we build on the author identifier approach by considering commonalities in fielded data between authors containing the same surname and first initial of their first name. We illustrate our approach using three case studies. Design/methodology/approach The approach we advance in this study is based on commonalities among fielded data in search results. We cast a broad initial net—i.e., a Web of Science (WOS) search for a given author’s last name, followed by a comma, followed by the first initial of his or her first name (e.g., a search for ‘John Doe’ would assume the form: ‘Doe, J’). Results for this search typically contain all of the scholarship legitimately belonging to this author in the given database (i.e., all of his or her true positives), along with a large amount of noise, or scholarship not belonging to this author (i.e., a large number of false positives). From this corpus we proceed to iteratively weed out false positives and retain true positives. Author identifiers provide a good starting point—e.g., if ‘Doe, J’ and ‘Doe, John’ share the same author identifier, this would be sufficient for us to conclude these are one and the same individual. We find email addresses similarly adequate—e.g., if two author names which share the same surname and same first initial have an email address in common, we conclude these authors are the same person. Author identifier and email address data is not always available, however. When this occurs, other fields are used to address the author uncertainty problem. Commonalities among author data other than unique identifiers and email addresses is less conclusive for name consolidation purposes. For example, if ‘Doe, John’ and ‘Doe, J’ have an affiliation in common, do we conclude that these names belong the same person? They may or may not; affiliations have employed two or more faculty members sharing the same last and first initial. Similarly, it’s conceivable that two individuals with the same last name and first initial publish in the same journal, publish with the same co-authors, and/or cite the same references. Should we then ignore commonalities among these fields and conclude they’re too imprecise for name consolidation purposes? It is our position that such commonalities are indeed valuable for addressing the author uncertainty problem, but more so when used in combination. Our approach makes use of automation as well as manual inspection, relying initially on author identifiers, then commonalities among fielded data other than author identifiers, and finally manual verification. To achieve name consolidation independent of author identifier matches, we have developed a procedure that is used with bibliometric software called VantagePoint (see www.thevantagepoint.com) While the application of our technique does not exclusively depend on VantagePoint, it is the software we find most efficient in this study. The script we developed to implement this procedure is designed to implement our name disambiguation procedure in a way that significantly reduces manual effort on the user’s part. Those who seek to replicate our procedure independent of VantagePoint can do so by manually following the method we outline, but we note that the manual application of our procedure takes a significant amount of time and effort, especially when working with larger datasets. Our script begins by prompting the user for a surname and a first initial (for any author of interest). It then prompts the user to select a WOS field on which to consolidate author names. After this the user is prompted to point to the name of the authors field, and finally asked to identify a specific author name (referred to by the script as the primary author) within this field whom the user knows to be a true positive (a suggested approach is to point to an author name associated with one of the records that has the author’s ORCID iD or email address attached to it). The script proceeds to identify and combine all author names sharing the primary author’s surname and first initial of his or her first name who share commonalities in the WOS field on which the user was prompted to consolidate author names. This typically results in significant reduction in the initial dataset size. After the procedure completes the user is usually left with a much smaller (and more manageable) dataset to manually inspect (and/or apply additional name disambiguation techniques to). Research limitations Match field coverage can be an issue. When field coverage is paltry dataset reduction is not as significant, which results in more manual inspection on the user’s part. Our procedure doesn’t lend itself to scholars who have had a legal family name change (after marriage, for example). Moreover, the technique we advance is (sometimes, but not always) likely to have a difficult time dealing with scholars who have changed careers or fields dramatically, as well as scholars whose work is highly interdisciplinary. Practical implications The procedure we advance has the ability to save a significant amount of time and effort for individuals engaged in name disambiguation research, especially when the name under consideration is a more common family name. It is more effective when match field coverage is high and a number of match fields exist. Originality/value Once again, the procedure we advance has the ability to save a significant amount of time and effort for individuals engaged in name disambiguation research. It combines preexisting with more recent approaches, harnessing the benefits of both. Findings Our study applies the name disambiguation procedure we advance to three case studies. Ideal match fields are not the same for each of our case studies. We find that match field effectiveness is in large part a function of field coverage. Comparing original dataset size, the timeframe analyzed for each case study is not the same, nor are the subject areas in which they publish. Our procedure is more effective when applied to our third case study, both in terms of list reduction and 100% retention of true positives. We attribute this to excellent match field coverage, and especially in more specific match fields, as well as having a more modest/manageable number of publications. While machine learning is considered authoritative by many, we do not see it as practical or replicable. The procedure advanced herein is both practical, replicable and relatively user friendly. It might be categorized into a space between ORCID and machine learning. Machine learning approaches typically look for commonalities among citation data, which is not always available, structured or easy to work with. The procedure we advance is intended to be applied across numerous fields in a dataset of interest (e.g. emails, coauthors, affiliations, etc.), resulting in multiple rounds of reduction. Results indicate that effective match fields include author identifiers, emails, source titles, co-authors and ISSNs. While the script we present is not likely to result in a dataset consisting solely of true positives (at least for more common surnames), it does significantly reduce manual effort on the user’s part. Dataset reduction (after our procedure is applied) is in large part a function of (a) field availability and (b) field coverage. 
    more » « less