skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Data Twinning
Abstract In this work, we develop a method namedTwinningfor partitioning a dataset into statistically similar twin sets.Twinningis based onSPlit, a recently proposed model‐independent method for optimally splitting a dataset into training and testing sets.Twinningis orders of magnitude faster than theSPlitalgorithm, which makes it applicable to Big Data problems such as data compression.Twinningcan also be used for generating multiple splits of a given dataset to aid divide‐and‐conquer procedures andk‐fold cross validation.  more » « less
Award ID(s):
1921873
PAR ID:
10444431
Author(s) / Creator(s):
 ;  
Publisher / Repository:
Wiley Blackwell (John Wiley & Sons)
Date Published:
Journal Name:
Statistical Analysis and Data Mining: The ASA Data Science Journal
Volume:
15
Issue:
5
ISSN:
1932-1864
Format(s):
Medium: X Size: p. 598-610
Size(s):
p. 598-610
Sponsoring Org:
National Science Foundation
More Like this
  1. PremiseThe ability to sequence genome‐scale data from herbarium specimens would allow for the economical development of data sets with broad taxonomic and geographic sampling that would otherwise not be possible. Here, we evaluate the utility of a basic double‐digest restriction site–associatedDNAsequencing (ddRADseq) protocol usingDNAs from four genera extracted from both silica‐dried and herbarium tissue. MethodsDNAs fromDraba,Boechera,Solidago, andIlexwere processed with a ddRADseq protocol. The effects ofDNAdegradation, taxon, and specimen age were assessed. ResultsAlthough taxon, preservation method, and specimen age affected data recovery, large phylogenetically informative data sets were obtained from the majority of samples. DiscussionThese results suggest that herbarium samples can be incorporated into ddRADseq project designs, and that specimen age can be used as a rapid on‐site guide for sample choice. The detailed protocol we provide will allow users to pursue herbarium‐based ddRADseq projects that minimize the expenses associated with fieldwork and sample evaluation. 
    more » « less
  2. Abstract We propose new tests for assessing whether covariates in a treatment group and matched control group are balanced in observational studies. The tests exhibit high power under a wide range of multivariate alternatives, some of which existing tests have little power for. The asymptotic permutation null distributions of the proposed tests are studied and theP‐values calculated through the asymptotic results work well in simulation studies, facilitating the application of the test to large data sets. The tests are illustrated in a study of the effect of smoking on blood lead levels. The proposed tests are implemented in anRpackageBalanceCheck. 
    more » « less
  3. PremisePhylogenetic trees of bryophytes provide important evolutionary context for land plants. However, published inferences of overall embryophyte relationships vary considerably. We performed phylogenomic analyses of bryophytes and relatives using both mitochondrial and plastid gene sets, and investigated bryophyte plastome evolution. MethodsWe employed diverse likelihood‐based analyses to infer large‐scale bryophyte phylogeny for mitochondrial and plastid data sets. We tested for changes in purifying selection in plastid genes of a mycoheterotrophic liverwort (Aneura mirabilis) and a putatively mycoheterotrophic moss (Buxbaumia), and compared 15 bryophyte plastomes for major structural rearrangements. ResultsOverall land‐plant relationships conflict across analyses, generally weakly. However, an underlying (unrooted) four‐taxon tree is consistent across most analyses and published studies. Despite gene coverage patchiness, relationships within mosses, liverworts, and hornworts are largely congruent with previous studies, with plastid results generally better supported. Exclusion ofRNAedit sites restores cases of unexpected non‐monophyly to monophyly forTakakiaand two hornwort genera. Relaxed purifying selection affects multiple plastid genes in mycoheterotrophicAneurabut notBuxbaumia. Plastid genome structure is nearly invariant across bryophytes, but thetufA locus, presumed lost in embryophytes, is unexpectedly retained in several mosses. ConclusionsA common unrooted tree underlies embryophyte phylogeny, [(liverworts, mosses), (hornworts, vascular plants)]; rooting inconsistency across studies likely reflects substantial distance to algal outgroups. Analyses combining genomic and transcriptomic data may be misled locally for heavilyRNA‐edited taxa. TheBuxbaumiaplastome lacks hallmarks of relaxed selection found in mycoheterotrophicAneura. Autotrophic bryophyte plastomes, includingBuxbaumia, hardly vary in overall structure. 
    more » « less
  4. Abstract Pleistocene diversity was much higher than today, for example there were three distinct wolf morphotypes (dire, gray, Beringian) in North America versus one today (gray). Previous fossil evidence suggested that these three groups overlapped ecologically, but split the landscape geographically. The Natural Trap Cave (NTC) fossil site in Wyoming,USAis an ideally placed late Pleistocene site to study the geographical movement of species from northern to middle North America before, during, and after the last glacial maximum. Until now, it has been unclear what type of wolf was present atNTC. We analyzed morphometrics of three wolf groups (dire, extant North American gray, Alaskan Beringian) to determine which wolves were present atNTCand what this indicates about wolf diversity and migration in Pleistocene North America. Results showNTCwolves group with Alaskan Beringian wolves. This provides the first morphological evidence for Beringian wolves in mid‐continental North America. Their location atNTCand their radiocarbon ages suggest that they followed a temporary channel through the glaciers. Results suggest high levels of competition and diversity in Pleistocene North American wolves. The presence of mid‐continental Beringian morphotypes adds important data for untangling the history of immigration and evolution ofCanisin North America. 
    more » « less
  5. Abstract We begin with a treatment of the Caputo time‐fractional diffusion equation, by using the Laplace transform, to obtain a Volterra integro‐differential equation. We derive and utilize a numerical scheme that is derived in parallel to the L1‐method for the time variable and a standard fourth‐order approximation in the spatial variable. The main method derived in this article has a rate of convergence ofO(kα + h4)foru(x,t) ∈ Cα([0,T];C6(Ω)),0 < α < 1, which improves previous regularity assumptions that requireC2[0,T]regularity in the time variable. We also present a novel alternative method for a first‐order approximation in time, under a regularity assumption ofu(x,t) ∈ C1([0,T];C6(Ω)), while exhibiting order of convergence slightly more thanO(k)in time. This allows for a much wider class of functions to be analyzed which was previously not possible under the L1‐method. We present numerical examples demonstrating these results and discuss future improvements and implications by using these techniques. 
    more » « less