skip to main content


Title: Data Twinning
Abstract

In this work, we develop a method namedTwinningfor partitioning a dataset into statistically similar twin sets.Twinningis based onSPlit, a recently proposed model‐independent method for optimally splitting a dataset into training and testing sets.Twinningis orders of magnitude faster than theSPlitalgorithm, which makes it applicable to Big Data problems such as data compression.Twinningcan also be used for generating multiple splits of a given dataset to aid divide‐and‐conquer procedures andk‐fold cross validation.

 
more » « less
Award ID(s):
1921873
NSF-PAR ID:
10444430
Author(s) / Creator(s):
 ;  
Publisher / Repository:
Wiley Blackwell (John Wiley & Sons)
Date Published:
Journal Name:
Statistical Analysis and Data Mining: The ASA Data Science Journal
Volume:
15
Issue:
5
ISSN:
1932-1864
Page Range / eLocation ID:
p. 598-610
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    We present a phylogenetic analysis of spiders using a dataset of 932 spider species, representing 115 families (only the family Synaphridae is unrepresented), 700 known genera, and additional representatives of 26 unidentified or undescribed genera. Eleven genera of the orders Amblypygi, Palpigradi, Schizomida and Uropygi are included as outgroups. The dataset includes six markers from the mitochondrial (12S, 16S,COI) and nuclear (histone H3, 18S, 28S) genomes, and was analysed by multiple methods, including constrained analyses using a highly supported backbone tree from transcriptomic data. We recover most of the higher‐level structure of the spider tree with good support, including Mesothelae, Opisthothelae, Mygalomorphae and Araneomorphae. Several of our analyses recover Hypochilidae and Filistatidae as sister groups, as suggested by previous transcriptomic analyses. The Synspermiata are robustly supported, and the families Trogloraptoridae and Caponiidae are found as sister to the Dysderoidea. Our results support the Lost Tracheae clade, including Pholcidae, Tetrablemmidae, Diguetidae, Plectreuridae and the family Pacullidae (restored status) separate from Tetrablemmidae. The Scytodoidea include Ochyroceratidae along with Sicariidae, Scytodidae, Drymusidae and Periegopidae; our results are inconclusive about the separation of these last two families. We did not recover monophyletic Austrochiloidea and Leptonetidae, but our data suggest that both groups are more closely related to the Cylindrical Gland Spigot clade rather than to Synspermiata. Palpimanoidea is not recovered by our analyses, but also not strongly contradicted. We find support for Entelegynae and Oecobioidea (Oecobiidae plus Hersiliidae), and ambiguous placement of cribellate orb‐weavers, compatible with their non‐monophyly. Nicodamoidea (Nicodamidae plus Megadictynidae) and Araneoidea composition and relationships are consistent with recent analyses. We did not obtain resolution for the titanoecoids (Titanoecidae and Phyxelididae), but the Retrolateral Tibial Apophysis clade is well supported. Penestomidae, and probably Homalonychidae, are part of Zodarioidea, although the latter family was set apart by recent transcriptomic analyses. Our data support a large group that we call the marronoid clade (including the families Amaurobiidae, Desidae, Dictynidae, Hahniidae, Stiphidiidae, Agelenidae and Toxopidae). The circumscription of most marronoid families is redefined here. Amaurobiidae include the Amaurobiinae and provisionally Macrobuninae. We transfer Malenellinae (Malenella, from Anyphaenidae), Chummidae (Chumma) (new syn.) and Tasmarubriinae (Tasmarubrius,TasmabrochusandTeeatta, from Amphinectidae) to Macrobuninae. Cybaeidae are redefined to includeCalymmaria,Cryphoeca,EthobuellaandWillisius(transferred from Hahniidae), andBlabommaandYorima(transferred from Dictynidae). Cycloctenidae are redefined to includeOrepukia(transferred from Agelenidae) andPakehaandParavoca(transferred from Amaurobiidae). Desidae are redefined to include five subfamilies: Amphinectinae, withAmphinecta,Mamoea,Maniho,ParamamoeaandRangitata(transferred from Amphinectidae); Ischaleinae, withBakalaandManjala(transferred from Amaurobiidae) andIschalea(transferred from Stiphidiidae); Metaltellinae, withAustmusia,Buyina,Calacadia,Cunnawarra,Jalkaraburra,Keera,Magua,Metaltella,PenaoolaandQuemusia; Porteriinae (new rank), withBaiami,Cambridgea,CorasoidesandNanocambridgea(transferred from Stiphidiidae); and Desinae, withDesis, and provisionallyPoaka(transferred from Amaurobiidae) andBarahna(transferred from Stiphidiidae).Argyronetais transferred from Cybaeidae to Dictynidae.Cicurinais transferred from Dictynidae to Hahniidae. The generaNeoramia(from Agelenidae) andAorangia,MarplesiaandNeolana(from Amphinectidae) are transferred to Stiphidiidae. The family Toxopidae (restored status) includes two subfamilies: Myroinae, withGasparia,Gohia,Hulua,Neomyro,Myro,OmmatauxesisandOtagoa(transferred from Desidae); and Toxopinae, withMidgeeandJamara, formerly Midgeeinae,new syn.(transferred from Amaurobiidae) andHapona,Laestrygones,Lamina,ToxopsandToxopsoides(transferred from Desidae). We obtain a monophyletic Oval Calamistrum clade and Dionycha; Sparassidae, however, are not dionychans, but probably the sister group of those two clades. The composition of the Oval Calamistrum clade is confirmed (including Zoropsidae, Udubidae, Ctenidae, Oxyopidae, Senoculidae, Pisauridae, Trechaleidae, Lycosidae, Psechridae and Thomisidae), affirming previous findings on the uncertain relationships of the “ctenids”AncylometesandCupiennius, although a core group of Ctenidae are well supported. Our data were ambiguous as to the monophyly of Oxyopidae. In Dionycha, we found a first split of core Prodidomidae, excluding the Australian Molycriinae, which fall distantly from core prodidomids, among gnaphosoids. The rest of the dionychans form two main groups, Dionycha part A and part B. The former includes much of the Oblique Median Tapetum clade (Trochanteriidae, Gnaphosidae, Gallieniellidae, Phrurolithidae, Trachelidae, Gnaphosidae, Ammoxenidae, Lamponidae and the Molycriinae), and also Anyphaenidae and Clubionidae.Orthobulais transferred from Phrurolithidae to Trachelidae. Our data did not allow for complete resolution for the gnaphosoid families. Dionycha part B includes the families Salticidae, Eutichuridae, Miturgidae, Philodromidae, Viridasiidae, Selenopidae, Corinnidae and Xenoctenidae(new fam., includingXenoctenus,ParavulsorandOdo, transferred from Miturgidae, as well asIncasoctenusfrom Ctenidae). We confirm the inclusion ofZora(formerly Zoridae) within Miturgidae.

     
    more » « less
  2. Premise

    The ability to sequence genome‐scale data from herbarium specimens would allow for the economical development of data sets with broad taxonomic and geographic sampling that would otherwise not be possible. Here, we evaluate the utility of a basic double‐digest restriction site–associatedDNAsequencing (ddRADseq) protocol usingDNAs from four genera extracted from both silica‐dried and herbarium tissue.

    Methods

    DNAs fromDraba,Boechera,Solidago, andIlexwere processed with a ddRADseq protocol. The effects ofDNAdegradation, taxon, and specimen age were assessed.

    Results

    Although taxon, preservation method, and specimen age affected data recovery, large phylogenetically informative data sets were obtained from the majority of samples.

    Discussion

    These results suggest that herbarium samples can be incorporated into ddRADseq project designs, and that specimen age can be used as a rapid on‐site guide for sample choice. The detailed protocol we provide will allow users to pursue herbarium‐based ddRADseq projects that minimize the expenses associated with fieldwork and sample evaluation.

     
    more » « less
  3. Abstract

    Molecular ecologists seek to genotype hundreds to thousands of loci from hundreds to thousands of individuals at minimal cost per sample. Current methods, such as restriction‐site‐associatedDNAsequencing (RADseq) and sequence capture, are constrained by costs associated with inefficient use of sequencing data and sample preparation. Here, we introduceRADcap, an approach that combines the major benefits ofRADseq (low cost with specific start positions) with those of sequence capture (repeatable sequencing of specific loci) to significantly increase efficiency and reduce costs relative to current approaches.RADcap uses a new version of dual‐digestRADseq (3RAD) to identify candidateSNPloci for capture bait design and subsequently uses custom sequence capture baits to consistently enrich candidateSNPloci across many individuals. We combined this approach with a new library preparation method for identifying and removingPCRduplicates from 3RADlibraries, which allows researchers to processRADseq data using traditional pipelines, and we tested theRADcap method by genotyping sets of 96–384Wisteriaplants. Our results demonstrate that ourRADcap method: (i) methodologically reduces (to <5%) and allows computational removal ofPCRduplicate reads from data, (ii) achieves 80–90% reads on target in 11 of 12 enrichments, (iii) returns consistent coverage (≥4×) across >90% of individuals at up to 99.8% of the targeted loci, (iv) produces consistently high occupancy matrices of genotypes across hundreds of individuals and (v) costs significantly less than current approaches.

     
    more » « less
  4. Abstract

    The Cyclophyllidea is the most diverse order of tapeworms, encompassing species that infect all classes of terrestrial tetrapods including humans and domesticated animals. Available phylogenetic reconstructions based either on morphology or molecular data lack the resolution to allow scientists to either propose a solid taxonomy or infer evolutionary associations. Molecular markers available for the Cyclophyllidea mostly include ribosomalDNAand mitochondrial loci. In this study, we identified 3641 single‐copy nuclear coding loci by comparing the genomes ofHymenolepis microstoma,Echinococcus granulosusandTaenia solium. We designedRNAbaits based on the sequence ofH. microstoma, and applied target enrichment and Illumina sequencing to test the utility of those baits to recover loci useful for phylogenetic analyses. We capturedDNAfrom five species of tapeworms representing two families of cyclophyllideans. We obtained an average of 3284 (90%) of the targets from the test samples and then used captured sequences (2 181 361 bp in total; fragment size ranging from 301 to 6969 bp) to reconstruct a phylogeny for the five test species plus the three species for which genomic data are available. The results were consistent with the current consensus regarding cyclophyllidean relationships. To assess the potential for our method to yield informative genetic variation at intraspecific scales, we extracted 14 074 single nucleotide polymorphisms (SNPs) from alignments of fourArostrilepis macrocirrosaand twoA. cookiand successfully inferred their relationships. The results showed that our target gene tools yield data sets that provide robust inferences at a range of taxonomic scales in the Cyclophyllidea.

     
    more » « less
  5. Abstract

    A whole‐genome duplication (WGD) doubles the entire genomic content of a species and is thought to have catalysed adaptive radiation in some polyploid‐origin lineages. However, little is known about general consequences of aWGDbecause gene duplicates (i.e., paralogs) are commonly filtered in genomic studies; such filtering may remove substantial portions of the genome in data sets from polyploid‐origin species. We demonstrate a new method that enables genome‐wide scans for signatures of selection at both nonduplicated and duplicated loci by taking locus‐specific copy number into account. We apply this method toRADsequence data from different ecotypes of a polyploid‐origin salmonid (Oncorhynchus nerka) and reveal signatures of divergent selection that would have been missed if duplicated loci were filtered. We also find conserved signatures of elevated divergence at pairs of homeologous chromosomes with residual tetrasomic inheritance, suggesting that joint evolution of some nondiverged gene duplicates may affect the adaptive potential of these genes. These findings illustrate that including duplicated loci in genomic analyses enables novel insights into the evolutionary consequences ofWGDs and local segmental gene duplications.

     
    more » « less