<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>A New Pipeline for Removing Paralogs in Target Enrichment Data</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>06/19/2021</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10274850</idno>
					<idno type="doi">10.1093/sysbio/syab044</idno>
					<title level='j'>Systematic Biology</title>
<idno>1063-5157</idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Wenbin Zhou</author><author>John Soghigian</author><author>Qiu-Yun (Jenny) Xiang</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Abstract            Target enrichment (such as Hyb-Seq) is a well-established high throughput sequencing method that has been increasingly used for phylogenomic studies. Unfortunately, current widely used pipelines for analysis of target enrichment data do not have a vigorous procedure to remove paralogs in target enrichment data. In this study, we develop a pipeline we call Putative Paralogs Detection (PPD) to better address putative paralogs from enrichment data. The new pipeline is an add-on to the existing HybPiper pipeline, and the entire pipeline applies criteria in both sequence similarity and heterozygous sites at each locus in the identification of paralogs. Users may adjust the thresholds of sequence identity and heterozygous sites to identify and remove paralogs according to the level of phylogenetic divergence of their group of interest. The new pipeline also removes highly polymorphic sites attributed to errors in sequence assembly and gappy regions in the alignment. We demonstrated the value of the new pipeline using empirical data generated from Hyb-Seq and the Angiosperm 353 kit for two woody genera Castanea (Fagaceae, Fagales) and Hamamelis (Hamamelidaceae, Saxifragales). Comparisons of datasets showed that the PPD identified many more putative paralogs than the popular method HybPiper. Comparisons of tree topologies and divergence times showed evident differences between data from HybPiper and data from our new PPD pipeline. We further evaluated the accuracy and error rates of PPD by BLAST mapping of putative paralogous and orthologous sequences to a reference genome sequence of Castanea mollissima. Compared to HybPiper alone, PPD identified substantially more paralogous gene sequences that mapped to multiple regions of the reference genome (31 genes for PPD compared with 4 genes for HybPiper alone). In conjunction with HybPiper, paralogous genes identified by both pipelines can be removed resulting in the construction of more robust orthologous gene datasets for phylogenomic and divergence time analyses. Our study demonstrates the value of Hyb-Seq with data derived from the Angiosperm 353 probe set for elucidating species relationships within a genus, and argues for the importance of additional steps to filter paralogous genes and poorly aligned regions (e.g., as occur through assembly errors), such as our new PPD pipeline described in this study.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Abstract.-Target enrichment (such as Hyb-Seq) is a well-established high throughput sequencing method that has been increasingly used for phylogenomic studies. Unfortunately, current widely used pipelines for analysis of target enrichment data do not have a vigorous procedure to remove paralogs in target enrichment data. In this study, we develop a pipeline we call Putative Paralogs Detection (PPD) to better address putative paralogs from enrichment data.</p><p>The new pipeline is an add-on to the existing HybPiper pipeline, and the entire pipeline applies criteria in both sequence similarity and heterozygous sites at each locus in the identification of paralogs. Users may adjust the thresholds of sequence identity and heterozygous sites to identify and remove paralogs according to the level of phylogenetic divergence of their group of interest.</p><p>The new pipeline also removes highly polymorphic sites attributed to errors in sequence assembly and gappy regions in the alignment. We demonstrated the value of the new pipeline using empirical data generated from Hyb-Seq and the Angiosperm 353 kit for two woody genera Castanea (Fagaceae, Fagales) and Hamamelis (Hamamelidaceae, Saxifragales). Comparisons of datasets showed that the PPD identified many more putative paralogs than the popular method HybPiper. Comparisons of tree topologies and divergence times showed evident differences between data from HybPiper and data from our new PPD pipeline. We further evaluated the accuracy and error rates of PPD by BLAST mapping of putative paralogous and orthologous sequences to a reference genome sequence of Castanea mollissima. Compared to HybPiper alone, PPD identified substantially more paralogous gene sequences that mapped to multiple regions of the reference genome (31 genes for PPD compared with 4 genes for HybPiper alone).</p><p>In conjunction with HybPiper, paralogous genes identified by both pipelines can be removed resulting in the construction of more robust orthologous gene datasets for phylogenomic and divergence time analyses. Our study demonstrates the value of Hyb-Seq with data derived from Downloaded from <ref type="url">https://academic.oup.com/sysbio/advance-article/doi/10.1093/sysbio/syab044/6306425</ref> by guest on 13 July 2021</p><p>High throughput sequencing (HTS) technologies, such as those associated with amplicon sequencing, restriction site digestion, target enrichment, and transcriptome sequencing, have empowered systematists and evolutionary biologists to infer phylogeny with genome-wide molecular markers for a better understanding of species relationships and to answer evolutionary questions with new perspectives that were not possible in the past (e.g. <ref type="bibr">Pais et al. 2017</ref><ref type="bibr">Pais et al. , 2018;;</ref><ref type="bibr">Dong et al. 2019;</ref><ref type="bibr">Fu et al. 2019;</ref><ref type="bibr">One Thousand Plant Transcriptomes Initiative 2019;</ref><ref type="bibr">Du et al. 2020;</ref><ref type="bibr">Gaynor et al. 2020;</ref><ref type="bibr">Zhou et al. 2020;</ref><ref type="bibr">Thomas et al. 2021</ref>; see reviews in <ref type="bibr">Lemmon and Lemmon 2013;</ref><ref type="bibr">Dodsworth et al. 2019</ref>). Among these HTS technologies, target enrichment (Hyb-Seq in plants or sequence capture - <ref type="bibr">Weitemier et al. 2014;</ref><ref type="bibr">and ultraconserved elements, UCEs, in animals -Faircloth et al. 2012</ref>) is highly promising and increasingly used for phylogenomic studies of lineages across different evolutionary timescales (e.g. <ref type="bibr">Faircloth et al. 2013;</ref><ref type="bibr">McCormack et al. 2013;</ref><ref type="bibr">Leache et al. 2015;</ref><ref type="bibr">L&#233;veill&#233;-Bourret et al. 2018;</ref><ref type="bibr">Gaynor et al. 2020)</ref>.</p><p>The target enrichment method produces data from a targeted set of highly conserved genomic regions (and their flanking areas), often protein coding genes, using probes designed from prior knowledge of target sequences, either from the organism of interest, or a closely related species.</p><p>The method is highly valued for its repeatability between experiments and between labs if the same probes are used <ref type="bibr">(Harvey et al. 2016)</ref>, and for generating a lasting and amplifiable resource for comparative studies at multiple taxonomic scales. Data from target enrichment have been shown to be suitable to phylogenomic studies of both deep and shallow phylogenetic divergence, depending on the probes used, because the data contain both conserved coding sequences and their flanking variable sequences <ref type="bibr">(Lemmon et al. 2012;</ref><ref type="bibr">Faircloth et al. 2013;</ref><ref type="bibr">McCormack et al. 2013;</ref><ref type="bibr">Leache et al. 2015;</ref><ref type="bibr">Barrow et al. 2018;</ref><ref type="bibr">L&#233;veill&#233;-Bourret et al. 2018;</ref><ref type="bibr">Banker et al. 2020;</ref><ref type="bibr">Gaynor et al. 2020)</ref>.</p><p>Downloaded from <ref type="url">https://academic.oup.com/sysbio/advance-article/doi/10.1093/sysbio/syab044/6306425</ref> by guest on 13 July 2021</p><p>The development of the Angiosperm 353 kit <ref type="bibr">(Johnson et al. 2019)</ref>, which captures 353 low copy nuclear genes across angiosperms, has enabled phylogenomic studies across angiosperm lineages from family to genus (e.g., <ref type="bibr">Gaynor et al. 2020</ref> for Diapensaceae; <ref type="bibr">Larridon et al. 2020</ref> for Cyperaceae; <ref type="bibr">Murphy et al. 2020</ref> for Nepenthes in Nepenthaceae; <ref type="bibr">Shee et al. 2020</ref> for Scheffera in Araliaceae). An explosion of phylogenomic studies using the Angiosperm 353 probes is expected in the plant systematics community in the coming years. This endeavor will result in combinable datasets for building the "tree of life" of angiosperms through global-scale analysis <ref type="bibr">(Dodsworth et al. 2019;</ref><ref type="bibr">Johnson et al. 2019)</ref>. However, the universal probe kit has a disadvantage compared to taxon-specific kits in that the 353 target genes may or may not all be single copy across all species on which the kit is used, and probe binding affinity may cause probes to target unintended paralogous sequences <ref type="bibr">(McCartney et al. 2016)</ref>. In other words, the potential high divergence of some of the 353 target genes among the diverse angiosperm genomes poses a concern on possible prevalence of paralogs in the Hyb-Seq data. It is unknown if current bioinformatic pipelines developed for analyses of target enrichment data can reliably exclude paralogous gene copies in data derived from the Angiosperm 353 probe kit.</p><p>Orthologs are genes related by descent from a common ancestor (due to a speciation event) and their evolutionary history tracks the phylogeny of species, while paralogs are products of gene duplication events. Theoretically, comparisons of paralogous copies of genes among species compromise phylogenetic inferences because the gene trees do not track speciation events, and hence, do not depict the true species relationships <ref type="bibr">(Altenhoff et al. 2019;</ref><ref type="bibr">Fig. 1a</ref>). In the Hyb-Seq data or target enrichment data, in general, the paralogous genes might be "over-lumped" by assembly methods which use sequence similarity thresholds to define homology. The overlumping of paralogs leads to inflation of sequence variation at those loci which may or may not Downloaded from <ref type="url">https://academic.oup.com/sysbio/advance-article/doi/10.1093/sysbio/syab044/6306425</ref> by guest on 13 July 2021 affect the inference of species relationships, but is expected to result in misestimation of branch lengths (and thus misestimates of divergence times). Therefore, excluding paralogs in phylogenetic studies using this type of data is pivotal, although paralogous gene sequences have value in other areas of comparative genomics <ref type="bibr">(Madlung 2013;</ref><ref type="bibr">Limborg et al. 2016;</ref><ref type="bibr">McKinney et al. 2017)</ref>. However, in Hyb-Seq data, orthologs and paralogs are often difficult to distinguish due to their high similarity in sequence identity <ref type="bibr">(Altenhoff et al. 2019)</ref>. All current pipelines for target enrichment data, including HybPiper <ref type="bibr">(Johnson et al. 2016)</ref>, PHYLUCE <ref type="bibr">(Faircloth 2016)</ref>, and SECAPR <ref type="bibr">(Andermann et al. 2018)</ref>, merely consider the sequence similarity in detecting paralogs.</p><p>A sequence similarity-based approach for calling paralogous genes may be sufficient for phylogenetic studies using custom designed probes based on orthologous sequences encompassing a closely related study group. However, for studies leveraging probes built from evolutionarily distant taxa from the focal group of investigation, especially in groups where gene and genome duplication are thought to be common such as plants, sequence similarity between contig and target genes alone may not be sufficient for removing all paralogs. Additional analyses of the sequence data may be needed to remove the potential paralogous sequences before performing phylogenetic analyses. In this study we propose supplementary criteria to sequence similarity for detecting and removing problematic paralogous gene data from Hyb-Seq by examining heterozygous sites within and among individuals in the aligned sequences. Low rates of shared heterozygous sites across all samples in a species-level dataset is expected under the assumption that polymorphisms among species are more likely to be fixed differences between paralogs over deep divergences <ref type="bibr">(Eaton 2014, Eaton and</ref><ref type="bibr">Overcast 2020)</ref>. Even at the shallow level of phylogenetic divergence (e.g. population genetics), high shared heterozygosity Downloaded from <ref type="url">https://academic.oup.com/sysbio/advance-article/doi/10.1093/sysbio/syab044/6306425</ref> by guest on 13 July 2021 across all samples within a locus may also be attributed to paralogs <ref type="bibr">(Hohenlohe et al. 2011;</ref><ref type="bibr">Harvey et al. 2016;</ref><ref type="bibr">McKinney et al. 2017)</ref>. Additionally, a high number of heterozygous sites of a locus within an individual may be considered an indicator of gene duplication events or previously undetected polyploidy <ref type="bibr">(Medina et al. 2019)</ref>. Therefore, a high level of shared heterozygosity at a site across individuals and high number of heterozygous sites within a locus in an individual are both indicative of paralogy of the aligned gene sequences. In pipelines developed for analyses of target enrichment data, usually an arbitrary cut-off of sequence identity value between the contigs of a putative Hyb-Seq locus and the reference target gene is used to determine if the locus contains paralogous sequences in an individual. Currently, HybPiper <ref type="bibr">(Johnson et al. 2016)</ref> uses BWA <ref type="bibr">(Li and Durbin 2009)</ref> or BLASTx <ref type="bibr">(Altschul et al. 1990</ref>) to classify the raw reads into individual gene locus, followed by SPAdes <ref type="bibr">(Bankevich et al. 2012)</ref> to assemble the reads in a given individual into contigs (Fig. <ref type="figure">1b</ref>). If multiple contigs with a &gt;10x coverage depth in an individual mapped to the same target gene with &gt;85% sequence identity, this target gene is marked for presence of paralogs in the individual (Fig. <ref type="figure">1b</ref>), which can be eliminated or addressed separately by investigators to determine its orthology to sequences of the same locus of other individuals in subsequent analyses.</p><p>In addition, most pipelines for enrichment data implementing popular assemblers such as SPAdes <ref type="bibr">(Bankevich et al. 2012)</ref> and Abyss <ref type="bibr">(Simpson et al. 2009)</ref> for sequence assembly can only construct a single consensus sequence for a given locus in each individual (multiploidy) that represents the most frequent base of each site among read variants. This approach loses all information from heterozygous sites for identification of potential paralogs, which may result in data containing phylogenetic noises from paralogous genes that can mislead the inference of species relationship. Although SECAPR and scripts from <ref type="bibr">Kates et al. (2018)</ref> can perform allele phasing, all of the presently widely used pipelines for handling target enrichment data do not make use of the information from heterozygous sites to detect paralogous sequences. To make use of heterozygous sites, such as to detect paralogs or for phylogenetic inference, modification of existing pipelines for enrichment data is needed. In this study, we developed a new pipeline that generates degenerate coded sequences (retaining information of heterozygous sites) from Hyb-seq reads and uses criteria from both sequence similarity and quantity and distribution pattern of heterozygous sites for detection and cleaning of putative paralogs for downstream enrichment data analyses, which we call the Putative Paralogs Detection (PPD) pipeline (available on Github: <ref type="url">https://github.com/Bean061/putative_paralog</ref>). We developed PPD by modifying HybPiper (see Methods) to code heterozygous sites in assemblies with IUPAC ambiguity codes, and to leverage these heterozygous sites for further filtering of putative paralogs (see details in Methods). In order to demonstrate the value of the new pipeline, we compared the number of putative paralogous loci detected by PPD and HybPiper and evaluated the influence of paralogs on phylogenetic and divergence time dating analyses using Hyb-Seq data from the Angiosperm 353 kit we generated for two diploid genera: Castanea (Chestnuts of Fagaceae) and Hamamelis (Witch-hazel of Hamamelidaceae). We further validated the paralogy of putative paralogous loci identified by PPD using a genome reference available for Castanea.</p><p>The chestnut genus Castanea Miller (Fagaceae) includes seven tree species, each restricted to eastern Asia (EA), eastern North America (ENA), or Europe. The species were divided into three sections <ref type="bibr">(Dode 1908)</ref>  <ref type="bibr">Johnson (1988)</ref> and <ref type="bibr">Nixon (1997)</ref>, C. pumila var. pumila in the southeastern United States and C. pumila var. ozarkensis (Ashe) A.E. Murray limited to the Ozark mountains.</p><p>Phylogenetic studies of Castanea were previously conducted using data from six chloroplast regions in <ref type="bibr">Lang et al. (2006</ref><ref type="bibr">Lang et al. ( , 2007))</ref>. The studies found that sect. Eucastanon is paraphyletic. The witch-hazel genus Hamamelis L. (Hamamelidaceae) is also a small woody genus consisting of six species of shrubs and small trees, isolated in EA and ENA. The EA species include H. mollis Oliv. from eastern and southern China <ref type="bibr">(Chang 1979;</ref><ref type="bibr">Zhang and Lu 1995)</ref> and H. japonica Siebold &amp; Zucc. from Japan <ref type="bibr">(Sargent 1890;</ref><ref type="bibr">Ohwi 1978)</ref>. The ENA species include H. virginiana L., that is widely distributed from Canada to the Gulf coast <ref type="bibr">(Bradford and Marsh 1977)</ref>, H. vernalis Sarg., a species endemic to the Ozark Mountains in Arkansas, Missouri, and eastern Oklahoma <ref type="bibr">(Bradford and Marsh 1977)</ref>, H. ovalis S.W. Leonard that is restricted to a small area of Mississippi <ref type="bibr">(Leonard 2006)</ref>, and H. mexicana Standl. endemic to northeastern Mexico <ref type="bibr">(Standley 1937)</ref>, which is also known as Hamamelis virginiana var. mexicana (Standl.) C.Lane.</p><p>A few phylogenetic studies of Hamamelis were previously conducted using data from ITS, ETS, waxy gene, and several plastid genes <ref type="bibr">(Wen and Shi 1999;</ref><ref type="bibr">Li et al. 2000;</ref><ref type="bibr">Xie et al. 2010)</ref>.</p><p>However, the species relationships within Hamamelis have remained uncertain due to low nodal support values and short internal branches, especially regarding the relationships within the ENA clade. Therefore, results from the study also allow us to evaluate the previous phylogenetic hypotheses and further resolve the species relationships within these two genera. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Data Generation</head><p>Preparation of DNA samples.-We generated data from 15 samples of Castanea, seven samples of Hamamelis, and three samples of outgroups (Supplementary Table <ref type="table">S1</ref>, available on Dryad), which covers all species of the two genera. Outgroup species were chosen based on their phylogenetic positions in Fagaceae and Hamamelidaceae, respectively inferred by <ref type="bibr">Lang et al. (2006)</ref> and <ref type="bibr">Xie et al. (2010)</ref>. Fothergilla and Parrotiopsis were used as the outgroups of Hamamelis while Quercus was used as the outgroup of Castanea. Leaf samples were collected from the field or plants grown in arboreta or botanical gardens (Supplementary Table <ref type="table">S1</ref>, available on Dryad). Fresh leaves were stored in silica gel to dry. The dry leaves were stored at -20 &#176;C until they were used for the DNA extraction.</p><p>Total genomic DNAs were extracted from leaf samples using the CTAB protocol <ref type="bibr">(Doyle 1991)</ref> with modification described in <ref type="bibr">Cullings (1992)</ref> and <ref type="bibr">Xiang et al. (1998)</ref>. For leaf samples of Castanea that are rich in secondary compounds, they were washed five times with 0.8 mL of a washing buffer containing 10% polyethylene glycol, 0.35 M sorbitol, 50 mM Tris-HCl, 0.1% bovine serum albumin and 0.1% &#946;-mercaptoethanol <ref type="bibr">(Sakaguchi et al. 2018;</ref><ref type="bibr">Zhou et al. 2020)</ref> prior to DNA extraction with the modified CTAB method. The quality and quantity of DNA samples were first checked by 1% agarose gel electrophoresis followed by measurement on a Nanodrop spectrophotometer (ThermoFisher) and with a PicoGreen fluorescent dye assay (Life Technologies, ThermoFisher). Florida, USA) for Hyb-Seq library reconstruction and sequencing. The DNA samples were pooled for hybridization to biotinylated probes using the Angiosperm-353 v. 1 target capture kit <ref type="bibr">(Johnson et al. 2019</ref>) available from Arbor Biosciences (Arbor Biosciences, Ann Arbor, Michigan, USA). Sequencing of DNAs pulled from the hybridization experiment was performed with Illumina MiSeq (Illumina, San-Diego, California. USA) to produce 2 x 150 bp paired end reads, as described in <ref type="bibr">Gaynor et al. (2020)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Library preparation of</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Locus Data Assembly and MSA Generation</head><p>All samples were demultiplexed using Illumina's BCLtofastq by Arbor Biosciences. Raw sequencing reads were then cleaned and trimmed by Trimmomatic v.0.38 <ref type="bibr">(Bolger et al. 2014)</ref> using parameters MAXINFO:100:0.5 and TRAILING:20. Subsequently, the HybPiper pipeline v.</p><p>3 <ref type="bibr">(Johnson et al. 2016</ref>) was used to recover both coding sequences (CDS) and their flanking intron/non-coding regions. The process includes three major steps: using the nuclear sequences of Angiosperm 353 genes <ref type="bibr">(Johnson et al. 2019)</ref> as the references to capture all the reads from sequenced accessions via the BWA option with default seed length k=19 <ref type="bibr">(Li and Durbin 2009)</ref>, applying the SPAdes <ref type="bibr">(Bankevich et al. 2012)</ref> to assemble reads into long contigs, and implementing the intronerate.py module to recover "intron" and "supercontig" (CDS + intron fragments) sequences. Then, we used our PPD to generate multiple matrices to compare with those generated from HybPiper (see details below). To assess the phylogenetic and divergence time dating effects of paralogous genes we generated matrices consisting of supercontig sequences of three gene groups trimmed with PPD: orthologous loci, paralogous loci, or all loci.</p><p>The supercontig matrices contained sequences of both coding and their flanking regions of the three respective gene groups. The original supercontig matrices derived from HybPiper were retained for comparison. To build the matrices of orthologous genes, the paralogs called from HybPiper and PPD were manually removed from each all-gene matrix, while the matrices of paralogous genes included the paralogs detected by HybPiper and paralogs detected by PPD.</p><p>Specifically, for the genes with paralog warning from HybPiper, we considered only those loci with warnings for at least two individuals as paralogs. This conservative approach followed <ref type="bibr">Murphy et al. (2020)</ref>, and was based on 1) the fact that the reference sequences for the Angiosperm 353 kit were putative single copy genes from diverse, evolutionarily distant taxa, and 2) the observation of a dissimilar sequence in one individual alone could be a random event or due to errors in sequencing or sequence assembly in that individual, rather than true paralogy.</p><p>To allow different comparisons between the "consensus" matrices from HybPiper and "degenerated" matrices from PPD, we generated "consensus" matrices without (default of HybPiper), with gappy trimming (s6 of part 2 of PPD), and with all PPD trimming steps (all steps of part 2 of PPD, see details below). All data matrices and the relevant information are listed in Table <ref type="table">1</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Pipeline Description</head><p>The putative paralogs pipeline (PPD) includes two major parts: first, generating "degenerated" matrices, and second, trimming highly heterozygous sites, misaligned regions, and particularly gappy columns and detection of paralogous genes (Fig. <ref type="figure">2</ref>).</p><p>In the first half of the PPD pipeline, the "degenerated" sequences are built for HybPiper derived supercontig or exon sequences (if the intron sequences were not captured or absent) of each locus using a bash script following <ref type="bibr">Kates et al. (2018)</ref> (available on Github: <ref type="url">https://github.com/Bean061/putative_paralogs</ref>). This involves using the "consensus" sequences Downloaded from <ref type="url">https://academic.oup.com/sysbio/advance-article/doi/10.1093/sysbio/syab044/6306425</ref> by guest on 13 July 2021 from HybPiper (Fig. <ref type="figure">1b</ref>) as the references and mapping the raw reads back to the references in BWA with customized seed length according to the sequence length. As a higher seed length (BWA -k) value improves mapping quality <ref type="bibr">(Robinson et al., 2017)</ref>, we applied high seed length to ensure high quality mapping. Our sequencing method produced sequences of 150 bp for each read, we used a minimum seed length (-k) of 100 bp, instead of the default "-k" (19 bp). After mapping, the mapped duplicate reads are discarded using picard (<ref type="url">https://broadinstitute.github.io/picard/</ref>). The program GATK <ref type="bibr">(McKenna et al. 2010;</ref><ref type="bibr">DePristo et al. 2011</ref>) is then used to identify the variable sites using the HaplotypeCaller, with "-ploidy 2" parameter for diploid species, and SelectVariants functions. Finally, we use the FastaAlternateReferenceMaker function in GATK to convert the variable sites into the IUPAC coding to produce the "degenerated" (IUPAC) sequences for each gene.</p><p>The second half of the PPD pipeline trims alignments and detects paralogs, and includes 8 steps: s1) Resort gene files: Use all "degenerated" sequence files from every individual as the input, and then sort the degenerated sequences orthologous to the 353 reference genes in each sample into individual locus files according to gene names. s2) Sequence filtering: Filter the sequences with more than 5% (default) heterozygosity according to the percentage information of heterozygous sites in every sequence because a sequence with a high percentage of heterozygous sites may indicate sequencing or assembly errors of the particular locus. This setting can be changed by users with "-he" parameter. s3-s5) MSA generating: To obtain a better alignment result, the reference sequence of each locus is added for alignment using MAFFT (-adjustdirection -maxiterate 1000 --globalpair) <ref type="bibr">(Katoh and Standley 2013)</ref>. The reference sequences are removed before trimming of the aligned sequences in s6 and s7. s6) MSA trimming: Remove the gappy sites (i.e., sites missing in 50% or more individuals) using TrimAl (default "-gt 0.51") <ref type="bibr">(Capella-Gutierrez et al. 2009</ref>), a threshold based on the simulation study by <ref type="bibr">Wiens and Morrill (2011)</ref> which showed that adding a set of characters with data for 50% of the species is either beneficial or harmless for phylogenetic study. We found the gappy regions were extensive and mostly at two ends spanning the intron/flanking regions of the gene sequence alignment in a locus, which might be attributed to erroneous assembly with a small number of raw reads in a few individuals. Therefore, we excluded these regions from the alignment to remove the influence of the gappy sites in phylogenetic analyses. The "-gt" parameter can also be customized (this parameter is identical to the "-gt" in TrimAl). s7) MSA further trimming:</p><p>Detect and trim the hypervariable sites or regions using a sliding window method. The polymorphic sites in the ingroups meeting the requirement in each window were marked and then removed from all individuals (including the sites in outgroup species) by TrimAl. The maximum number of sites in a sliding window can be modified by the "-mi" parameter and sliding window length can be modified by the "-w" parameter in PPD. The default values for "mi" and "-w" are 4 and 20, respectively, which represent if there are more than 4 polymorphic sites (not counting sites with heterozygous bases/degenerated sites) in a 20 bp sliding window (representing &gt;25% variable sites) all of the polymorphic sites will be marked and removed by TrimAl. For polymorphic sites attributed solely to differences in sequences of the outgroups and meeting the requirement of more than 8 polymorphic sites (changeable via "-mo" parameter in PPD, default is 8) present in each 20 bp window, they are marked and replaced by a dash "-" in the sequence of the outgroup and the sites are not removed from any individuals to retain information likely phylogenetically informative among ingroup taxa. These criteria should be adjusted according to observation of the non-trimmed taxa MSA. We used the &gt;25% cutoff for our data based on the assumption that such high rates of sequence variation in the 353 genes and their flanking regions among our study ingroup species is unlikely true and may represent alignment ambiguity due to errors from sequence assembly. Our visual inspection of the BWA mapping result found the hypervariable sites had extremely low mapping quality, e.g., low depth of mapped reads (less than 5 reads) and many wrongly mapped reads. Including these sites would inflate sequence variation, thus, the branch length in phylogenetic inferences. s8) Paralog identification: Consider a locus as a paralog if it contains one or more heterozygous site(s) that are shared by 50% (default) or more individuals. The threshold of shared percentage and the number of heterozygous sites can be adjusted by the user using the "-hs" parameter and "-nh" parameter, respectively. For example, in Figure <ref type="figure">2</ref>, a hypothetical MSA of a locus/gene (on the left side) shows sequence with high heterozygous sites (Sp1), a polymorphic site that is heterozygous in &gt;50% samples/individuals of a diploid organism (labeled as polymorphic site 2), and a sequence containing a region with apparent alignment ambiguity due to error in contig assembly (shown as hypervariable sites compared to the rest). Identical heterozygous site(s)</p><p>shared by over 50% individuals (Polymorphic site 2) in the MSA is used as the indication of presence of paralogs in the locus and is the criterion for calling putative paralogs in the PPD.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Phylogenetic Analyses</head><p>Concatenation-based tree.-Phylogenetic analyses of the concatenated Hyb-Seq data were performed for the supercontig data matrices of the three gene groups generated from PPD as well as supercontig data matrices of the orthologs derived from HybPiper listed in Table <ref type="table">1</ref> using a maximum likelihood method implemented in IQ-TREE v. 1.6.12 <ref type="bibr">(Nguyen et al. 2015)</ref> partitioned by genes. All analyses used the TESTNEW option to obtain the best molecular model per partition. UF bootstrap was applied to evaluate the topology <ref type="bibr">(Hoang et al. 2018)</ref>.To test the congruence among different partition methods and phylogenetic methods, we also ran a phylogenetic analysis with the best merged partitions suggested by ModelFinder using MFP-MERGE in IQ-TREE <ref type="bibr">(Lanfear et al. 2012</ref>) and conducted analyses without any partition using RAxML <ref type="bibr">(Stamatakis 2014)</ref> and MrBayes <ref type="bibr">(Ronquist and Huelsenbeck 2003)</ref> for the degenerated orthologous data matrices derived from PPD pipeline (for details, see Supplementary</p><p>Information, available on Dryad). The RAxML and MrBayes analyses above were all conducted on the CIPRES Science Gateway Portal <ref type="bibr">(Miller et al. 2010)</ref>.</p><p>Coalescent-based species tree.-We used both ASTRAL-III <ref type="bibr">(Zhang et al. 2018)</ref> and SVDQuartets <ref type="bibr">(Chifman and Kubatko 2014)</ref> to generate the coalescent-based species trees. For the analyses with ASTRAL-III, we used gene trees from IQ-TREE for both genera as the inputs and ran ASTRAL-III with the default parameters. For the analyses with SVDQuartets, we used concatenated multilocus data as the input. Then, PAUP* v4.0a166 <ref type="bibr">(Swofford 2003)</ref> was used to generate a total of 100,000 quartets with 100 bootstrap replicates and then the quartet assembly method QFM was used to produce a summary tree <ref type="bibr">(Reaz et al. 2014)</ref> Illustrator 2020 (Adobe Systems, San Francisco, CA, USA).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Divergence Time Analyses</head><p>We employed BEAST2 2.6.2 <ref type="bibr">(Bouckaert et al. 2014)</ref>  Hamamelis, the best models for each on the BIC values from jModelTest <ref type="bibr">(Darriba et al. 2012</ref>).</p><p>An uncorrelated lognormal relaxed clock <ref type="bibr">(Drummond et al., 2006)</ref> and the birth-death process model <ref type="bibr">(Stadler, 2010)</ref> were implemented in the analyses. To account for the fact that our sampling in Castanea contained two samples per species, which violates assumptions of the BD model, we performed an additional analysis of the orthologous gene data by using a single sample per species to evaluate the impact of this violation. To facilitate comparisons among datasets and between undated and dated phylogenies, we included the original sampling of Castanea in divergence time analyses of all datasets. We run our analyses as a single concatenated supermatrix, as divergence time analyses using concatenated unpartitioned supermatrices compared with gene partitioned matrices of genomic data results in similar divergence times, but the concatenated data sets were more efficient than the partitioned datasets in attaining suitable effective sample sizes <ref type="bibr">(Voloch and Schrago 2012)</ref>. We set the mean GrowthRate (net diversification rate) to have a uniform distribution with a range of 0-100, with an initial value of 0.0, and the relative Death Rate (extinction rate/speciation rate) to have a 0-1 range, with an initial value of 0.5. These values were chosen based on the estimated average net diversification rate and extinction rate in plants <ref type="bibr">(De Vos et al., 2015)</ref>. Because constraints on node times can interact with constraints on other nodes and can also impact the divergence times of nodes that are elsewhere on the tree, we ran "empty" Markov Chain Monte Carlo analyses by adopting the prior settings but without using the sequence data to determine if the marginal densities of calibrated nodes matched the calibration densities, a desired property of a calibrated tree prior <ref type="bibr">(Heled and Drummond 2012)</ref>. These analyses yielded approximations to the prior distributions. To ensure that the prior distributions were well approximated, these "empty" MCMC runs all had effective sample sizes that exceeded 200. We found congruence between the priors and their approximations. Then, we ran the analyses with data for 200 million generations, with sampling of trees every 10,000 generations. Quality of the runs and parameter convergence were assessed using Tracer v.1.6.0 <ref type="bibr">(Rambaut et al. 2018</ref>). The maximum credibility tree of median heights was then constructed using TreeAnnotator after discarding 20% trees as burn-in.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Assessment of PPD Success Rate on Paralogs Identification</head><p>To test whether the putative paralogs detected by PPD were true paralogs and assess the false positive and false negative rates of PPD in identifying paralogs, we conducted nucleotide BLAST <ref type="bibr">(Altschul et al. 1990)</ref>  in identification of paralogous genes only in Castanea samples using the Castanea reference genome. We considered a locus to be confirmed as paralogous when its sequence from any Castanea sample had two or more BLAST hits on the reference genome and/or had a BLAST hit to a genome location different from that of other samples with 90 percent of identity with at least 500 bp mapping length in the separate regions of the reference genome. We calculated the success rate of PPD in paralogous gene identification as the number paralogous loci confirmed by the BLAST mapping analyses divided by the total number of paralogous loci identified by PPD. We also assessed the failure rate of PPD in calling paralogous genes by mapping the pooled sequences of all species of a putative orthologous locus to the reference genome. If sequences of a gene locus are mapped to more than one region in the reference genome, we recorded it as a case of false orthology. We also evaluated if false orthology and false paralogs influenced our phylogenetic analyses by repeating IQ-TREE analyses described above on a matrix that contained PPD orthologs and putative false paralogs but excluding any putative false orthology from BLAST results.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>R ESULTS</head><p>The number of loci, alignment length, average length per locus, total hypervariable sites removed, number of segregating sites, and number of parsimony informative sites varied among the three gene groups and between genera (Table <ref type="table">1</ref>). We found no sequences with excess heterozygosity and thus no sequences were removed from our data due to the presence of excess within-individual heterozygosity (5% or more). In the consensus matrices generated from</p><p>HybPiper, approximately an average of 1120 bp in Castanea and 446 bp in Hamamelis were removed from each locus through the gap-trimming step in PPD. Through the PPD sliding  <ref type="table">1</ref>). We found 31 (77.5%) out of 40 paralogs from PPD had multiple hits to the Castanea reference genome, while 9 (22.5%) paralogs had one hit based on the BLAST results (Table <ref type="table">2</ref>; Supplementary Tables <ref type="table">S2</ref> and<ref type="table">S3</ref>, available on Dryad). In orthologous genes detected by PPD, we found 255 (83.9%) out of 304 orthologs had only a single hit (i.e. all samples mapped only to a single region of the genome), while 46 (15.1%) putative orthologs had multiple hits (Table <ref type="table">2</ref>; Supplementary Tables <ref type="table">S4</ref> and<ref type="table">S5</ref>, available on Dryad). Phylogenetic analyses that also excluded orthologs with multiple BLAST hits and included paralogs with single BLAST hits were qualitatively the same as all other PPD analyses described below (Supplementary Fig. <ref type="figure">S1</ref>, available on Dryad). As a comparison, we found 11 paralogs by HybPiper, eight of which differed from paralogs from PPD. Four (36.4%) out of 11 paralogs from HybPiper had multiple hits to the Castanea reference genome, while six (54.5%) paralogs had one hit and one putative paralog had no hits (Table <ref type="table">2</ref>). Among the 333 orthologous genes from HybPiper, 258 (77.5%) had single hit, 73 (21.9%) had multiple hits, and 2 (0.6%) had no hits (Table <ref type="table">2</ref>). </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Phylogenetic Analyses of Orthologous Gene Data</head><p>The phylogenetic analyses of the orthologous gene data from PPD using IQ-TREE (with gene partition and best merged partition), RAxML, and MrBayes resulted in the same tree topologies with strong nodal support in both Castanea (Fig. <ref type="figure">3a</ref>; Supplementary Figs. <ref type="figure">S2-S4</ref> available in Dryad) and Hamamelis (Fig. <ref type="figure">3b</ref> and Supplementary Figs. S5 -S7, available on Dryad). The coalescent-based species trees reconstructed from ASTRAL and SVDQuartets for each genus also had the same topology identical to the concatenation-based tree (Fig. <ref type="figure">4</ref>). In</p><p>Castanea, the reciprocal monophyly of species from EA and ENA were recovered for each region, and the European species C. sativa was placed as the sister to the American clade (Fig. <ref type="figure">4a</ref>). In Hamamelis, species from ENA form a monophyletic group sister to H. japonica with H. mollis diverging out first, sister to the remaining species. However, the node connecting the ENA clade and H. japonica was not well supported in ASTRAL (0.59) but well supported in SVDQuartets (90) (Fig. <ref type="figure">4b</ref>).</p><p>Phylogenetic analysis of the orthologous gene data from HybPiper alone with and without PPD trimming steps resulted in different results in the two genera considered. In Castanea, the same topology was recovered from orthologous gene data for HybPiper matrices with and without PPD trimming, and this topology was the same as the topology recovered from the full PPD pipeline (Compare Figs. <ref type="figure">3a, 3c,</ref> and<ref type="figure">3e</ref>). In Hamamelis, the analysis of the untrimmed matrix resulted in a tree with a topology different from the tree from the PPD and trimmed HybPiper data <ref type="bibr">(Compare Figs. 3b,</ref><ref type="bibr">3d,</ref><ref type="bibr">and 3f)</ref>. In both genera, the branch lengths in HybPiper data-based trees were substantially longer than trees based on the PPD data, especially in the trees from the untrimmed HybPiper consensus data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Divergence Time Analyses of Orthologous Genes</head><p>Castanea.-Divergence time analyses of the PPD-derived data including all samples (i.e., degenerated supercontigs of orthologous genes) estimated the crown age of the genus (splitting of the EA and ENA clades) as the early Miocene (17.9 Ma, 95% HPD: 14.3-21.8 Ma). Within the genus, other divergence occurred in the mid-Miocene and late Miocene (Fig. <ref type="figure">5a</ref> and Supplementary Table <ref type="table">S6</ref>, available on Dryad). The European chestnut (C. sativa) diverged from the two ENA species in the mid-Miocene (13.6 Ma, 95% HPD: 10.9-16.7 Ma). The divergence times estimated from analysis with one sample per species were highly similar to those based on full sampling (two samples per species) for Castanea, with differences of median values &lt; 1 million years (Supplementary Fig. <ref type="figure">S8</ref>, available on Dryad).</p><p>Divergence times (median values) estimated from the HybPiper-derived data were approximately 11 million years (untrimmed) and two million years (trimmed) older, respectively, for all the nodes (Figs. <ref type="figure">5a,</ref><ref type="figure">5c</ref>, and 5g and Supplementary Table <ref type="table">S6</ref>, available on Dryad). The divergence times estimated from the paralogous genes were two to a few million years older than the estimates based on the orthologous gene data (Fig. <ref type="figure">5e</ref>, and Supplementary Table <ref type="table">S6</ref> late Miocene for H. virginiana and the Pliocene for the other species (Fig. <ref type="figure">5b</ref> and Supplementary Table <ref type="table">S6</ref>, available on Dryad). Similarly, the divergence times estimated from HybPiper-derived data were approximately 6 to 10 million years (untrimmed) and up to three million years (trimmed) older, respectively, for all nodes (Fig. <ref type="figure">5d</ref> and 5h; Supplementary Table <ref type="table">S6</ref>, available on Dryad).</p><p>Divergence time analyses of the paralogous gene data detected by PPD showed the median were highly similar at some nodes but younger or older at other nodes with differences within four or five million years, compared to the estimates from the "degenerated" orthologous gene data (Fig. <ref type="figure">5f</ref> and Supplementary Table <ref type="table">S6</ref>, available on Dryad). However, the 95% HPD were much higher at all nodes, indicating greater uncertainty.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D ISCUSSION</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Impacts of Paralogs and the Value of the PPD</head><p>Our results showed that our new pipeline (PPD) identified many more putative paralogs than HybPiper. Although the "consensus" sequence data generated from HybPiper may produce the phylogenetic tree with the same topology as the tree from the "degenerated" sequence data derived from PPD, the HybPiper data contained many more "false" phylogenetic informative sites (due to the presence of paralogous genes and consensus coding of the sequences), resulting in longer branches affecting divergence time estimation (Figs. <ref type="figure">3</ref> and<ref type="figure">5</ref>; Supplementary Table <ref type="table">S6</ref>, available on Dryad). The sequence data with better cleaning of paralogs and coded with the "degenerated" method are advantageous for phylogenomic studies, as they contain more accurate information for phylogenetic and divergence time estimations. Comparisons of the PPD data with "consensus" data with and without PPD trimming steps (Figs. <ref type="figure">3</ref> and<ref type="figure">5</ref>) indicated that the observed differences in branch lengths and divergence times cannot be explained by differences in trimming alone and sequence coding and paralogs also affected branch lengths and divergence time estimation. Furthermore, our phylogenetic analyses of the loci containing paralogous genes often resulted in phylogenies different from those inferred from data of the orthologous genes in Hamamelis (Supplementary Fig. <ref type="figure">S9</ref>, available on Dryad). The divergence times estimated from data including potential paralogous genes (i.e., the "consensus" data matrices from HybPiper) or from the paralogous genes identified from PPD are older and have larger HPDs, likely due to the additional variable sites introduced by gene paralogy (Fig. <ref type="figure">5</ref> and Supplementary Table <ref type="table">S6</ref>, available on Dryad). Our results clearly highlighted the negative impacts of paralogous gene content in phylogenetic analyses and that paralogous gene content either inflates estimates of divergence time or increases uncertainty of divergence time estimation in Castanea and Hamamelis (Fig. <ref type="figure">5</ref> and Supplementary Table <ref type="table">S6</ref>, available on Dryad). Comparisons of the PPD data with the "consensus" and "untrimmed" consensus data from HybPiper further indicated that the effects of sequence trimming on branch lengths and divergence time estimation were major, greater than the influences of sequence coding and paralogs in our case (Figs. <ref type="figure">3</ref> and<ref type="figure">5</ref>). These results together strongly support that additional steps following HybPiper to "polish" data from Hyb-Seq of Angiosperm 353 probe kit are necessary before phylogenetic and downstream analyses. Moreover, we show that the PPD pipeline can effectively clean alignments with user-defined trimming and identify paralogs in these alignments to produce higher quality data for phylogenetic and divergence time dating analyses.</p><p>The "degenerated" matrix generated from the PPD using the IUPAC ambiguity codes are suitable for a wide range of modern phylogenetic tools for phylogeny and divergence time estimation, including RAxML <ref type="bibr">(Stamatakis 2014)</ref>, IQ-TREE <ref type="bibr">(Nguyen et al. 2015)</ref>, SVDQuartets (Chifman and Kubatko 2014), BEAST2 <ref type="bibr">(Bouckaert et al. 2014</ref>) that has an option to treat ambiguity-coded positions as informative.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Accuracy Rate and Caveats of PPD in Paralogs Identification</head><p>Through BLAST mapping analysis with the Castanea mollissima genome, the paralogy of most of the PPD identified paralogous loci (31 out of 40 at a rate of 77.5%) were confirmed by two or more hits. The remaining nine paralogs each had a single hit in the genome (i.e. all samples mapped only to a single region), which represented false-positive paralogs, may be explained by loss of the duplicated paralogous loci in the reference genome and/or incompleteness of the reference genome. Additional Castanea genomes that may become available in the future will help further test this hypothesis. Alternatively, small-scale duplication events (e.g., <ref type="bibr">Hudson et al. 2011;</ref><ref type="bibr">Carretero-Paulet and Fares 2012;</ref><ref type="bibr">Rensing 2014</ref>) that are prevalent in Castanea plants may be missed based on the settings we used for BLAST (such as a 500 bp length), leading to the false classification of a putative paralog as having only a single hit.</p><p>We found that five out of these nine loci have only one heterozygous site shared by &gt;50%</p><p>individuals. The single shared heterozygous site in these five paralogs could be a result of occurrence by chance or sequencing errors. If users want to minimize such potentially false identification of paralogs and they can use a more conservative approach by increasing the number of heterozygous sites shared by &gt;50% individuals. However, this may result in the potential of missing true paralogous loci. If no reference genome is available for verification of paralogy of loci, and given that sequences for numerous loci are available from Hyb-Seq for phylogenetic analyses, we recommend a more aggressive approach to removing paralogs, such as the one adopted in our study. Our mapping analysis also indicated that PPD outperformed HybPiper alone at identifying true orthologs. We found 255 out 303 (83.9%) orthologous genes identified by PPD were true orthologs (evidenced by a single hit in the BLAST analysis), compared with only 77.5% from HybPiper alone. Additionally, 46 of the orthologous genes from PPD had two hits (15.1%), indicating paralogy of these loci according to our mapping criterion, while 73 (21.9%) of putative orthologs from HybPiper alone had multiple hits. This may indicate that both HybPiper and PPD do not remove all potential paralogs, but with only a single reference genome available, it is also possible these putative orthologs mapping multiple times could reflect errors in reference genome assemblies. Regardless of the origin of these putative paralogs missed by PPD, excluding them from phylogenetic analyses did not result in substantial differences in phylogenetic results between the original orthologous PPD matrix and one without these genes, indicating that a small percentage of "false" orthologs is tolerable. However, researchers may choose to validate the PPD identified paralogs and orthologs for their taxa with reference genome available and further refine the data, as done with Castanea in our study Overall, compared to HybPiper, PPD generated more accurate orthologous gene data for phylogenetic and downstream analyses (Table <ref type="table">1</ref> and Supplementary Table <ref type="table">S7</ref>, available on Dryad).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Taxonomy and Relationships within Castanea and Hamamelis</head><p>Our phylogenetic data do not agree with the morphology-based classification scheme of three sections in Castanea (Sect. Eucastanon, Sect. Balanocastanon, and Sect. Hypocastanon) ENA clade is sister to European C. sativa with high support value (Figs. <ref type="figure">3a</ref> and<ref type="figure">4a</ref>). The taxonomic status of the Allegheny chinkapin (C. pumila) and the Ozark chinkapin has been disputed <ref type="bibr">(Johnson 1988;</ref><ref type="bibr">Nixon 1997)</ref>. <ref type="bibr">Johnson (1988)</ref> considered the Ozark chinkapin as a variety of C. pumila, while Nixon (1997) regarded it as a separate species C. ozarkensis. In our study, all individuals representing C. pumila including the Ozark chinkapins formed a monophyletic group sister to C. dentata with strong support. Therefore, our phylogenomic study does not support the recognition of C. ozarkensis as a distinct species. However, the hypothesis should be further tested with population level sampling of related taxa.</p><p>In Hamamelis, our result suggested a similar topology with previous phylogenies using data from ITS, ETS, waxy gene, and several plastid genes <ref type="bibr">(Wen and Shi 1999;</ref><ref type="bibr">Li et al. 2000;</ref><ref type="bibr">Xie et al. 2010)</ref>, which showed H. mollis diverged first, followed by the divergence between H.</p><p>japonica and ENA clade. The ENA clade was a well-supported monophyletic clade. Different from previous studies, our concatenation-based tree showed a well resolved relationship among ENA clade using nuclear gene data, indicating H. virginiana is the first diverged species, followed by the divergence of H. vernalis, and H. mexicana is sister to H. ovalis (Fig. <ref type="figure">3b</ref>).</p><p>However, our coalescent-based species tree showed a different topology within the ENA clade, uniting H. ovalis and H. vernalis as the sister group but with low support values (Fig. <ref type="figure">4b</ref>). This conflict suggests there might be incomplete lineage sorting or gene flow among these three taxa in North America. The node connecting H. japonica and ENA clade is also relatively low in the species tree reconstructed with ASTRAL (0.59; Fig. <ref type="figure">4b</ref>), indicating another phylogenetic conflict among gene trees and the possibility of ancient gene flow or incomplete lineage sorting.</p><p>In conclusion, PPD, the pipeline we have described here, improves the quality of data obtained from Hyb-Seq for phylogenomic analyses through detection of additional paralogous Notes: The "untrimmed consensus matrices" were the orthologs directly generated by HybPiper, the "gappy trimmed consensus matrices" were the orthologs generated by HybPiper trimmed by s6 of part 2 of PPD. the "consensus matrices" were generated with HybPiper with all the PPD trimming steps, and the "degenerated" datasets were all generated with all steps of PPD. The "consensus" matrix contains sequences with heterozygous sites represented by the most frequent base; "degenerated" matrix contains sequences with heterozygous sites represented by the IUPAC ambiguity codes.         </p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>Downloaded from https://academic.oup.com/sysbio/advance-article/doi/10.1093/sysbio/syab044/6306425 by guest on 13 July 2021</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_1"><p>Europe)); Top 2: Castanea, ((EA, Europe), (ENA)); Top 3: Hamamelis, ((H. mollis, (H. japonica, ENA)); Top 4: Hamamelis, ((H. japonica, H. mollis), ENA); Top 5: Hamamelis, (H. japonica, (H. mollis, ENA)). MSA: Multiple Sequence Alignment. The matrices indicated by an asterisk were used for divergence time dating comparisons. Dash line indicates data not examined. Bolded font indicates the tree topologies are concordant in concatenation-based tree and species trees. Downloaded from https://academic.oup.com/sysbio/advance-article/doi/10.1093/sysbio/syab044/6306425 by guest on 13 July 2021</p></note>
		</body>
		</text>
</TEI>
