<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>A global dataset of tree hydraulic and structural traits imputed from phylogenetic relationships</title></titleStmt>
			<publicationStmt>
				<publisher>Knighton, J., Sanchez-Martinez, P. &amp; Anderegg, L. A Globally Comprehensive Database of Tree Hydraulic and Structural Traits Imputed from Phylogenetic Relationships. Zenodo https://doi.org/10.5281/zenodo.15009207 (2025).</publisher>
				<date>12/01/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10625664</idno>
					<idno type="doi">10.1038/s41597-024-04254-4</idno>
					<title level='j'>Scientific Data</title>
<idno>2052-4463</idno>
<biblScope unit="volume">11</biblScope>
<biblScope unit="issue">1</biblScope>					

					<author>James Knighton</author><author>Pablo Sanchez-Martinez</author><author>Leander Anderegg</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[We present a dataset of plant hydraulic and structural traits imputed for 55,779 tree species based on tRY plant trait dataset observations and phylogenetic relationships. We collected plant trait values for maximum stomatal conductance (gs MAX ), xylem pressure at 12%, 50%, and 88% conductance loss (P12, P50, P88), maximum observed rooting depth (rd MAX ), photosynthetic Water Use Efficiency (WUE), maximum plant height (height), Specific Leaf Area (SLA), and leaf Nitrogen content (LeafN). We demonstrated that each of these traits exhibited remarkably large phylogenetic signals across all land plants. Based on the strength of this signal we then developed random forest (RF) models trained on tRY trait data to impute the traits of previously unstudied tree species using Phylogenetic Eigenvector Maps. We quantified imputed trait uncertainty by fitting RF model test dataset residuals to skew exponential power distributions accounting for heteroscedasticity, demonstrating encouraging lack of biases in the imputed dataset. the resulting dataset of imputed trait values can support global analyses of plant trait variations and species-level parameterization of earth systems models.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Background &amp; Summary</head><p>Hydraulic and structural traits define how plants uptake and transpire water from soils and groundwater, influencing ecosystem productivity, ecosystem resilience, and drought-induced mortality <ref type="bibr">[1]</ref><ref type="bibr">[2]</ref><ref type="bibr">[3]</ref> . The traits of the plant species that cover landscapes determine the land surface energy balance, hydrologic partitioning (i.e., infiltration of precipitation versus surface runoff), and the degree to which subsurface water pools are connected to the atmosphere through transpiration <ref type="bibr">[4]</ref><ref type="bibr">[5]</ref><ref type="bibr">[6]</ref><ref type="bibr">[7]</ref> . Advances in process-based ecosystem modelling allow for the detailed representation of plant hydraulics in order to resolve the soil-plant-atmosphere-continuum which connects ecosystem water, nutrient, and energy fluxes with primary productivity <ref type="bibr">[8]</ref><ref type="bibr">[9]</ref><ref type="bibr">[10]</ref><ref type="bibr">[11]</ref> . These ecosystem models provide the opportunity to forecast earth system responses to both atmospheric and biological change <ref type="bibr">12</ref> .</p><p>While the importance of these plant traits is well understood <ref type="bibr">13</ref> we lack trait measurements for most known tree species. A lack of direct trait observations to inform model parameterization has been part of the motivation for the compilation of global plant trait databases, such as the TRY Global Trait Database <ref type="bibr">14,</ref><ref type="bibr">15</ref> . More than a decade into these efforts, a few traits are now reasonably well sampled globally, principally traits related to leaf economics such as leaf mass per area and leaf nitrogen content. However, even for these few well sampled traits, most traits have never been sampled for the vast majority of species globally (e.g. specific leaf area or SLA values exist for ~16,000 of Earth's approximately &#189; million land plants in TRY <ref type="bibr">14</ref> ). Observations of multiple traits in the same species are extremely rare, taken against the backdrop of global plant diversity, even for the simplest traits such as plant height and growth form <ref type="bibr">14</ref> . For more difficult to measure physiological traits such as hydraulic traits, this data scarcity is even more dire. Models frequently forgo this complexity by representing vegetation with a small number of plant functional types, and therefore may be limited in their capacity to forecast earth systems processes <ref type="bibr">[16]</ref><ref type="bibr">[17]</ref><ref type="bibr">[18]</ref><ref type="bibr">[19]</ref> . As a result, there has been a call for creative efforts to parameterize the 'functional types' (discrete parameter sets that represent functional diversity in vegetation models), for example using evolutionary lineages to help guide the aggregation of trait values <ref type="bibr">16</ref> .</p><p>Alternative methods exist for estimating plant traits beyond direct measurement in the field; however, each carries limitations. Remote sensing products can support estimating ecosystem-scale hydraulic traits <ref type="bibr">20</ref> with some advancement towards retrieving functional trait diversity from spectral signals <ref type="bibr">21</ref> . Plant traits can also be inversely estimated through process-based ecosystem model fitting to species-level empirical field datasets (e.g., sapflux, xylem water isotopic compositions) <ref type="bibr">22,</ref><ref type="bibr">23</ref> ; however these measurements are resource intensive to collect and infrequently available. Given the limitations of current inverse approaches for estimating species-level hydraulic traits, a broad first order approximation of plant trait values could substantially advance ecosystem and earth systems modelling. Missing values in trait datasets can be imputed via methods such as Bayesian hierarchical probabilistic matrix factorization which can leverage the statistical structure of trait values, correlations among traits, and taxonomic relationships <ref type="bibr">24,</ref><ref type="bibr">25</ref> ; however these approaches have been tested primarily for highly sampled traits and rely on existing parallel measurements of other correlated traits. These approaches therefore may not satisfy the need for a tool that extrapolates to previously unstudied species.</p><p>Plant traits typically exhibit strong phylogenetic signals (i.e., more closely related species exhibit more similar trait syndromes than distantly related species) <ref type="bibr">[26]</ref><ref type="bibr">[27]</ref><ref type="bibr">[28]</ref><ref type="bibr">[29]</ref><ref type="bibr">[30]</ref> , providing the opportunity to impute traits for previously unstudied species based on the relationship between functional traits and widely available phylogenetic data. We first performed a series of significance tests for phylogenetic signals in the hydraulic traits maximum stomatal conductance (gs MAX ), xylem pressure at 12% (P12), 50% (P50), and 88% (P88) reduction in branch conductance, maximum rooting depth (rd MAX ), water use efficiency (WUE), as well as the structural traits maximum plant height, specific leaf area (SLA), and leaf nitrogen composition per unit leaf mass (LeafN). We then imputed trait values for 55 K tree species based only on phylogenetic relationships and the TRY plant trait database <ref type="bibr">15</ref> . This dataset of imputed values will support species-level ecosystem modelling and investigations of relationships between plant traits and environmental boundary conditions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Methods</head><p>Plant trait phylogenetic signals. We collected plant trait values for maximum stomatal conductance (gs MAX ), xylem pressure at 12%, 50%, and 88% conductance loss (P12, P50, and P88, respectively), maximum observed rooting depth (rd MAX ), photosynthetic water use efficiency (assimilation/transpiration, or WUE), maximum plant height (height), Specific Leaf Area (SLA), and leaf nitrogen content per unit mass (LeafN) from the TRY database <ref type="bibr">15</ref> . Plant trait records were filtered to remove values with TRY ErrorRisk values greater than 5 (indicating that the value is greater than five standard deviations from either the species-mean, genus-mean, family-mean or mean of all data for that trait, likely indicative of a data error) where ErrorRisk estimates were present, unflagged values that were likely data entry errors (e.g., negative stomatal conductance), and the over-representation of two crops (Coffea arabica and Glycine max). Documentation of TRY database filtering is provided in publicly available code attached to this work. Where multiple records existed for a single species, we computed the species median trait value. We validated each record name against World Flora Online (WFO), a comprehensive list of plant species <ref type="bibr">31</ref> with the R package 'WorldFlora' . TRY species names that did not match WFO were corrected. Where corrections were not possible, observations were discarded. Validated plant species were mapped to a phylogeny using V.Phylomaker in the R package 'V.Phylomaker2' <ref type="bibr">32,</ref><ref type="bibr">33</ref> . Species not present in the backbone phylogeny were bound using 'V.phylomaker2' under the scenario 3, which is the most commonly used approach. The scenario 3 methodology binds any new genus to an intermediate point of its family branch length and any species of an existing genus to the basal node of its genus. It varies from scenarios 1 and 2 as they bind any new tip to the genus or family basal node and to a random node within the genus or family, respectively <ref type="bibr">33</ref> . The three scenarios have been compared in previous works, showing how scenarios 1 and 3 perform better and give similar results <ref type="bibr">32</ref> . Therefore, we opted to use scenario 3. The resulting phylogenies contained the following unique species: gs MAX (n = 2,377), P12 (n = 387), P50 (n = 682), P88 (n = 436), rd MAX (n = 1,498), WUE (n = 317), height (n = 5,775), SLA (n = 12,595), and LeafN (n = 5,141).</p><p>Imputing plant traits using phylogenetic relationships requires first establishing that traits exhibit phylogenetic signals. We tested the hypothesis that each trait exhibited a significant phylogenetic signal with Pagel's &#955;, which can be interpreted as a measure of the amount of variance explained by phylogenetic distances between species (ranging between 0 and 1) <ref type="bibr">34</ref> , using 100 iterations as implemented in the R package 'phytools' <ref type="bibr">35</ref> . For this and all subsequent hypothesis tests we compared our p-values to &#593; thresholds of 0.1, 0.05, and 0.01. We also computed the fractions of trait variance explained by the phylogeny, Var Phylo , and their associated p-values <ref type="bibr">30</ref> .</p><p>We acknowledge that species-level phylogenies may contain larger inaccuracies than deeper in the phylogenetic tree, especially when representing tropical taxa <ref type="bibr">36</ref> . To assess the potential impact of such topological inaccuracies, we repeated this analysis for TRY traits with Pagel's &#955; aggregated to the genus-level, pruning the species-level phylogeny keeping one species per genus (equivalent to a genus-level phylogeny). As will be demonstrated, phylogenetic signals maintained their significance, showing how most of the phylogenetic variance was explained by deep evolutionary divergences representing distances between well resolved high taxonomic ranks, in line with coarser taxonomic decomposition analyses of these same traits <ref type="bibr">16</ref> . This verified that species-level phylogenetic patterns are not strongly affected by the phylogenetic distances within genera, which can contain a higher amount of error.</p><p>Estimation of species-level hydraulic and structural traits. To facilitate prediction of species-level hydraulic and structural traits, we repeated the above analysis; however, we retained all individual trait observation values (rather than collapsing all observations of each species to one median trait value). Phylogenies were constructed following the same approach. We then reduced these phylogenies to Phylogenetic Eigenvector Maps (PEM) which characterize the distances between species <ref type="bibr">37</ref> . The original TRY trait observations were then joined to PEMs which could then serve as predictors of trait values.</p><p>We constructed all Random Forest (RF) models to predict trait values from PEMs with the R package 'h2o' <ref type="bibr">38</ref> . We then compared two methods for RF feature selection. First, using gs MAX , we trained the RF model on all PEMs. We then iteratively dropped the single PEM predictor with the lowest variable importance score and retrained the model. This process was repeated until RF performance significantly decreased when additional columns were removed. Second, we used a filter-based approach where we retained PEM predictors for model training that exhibited the strongest Spearman's rank correlations with the observed trait values. RF model tests suggested that performance for the Spearman-based approach was similar for models retaining between 25 and 75 columns. We therefore used the 50 strongest rank-correlated columns. The two approaches to feature selection yielded similar RF performance. We selected the simpler filter-selection approach for imputing all plant traits.</p><p>RF models parameters included 300 trees, maximum depth of 50, and 8-fold cross validation. To estimate RF prediction uncertainty, the database was divided into training, validation, and test datasets based on 70%:15%:15% splits. The stopping condition used for training was Mean Squared Error. To estimate trait prediction performance, splits were developed by randomly sampling subsets of species such that all records each species occur only in one of the training, validation, or test datasets. We present four RF test dataset objective function values for each trait: Mean Absolute Scaled Error (MASE), Mean Absolute Error (MAE), R 2 , and Percent Bias (P-bias). All model metrics are computed only for the 15% of observations that were not used in model training/validation.</p><p>RF models using all TRY records for training and validation (i.e., no test hold out) were used to impute the trait values for tree species listed in the BCGI Global Tree Search dataset of 57,922 named species <ref type="bibr">39</ref> . Validating and correcting tree species names in this list against WFO yielded 55,779 species names. TRY observations exist for the following fractions of species contained within the global tree list for the following traits: gs MAX (2.07%), P12 (0.52%), P50 (0.94%), P88 (0.60%), rd MAX (0.73%), WUE (0.33%), height (2.52%), SLA (10.19%), and LeafN (9.22%) of all species.</p><p>We compared the above approach to several parallel methodologies for imputing traits to provide context for the final dataset. We first compared using PEMs to Principal Coordinate Analysis (PCoA) as implemented in the R 'ape' package <ref type="bibr">40</ref> . Next, we repeated the PEM-based analysis for P12, P50, and P88 records in the xylem functional trait database <ref type="bibr">41</ref> to test whether more curated (but smaller) hydraulic datasets yielded similar results. This dataset was filtered to include only stem samples from adult trees with S-shaped PLC curves. imputed trait residual characteristics and uncertainty bound estimation. The accuracy of imputed hydraulic and structural traits were quantified with RF test dataset residuals (i.e., e = predicted trait values -observed trait values). It was possible that RF trait residuals would be larger for tree species with greater documented within-species trait variations and for trees with fewer closely related species contained in the TRY database. We therefore hypothesized that RF residuals for all test datasets would exhibit significant phylogenetic signals. We tested for significant phylogenetic signals in model residuals with Pagel's &#955; as described above. As will be demonstrated, model residuals were not significantly related to species-identity or phylogenetic relatedness for any traits. We therefore did not consider species identity in constructing statistical models of RF residuals.</p><p>Uncertainty bound estimates for each trait prediction were developed by fitting RF trait residual datasets to Skew Exponential Power (SEP) distributions with standard deviations accounting for residual Table 1. Genus-level phylogenetic analysis of the TRY database showing the number of genera, Pagel's &#955; (&#955;), and p-values (P).</p><p>heteroscedasticity <ref type="bibr">42</ref> . Best-fit SEP parameters describing residual kurtosis, skew, and variance were estimated through Maximum Likelihood Estimation via 1e6 Monte Carlo simulations for each set of trait residuals. Fitted SEP distributions were then used to construct 50% confidence intervals for each imputed trait for ease of use, though we note that the provided SEP parameter values and code support construction of any confidence interval as well as Monte Carlo sampling of trait uncertainty.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Data Records</head><p>The global imputed trait dataset is publicly available on Zenodo <ref type="bibr">43</ref> . The dataset consists of an R scripting language R Data Serialization (RDS) file, a Matlab MAT-file object, and an Excel spreadsheet (GlobalTrees_Traits_ Median.xlsx), each containing median estimated trait values. The provided Skew Exponential Power (SEP) distribution parameters (Table 2) and median imputed trait values support the generation of random permutations of plant trait values for Monte Carlo simulations, (e.g. for parameter sensitivity analyses or forecast    uncertainty using process-based vegetation models). Code to generate random permutations of plant traits from median values and SEP distribution parameters is available (see Code Availability). technical Validation Plant trait phylogenetic signals. All median plant hydraulic, economic and structural trait values exhibited significant phylogenetic signals based on Pagel's &#955; and Var Phylo at the &#593; &lt; 0.01 threshold (Fig. <ref type="figure">1</ref>). The phylogenetic dendrograms for maximum plant height, SLA, and LeafN are shown in Fig. <ref type="figure">2</ref>. Genus-level analysis of phylogenetic signals yielded a similar result (Table <ref type="table">1</ref>). This result largely agrees with prior research demonstrating strong phylogenetic signals in plant hydraulic and structural traits <ref type="bibr">26,</ref><ref type="bibr">27,</ref><ref type="bibr">29</ref> . The phylogenetic signal in all tested traits was highly statistically significant (based on both &#955; and Var Phylo ). Phylogenetic variance was generally quite high (&gt;65%) for all traits, with the exception of gs MAX and rd MAX Estimation of species-level hydraulic and structural traits. Predicted plant hydraulic traits for the test datasets using PEMs demonstrated a reasonable predictive skill of the underlying RF models (Fig. <ref type="figure">3</ref>). Mean Absolute Scaled Error (MASE) values for all test datasets were less than 1, indicating the RF models substantially outperformed the mean of the TRY database for each trait. Observed P-bias scores, with the exception of WUE, were all close to 0%, indicating that the RF models were mostly unbiased predictors of trait values. There also was no obvious dichotomy, either in observed phylogenetic signal nor RF model skill between the more classic leaf economics traits (SLA, Leaf N) and less well-sampled water use traits (P50, WUE), potentially supporting similar levels of phylogenetic conservatism among the traits that dictate carbon, water and nutrient strategies.</p><p>Trait values for P12 (Fig. <ref type="figure">3b</ref>) were somewhat more poorly predicted than all other traits as measured by RF model R 2 scores, despite this trait exhibiting a strong phylogenetic signal within TRY (Fig. <ref type="figure">1</ref>). Imputed P12 values for some species are more negative than the predicted P50 value (Fig. <ref type="figure">4a</ref>), an inconsistency that is largely absent between P50 and P88 (Fig. <ref type="figure">4b</ref>). This further suggested high uncertainty in imputed P12 values relative to P50 and P88. Prior studies have noted that xylem pressures at turgor loss (often similar in magnitude and potentially mechanistically related to P12) can exhibit a weaker phylogenetic signal than P50 <ref type="bibr">27</ref> , which may explain the reduction in predictive skill. Alternatively, the substantial methodological uncertainty of hydraulic vulnerability curve measurements may make P12 or Pe (the point of initial air entry into xylem, often assumed to be near P12) inherently more difficult to measure than P50 across different methods. Alternatively, P12 may be negatively influenced by the composition of the TRY database. There is a disproportionate representation of conifers within TRY, though this is also true for P50 and P88 (Fig. <ref type="figure">1</ref>). The distribution and few number of observed species for P12 in TRY may be limiting the computed PEMs from fully characterizing trait variations across the phylogeny.</p><p>We demonstrate that the PEM approach yields similar test dataset objective function values to a RF model trained on Principal Coordinate Analysis (PCoA) (Fig. <ref type="figure">5</ref>) as implemented in the R 'ape' package <ref type="bibr">40</ref> . RF model  performance based on records in the xylem functional traits database, which is more curated and more easily screened but smaller than the TRY database, showed slightly improved prediction scores relative to TRY for P12, P50, and P88 (Fig. <ref type="figure">6</ref>). Though this dataset shows promise for future use, we did not consider it further due to the small dataset size.</p><p>The intention of this dataset is to support global trait analyses and earth systems model forecasts that are by necessity climatic and ecological extrapolations. Our methodology intentionally excluded local environmental conditions from training despite the promise that these approaches have shown as hindcasting tools. By excluding this information, we produced a dataset of imputed traits and their associated uncertainties that  reflects the broadest range of environmental conditions possible. The trait dataset conditioned only on phylogenies is therefore more robust with respect to the broad need for ecosystem model parameterizations that are climate-transferable <ref type="bibr">19,</ref><ref type="bibr">44,</ref><ref type="bibr">45</ref> .</p><p>imputed trait residual characteristics and uncertainty bound estimation. RF model residuals for all traits did not exhibit significant phylogenetic signals at the &#945; &lt; 0.1 threshold (Fig. <ref type="figure">7</ref>). Residuals for WUE showed a high &#955; value, but the result was not significant possibly due to the relatively smaller dataset size. We expected that issues of data sparsity, non-random sampling of the phylogeny for some traits, and other issues with the training data would result in phylogenetically structured model errors. However, the RF models apparently captured the phylogenetic structure of the data extremely well for all traits. This result suggested that RF performance did not vary significantly with tree species identity. We therefore did not consider species identity in constructing statistical models of plant trait residuals.</p><p>RF model residuals were well described by Skew Exponential Power (SEP) distributions accounting for heteroscedasticity (Figs. <ref type="figure">8</ref>, <ref type="figure">9</ref>; Table <ref type="table">2</ref>). All trait residuals exhibited very limited skew (similar to P-bias scores near 0%), further demonstrating that the RF models were unbiased predictors. All traits exhibited some degree of heteroscedasticity where residual variance increased with the magnitude of the trait being predicted (Fig. <ref type="figure">8</ref>, Table <ref type="table">2</ref>).</p><p>The cause of the observed residual heteroscedasticity could potentially be explained by trait measurement errors within TRY, where the magnitude of measurement biases scale with the measurement being taken. For example, tree height uncertainty measurements are often expressed as a percentage <ref type="bibr">46</ref> , implying that height uncertainty increases linearly as a function of height. Another possibility is that plants may tend to evolve similar strategies for survival <ref type="bibr">27</ref> , resulting in few plant records within TRY that represent extreme trait values. The underrepresentation of extremal trait values in the training datasets may have limited the ability of the RF models to learn where large magnitude trait values are likely to occur across the phylogeny, resulting in residuals that scale in magnitude with trait values. Given that underlying traits exhibited strong phylogenetic signals (Fig. <ref type="figure">1</ref>, Table <ref type="table">1</ref>) but that test dataset residuals did not exhibit significant phylogenetic signals (Fig. <ref type="figure">7</ref>) this explanation may be less likely.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>Scientific Data | (2024) 11:1336 | https://doi.org/10.1038/s41597-024-04254-4</p></note>
		</body>
		</text>
</TEI>
