<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Efficient and Robust Bayesian Selection of Hyperparameters in Dimension Reduction for Visualization</title></titleStmt>
			<publicationStmt>
				<publisher>Conference on Neural Information Processing Systems (NeurIPS)</publisher>
				<date>07/26/2023</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10526897</idno>
					<idno type="doi"></idno>
					
					<author>Yin-Ting Liao</author><author>Hengrui Luo</author><author>Anna Ma</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[We introduce an efficient and robust auto-tuning framework for hyperparameter selection in dimension reduction (DR) algorithms, focusing on large-scale datasets and arbitrary performance metrics. By leveraging Bayesian optimization (BO) with a surrogate model, our approach enables efficient hyperparameter selection with multi-objective trade-offs and allows us to perform data-driven sensitivity analysis. By incorporating normalization and subsampling, the proposed framework demonstrates versatility and efficiency, as shown in applications to visualization techniques such as t-SNE and UMAP. We evaluate our results on various synthetic and real-world datasets using multiple quality metrics, providing a robust and efficient solution for hyperparameter selection in DR algorithms.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.1">Problem and Background</head><p>The problem of data visualization in the big-data era is one of the central topics in modern data analysis and graphical statistics. As a crucial part of exploratory data analysis, it has been both a methodological and engineering challenge to visualize patterns in large-scale high-dimensional data. Recent research attempts have been extended into scalable methodologies for large-scale datasets with highdimensionality <ref type="bibr">(Wilkinson and Luo, 2022)</ref>, taking a generic regime consisting of dimension reduction (DR), subsampling, and sketching <ref type="bibr">(Murray et al., 2023)</ref>.</p><p>One practical challenge in implementing DR methods for visualization on large datasets is hyperparameter selection. Novel DR methods, like those accommodating high-dimensional structures (e.g., <ref type="bibr">Luo et al. (2022)</ref>, <ref type="bibr">Dover et al. (2022)</ref>), introduce more hyperparameters and in many cases, selection is crucial but expert knowledge is limited. For instance, <ref type="bibr">Szubert et al. (2019)</ref> used large neural networks for DR, adding hundreds of hyperparameters. Efficient hyperparameter selection is vital as it impacts reduced dataset quality and subsequent analyses. However, trying all configurations of hyperparameters (i.e., try-and-err) is unrealistic due to the computational cost of DRs.</p><p>We focus on hyperparameter selection for established DR methods for visualization and restrict to case studies using t-Stochastic Neighbor Embedding (t-SNE) (Van der <ref type="bibr">Maaten and Hinton, 2008)</ref> and Uniform Manifold Approximation by Projection (UMAP) <ref type="bibr">(McInnes et al., 2018)</ref>. However, as pointed out by <ref type="bibr">Coenen and Pearce (2022)</ref> and <ref type="bibr">Wattenberg et al. (2016)</ref>, selecting appropriate hyperparameters is crucial in obtaining meaningful and interpretable results. As shown by recent works <ref type="bibr">(Kobak and Linderman, 2021;</ref><ref type="bibr">Damrich et al., 2022;</ref><ref type="bibr">B&#246;hm et al., 2022)</ref>, the choice of hyperparameters for t-SNE and UMAP impacts the quality of the data visualization. Here, we briefly review t-SNE and UMAP and refer the reader to <ref type="bibr">Van Der Maaten (2009)</ref> and <ref type="bibr">McInnes et al. (2018)</ref> for a more in-depth exposition of these methods.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.2">t-SNE and UMAP</head><p>The t-SNE algorithm is one of the most popular DR techniques for visualization for high-dimensional data with many applications, including single-cell multiomics <ref type="bibr">(Kobak and Berens, 2019)</ref>, bioinformatics <ref type="bibr">(Li et al., 2017)</ref>, and structural engineering <ref type="bibr">(Hajibabaee et al., 2021)</ref>. Informally, t-SNE finds a lower dimensional visualization of high dimensional data by preserving probability distributions over neighborhoods of points in high dimensions.</p><p>Until recently, data-dependent hyperparameter selection for t-SNE remains relatively unexplored <ref type="bibr">(Vu et al., 2021)</ref>. The algorithm has two primary hyperparameters: perplexity, a proxy for the number of nearest neighbors, typically ranging from 5 to 50; and an early exaggeration factor, affecting the tightness of natural clusters in visualization, with a default value of 12 in standard implementations. The performance of t-SNE heavily depends on these hyperparameters.</p><p>UMAP <ref type="bibr">(McInnes et al., 2018</ref>) is a versatile DR technique for visualization that is often used in bioinformatics <ref type="bibr">(Becht et al., 2019)</ref>, natural language processing <ref type="bibr">(Pealat et al., 2021)</ref>, and image analysis <ref type="bibr">(Allaoui et al., 2020)</ref>. UMAP builds upon the idea that data points lie on a low-dimensional manifold embedded within the high-dimensional space. The algorithm employs Riemannian geometry and topological data analysis for manifold structure learning and lower-dimensional projection.</p><p>Several hyperparameters influence UMAP's low-dimensional representation. We focus on optimally selecting the number of nearest neighbors and minimum distance. The number of neighbors balances local and global structure preservation, with smaller values emphasizing local structure. Minimum distance controls clustering compactness in lower-dimensional space, where smaller values create tighter clusters and larger values spread points. Typical choices for nearest neighbors and minimum distance are 2 to 100 and the interval [0, 1], respectively.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.3">Previous Work</head><p>Previous works used manual selection, trial-and-error, and neural networks to tune hyperparameters for t-SNE and UMAP <ref type="bibr">(Wattenberg et al., 2016;</ref><ref type="bibr">Du et al., 2022;</ref><ref type="bibr">Gove et al., 2022;</ref><ref type="bibr">Belkina et al., 2019</ref><ref type="bibr">Belkina et al., , 2017;;</ref><ref type="bibr">Cao and Wang, 2017)</ref>. Recent approaches have focused on tuning a single hyperparameter, such as exaggeration rate <ref type="bibr">(Damrich et al., 2022;</ref><ref type="bibr">Cai and Ma, 2021;</ref><ref type="bibr">B&#246;hm et al., 2022;</ref><ref type="bibr">Kobak and Linderman, 2021)</ref>. In contrast, our framework efficiently handles multiple hyperparameters through Bayesian selection for DR techniques, providing a comprehensive and general approach.</p><p>Most related to our work is the work of <ref type="bibr">Vu et al. (2021)</ref>, where the authors devised f score for evaluating global quality for user-constrained DR method, along with Bayesian optimization to optimize the score. Their investigation is limited to the f score metric, and requires pairwise user constraints. Moreover, the performance of DR methods rely heavily on the initialization, and DR on large datasets (i.e., n &gt; 3, 000) remain challenging to tune efficiently <ref type="bibr">(Bibal and Fr&#233;nay, 2019)</ref>. Unlike <ref type="bibr">(Vu et al., 2021)</ref>, our approach incorporates subsampling and works for arbitrary metrics. This advantage of applicability for any metric offers a comprehensive, interpretable solution for DR hyperparameter selection with multi-objective trade-offs (for multi-objective tuning involving more than one performance metrics) and sensitivity analysis (for multivariate hyperparameters).</p><p>Furthermore, by normalizing the parameter space <ref type="bibr">(Xiao et al., 2022)</ref> and taking empirical averages (or quantiles) of repeated metrics from different intializations, we showed that the optimal hyperparameters learned from the subsample can be transferred back to the large datasets with good visualization.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Method Framework</head><p>We store data as an n &#215; d matrix X, where n is the sample size and d is the dimensionality. We leverage subsampling to reduce n to n &#8242; &#8810; n and DR to reduce d to d &#8242; &#8810; d, resulting in a reduced n &#8242; &#215; d &#8242; matrix. Our goal is to minimize the loss of original structure when reducing size and dimensionality, making n &#8242; and d &#8242; suitable for tuning existing DR methods for visualization.</p><p>Besides expert evaluation, there are existing performance metrics that allows us to quantify the quality of DR results for visualization. Let D &#947; : R n&#215;d &#8594; R n&#215;d &#8242; denote the mapping of an input dataset X to a lower dimensional representation via algorithm D that requires a pre-selected hyperparameter &#947;. For example, D can be the t-SNE algorithm and &#947; the chosen perplexity. We denote the output as X * = D &#947; (X). For visualization purposes <ref type="bibr">(Wilkinson, 2012)</ref>, we set d &#8242; = 2 in the DR method but our frameworks works for any d &#8242; and any metric in a scalable way.</p><p>Select a DR method Attempt the DR method on N0 sets of hyper-parameters on the full dataset Determine optimal hyperparameters for the chosen DR method from trials on the full dataset Perform the DR on the full dataset with optimal hyperparameters Select a DR method Provide N1 sets of hyper-parameters on the subsampled dataset Determine optimal hyperparameters for the chosen DR method from surrogate Perform the DR on the full dataset with optimal hyperparameters Fit surrogate model with tried hyperparameters. Choose the next hyperparameter to try on the subsampled dataset Update surrogate for the performance model on subsample dataset Loop N2 times Assume the same budget for the number of hyper-parameters: N0=N1+N2 Regular grid try-and-err search for optimal hyper-parameters Surrogate-based search for optimal hyper-parameters We can compute a performance metric (e.g., coranking scores), &#964; : R n&#215;d &#215; R n&#215;d &#8242; &#8594; R, between the original data X and processed data X * , where &#964; describes one aspect of the quality of the reduced data <ref type="bibr">(Lee and Verleysen, 2007)</ref>. The benefit of adopting a performance metric rather than expert evaluation is that we can quantify and model it as a function of its hyperparameters<ref type="foot">foot_0</ref> .</p><p>There are two categories of operational performance metrics we consider: task-independent and taskdependent performance metrics. As pointed out by <ref type="bibr">Lee and Verleysen (2007)</ref>, the primitive task-independent metrics will only depend on the original data matrix X and reduced data matrix X * . However, frequently we conduct subsequent analyses on reduced dataset X * , using task performance to characterize DR quality using task-dependent metrics.</p><p>After fixing the DR method D, our problem of hyperparameter selection can be interpreted as optimizing a performance function F (&#947;) := &#964; (X, D &#947; (X)) over a hyperparameter &#947;. While this formulation simplifies the problem, closedform expressions of F (&#947;) are not typically known, and classical (gradient-based) optimization methods are not applicable. To overcome this obstacle, we take the Bayesian optimization (BO) framework <ref type="bibr">(Snoek et al., 2012;</ref><ref type="bibr">Luo et al., 2021</ref><ref type="bibr">Luo et al., , 2022) )</ref> select a surrogate model S, and sequentially update the model to approximate F (&#947;).</p><p>In sharp contrast to previous work <ref type="bibr">(Vu et al., 2021)</ref>, we proposed to tune empirical means of several repeats EF (&#947;) instead of F (&#947;) directly due to the observation that metrics computed by DR methods are highly variable with different initializations <ref type="bibr">(Coenen and Pearce, 2022)</ref>. Intuitively, as the sequential samples of (&#947;, F (&#947;)) accumulate, the surrogate model targeting EF (&#947;) &#8776; EF (&#947;) will be a better approximation to EF (&#947;). In conjunction with the heuristic sequential selection of candidate &#947;'s, we can obtain reasonable optimal &#947; that is sufficiently close to the true minimizer of EF (&#947;). This procedure is also referred to as surrogate-based hyperparameter tuning in applications like machine learning <ref type="bibr">(Snoek et al., 2012;</ref><ref type="bibr">Shahriari et al., 2015)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Hyperparameter Tuning</head><p>We present a framework for tuning experiments for DR methods with single and multiple hyperparameters, outlined in Figure <ref type="figure">1</ref>, where we split the budget for pilot N 1 and sequential N 2 samples in our surrogate-based approach<ref type="foot">foot_1</ref> . Central trade-offs in our method are: (1) Computational Efficiency: Balancing accuracy and efficiency hinges on surrogate model and optimization strategy choices, trading off between expense and optimization quality. (2) Metric Robustness: Maximizing quality involves stable performance and mitigating randomness via averaging multiple-initialization metrics, measured by chosen metrics &#964; , which further raises the computation complexity.</p><p>Averaging metrics is necessary to address the multiple possible sources of randomness in DR approaches. For example, one kind of randomness comes from the (random) initialization of the low-dimensional representation. Another type of randomness can be caused by subroutines within the algorithm (e.g., nearest neighbor search in UMAP). Given the fact we want to approximate EF (&#947;) from multiple possible sources of randomness, we propose multiple-initialization averaged metrics for more stable and robust performance. In our experiments, we perform the DR for N repeat = 10 times using the same hyperparameter &#947; and take the average metric to obtain EF (&#947;).</p><p>Our methodology consists of two main components: the objective function and the surrogate model S. For the objective function, we need to match the parameter space of the reduced and the original dataset due to the lack of distance preservation in DR methods. There are several surrogate models S, including the cheapest random selection, and the most expensive optimal benchmark, grid exhaustive search. We examine the following approaches, grid exhaustive search (N 0 = total sample size), random forest regression, and Gaussian processes (GP-EI, GP-PI, GP-LCB <ref type="bibr">(Louppe, 2017)</ref>) as Bayesian surrogates that can incorporate prior information with a total budget N 0 = N 1 + N 2 , where N 1 is the budget for pilots and N 2 is the budget for the surrogate-based tuning. The sampling regime, normalization procedure and budget allocation are determining factors for surrogate quality.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Multi-objective Trade-offs</head><p>Surrogate-based tuning faces challenges in balancing and optimizing competing objectives, especially as the number of goal metrics increases. This emphasizes the need for appropriate surrogate models and optimization strategies that efficiently handle multi-objective problems, balancing accuracy and computational efficiency. Rather than focusing on optimizing metrics, we construct sensitivity analyses based on surrogates to detect sensitive hyperparameters (SM C). With multiple goal metrics (e.g., AUC and Q-global in Figure <ref type="figure">2</ref>), conventional single-objective optimization techniques are insufficient due to conflicting objectives. We propose the adoption of multi-objective optimization algorithms, such as Pareto-based approaches, to identify non-dominated solutions representing optimal trade-offs.</p><p>Pareto fronts <ref type="bibr">(Miettinen, 1999)</ref> provide non-dominated solutions, balancing conflicting metrics. Multi-objective optimization algorithms, such as NSGA-II <ref type="bibr">(Deb et al., 2002)</ref>, efficiently search for Pareto optimal solutions. In our approach, we construct the Pareto front using samples from the surrogate model, noting that it can also be elicited directly from the fitted surrogates for two metrics F 1 (&#947;), F 2 (&#947;) where the perplexity &#947; could affect both metrics simultaneously. After obtaining the Pareto front, decision-makers analyze trade-offs and select suitable solutions based on requirements or preferences, possibly choosing a point with the best balance or using additional criteria like the knee point.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Numerical Experiments</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Datasets</head><p>We consider the following four synthetic and real-world datasets, and provide experiments on additional datasets in SM D and E. The Unbalanced 2-cluster dataset is a small-scale, synthetic dataset that captures the performance of our framework when applied to a small class unbalanced dataset. It consists of 60 points, 10 in one class and 50 in another class. The COIL <ref type="bibr">(Nene et al., 1996)</ref> is a real-world dataset of images of physical objects taken from different angles. We take n = 360 from 5 classes in this data. Lastly, we consider two real-world, large-scale datasets. Reuters-English <ref type="bibr">(Amini et al., 2009)</ref> for which we use n &#8776; 14, 000 data points and Single cell transcriptomics with n &#8776; 23, 000 <ref type="bibr">(Tasic et al., 2016)</ref>. These real-world datasets exceed the maximum n &#8776; 3, 000 in previous work <ref type="bibr">(Vu et al., 2021)</ref>.</p><p>Using the unbalanced 2-cluster dataset, we illustrate how the surrogate-based tuning methodology works when the evaluation metrics can be directly computed on the full dataset. We show that our approach can find a set of hyperparameters that optimize the chosen metrics. Using the COIL dataset, we demonstrate that subsampling will speed up the tuning process without significantly impacting the outcome of hyperparameter optimization. The trend (hence the optimal hyperparameters) of each metric computed on subsamples is preserved when compared to the metric computed on the full dataset at the cost of increased uncertainty. Lastly, we apply our entire pipeline of the subsample-and-tune method and carry out corresponding analyses on the Reuters-English dataset and the single cell transcriptomics dataset to demonstrate efficacy on large-scale data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Evaluation Metrics</head><p>We employ multiple quality metrics to evaluate t-SNE and UMAP. The metrics we consider fall into two categories: task-independent and task-dependent. Task-independent metrics describe data properties, e.g., distance preservation, while task-dependent metrics focus on downstream tasks' performance, reflecting real-world utility. Combining both types enables a comprehensive evaluation, emphasizing strengths and weaknesses of DR techniques <ref type="bibr">(Bibal and Fr&#233;nay, 2019)</ref>.</p><p>For task-independent metrics, we investigate Q-global, average pairwise distance ratio, and Q-local metrics from <ref type="bibr">Lee and Verleysen (2009)</ref> and Area Under the Curve (AUC) from <ref type="bibr">(Lee et al., 2015)</ref>. Q-global and average pairwise distance ratio (between the pairwise distances of the original and reduced dataset) evaluate distributional pairwise distance preservation, while Q-local focuses on local neighborhood structures. These metrics determine DR effectiveness in preserving data relationships and are used to assess DR quality. The AUC metric is also used to evaluate the quality of DR techniques. The AUC quantifies the area under their proposed functional metric R N X (K), curve (varying scale K), providing a scalar-valued metric that allows comparison of the average (across all possible scales) quality of the resulting DR embedding compared to a random embedding. A higher AUC indicates better performance of the model. For task-dependent metrics, we evaluate subsequent task performance on reduced datasets, expecting minimal performance loss if the DR method preserves essential information. Normalized Mutual Information (NMI, <ref type="bibr">(Strehl and Ghosh, 2002)</ref>) evaluates data structure preservation and is suitable for unsupervised tasks. For supervised tasks like classification, we assess misclassification rates using a simple logistic regression model <ref type="bibr">(McCullagh, 2019)</ref>. The classifier uses the 80% of the data for training and the 20% for testing, enabling DR method comparisons by evaluating the preservation of the original classification structure.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Numerical Results</head><p>In this section, we provide empirical results using our framework to choose t-SNE's perplexity parameters on the four datasets discussed in Subsection 3.1. To ensure the parameter spaces are the same for full and subsampled datasets and facilitate the comparison of perplexities between the original and subsampled data, we use normalized perplexity, calculated by dividing the perplexity by the number of data points. This rough perplexity normalization is intuitively consistent with more principled perplexity normalization using asymptotics <ref type="bibr">(Xiao et al., 2022)</ref>. We perform each surrogate-based tuning for 20 batches and provide the budget for pilot N 1 = 5 and the budget for tuning N 2 &#8712; {15, 20} depending on the tasks.</p><p>We also include experiments on tuning for UMAP focusing on two hyperparameters -the number of nearest neighbors (n_neighbor), between 2 and 100 as suggested in the UMAP documentation, and minimum distance (min_distance), in the interval [0,1]. Due to this larger dimension of the hyperparameter space, we assign a higher budget N 2 &#8712; {20, 40}. In order to account for the difference in sizes between the original and subsampled data, we use a normalized n_neighbor, which is similar to the normalized perplexity used for t-SNE. For the real-world dataset, we reduce the parameter space of min_dist from the interval [0, 1] to points {0, 0.2, 0.4, 0.6, 0.8, 1}, since the sensitivity of min_dist is smaller compared to that of n_neighbor as is observed in SM C.</p><p>Figure <ref type="figure">2</ref>: Boxplots for the performance of t-SNE using AUC, average ratio, and NMI for all possible normalized perplexities, where the medians are marked by green triangles.</p><p>Example 1. Unbalanced 2cluster dataset. In our first experiment, we demonstrate our framework on parameter selection for t-SNE applied to the unbalanced 2-cluster dataset. Since perplexity serves as a proxy for the number of nearest neighbors, we intuitively expect to obtain optimal results when the perplexity does not exceed the size of a single cluster. We focus on three performance metrics: AUC, average ratio and NMI, to analyze the influence of t-SNE's initialization on the resulting lower-dimensional dataset.</p><p>We begin by considering a simple grid search over all possible normalized perplexity values. Our experiments suggest that most metrics are sensitive to initialization, and averaging the results over multiple initializations can help stabilize the outcome. We demonstrate this phenomenon with the 2-cluster dataset, and the results are shown in Figure <ref type="figure">2</ref>. We observe that 1 -AUC is minimized when the normalized perplexity is approximately 0.15 (perplexity is 9 = 0.15 &#215; 60). The clustering performance under t-SNE, as evaluated by NMI, is also best when the normalized perplexity is around 0.15. For unbalanced datasets, based on a grid search the optimized normalized perplexity is 0.16, where the performance is optimal in terms of both the AUC (task-independent) and NMI (task-dependent). Performance is not as favorable when the perplexity falls below or exceeds this value. On the other hand, given the average ratio (task-independent), the optimal normalized perplexity is at 0.75 (or perplexity 45 = 0.75 &#215; 60).</p><p>Next, we use the NMI for surrogate-based perplexity tuning. The result is given in Figure <ref type="figure">3</ref>. Among the 20 batches, the minimum of 1 -NMI converges within 15 iterations. In grid search, we use the budget of N 0 = 60 and obtain the minimal 1 -NMI of 0.541, whereas using the surrogate-based tuning obtained a minimal 1 -NMI of 0.537, using a lower budget of N 0 = 20. We compare our results with <ref type="bibr">(Vu et al., 2021)</ref> and observe that our approach effectively reduces the variance and enhances the robustness of the tuning process, particularly when the metric under consideration are sensitive to initialization (e.g., AUC and NMI in Figure <ref type="figure">2</ref>). Finally, we include Pareto fronts using samples from the surrogate model. Figure <ref type="figure">4</ref> includes samples from tuning samples based on AUC and average ratio. We include both samples to balance the two tuning processes, as those circle markers based on AUC appear mostly on the left. We observe that the Pareto fronts for AUC versus average ratio form a surface rather than a singular point. Hence, to address more than one objective and account for the trade-off between different metrics, our method recommends blue points in Figure <ref type="figure">4</ref>. Example 2. COIL dataset. In this experiment, we demonstrate how subsampling techniques effectively enhance the efficiency of DR methods. We subsample a third of the original dataset under two different sampling schemes: random sampling and leverage score sampling <ref type="bibr">(Ma et al., 2014)</ref>. Figure <ref type="figure">6</ref> shows we obtain similar results for the two sampling schemes (See also SM D). Under leverage score sampling, the variation for the performance metrics is smaller compared to that under random sampling. Moreover, using the subsampled dataset, the landscape of the performance is similar to the original dataset. Thus, to find the optimal perplexity for the full dataset, it suffices to obtain the optimal normalized perplexity for the subsampled dataset. In Figure <ref type="figure">6</ref>, we see that the classification performs relatively well when the perplexity is small. While the AUC is minimized when the normalized perplexity is around 0.12, the Q-global is minimized at 0.9. We plot the visualization corresponding to these two optimal perplexities in SM B. Finally, we run the surrogatebased tuning for the subsampled data to find the optimal perplexity according to AUC and assign the budget of N 0 = 20 (as opposed to a sparse grid search, which requires the budget N 0 = 60). The result is shown in Figure <ref type="figure">7</ref>. We see that within 5 iterations, the average converges to the minimum 1-AUC. This experiment demonstrates that the subsampled data is a good proxy for the original data in tuning for optimal perplexity. The surrogate-based tuning in the subsampled data is efficient and accurate. (a) Tuning for t-SNE (b) Tuning for UMAP Our framework is not limited to one hyperparameter or a single DR approach. We present how subsampling techniques enhance the efficiency of tuning UMAP. We plot the heatmaps for the mean 1 -AUC in Figure <ref type="figure">5</ref> and see that the optimal 1 -AUC is obtained at normalized n_neighbor 0.1 and min_distance 0.25 for both the subsampled and full dataset. This justifies the use of a subsampled dataset as a proxy of the full dataset when tuning hyperparameters in UMAP. Finally, we run the surrogate-based tuning using the subsampled data, as shown in Figure <ref type="figure">7b</ref>. Amongst the 20 batches, we obtain the minimum mean 1 -AUC of 0.3414, which is comparable to that of 0.3511 for a sparse grid search with budget N 0 = 150.</p><p>Example 3. Reuters English dataset. This experiment demonstrates the subsample-and-tune pipeline, which overcomes the sample size bottleneck in the simple BO method without subsampling by <ref type="bibr">Vu et al. (2021)</ref>. We subsample a tenth of the full dataset, obtain 1,426 data points using random sampling, and run the surrogate-based tuning for t-SNE with the performance metric NMI. Figure <ref type="figure">9</ref>: Visualization of the subsampled data (left) and full data (middle) at the optimal normalized perplexity, and the full data for the default perplexity, 30 (right).</p><p>The results for different surrogates are plotted in Figure <ref type="figure">8a</ref>. The budget for grid search is N 0 = 1, 426 while the budget for the surrogate-based tuning is N 0 = 25. We observe that for all 4 kinds of surrogates, the tuning processes converge to the minimal 1 -NMI within 10 iterations. Moreover, the optimal normalized perplexities obtained from grid search and surrogate-based tuning are 0.47 and 0.45, respectively. We next take the normalized perplexity from tuning using subsampled data and obtain the t-SNE visualization of the full data in Figure <ref type="figure">9</ref>. We observe that at the same normalized perplexity, the visualizations of the subsampled and full data are similar. Moreover, we compare the optimal normalized perplexity with default perplexity. Since the optimum is tuned under NMI, we see that the clustering result in the optimal perplexity is better than that for the default perplexity and leads to a better t-SNE visualization using fine-tuned perplexity as shown in Figure <ref type="figure">2</ref>(f) in <ref type="bibr">Liu et al. (2022)</ref>.</p><p>In addition, we present the convergence plot for the subsample-and-tune pipeline when applied to UMAP in Figure <ref type="figure">8b</ref>. The tuning processes for different surrogates, N 0 = 45 converge within 10 iterations and the optimal values obtained is closed to the grid search, which uses the budget N 0 = 600.</p><p>Example 4. Single cell transcriptomics dataset. We subsample a tenth from the full data and obtain 2,282 data and perform surrogate-based tuning for both t-SNE and UMAP using NMI on the subsampled data. The results are shown in Figure <ref type="figure">10</ref>. With much lower budgets, N 0 = 25, we obtain the optimum values comparable to those found using grid search, N 0 = 2, 280 and 600 for t-SNE and UMAP, respectively. We next use the obtained hyperparameters to recover the 2-dimensional visualisations for the full data, as shown in Figure <ref type="figure">11</ref>. The visualisations of subsampled data with the optimal hyperparameters is given in the left column, and the middle column showcases the visualisation for the full data with the corresponding optimal normalized hyperparameters.</p><p>Comparing the visualization using our method to choose perplexity and the default perplexity, we highlight that our method is better able to cluster like-data points. For example, the red clusters in our visualization have a linear structure but appear close together where using the default perplexity, the red clusters have a linear structure but do not appear together in the visualization.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Conclusion</head><p>We present an efficient and robust framework for selecting hyperparameters for DR methods and use DR techniques for data visualization such as t-SNE and UMAP to demonstrate the efficacy of our method. This work extends the applicability of previous works <ref type="bibr">(Vu et al., 2021)</ref> on this topic to large-scale data by leveraging sub-sampling and normalization of parameter spaces. We carry out both multi-objective tuning and sensitivity analyses for DR, addressing questions such as how much the sampling scheme affects the quality of DR and the trade-offs between computational complexity and performance.</p><p>Considering EF (&#947;) instead of F (&#947;) improves stability and robustness, we leace for future work level-set estimates (instead of maximum of metrics) to gain more robustness and graph-theoretic guarantees <ref type="bibr">(Wilkinson et al., 2005)</ref> (e.g., Scagnostics-preserving subsampling in our pipeline <ref type="bibr">(Wilkinson and Wills, 2008)</ref>), answering the question of hyperparameter selection for large datasets raised in <ref type="bibr">Gove et al. (2022)</ref>. Subsampling on the normalized parameter space enhances efficiency in evaluating metrics. We have not observed significant variations among different sampling regimes on the datasets and DR methods we investigated, but for highly structured <ref type="bibr">(Hie et al., 2019)</ref> or ultra-high-dimensional (i.e., n &lt; d for micro-arrays <ref type="bibr">(Wilkinson and Luo, 2022)</ref>) DR tasks, we expect nontrivial modification of surrogate-based tuning methods need to be developed.</p><p>Yang, Z., I. <ref type="bibr">King, Z. Xu, and E. Oja (2009)</ref>. Heavy-tailed symmetric stochastic neighbor embedding. In Y. Bengio, D. Schuurmans, J. Lafferty, C. Williams, and A. Culotta (Eds.), Advances in Neural Information Processing Systems, Volume 22. Curran Associates, Inc.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Task-independent</head><p>Pearson correlation coefficient based on the pairwise distances (between pairwise distance vectors) between X, X * coranking measures<ref type="foot">foot_2</ref> based on the ranks of pairwise (e.g., <ref type="bibr">AUC, Q-global, Q-local)</ref> distances between X, X * Task-dependent mis-classification rates for classification on X * using labels defined on X clustering scores metrics <ref type="foot">4</ref>for clustering on X * (e.g., mutual information indices) using features of X *</p><p>Table 1: The performance metric considered for evaluating DR methods for hyperparameter selection. Input: n &#215; d data matrix X, subsample size n &#8242; , reduced dimension d &#8242; , DR method D and associated hyperparameter space &#947; &#8712; &#915;, performance metric &#964; , surrogate model S, pilot budget N 1 , sampling budget N 2 (usually N 1 &#8810; N 2 ), (optional) the number of repeats N repeat . Result: Optimal hyperparameter &#947; &#8712; &#915;.</p><p>Initialize:</p><p>Fit S using &#915; m as inputs and T m as responses Optimize the acquisition functions of S to get new &#947; &#9839; // Apply D &#947; &#9839; to X &#9839; to get X * under different initializations, and take sample means of metrics.</p><p>Initialize with a different random seed each time and compute</p><p>Compute the mean ET of list T as the next objective Append &#947; &#9839; to &#915; m to get &#915; m+1 ; and append ET to T m to get T m+1 end Algorithm 1: Subsampling Surrogated-based Hyperparameter Tuning Algorithm.</p><p>This algorithm shares similarities with regular BO, as both methods employ a surrogate model to approximate the objective function and use an acquisition function to guide the search for optimal hyperparameters. However, there are some novel aspects of this algorithm that differentiate it from standard BO:</p><p>&#8226; Subsampling: The algorithm focuses on finding the optimal hyperparameters using a subsample of the dataset, aiming to maintain efficiency in evaluating the metrics. This is particularly beneficial when dealing with large datasets, as it significantly reduces computational costs.</p><p>&#8226; Dimension reduction: The algorithm is tailored for optimizing hyperparameters of dimensionality reduction (DR) methods, which is a specific application that requires unique consideration of the performance metrics and DR goals.</p><p>&#8226; Normalization and adaptability: The algorithm normalizes the hyperparameter spaces to ensure that the DR methods have the same parameter spaces for both the subsample and the full sample. This is a crucial aspect, as it allows the algorithm to maintain consistency between the subsample and the full sample, ensuring that the optimal hyperparameters found on the subsample generalize well to the full dataset. Furthermore, this normalization process allows the algorithm to be adapted to work with various DR methods, performance metrics, or dimensionality reduction goals, depending on specific applications or dataset characteristics, making it more versatile in handling different problems.</p><p>&#8226; Task-independent and dependent metrics: The algorithm considers the expected performance metric EF (&#947;) instead of the actual metric F (&#947;), which increases its stability, robustness, and applicability to various taskindependent and dependent metrics.</p><p>While the algorithm builds upon the foundation of BO, these novel aspects make it more suitable for optimizing hyperparameters in dimensionality reduction tasks and provide unique advantages over regular BO in this context. The algorithm could also include a (manual or automatic) procedure for determining the next objective based on the variability of specific metrics. These metrics are calculated by running the dimension reduction (DR) method multiple times with different initializations and computing a score t &#9839; = &#964; X &#9839; , X * for each run. The aim here is to capture a representative value that indicates the quality of the DR with the current hyperparameters.</p><p>For metrics that exhibit low variability (smaller variance during N repeat repeats), the algorithm returns the mean of the computed scores as the next objective. This is because, in such cases, the mean is a reliable measure of central tendency.</p><p>However, for metrics with high variability (larger variance during N repeat repeats), the empirical mean might not be an adequate representation due to the influence of potential outliers. In such cases, the algorithm proposes two alternatives. The first one is to return a specific quantile (for example, the median) of the computed scores as the next objective. The median is a robust measure of central tendency unaffected by outliers. The second alternative is to return the mean of the computed scores minus a multiple of the variance. This adjustment accounts for the variability of the metrics and makes the selected next objective more conservative. The choice between these two alternatives could be based on the specific characteristics of the dataset and the DR method in use.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B Visualization for COIL Dataset</head><p>In Figure <ref type="figure">12</ref>, we compare the performance of t-SNE when different metrics are minimized. When the perplexity is set to 44 (normalized perplexity 0.19), we observe a distinct separation of the data points from different classes, and at this perplexity both 1 -AUC and classification error rate are minimized. Moreover, the metric 1 -Q-global attains its minimum when the perplexity is set to 324 (normalized perplexity 0.9), indicating t-SNE better preserves global structure.</p><p>Figure <ref type="figure">12</ref>: 2-dimensional visualization for 5 of the objects from the COIL dataset using t-SNE with different perplexities.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C Surrogate-based Sensitivity Analysis for UMAP</head><p>When there is more than one hyperparameter to select, such as in UMAP, we can perform surrogate-based sensitivity analyses for hyperparameters with respect to the chosen performance metric (i.e., Sobol and Shapley analysis <ref type="bibr">(Owen, 2014)</ref>). The Sobol sensitivity analysis conducted on the given fitted surrogate model for EF (&#947;) provides valuable insights into the influence of the hyperparameters (number of nearest neighbors (n_neighbor) and minimum distance (min_distance)) on the variance of the resulting chosen performance metric. A comparison of the first-order (S 1 ) and total-order (ST ) sensitivity indices reveals distinct effects of the parameters on the model output (i.e., performance metric) of the chosen metric 1 -AUC. This analysis helps us understand how each hyperparameter affects the chosen performance metric.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Parameter</head><p>S 1 S 1 Conf. ST ST Conf. n_neighbor 4.89e-04 4.70e-04 1.00049 9.61e-01 min_distance 4.89e-04 9.29e+24 1.00049 1.22e+25 Table <ref type="table">2</ref>: Sobol sensitivity indices based on surrogate for the tuning samples (&#947;, F (&#947;)) and their confidence intervals for the UMAP experiment on the unbalanced 2-cluster dataset.</p><p>In Table <ref type="table">2</ref> for the unbalanced 2-cluster dataset, the first-order sensitivity indices (S 1 ) and their respective confidence intervals (S 1 Conf.) are presented, indicating that the individual contribution of each input parameter to the performance metric variance. The total-order sensitivity indices (ST ) and their confidence intervals (ST Conf.) are displayed, representing the combined effect of each parameter, accounting for both its individual contribution and its interactions with other parameters.</p><p>Both parameters have similar S 1 and ST values, suggesting that they have comparable individual and overall impacts on the performance metric variance. However, the confidence intervals for min_distance are significantly larger, particularly for S 1 , which indicates a higher uncertainty in the sensitivity index estimates for min_distance. This is intuitive since both hyperparameters can affect the quality of DR on unbalanced clusters.</p><p>Full Parameter S 1 S 1 Conf. ST ST Conf. n_neighbor 0.005881 0.002237 1.003019 0.002075 min_distance 0.000307 0.019511 1.896585 3.475031 Subsampled (Uniform) Parameter S 1 S 1 Conf. ST ST Conf. n_neighbor 0.013529 0.001716 1.008246 0.004326 min_distance 0.004862 0.007099 1.510901 0.429790 Table <ref type="table">3</ref>: Sobol sensitivity indices based on surrogate for the tuning samples (&#947;, F (&#947;)) and their confidence intervals for the UMAP experiment on the COIL dataset.</p><p>Table <ref type="table">3</ref> indicates that the individual contribution of each input parameter to the performance metric variance. In both analyses with full or subsample during tuning, n_neighbor has a higher first-order sensitivity index (S1) than min_distance, indicating that n_neighbor has a larger individual contribution to the metric variance. The total-order sensitivity index (ST ) values for both n_neighbor and min_distance show differences between the two analyses, partially due to the increased uncertainty introduced by subsampling. In analysis based on full sample (no subsampling during tuning), min_distance has a much larger ST value, suggesting that min_distance has a more substantial overall impact on the metric variance when accounting for interactions between these two parameters on this COIL dataset, which is supported by the analysis on subsampled dataest as well.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D Sine Dataset</head><p>This is a small-scale, synthetic dataset that allows us to investigate the framework of data with underlying geometric structures. The full dataset {x i } n i=1 &#8712; R 4 where the vector x i = [z(i), sin(z(i)), sin(2z(i)), sin(3z(i))], were z(i) = 2&#960; n-1 (i -1) -&#960;. In other words, the dataset is the set of tuples (z, sin(z), sin(2z), sin(3z)) evaluated on n = 200 equally spaced points z &#8712; [-&#960;, &#960;].</p><p>Since our tuning is surrogate-based, we can also apply other supervised model analysis tools like Shapely values <ref type="bibr">(Owen, 2014)</ref> or LIME <ref type="bibr">(Ribeiro et al., 2016)</ref>.</p><p>For our next experiment, we apply our framework to the Sine dataset. In this experiment, we consider the following metrics, AUC (for local structure), Figure <ref type="figure">13</ref>: Boxplots for the performance of t-SNE for the Sine Dataset.</p><p>Q-global (for global structure), and the error rate for classification performance.</p><p>In Figure <ref type="figure">13</ref>, we observed that the choice of subsampling regime can influence the discrepancy between the subsample-based metric S &#947; (X &#9839; ) and the true performance metric function S &#947; (X). Random subsampling, for instance, may not capture the full diversity of the data, leading to ignorance in local details. Additionally, the selected performance metric plays a crucial role. If it doesn't adequately encapsulate aspects of the data sensitive to the hyperparameters, the resulting function may deviate from the true relationship. The complexity and structure of the full dataset pose challenges to the subsampling <ref type="bibr">(Hie et al., 2019)</ref> and surrogate modeling <ref type="bibr">(Luo et al., 2022)</ref>, possibly causing it to miss key interaction effects, leading to a divergence from the true function form. Additionally, the choice of surrogate model, optimization procedure, and inherent data variability are crucial in reducing the randomness. For instance, an inflexible surrogate model might fail to capture the true relationship, and an inadequate optimization procedure might not fully explore the parameter space. High data heterogeneity could further influence the function shape derived from a subsample, which makes it more important to examine multiple metrics simultaneously as our framework suggests.</p><p>From Figure <ref type="figure">13</ref>, the optimal perplexity for the full data, according to average ratio, is around 160 (normalized perplexity =0.8), while that for the subsampled data is near 50 (normalized perplexity =0.8). In contrast, according to AUC, Q-global, or Q-local, the optimal perplexity for the original data is close to 100 (normalized perplexity = 0.5), and that for the subsampled data is close to 33 (normalized perplexity = 0.5). Under the normalized perplexity, the performance of the original and subsampled data is similar. This suggests that using the same normalized perplexity for t-SNE on the subsampled data yields results comparable to applying t-SNE on the full data and then subsampling the reduced data.</p><p>Finally, we include the surrogate-based tuning for both original data and the subsampled data. In contrast with Example 1, where we use the metric AUC, in this example we choose a different metric Q-global which has a much flatter landscape as shown in Figure 13. We use GP-EI for the surrogate-based tuning and assign the budget N 0 = N 1 + N 2 = 5 + 15 and run the training for 20 times.</p><p>The convergence plots for both cases are given in Figure <ref type="figure">14</ref>. We see from both plots that though the landscape for Q-global is flat, the surrogate-based tuning is able to converge to the minimum in a few iterations and in order to find the optimal perplexity for the original data, we could instead more efficiently tune on the subsampled data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E Vehicle Dataset</head><p>We consider the vehicle dataset from the LIBSVM repository. The Vehicle dataset consists of features extracted from the silhouettes of vehicles, consisting of 4 classes and 846 data with 18 features for each data. The objective is to classify a given silhouette as one of four types of vehicles: 'Opel', 'Saab', 'Bus', or 'Van'. The features measure various properties of the silhouette, such as compactness, circularity, and distance between the center of the vehicle and a line through the center of the vehicle and the rear axle.</p><p>We sample a third from the full data and plotted the performance metrics at each perplexity in Figure <ref type="figure">15</ref>. We next include surrogate-based tuning using AUC on the subsampled data as a proxy for the full data. The optimal normalized perplexity obtained in the tuning for the subsampled data is 0.06. We then perform t-SNE on the full dataset with perplexity 51 &#8776; 846 &#215; 0.06. In this example, we are able to find the optimal perplexity under AUC. However, without preprocessing the dataset, we are not able to use t-SNE to distinguish different classes of data. A similar observation can be found in Figure <ref type="figure">3</ref> of <ref type="bibr">(Yang et al., 2009)</ref>.</p><p>Additionally, the convergence plot Figure <ref type="figure">16</ref> shows typical behavior of "flatness" when the performance metric landscape is flat. As future work, we may want to include early-stopping techniques from BO <ref type="bibr">(Swersky et al., 2014)</ref> to terminate the hyperparameter tuning. On one hand, this saves the computational cost for further evaluations; on the other hand, this could serve as a flag for potential inappropriateness for applying certain DR method on the given dataset. F Reuters Dataset In Figure 17, we highlight that the visualization produced via our framework provides a more informative view of the data. In this figure, we highlight individual classes in the Reuters dataset.  We see that the C15 and CCAT are better separated from the GCAT when the perplexity is chosen using the surrogate model compared to the default t-SNE perplexity parameter, which is a reasonable expectation as C15 can be considered a subclass of CCAT in the Reuters dataset. Moreover, at the optimal perplexity, the visualization performs better in classification and clustering tasks, as demonstrated by the evaluation metrics, namely classification error rate (0.3102 vs 0.3242) and NMI (0.2429 vs 0.2054). </p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>Based on different definitions of performance metrics, metrics may have different ranges. We normalize each of these metrics to be in [0, 1] and enforce that smaller values of metrics indicate better DR results.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1"><p>We did not explicitly specify a subsampling scheme in this flowchart for simplicity. We use scikit-optimize for basic experimental results, but also point out that<ref type="bibr">Cho et al. (2021)</ref> provides a transfer-learning-oriented pipeline that records learned surrogates and supports the framework described in this paper.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2"><p>Coranking scores are introduced by<ref type="bibr">Lee and Verleysen (2009)</ref> and implemented in R by<ref type="bibr">Kraemer et al. (2018)</ref>.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3"><p>We refer to<ref type="bibr">Ben-David and Ackerman (2008)</ref> for detailed definition for clustering scores.</p></note>
		</body>
		</text>
</TEI>
