<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Efficient Task Transfer for HLS DSE</title></titleStmt>
			<publicationStmt>
				<publisher>ACM</publisher>
				<date>10/27/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10647938</idno>
					<idno type="doi">10.1145/3676536.3676723</idno>
					
					<author>Zijian Ding</author><author>Atefeh Sohrabizadeh</author><author>Weikai Li</author><author>Zongyue Qin</author><author>Yizhou Sun</author><author>Jason Cong</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[<p>There have been several recent works proposed to utilize model-based optimization methods to improve the productivity of using high-level synthesis (HLS) to design domain-specific architectures. They would replace the time-consuming performance estimation or simulation of design with a proxy model, and automatically insert pragmas to guide hardware optimizations. In this work, we address the challenges associated with high-level synthesis (HLS) design space exploration (DSE) through the evolving landscape of HLS tools. As these tools develop, the quality of results (QoR) from synthesis can vary significantly, complicating the maintenance of optimal design strategies across different toolchains. We introduce Active-CEM, a task transfer learning scheme that leverages a model-based explorer designed to adapt efficiently to changes in toolchains. This approach optimizes sample efficiency by identifying high-quality design configurations under a new toolchain without requiring extensive re-evaluation. We further refine our methodology by incorporating toolchain-invariant modeling. This allows us to predict QoR changes more accurately despite shifts in the black-box implementation of the toolchains. Experiment results on the HLSyn benchmark transitioning to new toolchain show an average performance improvement of 2.38× compared to AutoDSE and a 1.2× improvement over HARP, while also increasing the sample efficiency by 5.75×, and reducing the runtime by 2.7×.</p>]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">INTRODUCTION</head><p>General-purpose computers are widely employed across diverse domains owing to their ease of programming. However, they face significant overheads due to extended instruction pipelines necessary for supporting generalization. This has given rise to domain-specific</p><p>(a) V20&amp;V21 (b) V21&amp;V23</p><p>Figure <ref type="figure">1</ref>: The distance of the best design found by Au-toDSE <ref type="bibr">[28]</ref> between different toolchains: The horizontal axis represents different programs, and the vertical axis denotes the distance between designs. V20: Vitis HLS 2020.2, V21: Vitis HLS 2021.1, V23: Vitis HLS 2023.2. The distance is calculated by summing up the discrete code difference between two designs over each pragma.</p><p>accelerators (DSAs) <ref type="bibr">[8,</ref><ref type="bibr">9,</ref><ref type="bibr">14]</ref>. However, DSAs pose challenges in programming, restricting their user base to hardware experts <ref type="bibr">[8]</ref>.</p><p>High-level synthesis (HLS) <ref type="bibr">[11]</ref> was introduced to simplify the design process by raising the abstraction level from the registertransfer level (RTL) to C/C++. Designers use compiler directives, in the form of pragmas, to describe microarchitecture. While HLS reduces design turn-around times, not every HLS design yields a high-performance microarchitecture <ref type="bibr">[28]</ref>. As a result, a new line of research aims to explore the solution space defined by pragmas more efficiently. Because evaluating each design candidate with HLS tools is time-consuming, recent studies propose supervised learning models as proxies of HLS tools, which can predict the quality of results (QoR) to expedite exploration <ref type="bibr">[6,</ref><ref type="bibr">26,</ref><ref type="bibr">[30]</ref><ref type="bibr">[31]</ref><ref type="bibr">[32]</ref>. Training such models involves collecting extensive datasets of diverse designs synthesized by HLS tools. A significant issue arises from the continuous evolution of HLS tools, which can significantly impact QoR values <ref type="bibr">[27]</ref>. To gain insight into this issue, we utilized AutoDSE <ref type="bibr">[28]</ref> to gather data on different Vitis HLS toolchains: 2020.2 (V20) <ref type="bibr">[1]</ref>, 2021.1 (V21) <ref type="bibr">[2]</ref> and 2023.2 (V23) <ref type="bibr">[3]</ref>. Figure <ref type="figure">2</ref> illustrates how the performance and resource utilization of each design may vary when changing the toolchain. Additionally, as shown in Table <ref type="table">1</ref>, the validity of a design may also fluctuate. During HLS design synthesis, encountering errors is common. For instance, the HLS tool may refuse to synthesize a design with high parallelization factors <ref type="bibr">[26]</ref>, or it may timeout during synthesis. It is worth noting that we only consider designs within the intersection of those explored by AutoDSE, and more significant changes may occur when evaluating designs across the entire design space. One question that arises is whether changes in performance/resource labels and validity lead to alterations in the optimal design scheme. For instance, a design that previously performed well with one toolchain may become invalid with another. To understand how changes in HLS tools affect the design scheme, we run AutoDSE on these toolchains and compare the best design  Table <ref type="table">1</ref>: The validity changes when altering the toolchain.</p><p>We compare two pairs of commercial HLS tools: V20&amp;V21, V21&amp;V23. "V1" and "V0" stand for the valid or invalid design in the corresponding toolchain.</p><p>(a) V20&amp;V21 V21 V1 V21 V0 V20 V1 1462 295 V20 V0 269 348 (b) V21&amp;V23 V23 V1 V23 V0 V21 V1 3690 120 V21 V0 72 1814</p><p>it finds. AutoDSE optimizes design performance based on feedback from the toolchain and can naturally adapt to new toolchains. Therefore, any change in the best design found by AutoDSE is attributed to the change in the downstream toolchain. Upon applying AutoDSE to the new toolchain, we observed an average performance improvement of 2&#215;. Furthermore, we noted that the change in the performance of the best design arises from the modification in the design scheme. Essentially, AutoDSE can discover new and improved designs on the new toolchain. Figure <ref type="figure">1</ref> depicts the editing distance of the best designs found by AutoDSE on the new and old toolchains. The pragma edit distance is calculated by summing up the differences between each pragma. For example, if one pragma &#119875; 1 can take values from <ref type="bibr">[1,</ref><ref type="bibr">2,</ref><ref type="bibr">5]</ref>, and one design takes value 5 while the other takes value 1, then their discretized values (index in the design space) are 3 and 1 respectively. This results in a contribution of 3 -1 = 2 to the total editing distance. A larger editing distance indicates a greater shift in the design scheme.</p><p>It is therefore critical to design a task transfer learning scheme that could efficiently adapt to the evolving toolchain. According to the previous observations, we observe two major challenges. <ref type="bibr">(1)</ref> The first one is to sample efficiently the design from the design space so that the model transferred to those designs can effectively identify the high-quality designs. (2) Another challenge is to adapt the model to the large label shift. Overall, we want to efficiently uncover good designs on the new toolchain when the label shift is unknown and the design space could be as large as 10 13 as we see in the HLSyn benchmark <ref type="bibr">[5]</ref>.</p><p>Model-based optimization is a widely used technique for efficiently sampling high-quality designs. Previous work has utilized Bayesian Optimization <ref type="bibr">[7,</ref><ref type="bibr">29]</ref>, where sampling is guided by an acquisition function. However, optimizing the acquisition function in a large discrete space, such as HLS designs, remains challenging. Other studies, for example <ref type="bibr">[20]</ref> have adapted active learning to achieve better sample efficiency when transferring to a new technology node by proposing data selection algorithms that select diverse and informative data from existing datasets. Our approach is not limited to selecting data from an existing unlabeled data pool, and could navigate the whole design space. In <ref type="bibr">[26,</ref><ref type="bibr">27,</ref><ref type="bibr">[30]</ref><ref type="bibr">[31]</ref><ref type="bibr">[32]</ref>, optimizing HLS designs involves training a proxy model on an offline dataset and then conducting optimization with this proxy model. This offline approach requires time-consuming data collection and suffers from generalizability issues when applied to new domains and tasks.</p><p>Transfer learning investigates the generalizability of models to unseen domains and tasks, with representation learning playing a critical role in domain transferability <ref type="bibr">[6,</ref><ref type="bibr">13,</ref><ref type="bibr">17,</ref><ref type="bibr">[33]</ref><ref type="bibr">[34]</ref><ref type="bibr">[35]</ref><ref type="bibr">[36]</ref>. Some studies propose learned design space pruning for domain transfer in Bayesian Optimization <ref type="bibr">[4,</ref><ref type="bibr">23]</ref>. Unlike domain transfer, which assumes a data distribution shift, we focus on task transfer, assuming a tractable change in the black-box function. Previous works <ref type="bibr">[20,</ref><ref type="bibr">27]</ref> have shown better task transfer accuracy by freezing some model parameters.</p><p>To enable efficient sampling on an unseen toolchain, we propose a novel model-based explorer designed for discrete optimization spaces. We jointly optimize the model prediction and the importance sampling distribution, allowing the sampling process to leverage knowledge from the previous toolchain and achieve better sample efficiency. Moreover, the sampling distribution can guide model updates, enabling the model to minimize its error on a subset of the design space rather than striving for perfect accuracy across the entire space. We also explore learning task-generalizable representations of HLS designs. Our key observation is that the same input often represents similar micro-architectures, despite potential differences in their implementations across toolchains. Thus, we aim to learn representations of HLS designs that can extract their invariance across various toolchains.</p><p>Overall, we make the following contributions:</p><p>&#8226; We propose a model-based explorer that efficiently searches the whole design space, with better sample efficiency compared with the SoTA approach. &#8226; We introduce a novel modeling of HLS designs to capture the invariance between toolchains. &#8226; Evaluation on 13 programs transferring to V21 shows that we can improve design performance by 2.38&#215; and 1.2&#215; on average, compared with AutoDSE and HARP, with a sample efficiency of 5.75&#215; and a 2.7&#215; runtime improvement. &#8226; Evaluation on the latest toolchain reveals that our design outperforms the best of AutoDSE and HARP by 1.27&#215;.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">PRELIMINARIES 2.1 HLS Design Space and Pragmas</head><p>As with AutoDSE and HARP, we utilize the open-source AMD/Xilinx Merlin Compiler <ref type="bibr">[10]</ref> to streamline DSE for HLS. It supports three types of pragmas: PIPELINE, PARALLEL, and TILE. Taking designs with these pragmas as input, it automates the synthesis of on-chip buffers and memory interfaces. The PIPELINE, PARALLEL, and TILE pragmas handle loop pipelining, double buffering, loop unrolling, and loop tiling for parallelization and latency hiding. The Merlin Compiler supports standard pipelining and parallelization of HLS, but also offers additional automation for pragmas like "ar-ray_partition" which can be optimally solved given the parallel and tiling factors. This compiler provides a more compact design space and is used in this work. However, our approach can be directly applied to other HLS tools as well. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">HARP</head><p>The HARP framework <ref type="bibr">[27]</ref> is the state-of-the-art model-based approach for predicting the performance and resource utilization of HLS designs. It introduces a hierarchical graph representation to mitigate the over-smoothing issue encountered in graph neural networks. By employing a hierarchical graph, the shortest path of the graph can be substantially reduced. This enables the capture of more global information with a shallow GNN. Furthermore, HARP proposes to model the program and pragma transformations separately, treating pragmas as transformations modeled by Multi-Layer Perceptrons (MLPs). This model design aligns with the nature of HLS designs and results in improved accuracy and DSE performance. It trains a classification model to predict the validity and a regression model to predict the QoR. We follow this convention and denote the classification model as R&#120579; &#119888; and the regression model as R&#120579; &#119903; .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">METHODOLOGY</head><p>Our primary objective is for the learned model to adapt to a new toolchain while identifying Pareto-optimal points and maximizing sample efficiency. Our approach focuses on employing importance sampling to select high-quality designs and integrating model prediction within the importance sampling algorithm to improve sample efficiency. Figure <ref type="figure">3</ref> illustrates an overview of our proposed method, which consists of two main components: the DS Sampler and Reward Model Transfer Learning. Using the Cross-Entropy Method (CEM), the DS Sampler performs importance sampling based on the design performance. It iteratively optimizes a probability distribution across the entire design space, concentrating higher density on designs with high rewards. This approach ensures that high-reward designs are sampled more frequently, thereby increasing the likelihood of identifying optimal designs. To further boost sample efficiency, we introduce a reward model that predicts the validity and QoR of HLS designs. This model is adapted from previous toolchains, leveraging data collected from earlier implementations to enhance prediction accuracy. To address the potential label shift due to toolchain changes and distribution shift from the importance sampling, we integrate an active learning subroutine within the CEM algorithm. In summary, we facilitate an efficient transfer to a new toolchain through a model-based method, which involves two iterative steps: fitting the model to the current sampling distribution and updating the sampling distribution based on the model's recommendations.</p><p>In Section 3.1, we will describe version-invariant modeling that efficiently adapts the model to the label shift. In Section 4, we will explain the model-based importance sampling algorithm based on CEM. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Toolchain-invariant modeling</head><p>A critical problem of training a prediction model for HLS designs is how to train a robust model when the labeled data is scarce compared to the size of the design space. One way to mitigate the insufficiency of labeled data is to transfer knowledge from all existing data, possibly collected from different toolchains. Our intuition is to model a robust representation of HLS designs that contains the invariant information between different toolchains. In practice, such invariance is abundant. For example, the same input program with the same pragmas represents similar microarchitectures, and similar microarchitectures will not result in completely different implementations. Following this intuition, we consider learning invariant embedding for task transfer on the HLS design by training a single model with data from different toolchains.</p><p>One problem to tackle is to train a model that is aware of the version of the data. A simple solution is to include the version tag in the input data. However, when switching to a new version, we may need a completely new tag, and that will cause a huge shift in input data. Another approach is to have a shared encoder, but different decoder for each task. Each decoder is trained on data from its corresponding toolchain, allowing the unique parameters of the decoders to capture the specific characteristics of each toolchain. This approach follows the intuition that the same input HLS design represents a similar micro-architecture for different toolchains, and could be represented as the same embedding in the latent space. However, we argue that it is not enough to just model the shared information between toolchains. A shared embedding may not contain enough information to address the variance between toolchains. In HLS design, the information that remains completely invariant is the program. But for pragma transformations, it remains unknown exactly what implementation stays the same and what is changed. Following this intuition, we propose the following modeling scheme:</p><p>Separate embedding for shared and private information. Fig. <ref type="figure">4</ref> shows the model architecture. For each input design, two separate embeddings of the design will be learned. One of the embeddings is modeled by a set of parameters shared among all toolchains, and the other embedding is modeled with a set of parameters that is private to each toolchain. Then, we concatenate the two embeddings and forward it to different decoders for each toolchain.</p><p>Another benefit of the proposed model architecture is that it can be trained jointly with data from multiple versions. We optimize the mean-squared error (MSE) averaged over all versions of data. Intuitively, training with more data will help the model get a stronger representation of the invariant part. Moreover, once diverse data from different toolchains have been gathered, the invariant part R &#119905; &#119904; &#8592; Evaluate &#119877;(&#119909;) for &#119909; &#8712; D &#119905; &#119904; .</p><p>8:</p><p>&#119902; &#8592; Calculate the 1 -&#120588; quantile of R &#119905; &#119904; 10:</p><p>11:</p><p>&#119901; (&#119909;) = (1 -&#120572;)&#119901; (&#119909;) + &#120572; p (&#119909;) 12: end for 13: return &#119903; max can be pretrained, and we are able to fix the invariant part and fine-tune only the private part when switching to a new toolchain. This will result in fine-tuning fewer parameters when transferring to a new toolchain and better efficiency.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">DS SAMPLER 4.1 Optimization with Cross-Entropy Method</head><p>We will now describe the original model-free version of the Cross-Entropy Method (CEM) <ref type="bibr">[15]</ref> and how it can be applied to optimize HLS design. It was originally designed for rare-event simulation but can be applied for global optimization, and it has the ability to converge to global optimum theoretically <ref type="bibr">[22,</ref><ref type="bibr">24]</ref>. While the cross-entropy method is treated as a model-based technique in some work <ref type="bibr">[12]</ref> because it contains a model for the distribution of rare events (good designs), we treat the original version of the algorithm as model-free since it does not contain a learned model of the reward function. We will show that the cross-entropy method is a strong candidate for efficient global optimization.</p><p>Algorithm 1 outlines the application of the Cross-Entropy Method (CEM) for optimizing black-box functions. The process begins by initializing the sampling distribution &#119901; (&#119909;) to a random distribution (line 3). In each iteration, &#119873; samples are drawn from the current distribution and evaluated using the ground truth function &#119877; (lines 6, 7). Based on these evaluations, the distribution is then updated to concentrate more on samples that exceed the (1 -&#120588;) quantile of the &#119873; rewards, i.e. focusing on the top &#119873; &#8226; &#120588; performing samples (lines 9-11). To maintain randomness and prevent overfitting, the distribution is modified by a step size &#120572;, which typically ranges from 0.7 to 0.9. This algorithm is noted for its rapid convergence rate and robustness against different hyperparameters, as introduced in <ref type="bibr">[15,</ref><ref type="bibr">21]</ref>.</p><p>To tailor Algorithm 1 for HLS design optimization, the key tasks involve modeling the distribution of design configurations and implementing updates to this distribution. In HLS, each design is characterized by a combination of different pragma values. Each pragma within the program must either be assigned a specific value or be set to a default value that effectively disables it, thereby completing the design configuration. We model the distribution over all possible designs by assigning a probability vector &#119901; &#119894; (&#119909;) for each pragma &#119894;, resulting in &#119901; (&#119909;) = &#928; &#119894; &#119901; &#119894; (&#119909; &#119894; ). The independent (a) CEM-5000 (b) CEM-10000 Figure 6: The convergence rate of CEM-K. The horizontal axis shows the number of iterations, and the vertical axis depicts the logarithmic speedup. The different lines illustrate different kernels. The performance typically reaches a plateau in less than 10 iterations.</p><p>modeling reduces the likelihood of overfitting when the number of data points available for distribution calculation is limited. We will demonstrate that this method of distribution modeling not only simplifies sampling distribution updates but also yields effective optimization results for HLS designs.</p><p>To implement the distribution update, we first get the one-hot embedding of each design. Specifically, for each design &#119909;, we use a one-hot vector &#119909; &#119894; for each pragma, where &#119909; &#119894; &#119895; = 1 if the discretized value of pragma &#119894; is &#119895;, and &#119909; &#119894; &#119895; = 0 otherwise. For example, consider a design space containing two pragmas &#119875; &#119886; and &#119875; &#119887; , and suppose &#119875; &#119886; can take value from [1, 2, 5] and &#119875; &#119887; can take value from <ref type="bibr">[1,</ref><ref type="bibr">2]</ref>. Then the design &#119909; = {&#119875; &#119886; = 5, &#119875; &#119887; = 2} will be encoded with</p><p>Then, at each iteration, the distribution p in Algorithm 1 can be calculated by separately calculating the probability vector &#119901; &#119894; of each pragma &#119894;, with the one-hot vector &#119909; &#119894; :</p><p>where &#119877; is the ground truth reward function, D &#119905; &#119904; denotes the &#119873; samples generated by the previous distribution &#119901; at iteration &#119905;, and the &#119902; is the 1 -&#120588; quantile of &#119866; &#119903; .</p><p>When implementing the algorithm, we also memorize the best &#120588; &#8226; &#119873; design in all previous iterations and consider them when calculating the quantile and optimizing the distribution, which improves the performance. This detail is omitted in the algorithm description for simplicity.</p><p>We first conduct experiments to validate that CEM is good for global optimization, comparing it with other algorithms including breadth-first search (BFS), simulated annealing, and genetic algorithm. Due to the long runtime of evaluating HLS design, we use the model prediction as the ground truth function. Even though high predicted performance may not result in high actual performance, it is a good indicator of the ability of the optimizer. We evaluate 5 programs with a large design space and calculate the geomean of the top-1 latency under resource constraints. We also test CEM and genetic search with different population sizes. Fig. <ref type="figure">5</ref> displays the normalized result with respect to BFS. Gen-&#119870; and CEM-&#119870; stand for the genetic algorithm and CEM under different population sizes. Under the same runtime budget, the CEM explorer performs the best among all baselines, and the genetic search also achieves a comparable performance. When the population size is larger, the genetic search will run fewer iterations under the same time limit, resulting in the performance degradation from Gen-500 to Gen-5000. On the other hand, CEM could uncover a better design when increasing the population size, due to its fast convergence rate. As illustrated in Fig. <ref type="figure">6</ref>, the performance typically converges after 8 iterations. However, the result also shows that optimization with CEM is not sample-efficient. It would require a very large population size, typically larger than 1000, to reach a stable performance. When the population size &#119873; is small, the distribution may quickly overfit to a specific design. Evaluating more than 1000 designs using HLS is very time-consuming, even if a dozen of designs can be evaluated in parallel. So, we consider inserting model prediction in the loop to improve the sample efficiency of the CEM in the next section.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Model-based CEM</head><p>One simple way to achieve sample efficiency is to replace the evaluation on the true reward function &#119877; with a proxy model R&#120579; , pick the design with the top prediction value, and validate it with &#119877; at the last iteration. However, such an approach may cause severe performance degradation, because the model R&#120579; is not accurate in terms of the label shift caused by changing the toolchain, and the distribution shift caused by a sampling distribution update. In Section 1, we illustrate how the model could be vulnerable to the label shift. Another factor to consider is the distribution shift. Even if the model has already adapted to the label shift with data on the new version, it may have low accuracy on some designs visited by the explorer, because of the mismatch between the training data distribution induced by the data gathering heuristic, and the testing distribution induced by the explorer. Therefore, it is possible for the explorer to find a design with a high predicted performance but a low actual performance. Such a phenomenon is constantly observed when trying to solely rely on the proxy model to do the optimization <ref type="bibr">[18,</ref><ref type="bibr">19]</ref>. Intuitively, a powerful optimization algorithm will put pressure on the model to be perfectly accurate on the whole design space, which is usually hard to achieve given limited labeled data.</p><p>Therefore, it is critical to first update the model to have a reasonably accurate prediction under the current function &#119877;(&#119909;) and the current data distribution &#119901; (&#119909;). Typically, we want to optimize the following objective, where D is the whole design space.</p><p>When the sampling distribution updates, the objective becomes a weighted error, assigning greater weight to regions that potentially contain high-quality designs. Although it is still sample inefficient to minimize the error of the model with arbitrary data distribution on the whole design space, we find that it is possible to achieve better sample efficiency by integrating the model update into the optimization process. For the CEM, if the model can have an accurate prediction on D &#119905; &#119904; in each iteration, then the next distribution can be calculated solely based on the model prediction. Note that D &#119905; &#119904; &#8834; D is sampled from the distribution &#119901; (&#119909;), and is an unlabeled pool of the design. As discussed in section 4.1, the size of D &#119905; &#119904; could still be large, so we want to further improve the sample efficiency, by selecting several designs from D &#119905; &#119904; to update the model. The sub-problem to solve then is: How to select a few designs in D &#119905; &#119904; and update the model on the true value of these designs, so that the model has a minimal error on D &#119905; &#119904; . More specifically, we want to find D &#119905; &#119888;&#119900;&#119903;&#119890; &#8834; D &#119905; &#119904; , so that the model can have minimal error on D &#119905; &#119904; when the parameter &#120579; is updated on D &#119905; &#119888;&#119900;&#119903;&#119890; . 4.2.1 Selective Labeling for Model Update. Several heuristics have been proposed to tackle the data selection problem in active learning. Those heuristics primarily consider two objectives, uncertainty and diversity. Both objectives are important to minimize the model error given a fixed labeling budget.</p><p>However, the uncertainty-based approaches can only characterize the distribution shift, the change of &#119901; (&#119909;). In task transfer, it is also important to be aware of the change in the function &#119877;(&#119909;), or &#119901; (&#119910;|&#119909;). While it is hard to estimate the change without actually observing the true value, it is still reasonable to sample diverse designs under a fixed budget, so that more information can be revealed about the unknown function. Therefore, we choose to use the coreset approach <ref type="bibr">[25]</ref> to acquire diverse data.</p><p>The coreset based sampling algorithm approximates the model error on the whole pool of data with distance measurement from selected data points and the rest of the data. <ref type="bibr">[25]</ref> and <ref type="bibr">[20]</ref> separately model the problem as K-Center and K-Means clustering. We consider utilizing the K-Means algorithm on the hidden representation due to its efficiency. Since we have both a regression model R&#120579; &#119903; to predict the QoR and a classification model R&#120579; &#119888; to predict the validity of the design, we need samples to provide information for both models. We thus divide the &#119870; labeling budget into &#119870; 1 and &#119870; 0 and divide the unlabeled pool D &#119905; &#119904; into two subsets D &#119905; &#119904; 1 and D &#119905; &#119904; 0 . D &#119905; &#119904; 1 contains designs that the classification model R&#120579; &#119888; predicts valid, and D &#119905; &#119904; 0 contains those that the model predicts invalid. Then, we run K-Means to get &#119870; 1 cluster centroids of the hidden embeddings of R&#120579; &#119903; on D &#119905; &#119904; 1 and &#119870; 0 cluster centroids of the hidden embeddings of R&#120579; &#119888; on D &#119905; &#119904; 0 . Then, the designs closest to the cluster centroids are selected to be added to the database. 4.2.2 Active-CEM. We now describe the complete exploration algorithm which consists of an outer loop that updates the sampling distribution according to the model prediction, and an inner loop that fits the model to the current distribution. We abbreviate the name with Active-CEM.</p><p>As shown in Algorithm 2, the sampling distribution &#119901; is initialized to random, and the model R&#120579; &#119903; , R&#120579; &#119888; is initialized with the old model trained on the old database. A database &#119861; that has designs with true labels is initialized to empty. For each iteration &#119905;, an unlabeled pool D &#119905; &#119904; will be sampled according to distribution &#119901;. Then, the unlabeled pool D &#119905; &#119904; , the database &#119861;, and the current model will be forwarded to the inner active learning loop with the K-Means</p><p>Algorithm 2 Active-CEM Input: num iterations &#119899;, population size &#119873; , true function &#119877;, quantile &#120588;, old model R&#8242; &#120579; &#119903; , R&#8242; &#120579; &#119888; , active learning round &#119898;, batch size &#119887; Output: best design &#119889; best , labeled database &#119861; Initialize distribution &#119901; (&#119909;) to random. R&#120579; &#119903; , R&#120579; &#119888; &#8592; R&#8242; &#120579; &#119903; , R&#8242; &#120579; &#119888; for &#119905; = 1 to &#119899; do D &#119905; &#119904; &#8592; &#119873; samples without replacement from &#119901; (&#119909;) # actively update model for &#119895; = 1 to &#119898; do D &#119905; &#119888;&#119900;&#119903;&#119890; &#8592; select &#119887; samples from D &#119905; &#119904; # K-Means &#119861; = &#119861; &#8746; (D &#119905; &#119888;&#119900;&#119903;&#119890; , &#119877;(D &#119905; &#119888;&#119900;&#119903;&#119890; )) # run HLS on each instance in D &#119905; &#119888;&#119900;&#119903;&#119890; , add to &#119861; R&#120579; &#119903; , R&#120579; &#119888; &#8592; train_model(&#119861;, R&#120579; &#119903; , R&#120579; &#119888; ) end for # use model prediction to update sampling distribution # invalid points assigned to -&#8734; R &#119905; &#119904; (&#119909;) = R&#120579; &#119903; (&#119909;) + &#8734; &#8226; ( R&#120579; &#119888; (&#119909;) -1) for &#119909; &#8712; D &#119905; &#119904; &#119902; &#8592; Calculate the 1 -&#120588; quantile of R &#119905; &#119904; Update &#119901; (&#119909;) according to R &#119905; &#119904; and &#119902; end for return best design found, &#119861;, ( R&#120579; &#119903; , R&#120579; &#119888; )</p><p>algorithm. The active learning loop will return an updated model. Then, the updated model will predict the label for all designs in D &#119905; &#119904; . According to the predicted label, the 1 -&#120588; quantile of the design performance will be calculated on the valid designs that satisfy the resource constraint, and the distribution will be updated according to Equation <ref type="formula">1</ref>. The only difference is that the true value &#119877;(&#119909;) in equation 1 is replaced with the updated prediction model R&#120579; &#119903; (&#119909;) and R&#120579; &#119888; (&#119909;). 4.2.3 Sample efficiency and runtime of the algorithm. We briefly discuss the sample efficiency and the runtime of the proposed algorithm. By sample efficiency, we mean the number of designs evaluated with the HLS toolchain during exploration, which is the major runtime bottleneck. As shown in algorithm 2, the modelfree version of the CEM algorithm needs to sample &#119873; &#8226; &#119899; design points, while the model-based version only requires &#119898; &#8226; &#119887; &#8226; &#119899;. As for the runtime, since the design is evaluated in batch, the runtime is bounded by the number of parallel HLS workers and the number of samples. The overhead of sampling unlabeled design and updating the distribution are negligible. Suppose there are &#119862; parallel workers to run HLS, and the timeout for running HLS design is &#119879; , then the runtime of the model-free version is &#119879; &#119899; &#8226; &#119873; &#119862; , while the runtime of the model-based version is &#119879; &#119899;&#119898; &#8226; &#119887; &#119862; . While proper scheduling could reduce the overall runtime of evaluating &#119873; design with &#119862; workers, we provide an upper bound of the runtime here.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Design Space Pruning</head><p>Another way to achieve sample efficiency is to prune away part of the large design space so that the explorer will have less pressure to explore a smaller design space. While design space pruning is not the focus of this work, we empirically find that the "TILE" pragma is not as efficient as the "PARALLEL" and "PIPELINE" pragma for optimizing design. Also, whether or not "TILE" pragmas would be effective is determined mostly by the program itself (whether it is memory bound/compute bound), and is more agnostic to the change in the toolchain. Therefore, we consider pruning away "TILE" pragmas for some programs. We adopt a conservative heuristic that prunes away "TILE" pragmas only if (1) less than 10% of the designs in the old database have tiling pragma, (2) the top-10 designs do not contain "TILE" pragma, and (3) the memory footprint of the workload could fit on-chip. We only cut the "TILE" pragma if all these conditions are satisfied. With this heuristic, the model will also witness less distribution shift at the early stage of exploration.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">EVALUATION 5.1 Experiment setup</head><p>We evaluate our approach on two task transfer scenarios: from AMD Xilinx Vitis 2020.2 (V20) to Vitis 2021.1 (V21) and from Vitis 2021.1 (V21) to Vitis 2023.2 (V23). We also utilize the AMD Xilinx Merlin Compiler <ref type="bibr">[10]</ref>. We assess our methods on 13 programs from the HLSyn <ref type="bibr">[5]</ref> dataset, from toolchain V20 to V21. We do not consider programs that have a very small design space. The workload covers dense linear algebra, data analytics, and stencil operations.</p><p>To evaluate the robustness of the method to different label shifts, we further test it by transferring from V21 to V23, on 5 programs with very large design space. We list the domain and the size of the design space for each program in Table <ref type="table">2</ref>. "MM" stands for matrix-matrix multiplication, and "MV" stands for matrix-vector multiplication. Since we focus on assessing task transfer to new toolchains, we fixed the evaluation platform to Xilinx U200 with a frequency of 250MHz. While the proposed model architecture introduces extra overhead, we simply finetune the HARP model during the CEM exploration. After collecting a database with the explorer, we train the new model architecture with invariant embedding on the database and conduct one last round of DSE with a BFS search. The model is trained with a batch size of 64 and is trained for 1500 epochs, and the learning rate is set to 1e-3 with a linear warmup and is cosine annealed to 1e-5. When training the model with invariant embedding, we randomly sample mini-batches of 64 from the V18 and the V20 database, and forward all three batches in a single iteration to calculate the loss. We run each experiment three times with different random seeds, under the same split of the dataset, and report the mean and standard deviation of each method.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Model accuracy</head><p>We validate the effectiveness of learning invariant embedding on the HLS designs with model accuracy. We compare the proposed model (Inv-Hybrid) with three baselines, the HARP model trained from scratch (HARP scratch) on the new version data, the HARP model initialized with pre-trained parameters on old data and fine-tuned on new data (HARP finetune). For the old version data, we utilize the HLSyn database, which consists of data collected by AutoDSE on V18 and V20 of the toolchain. For the new version data, we utilize the data collected by AutoDSE on the V21 toolchain for the 13 programs we selected, following the data collection pipeline as described in HLSyn. 3975 regression data is gathered for the 13 programs on the V21 toolchain. And the V18 and V20 databases contain 9439 and 5194 data. We split the V21 data into training, validation, and test set, with a portion of 40%, 10%, and 50%, to mimic the scenario where a few data is gathered on the large design space. We select the model with a minimum validation error and record the testing accuracy. Table <ref type="table">3</ref> summarizes the efficiency of each model, using the regression RMSE as a metric. We observe that the HARP model performs much better when initialized with pre-trained parameters, which is consistent with the observation made previously in <ref type="bibr">[27]</ref>. With the Inv-Hybrid model, the RMSE on the test set further improves by 11% and shows better stability during training with less standard deviation. This validates our assumption that we should model both the invariant and the difference between toolchains to achieve good transfer learning accuracy.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3">DSE Performance</head><p>We next verify the efficiency of the Active-CEM algorithm. We compare with AutoDSE and HARP. For the AutoDSE baseline, we follow previous work and run all three explorers in parallel. Following <ref type="bibr">[27]</ref>, we set a timeout of 24 hours and do not terminate the jobs that are already running, resulting in a maximum runtime of 25 hours. For baseline evaluation, we adopt the same approach as HARP for running its explorer. Specifically, after collecting data with Au-toDSE, we initialize the model with the pre-trained parameter on the old version, and then fine-tune the HARP model on the new version data. Then we run a BFS search for 1 hour with HARP and validate the design with top-10 predicted performance using the HLS toolchain. When transferring to V21, we initialize the model with parameters pre-trained on the V20 database. When transferring to V23, we initialize the model with parameters pre-trained on the V21 database. For a fair comparison, when pretraining the model on existing versions of databases, we only utilize the database collected by AutoDSE. For Active-CEM, similar to HARP, we initialize the model pre-trained parameter on the old version data. For each program, we run Active-CEM for 8 iterations to converge,  <ref type="table">4</ref> and Table <ref type="table">5</ref> illustrate the result of the design performance when transferring from V20 to V21 and from V21 to V23. When transferring to V21, exploration with Active-CEM performs 2.34&#215; better than AutoDSE and 1.18&#215; better than the best of AutoDSE and HARP and achieves this with a 5.75&#215; total sample efficiency and a 1.74&#215; sample efficiency in terms of valid data. Note that Active-CEM tends to explore more valid designs during policy updates while AutoDSE does not contain heuristics to explicitly sample valid designs, therefore a larger portion of the designs explored by AutoDSE are invalid. As for design space pruning, among all 13 programs, only the design spaces of 3 programs (gemver, 3mm, and fdtd-2d) are pruned without searching for "tile" pragmas. We leave learning a more robust design space pruning strategy to future work. When transferring to V23, Active-CEM still performs 1.3&#215; better than AutoDSE, and 1.27&#215; better than the best of AutoDSE and HARP, and is a 8.85&#215; sample efficiency in total and a 3.47&#215; sample efficiency in terms of valid design, demonstrating the robustness of the proposed method. Better design performance and better sample efficiency are achieved by inserting model prediction into an importance sampling method, utilizing the knowledge learned on the old toolchains and addressing the label and distribution shift.</p><p>We further train the HARP model and the Invariant-hybrid model with the database collected by the CEM method, and conduct one last round of DSE with a BFS search. Our result in Table <ref type="table">4</ref> reveals that we can locate a better design in one last round of the exploration, achieving another moderate 1.01&#119909; speedup on average.</p><p>We further perform an ablation study by running AutoDSE for the same amount of time as Active-CEM. Specifically, we pick the checkpoint of AutoDSE at 10 hours and report the best design found by all three explorers. As seen in Table <ref type="table">5</ref>, Active-CEM outperforms AutoDSE on average by 1.39&#215; when the runtime budget is the same.   <ref type="table">6</ref> shows the result of transferring the model by fixing the first several layers of the GNN encoder. Similar to the previous experiment, we split the data with 40%, 10%, and 50%, conduct each experiment 3 times with different random seeds, and report the mean and standard deviation. We observe that the model achieves the best accuracy when we do not freeze any parameter, which is different from the conclusion drawn from various transfer learning works with image data <ref type="bibr">[20]</ref>. We hypothesize that this is because of the large label shift caused by switching to a different toolchain, and the lack of proper pretraining on a large database that contains multiple versions of data.</p><p>5.4.2 Separate modeling the invariance and the difference. We add another baseline to confirm the effectiveness of the proposed model architecture, which models the invariance and the difference between toolchains. Specifically, we implement a model architecture as shown in Fig. <ref type="figure">7</ref>, which simply uses a different decoder to decode a shared embedding to different outputs. Moreover, we fix the parameter size of both architectures to be the same, so that we can strictly study the effect of decoupling the shared and the private embedding. In Table <ref type="table">7</ref>, the test RMSE of "Inv-Hybrid" is lower than "Inv-Share" by 11%, demonstrating the importance of modeling both the invariance and the difference between toolchains. 5.4.3 Design space pruning. We evaluate the efficiency of design space pruning by comparing the design performance when enabling and disabling it. Only 3 out of 13 programs satisfy the pruning constraints and we report the performance difference for them. Table <ref type="table">8</ref> illustrates how the design performance improves when design space pruning is enabled.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">CONCLUSION</head><p>In this work, we discuss the sample efficiency challenge in the task transfer learning problem for HLS DSE. We mitigate the challenge with a novel approach on both sampling and modeling. On sampling, we propose Active-CEM, a model-based explorer that efficiently samples high-quality designs in the whole design space.</p><p>On modeling, we propose to model both the invariance and difference between toolchains and jointly train a single model with data collected from multiple toolchains. Moving forward, we plan to evaluate our sampling method in the domain-transfer setting and consider task-transfer learning for different devices. We also plan to combine domain knowledge in the sampling process for better efficiency. In <ref type="bibr">[16]</ref>, we extend our task transfer results to the complete HLSyn dataset on V21 toolchain.</p></div></body>
		</text>
</TEI>
