<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Improving Realistic Worst-Case Performance of NVCiM DNN Accelerators through Training with Right-Censored Gaussian Noise</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>10/29/2023</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10462379</idno>
					<idno type="doi"></idno>
					<title level='j'>IEEE/ACM 2023 International Conference on Computer-Aided Design</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Z. Yan</author><author>Y. Qin</author><author>W. Wen</author><author>X. Hu</author><author>Y. Shi</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Memory (CiM), built upon non-volatile memory (NVM) devices, is promising for accelerating deep neural networks (DNNs) owing to its in-situ data processing capability and superior energy efficiency. To battle device variations, noise injection training is commonly used, which perturbs weights with Gaussian noise during training to make the model more robust to weight variations. Despite its prevalence, however, existing successes are mostly empirical, and very little theoretical support is available. Even the most fundamental questions such as why Gaussian but not other types of noises should be used is not answered. In this work, through formally analyzing the effect of injecting Gaussian noise in training to improve the k-th percentile performance (KPP), a realistic worst-case performance metric, for the first time we provide a theoretical justification of the effectiveness of the approach. We further show that surprisingly Gaussian noise is not the best option, contrary to what has been taken for granted in the literature. Instead, a right-censored Gaussian noise significantly improves the KPP of DNNs. We further propose an automated method to determine the optimal hyperparameters for injecting this right-censored Gaussian noise during the training process. Our method achieves up to a 26% improvement in KPP compared to the state-of-the-art methods employed to enhance DNN robustness under the impact of device variations.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>I. INTRODUCTIONS</head><p>Deep neural networks (DNNs) have demonstrated remarkable advancements, surpassing human performance in a wide range of perception tasks. The recent emergence of deep learning-based generation models, such as DALL-E <ref type="bibr">[1]</ref> and the GPT family <ref type="bibr">[2]</ref>, has further reshaped our workflows. To date, the trend of incorporating on-device intelligence across edge platforms such as mobile phones, watches, and cars, has become an evident <ref type="bibr">[3]</ref>&#177; <ref type="bibr">[5]</ref>, transforming every walk of life. However, the limited computational resources and strict power constraints of these edge platforms present challenges. These circumstances necessitate more energy-efficient DNN hardware beyond the general-purpose CPUs and GPUs.</p><p>Compute-in-Memory (CiM) DNN accelerators <ref type="bibr">[6]</ref>, on the other hand, are competitive alternatives to replace CPUs and GPUs in accelerating DNN inference on edge. In contrast to the traditional von Neumann architecture platforms, which involve frequent data movements between memory and computation components, CiM DNN accelerators reduce energy consumption by enabling in-situ computation directly at the storage location of weight data. Moreover, emerging non-volatile memory (NVM) devices, such as ferroelectric field-effect transistors (FeFETs) and resistive random-access memories (RRAMs), allows NVCiM accelerators to achieve higher memory density and improved energy efficiency compared to conventional MOSFET-based designs <ref type="bibr">[4]</ref>. However, the reliability of NVM devices can be a concern due to device-to-device (D2D) variations incurred by fabrication defects and cycle-to-cycle (C2C) variations due to thermal, radiation, and other physical impacts. These variations can have a notable negative impact on NVCiM DNN accelerators' inference accuracy, as they may introduce significant differences between the weight values read out from NVM devices during inference and their intended values.</p><p>Various strategies have been proposed to mitigate the impact of device variations. These strategies can be broadly categorized into two categories: reducing device value deviations and enhancing the robustness of DNNs in the presence of device variations. Device value deviations can be reduced through methods such as write-verify <ref type="bibr">[7]</ref>, which iteratively applies programming pulses to reduce device value deviation from the desired value after each write. On the other hand, there exist various approaches that enhance DNN robustness in the presence of device variations. One direction is to identify novel DNN topologies that are more robust in the presence of device variations. This can be achieved through techniques such as neural architecture search <ref type="bibr">[8]</ref>, <ref type="bibr">[9]</ref> or by leveraging Bayesian Neural Networks <ref type="bibr">[10]</ref> which use variational training to improve DNN robustness. Another line of methods focuses on training more robust DNN weights using noise injection training <ref type="bibr">[3]</ref>, <ref type="bibr">[11]</ref>, <ref type="bibr">[12]</ref>. In this approach, randomly sampled noise is injected into DNN weights during the forward and backpropagation phases of DNN training. After the gradient is calculated through backpropagation, the noise is then removed and the weight value without noise is updated by gradient descent. 1  Despite its wide adoption, very little theoretical support is available for noise injection training. Even the most fundamental questions such as why Gaussian but not other types of noises should be used is not answered. To date, only empirical analyses have been offered, suggesting that this method is effective because it simulates the noisy inference environment. However, no theoretical explanation has been provided. In this work, in light of this gap between implementation and understanding, we propose to formally explain the effect of noise injection training. By mathematically analyzing its impact on improving the worst-case performance of DNN models under the influence of device variations, for the first time, we provide a theoretical justification of the effectiveness of the approach. We demonstrate that surprisingly vanilla noise injection training, which merely mirrors the inference noise by injecting Gaussian noise, is not the best approach for enhancing the worst-case performance of DNN models, contrary to what has been taken for granted in the literature. Instead, a right-censored Gaussian noise injection framework better improves DNN robustness under the influence of device variations.</p><p>In detail, worst-case analysis is crucial for safety-critical applications, so we propose to analyze the effect of noise injection training in improving the worst-case performances of DNN models <ref type="bibr">[13]</ref>. To capture realistic worst-case scenarios precisely in the presence of device variations, in this work, we propose to use the k-th percentile performance (KPP) metric, instead of the average or absolute worstcase performance. With a predetermined K value, the KPP metric aims to identify a performance score that the model's performance is consistently greater than this score in all but k% of cases. For example, if a model has a KPP of 0.912 when K = 1, this suggests that the likelihood of a model's performance being greater than 0.912 is 99% (except the 1% of the cases). When a realistically small K value is given, such as K = 1, KPP can capture a realistic worst-case performance of a DNN model because it (1) guarantees a lower bound of the model's performance and (2) filters out extreme corner cases. Given the same K value, a higher KPP for a DNN model is desirable as it signifies that the model can consistently deliver high performance within a certain probability threshold.</p><p>Since improving KPP guarantees higher realistic worst-case performance of a DNN model, we revisited the state-of-the-art (SOTA) Gaussian noise injection training method to analyze its effectiveness in improving KPP. Gaussian noise injection training is widely used simply because it injects noises that statistically mirror the noises in the inference environment. Although it is empirically valid to state that a precise simulation of the inference environment during training would yield optimal results, there is no theoretical proof for it. Thus, to prove the effectiveness of Gaussian noise injection training, we thoroughly analyze the relationship between KPP of a DNN model and the properties of DNN weights to show what kind of models would provide higher KPP. Surprisingly, our analysis shows that Gaussian noise injection training is far from optimal in generating robust DNN models in the presence of device variations.</p><p>In particular, our key observation is that achieving a higher KPP in the presence of device variation needs to satisfy the following three requirements simultaneously: (1) higher DNN accuracy under no device variation; (2) smaller 2 nd derivatives w.r.t. DNN weights, and (3) larger 1 st derivatives w.r.t. DNN weights. However, our analysis (see Section III-C) shows that the conventional Gaussian noise-injected training approaches can only fulfill the first two requirements, but not the third, making them ineffective for KPP improvement. Specifically, the third requirement necessitates distributions with non-zero expected values, a condition that the Gaussian distribution fails to satisfy.</p><p>To this end, we develop TRICE, a method that injects adaptively optimized right-censored Gaussian (RC-Gaussian) noise in the training process. The abbreviation of this method is derived from the name Training with RIght-Censored Gaussian NoisE (TRICE), to address all aforementioned three requirements simultaneously. TRICE differs from existing approaches in several aspects: (1) rather than using the general Gaussian noise, TRICE uses RC-Gaussian noise which exhibits a unique feature&#177;for all sampled values greater than a designated threshold, the sample value is fixed (i.e., censored) to the threshold. This results in a negative expected value for the injected noise, thus meeting the third requirement. (2) TRICE requires additional hyperparameters tuning, e.g., via a dedicated adaptive training method to identify the optimal noise hyperparameters within a single run of DNN training, which is different from the conventional Gaussian noisebased approaches using the same noise hyperparameters in training and inference. The main contributions of this work are multi-fold:</p><p>&#8226; We analytically derive the effect of noise injection training in improving KPP and show that the common practice of injecting Gaussian noise is not the optimal solution.   In a crossbar array, the input is fed horizontally and multiplied by weights stored in the NVM devices at each cross point. The multiplication results are summed up vertically and the sum serves as an output. The outputs are converted to the digital domain and further processed using digital units such as non-linear activation and pooling.</p><p>The computation engine driving NVCiM DNN accelerators is the crossbar array structure, which can perform matrix-vector multiplication in a single clock cycle. Crossbar arrays store matrix values (e.g., weights in DNNs) at the intersection of vertical and horizontal lines using NVM devices (e.g., RRAMs and FeFETs), while vector values (e.g., inputs for DNNs) are fed through horizontal data lines (word lines) in the form of voltage. The output is then transmitted through vertical lines (bit lines) in the form of current. While the crossbar array performs calculations in the analog domain according to Kirchhoff's laws, peripheral digital circuits are needed for other key DNN operations such as shift &amp; add, pooling, and non-linear activation. Additional buffers are also needed to store intermediate data. Digital-to-analog and analog-to-digital conversions are also needed between components in different domains.</p><p>Crossbar arrays based on NVM devices are subject to a number of sources of variations and noise, including spatial and temporal variations. Spatial variations arise from defects that occur during fabrication and can be both local and global in nature. In addition, NVM devices are susceptible to temporal variations that result from stochastic fluctuations in the device material. These variations in conductance can occur when the device is programmed at different times. Unlike spatial variations, temporal variations are usually independent of the device but could be subject to the programmed value <ref type="bibr">[14]</ref>. For the purpose of this study, we have considered the nonidealities to be uncorrelated among the NVM devices. However, our framework can be adapted to account for other sources of variations with appropriate modifications.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Evaluating DNN Robustness in the Presence of Device Variations</head><p>Most existing research uses Monte Carlo (MC) simulations to assess the robustness of NVCiM DNN accelerators in the presence of device variations. This process typically involves extracting a device variation model and a circuit model from physical measurements. The DNN to be evaluated is then mapped onto the circuit model, and the desired value for each NVM device is calculated. In each MC run, one instance of a non-ideal device is randomly sampled from the device variation model, and the actual conductance value of each NVM device is determined. DNN performance (e.g., classification accuracy) in this non-ideal accelerator, is then recorded. This process is repeated numerous times until the collected DNN performance distribution converges. Existing practices <ref type="bibr">[11]</ref>, <ref type="bibr">[15]</ref> generally include around 300 MC runs. This number of MC runs is empirically sufficient according to the central limit theorem <ref type="bibr">[8]</ref>.</p><p>Only a few researchers are focusing on the worst-case scenarios of NVCiM DNN accelerators in the presence of device variations. A line of research <ref type="bibr">[13]</ref>, <ref type="bibr">[16]</ref>, <ref type="bibr">[17]</ref> focuses on determining the worst-case performance by identifying weight perturbation patterns that can cause the most significant decrease in DNN inference performance, while still adhering to the physical bounds of device value deviations. One representative work <ref type="bibr">[13]</ref> shows that DNN classification accuracy can drop to random guesses level when adding a less than 3% perturbation to weights. However, the likelihood of such a worst-case scenario occurring is lower than &lt; 10 -100 , which can be safely ignored in common natural environments <ref type="bibr">[13]</ref>. Thus, such kinds of worst-case analyses are impractical in terms of accessing the robustness of an NVCiM DNN accelerator.</p><p>Thus, in this work, we advocate using k-th percentile performance, a metric that is both practical and precise, for capturing the worst-case performances of a DNN model.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Addressing Device Variations</head><p>Various approaches have been proposed to deal with the issue of device variations in NVCiM DNN accelerators. Here we briefly review the two most common types: enhancing DNN robustness and reducing device variations.</p><p>A common method used to enhance DNN robustness in the presence of device variations is variation-aware training <ref type="bibr">[3]</ref>, <ref type="bibr">[11]</ref>, <ref type="bibr">[12]</ref>, <ref type="bibr">[18]</ref>. Also known as noise injection training, the method injects variation to DNN weights in the training process, which can provide a DNN model that is statistically robust in the presence of device variations. In each iteration, in addition to traditional gradient descent, an instance of variation is sampled from a variation distribution and added to the weights in the forward pass. In the backpropagation pass, the same noisy weight and noisy feature maps are used to calculate the gradient of weights in a deterministic and noise-free manner. Once the gradients are collected, this variation is cleared and the variation-free weight is updated according to the previously collected gradients. The details of noise injection training are shown in Alg. 1. Another fashion of training more robust DNN weights is CorrectNet <ref type="bibr">[19]</ref>. This approach uses a modified Lipschitz constant regularization during DNN training so that the regularized weights are less prone to the impact of device variations. Other approaches include designing more robust DNN architectures <ref type="bibr">[3]</ref>, <ref type="bibr">[8]</ref>, <ref type="bibr">[10]</ref> and pruning <ref type="bibr">[20]</ref>.</p><p>To reduce device variations induced device value deviation, writeverify <ref type="bibr">[7]</ref>, <ref type="bibr">[21]</ref> is commonly used during the programming process. An NVM device is first programmed to an initial state using a pre-defined pulse pattern. Then the value of the device is read out to verify if its conductance falls within a certain margin from the desired value (i.e., if its value is precise). If not, an additional update pulse is applied, end for 8: end for aiming to bring the device conductance closer to the desired one. This process is repeated until the difference between the value programmed into the device and the desired value is acceptable. This approach is highly effective in reducing the device value deviations, but the process typically requires a few iterations, which is time-consuming. There are also various circuit design efforts <ref type="bibr">[22]</ref>, <ref type="bibr">[23]</ref> that try to mitigate the device variations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>III. PROPOSED METHOD</head><p>In this section, we introduce a novel variant of the noise injection training method designed to improve the k-th percentile performance (KPP) of a DNN model. The conventional noise injection training injects Gaussian noise in the training process simply because it mirrors the impact of device variations occurring in inference. There is no theoretical proof that such practice would offer the most robust DNN models. In this section, we show through mathematical analysis that Gaussian noise injection training is far from optimal in improving KPP. Specifically, this section begins with a formal definition of KPP and an analysis of its relationship with DNN weights. Next, we analyze the noise injection training framework and identify the requirements for the noise injected during training. We show that Gaussian noise does not satisfy all requirements. Thus, we propose several candidate noise types and select rightcensored Gaussian noise through experimentation. Moreover, we develop an adaptive training method that automatically determines the optimal hyperparameters for the right-censored Gaussian noise injection. The resulting framework is called Training with RIght-Censored Gaussian NoisE (TRICE).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. K-th Percentile Performance</head><p>The KPP of a DNN model is derived from the k-th percentile of a distribution. The k-th percentile of a distribution can be defined as the value z pk that separates the lowest k% of the observations from the highest (100-k)% of the observations in a distribution. Formally speaking, given a random variable Z following a distribution Dist, there exists a value z pk that, if sampling a value zi from Z, there is a k% probability that zi &#8804; z pk . It is equivalent to:</p><p>where cdfDist is the cumulative distribution function of Dist.</p><p>In the context of a DNN model's performance in the presence of device variations, the KPP represents the minimum performance level that the model achieves with a probability of at least (100-k)%. For example, As shown in Fig. <ref type="figure">2</ref>, the 5 th percentile performance in terms of top-1 accuracy (i.e., k-th percentile accuracy) of this DNN model in the presence of device variations is 0.4623 which means for 5% of the cases the DNN accuracy will be lower than 0.4623, and for 95% of the cases, the DNN accuracy is greater than 0.4623.  Add value perfi to list perf l ; 8: end for 9: perf l = sort(perf l ); 10: perfq = perf l [q &#215; len(perf l )] 11: return perfq; KPP of a DNN model can be easily evaluated through Monte-Carlo simulation. Specifically, with N sample Monte Carlo runs, N sample performance values are collected. These performance values are then sorted in ascending order and the (N sample &#215; k%) th element of this sorted array is the estimation of KPP. The overall process is shown in Algorithm 2.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Relationship Between Weights and k-th Percentile Performance</head><p>After establishing the definition of the KPP, we proceed to analyze how it relates to the trained weights of the DNN model. We use the loss function as the metric for assessing the performance throughout this analysis.</p><p>Given a neural network model M and its trained weight vector w, the output out of this model from the input x can be described as out = M(w, x). Further given the ground truth label GT and the loss function f , its loss can be described as loss = f (M(w, x), GT). Because the values of x and GT are fixed when inferencing on a given dataset, the loss expression can be simplified as a function of w, i.e., loss = f (w).</p><p>Here we study the impact of perturbing one element w0 in the weight vector w. Specifically, because this weight value is subjected to the impact of device variations, it is perturbed to w0 + &#8710;w, where &#8710;w is the device variation-induced perturbation. We can then apply Taylor expansions to the loss function w.r.t. the perturbed weight:</p><p>We can observe in Eq. 2 that the loss function can be approximated by a quadratic function of &#8710;w. Given that the weight perturbation &#8710;w follows the distribution of device variations (&#8710;w &#8764; Dist), we can calculate the k-th percentile of the loss as follows:</p><p>First, let q = k/100 be the probability number of k-th percentile. We then let the unknown k-th percentile be lossq. According to the property of quadratic functions, along with the fact that f &#8242;&#8242; (w) &#8805; 0 <ref type="bibr">[24]</ref> and lossq is greater than the minimum value of Eq. 2, we know that there exist two real numbers &#8710;w1 and &#8710;w2, &#8710;w1 &lt; &#8710;w2, such that if &#8710;w1 &lt; &#8710;w &lt; &#8710;w2, then f (&#8710;w) &lt; lossq.</p><p>By the definition of KPP and the loss is the lower the better, we have q as the probability of f (&#8710;w) &#8805; lossq, and then 1q is the probability of &#8710;w1 &#8804; &#8710;w &#8804; &#8710;w2. Recalling that weight perturbation &#8710;w follows the device variation distribution (&#8710;w &#8764; Dist), we have:</p><p>where cdfDist is the cumulative distribution function (CDF) of Dist.</p><p>Through the definition of w1, w2 and lossq, we also know that:</p><p>Combining Eq. 3 and Eq. 4, we can get an analytical relationship between q and lossq and thus can calculate lossq given the device value deviation distribution Dist and the trained model weight w0.</p><p>In this work, we target a device model that the device value deviation follows Gaussian distribution N (0, &#963; d ), whose CDF is:</p><p>Combining Eq. 3 and Eq. 4 and the first-order approximation of Eq. 5, we obtain:</p><p>Considering f &#8242; (w0) as a variable, it is clear that lossq is a quadratic function w.r.t. f &#8242; (w0). Extensive research works <ref type="bibr">[24]</ref>, <ref type="bibr">[25]</ref> have shown that when using cross-entropy loss with softmax as the loss function, the second derivatives of weights w.r.t. the loss is positive, i.e., f &#8242;&#8242; (w0) &gt; 0. Thus, it is clear that Eq. 6 reaches its maximum value when f &#8242; (w0) = 0 and decreases when f &#8242; (w0) diverges from 0. Therefore, by observing the first term of 6, to gain a low enough lossq, hence high enough KPP, a smaller f &#8242;&#8242; (w0), and a f &#8242; (w0) with larger absolute values is required. Similarly, by observing the second and the third term of 6, a smaller f (w0), and a smaller f &#8242;&#8242; (w0) is required. Thus, to improve the KPP of a DNN model, the DNN training process needs to simultaneously minimize f (w0) and f &#8242;&#8242; (w0), and maximize |f &#8242; (w0)|.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. The Effect of Noise Injection Training</head><p>According to the conclusion in Section III-B, the DNN training process needs to minimize f (w0) and f &#8242;&#8242; (w0), then maximize |f &#8242; (w0)| at the same time. We now analyze the noise injection training process to see how to satisfy these requirements.</p><p>Using similar denotations as Section III-B and recall Alg. 1, one iteration of the noise injection training process can be depicted as:</p><p>where wt is the current weight value, wt+1 is the updated weight value after this iteration of training and &#945; is the learning rate. By applying Taylor expansion on f &#8242; (wt + &#8710;w), we obtain:</p><p>Considering a noise injection training process where in each iteration of training, the device variation-induced weight value perturbation &#8710;w is sampled for enough instances instead of only once, the statistical behavior for such noise injection training is:</p><p>where E[&#8710;w] is the expected value (i.e., mean) of &#8710;w. By observing Eq. 9 and by recalling the requirements derived through Eq. 6 that the DNN training process needs to (1) minimize f (w0), (2) minimize f &#8242;&#8242; (w0), then (3) maximize |f &#8242; (w0)| at the same time. We can analyze the three terms after &#945; in Eq. 9 to design the noise distribution to be injected.</p><p>For the three terms after &#945;, the first term f &#8242; (wt) is the firstorder gradient that is used in vanilla gradient descent that minimizes the value of f (wt+1). This satisfies the first requirement gained in Section III-B. Another side effect is that, when the training process is close to converging, this term would push the first-order gradient toward zero.</p><p>The third term E[(&#8710;w) 2 ] 2 f &#8242;&#8242;&#8242; (wt) affects the second derivatives. Because E[(&#8710;w) 2 ] is always positive, this term minimizes the value of f &#8242;&#8242; (wt+1). This satisfies the second requirement gained in Section III-B.</p><p>For the second term E[&#8710;w]f &#8242;&#8242; (wt), it affects the first derivatives. As E[&#8710;w] can be either positive, zero, or negative, this term would respectively minimize, not change, or maximize the first-order gradient. Combined with the first term that pushes the first-order gradient towards zero, injecting a noise with a negative mean would result in a maximized positive first-order gradient and vice versa. Because Eq. 6 requires a first-order gradient of larger absolute value, a noise distribution with a non-zero mean value is required. The widely used Gaussian distribution, whose mean value is zero, however, does not meet this requirement. Therefore, a new type of noise needs to be utilized for noise injection training.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. Candidate Noise Distributions</head><p>According to Section III-C, to improve the model robustness, the distribution injected in the training process needs to satisfy requirements that: E[(&#8710;w) 2 ] &gt; 0 and E[(&#8710;w)] &#824; = 0. We also need this distribution to yield a model with high enough accuracy when noise-free, according to Section III-B. We propose to consider four candidate noise distributions for our study, all of which are variations of the Gaussian distribution. These distributions include (a) Right-Censored Gaussian (RC-Gaussian), (b) Left-Censored Gaussian (LC-Gaussian), (c) Right-Truncated Gaussian (RT-Gaussian), and (d) Left-Truncated Gaussian (LT-Gaussian). In a Right-Censored Gaussian distribution, all values follow Gaussian distribution except that those greater than a certain threshold are set (censored) to be the threshold value. This applies similarly to the LC-Gaussian distribution except that the value smaller than the threshold is censored. The property of RC-Gaussian is shown in Eq. 10. Different from RC and LC-Gaussian, in the Right-Truncated Gaussian distribution, any value greater than a threshold is cut off, which means there is zero probability for the perturbation value to be greater than the threshold. This applies similarly to LT-Gaussian. The distribution histograms of the four candidates are shown in Fig. <ref type="figure">3</ref>. With excessive experiments, we select to inject Right-Censored Gaussian distribution during noise injection training because it would result in the best KPP. The results of this study are shown in the experiment section.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E. Automated Hyperparameter Selection through Adaptive Training</head><p>Right-Censored Gaussian noise injection training requires massive hyperparameter tuning. Unlike traditional Gaussian noise injection training, which employs noise hyperparameters the same as the device variation-induced weight value deviation during training to accurately replicate the inference environment, injecting RC-Gaussian noise introduces different types of noise during training and inference. Thus, the two hyperparameters, &#963;t and th, need to be calibrated for each different DNN model and &#963; d value. The process of determining the optimal hyperparameters can be time-consuming and requires significant human effort. AutoML <ref type="bibr">[9]</ref>-based methods are possible solutions but they typically require multiple trials to determine the optimal hyperparameter. Therefore, we propose an adaptive training method to find the optimal noise hyperparameters during the training process. This method requires no hyperparameter tuning and takes only one single training run to train the optimal model. To develop this method, we first conduct a grid search of hyperparameters. As shown in Fig. <ref type="figure">4</ref>, for both hyperparameters (&#963;t and th), as the value of the hyperparameter increases, the DNN performance initially increases and then decreases after reaching an optimal point. This property allows us to use a binary search-like method to find the optimal hyperparameter values.  // Train only one model when start == end.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>6:</head><p>NoiseTrain(M, w1, RC-Gauss(th, start), 1, D, &#945;); NoiseTrain(M, w1, RC-Gauss(th, lef t), 1, D, &#945;);</p><p>NoiseTrain(M, w2, RC-Gauss(th, mid), 1, D, &#945;); // Only evaluate performance and update hyperparameters after warmup. 17: perf1 = QuantileEval(M, w1, &#963; d , q, D, Ntrain); if max(perf1, perf2, perf3) == perf2 then</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>22:</head><p>start, end, w1, w3 = lef t, right, w2, w2; end if 30: end for Note that there are more efficient hyperparameter tuning algorithms available compared to our method. The optimal solution requires training fewer models using different hyperparameters. However, our approach is better suited for noise injection training due to the following reasons. (1) It involves more estimations of model performances using different hyperparameters, thereby reducing the impact of imperfect KPP estimations obtained from a small number of Monte Carlo runs. (2) It continuously trains a model using a hyperparameter of mid = (start + end)/2, which is closer to the final optimal hyperparameter. This makes the training process easier to converge.</p><p>In our practice, we use adaptive search to automatically find perturbation scale &#963;t and manually determine th.</p><p>The whole training framework with automated hyperparameter tuning is named Training with RIght-Censored Gaussian NoisE (TRICE) and shown in Algorithm 3.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>IV. EXPERIMENTAL EVALUATION</head><p>In this section, we comprehensively evaluate our proposed TRICE method in terms of KPP improvement for CiM DNN accelerators suffering from device variations. We first discuss how to link the device value variations to additive noise on weights based on the noise model. We then compare the effectiveness of TRICE against SOTA baselines using different datasets, models, and different types of NVM devices that can be used to build NVCiM DNN accelerators. Ablation studies that show the advantages of RC-Gaussian noise over different noise candidates are also conducted.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Modeling of Device Variation-induced Weight Perturbation</head><p>Without loss of generality, we mainly focus on device variations originating from the programming process, in which the conductance value programmed to NVM devices can deviate from the desired value. Next, we show how to model the impact of device variations on DNN weights.</p><p>Assume a H bits DNN weight, the desired weight value after quantization (W des ) can be represented as:</p><p>where hj &#8712; {0, 1} is the value of the j th bit of the desired weight value, W is the floating point weight value and max |W| is the maximum absolute value of the weight. For an NVM device capable of representing B bits of data, since each weight value can be represented by H/B devices<ref type="foot">foot_0</ref> , the corresponding mapping process can be expressed as:</p><p>where gi is the desired conductance of the i th device representing a weight. Note that negative weights are mapped in a similar manner.</p><p>Considering the impact of device variations, the actually programmed conductance value gpi is as follows: gpi = gi + &#8710;g <ref type="bibr">(13)</ref> where &#8710;g is the deviation from the desired conductance value gi. Thus when weight is programmed, the actual value Wp mapped on the devices would be:</p><p>To simulate the above process, we follow the settings consistent with existing works. Specifically, we set B = 2 based on existing works <ref type="bibr">[3]</ref>, <ref type="bibr">[24]</ref>, while H is specified by each model. For the device variation model, we adopt &#8710;g &#8764; N (0, &#963; d ) (if not specified), which indicates that &#8710;g follows Gaussian distribution with a mean of zero and a standard deviation of &#963; d . We constrain &#963; d &#8804; 0.4 as this is a reasonable range that can be realized by device-level optimizations such as write-verify based on the measurement results. Our model and parameter settings are in line with that of RRAM devices reported in <ref type="bibr">[7]</ref>. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Experimental Setup</head><p>Platforms and Metrics: All experiments are conducted on PyTorch using an on-the-shelf GPU. To precisely capture the performance (accuracy) of the DNN model under device variations, our report data points are averaged from 5 identical runs. For the evaluation metric, if not specified, we report the 1 st percentile accuracy, which is KPP using accuracy as the performance metric and with k = 1. To obtain the KPP of a DNN model under sufficiently high precision, we choose to run 10,000 Monte Carlo simulations (N sample = 10,000). Since our experiments show that 10,000 runs can output 1 st percentile accuracy whose 95% confidence interval is &#177;0.009 based on the central limit theorem.</p><p>Baselines for Comparison: We compare TRICE with three baselines that are built upon training: (1) training w/o noise injection, (2) CorrectNet <ref type="bibr">[19]</ref>, and (3) injecting Gaussian noise in training <ref type="bibr">[3]</ref>, <ref type="bibr">[12]</ref>. For a fair comparison, we do not compare TRICE with other orthogonal methods like NAS-based DNN topologies design <ref type="bibr">[3]</ref>, <ref type="bibr">[8]</ref> or Bayesian Neural Networks <ref type="bibr">[10]</ref>, given TRICE can be used together with them.</p><p>Hyperparameters Setting: For all experiments, TRICE uses the same hyperparameter setups: start = 0, end = 2 &#215; &#963; d , th = 2, ep = 100, warm = 5 and Ntrain = 300, where &#963; d is the standard deviation for device variation. We limit the range of &#963; d as suggested by Sect. IV-A and report the effectiveness of TRICES across different &#963; d values within that range. For other training hyperparameters such as learning rate, batch size, and learning rate schedulers, we follow the best practice in training a noise-free model.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. The Effectiveness of TRICE on MNIST Dataset</head><p>We first compare TRICE with the aforementioned baselines using the model LeNet to recognize the 10-class handwritten digits dataset MNIST <ref type="bibr">[26]</ref>. LeNet is a plain convolutional neural network consisting of two convolution layers and three fully connected layers. All weights and layer outputs (i.e., activations) are quantized to four bits (H = 4). We also compare TRICE with injecting right-censored Gaussian noise with handpicked hyperparameters (RC-Manual) as an ablation study. Table <ref type="table">I</ref>   for it in the latter experiments. The ablation study also shows that TRICE outperforms injection right-censored Gaussian with handpicked hyperparameters (RC-Manual) and the improvement in 1 st percentile accuracy is up to 9.95%, so we do not show the result of RC-Manual in the remainder of this paper.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. The Effectiveness of TRICE in Large Models</head><p>After showing the effectiveness of TRICE in a small model LeNet for MNIST, here we further demonstrate the effectiveness of TRICE by comparing it with the baselines in larger DNN models for larger datasets. We choose two representative models VGG-8 <ref type="bibr">[27]</ref> and ResNet-18 <ref type="bibr">[28]</ref>. Both models use a 6-bit quantization (H = 6) for weights and activations. They both perform image classification tasks for dataset CIFAR-10 <ref type="bibr">[29]</ref>. As shown in Fig. <ref type="figure">5a</ref> and Fig. <ref type="figure">5b</ref>, TRICE clearly outperforms all baselines in most device value deviation values and performs similarly as baselines in some rare cases where device value deviation is too small to make an impact or too large to perform a valid classification. Compared with injecting Gaussian noise, TRICE improves the 1 st percentile accuracy by up to 25.09%, and 26.01% in VGG-8 for CIFAR-10 and ResNet-18 for CIFAR-10, respectively.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E. The Effectiveness of TRICE in Different Devices</head><p>To demonstrate the scalability of TRICE, we also show the effectiveness of TRICE on NVCiM platforms using different types of  <ref type="bibr">[3]</ref>, <ref type="bibr">[24]</ref>. More specifically, it is a four-level RRAM device whose device value deviation model is &#8710;g &#8764; N (0, &#963; d ), which means &#8710;g follows Gaussian distribution with a mean of zero and a standard deviation of &#963; d , independent of the programmed device conductance.</p><p>We further analyze the effectiveness of TRICE on two real-world FeFET devices whose device value deviation magnitude varies as its programmed conductance changes. Their device models are derived from measurement results in <ref type="bibr">[30]</ref>. Specifically, a generalized device value variation model for a four-level device is:</p><p>which means &#8710;g follows Gaussian distribution with a mean of zero and a standard deviation of &#963; h but the &#963; h value differs as its programmed conductance changes. We abstract the behaviors of the two FeFET devices to be: </p><p>This means the devices suffer from more device variations when they are programmed to value 1 and 2 and suffer from less device variations when they are programmed to value 0 and 3. As a comparsion, we show the conductance (gp) distribution of the previously used RRAM device and FeFET2 in Fig. <ref type="figure">7a</ref> and Fig. <ref type="figure">7b</ref>, respectively.</p><p>We report the effectiveness of TRICE in NVCiM platforms using FeFET1 and FeFET2 in Fig. <ref type="figure">6a</ref> and Fig. <ref type="figure">6b</ref>, respectively. As expected,  again, it is obvious that TRICE outperforms all baselines in most &#963; d values and performs similarly as baselines where device value deviation is too small to make an impact. Compared with injecting Gaussian noise, TRICE improves the 1 st percentile accuracy by up to 15.61%, and 12.34% in FeFET1 and FeFET2, respectively.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>F. Ablation Study for Different Noise Candidates</head><p>Gaussian RC-GaussianLC-GaussianRT-Gaussian LT-Gaussian</p><p>Injected Noise Type We also show the effectiveness of injecting RC-Gaussian noise in the training process by comparing it against injecting other three noise candidates: LC-Gaussian, RT-Gaussian, and LT-Gaussian noise. Here the result of training with Gaussian noise is also included as a baseline. Without loss of generality, we perform this study on the LeNet for MNIST dataset using uniform RRAM devices with &#963; d = 0.25. As shown in Fig. <ref type="figure">8</ref>, training with RC-Gaussian noise shows a clear advantage over training with other types of noise by at least 8.76%. Note that training with left and right truncated Gaussian performs even worse than injecting Gaussian noise because they exhibit lower accuracy w/o the presence of device variations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>V. CONCLUSIONS</head><p>In this work, we offer a mathematical explanation for the effectiveness of noise injection training in improving the robustness of DNN models under the influence of device variations. We also propose to use k-th percentile performance (KPP) instead of widely used average performance as a metric to evaluate the realistic worstcase performance of a DNN model. By analyzing the properties of DNN models and noise injection-based training, we show that the conventional Gaussian noise injection training improves KPP but far from optimal. Instead, we propose a novel noise injection training framework that injects right-censored noise during training. Extensive experiments show that TRICE clearly outperforms SOTA baselines in improving the k-th percentile performance of DNN models.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_0"><p>Without loss of generality, we assume that H is a multiple of B.</p></note>
		</body>
		</text>
</TEI>
