<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Generalizing Functional Error Correction for Language and Vision-Language Models</title></titleStmt>
			<publicationStmt>
				<publisher>IEEE</publisher>
				<date>12/18/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10597903</idno>
					<idno type="doi">10.1109/ICMLA61862.2024.00110</idno>
					
					<author>Wenyu Peng</author><author>Simeng Zheng</author><author>Michael Baluja</author><author>Tao Xie</author><author>Anxiao Jiang</author><author>Paul H Siegel</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[The goal of functional error correction is to preserve neural network performance when stored network weights are corrupted by noise. To achieve this goal, a selective protection (SP) scheme was proposed to optimally protect the functionally important bits in binary weight representations in a layerdependent manner. Although it showed its effectiveness in image classification tasks on some relatively simple networks such as ResNet-18 and VGG-16, it becomes inadequate for emerging complex machine learning tasks generated from natural language processing and vision-language association domains. To solve this problem, we extend the SP scheme in three directions: task complexity, model complexity, and storage complexity. Extensions to complex natural language and vision-language tasks include text categorization and "zero-shot" textual classification of images. Extensions to more complex models with deeper block structures and attention mechanisms consist of Very Deep Convolutional Neural Network (VDCNN) and Contrastive Language-Image Pre-Training (CLIP) networks. Extensions to more complex storage configurations focus on distributed storage architectures to support model parallelism. Experimental results show that the optimized SP scheme preserves network performance in all of these settings. The results also provide insights into redundancyperformance trade-offs, generalizability of SP across datasets and tasks, and robustness of partitioned network architectures.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>I. INTRODUCTION</head><p>In the past decades, machine learning has witnessed an exponential increase in its application across various domains, ranging from computer vision and natural language processing (NLP) to complex systems like NAND flash memory systems <ref type="bibr">[1]</ref>, <ref type="bibr">[2]</ref> and storage systems <ref type="bibr">[3]</ref>, <ref type="bibr">[4]</ref>. As the scope of deep learning applications expands, the architectures of neural networks have also become deeper and larger <ref type="bibr">[5]</ref>, with the recent trend of using Large Language Models (LLMs). For instance, GPT-3.5 has 175 billion parameters and a model size of approximately 700 GB <ref type="bibr">[6]</ref>, while the recently announced GPT-4.0 <ref type="bibr">[7]</ref> has even more parameters, further increasing its capacity and complexity. Model parallelism has been used to train such large neural networks. At the same time, as the neural networks become larger, their weights may need to be stored across different devices <ref type="bibr">[8]</ref>.</p><p>This work was supported in part by NSF Grant CCF-2416362.</p><p>When a neural network is trained, its weights need to be stored in a memory device. Noise such as retention errors in such devices can accumulate over time, which degrades the performance of the neural network (e.g., a reduced classification accuracy <ref type="bibr">[9]</ref>). To mitigate these errors, it's crucial to protect the neural network weights using error correction codes. Designing an optimal error correction scheme for neural networks presents several challenges. Modern neural networks often have millions of parameters, which causes an extensive redundancy overhead for error corrections. The relationship between a neural network's weights and performance is highly complex, further increasing the difficulties for the design. Experiments using independent binary symmetric errors in the bits representing the weights of a ResNet-18 model trained for image classification reveal two key insights <ref type="bibr">[10]</ref>. First, beyond a threshold bit error rate (BER), the network performance falls off rapidly. Second, errors in different network layers have different impacts on the overall network performance. These observations suggest that we need a layer-dependent error correction scheme that can protect the functionally important bits in each layer while minimizing the redundancy required to preserve network performance.</p><p>The Selective Protection (SP) scheme proposed in <ref type="bibr">[10]</ref> achieved an effective trade-off between redundancy and performance in VGG-16 <ref type="bibr">[11]</ref> and ResNet-18 <ref type="bibr">[12]</ref> neural networks trained for image classification tasks. The SP approach leverages deep reinforcement learning (DRL), utilizing the Deep Deterministic Policy Gradient (DDPG) algorithm to learn the optimal policy which identifies a subset of important bits in each layer for protection using an error correcting code (ECC).</p><p>However, the network architectures of VGG-16 and ResNet-18 are relatively simple, with a limited number of trainable parameters compared with state-of-the-art networks, and more complex machine learning tasks have arisen in the natural language processing and vision-language association domains. Given the recent trend toward using large language models (LLMs) to tackle increasingly complex tasks, along with the corresponding need to use distributed computing and storage in the realization of these massive models <ref type="bibr">[13]</ref>, the question arises whether the SP scheme can be generalized for these more challenging scenarios.</p><p>In this paper, we extend the SP paradigm to handle in-creased problem complexity in three directions: task complexity, model complexity, and storage complexity. Increased task complexity arises from the trend from simple image classification tasks to language-related tasks such as text classification and vision-language tasks such as "zero-shot" generation of textual descriptions of images. Increased model complexity stems from the trend toward more complex models, such as models with self-defined block structures and attentionbased LLMs. Storage complexity is an outgrowth of the trend toward model partitioning across multiple storage devices with possibly different reliability. Specifically, we generalize the SP scheme to text classification on VDCNNs (Very Deep Convolutional Neural Networks) <ref type="bibr">[14]</ref> and "zero-shot" textual classification images on CLIP (Contrastive Language-Image Pre-Training) <ref type="bibr">[15]</ref> networks. To handle the more complex models, we modify the DRL algorithm to support different types of network layers. For the CLIP networks, we also extend SP to the scenario where the text encoder and vision encoder are stored on separate devices. We compare the performance of the otpimized SP scheme to a natural baseline scheme, where all layers of the neural network receive the same level of protection from ECCs. Experimental results confirm the superiority of the optimized SP scheme in all of the considered scenarios. We demonstrate the generalizability of an optimized SP scheme transferred from a VDCNN trained on one dataset for a specific text classification task to another VDCNN trained on a different dataset for a different text classification task. We also show that the SP scheme performs well in a distributed scenario, where the model weights are stored in different devices with different BERs. To the best of our knowledge, we are the first to apply the SP scheme to these more complex learning scenarios. The code and data used in our experiments is available at <ref type="url">https://github.com/w6peng/GeneralizingSP</ref>.</p><p>The rest of the paper is organized as follows. In Section II, we review related works in more detail. In Section III, we introduce the SP scheme and present a modified DRL algorithm for distributed storage scenarios. In Section IV, we experimentally evaluate the SP scheme and demonstrate that the SP scheme can substantially improve the redundancyperformance trade-off for neural networks. In Section V, we present a detailed analysis of the generalizability of the SP scheme and its limitations. In Section VI, we conclude the paper and discuss direction for future research.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>II. RELATED WORKS</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Robustness of Neural Networks</head><p>Researchers have studied the robustness of neural networks in different contexts. In machine learning, Carlini et al. <ref type="bibr">[16]</ref> discuss the vulnerability of neural networks to adversarial examples. They build an attack benchmark to illustrate that the defensive distillation does not significantly increase the robustness of neural networks. Alshemali et al. <ref type="bibr">[17]</ref> also study the reliability of neural networks by "attacking" them using adversarial examples. They create a taxonomy to categorize the approaches for generating adversarial examples and discuss various types of defensive strategies against adversarial examples in NLP.</p><p>In coding theory, several authors have evaluated the performance impact of errors in the binary representations of neural network weights. Qin et al. <ref type="bibr">[9]</ref> consider the robustness of neural networks in the presence of storage media bit-flip errors. Fernandes et al. <ref type="bibr">[18]</ref> focus on the errors when neural networks are running on a GPU. They consider the Code Error and they use ECC to correct them. Upadhyaya et al. <ref type="bibr">[19]</ref> study how the performance of a neural network degrades when noise is present. They use analog ECC to correct such errors.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Model Parallelism of Neural Networks and Storage</head><p>The recent trend toward using LLMs brings a new challenge of model scale in deep learning research. GPT-3.5 has 175 billion parameters, and M6-10T <ref type="bibr">[20]</ref> even has 10 trillion parameters. It is impossible for a single GPU to train such large networks. Model parallelism is a promising solution to this problem <ref type="bibr">[13]</ref>. Model parallelism is the technique that shards a neural network architecture graph into subgraphs and assigns each subgraph to a different device. These shards refer to groups of layers in a feedforward network. Facebook <ref type="bibr">[8]</ref> has applied model parallelism to train some of their applications. The layers of their models are grouped and distributed to optimize for throughput between machines. To guarantee data consistency, the weights of the grouped layers need to be stored via checkpointing during the training process.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Very Deep Convolutional Networks</head><p>Text classification is a widely used machine learning application in the NLP domain. For example, given a text sample, the model will classify it into a predetermined class, such as "sports" or "arts". VDCNN architectures, introduced in <ref type="bibr">[14]</ref>, incorporate many convolutional layers and have been successful when applied to text classification tasks. Their design is based on the observation that although a fully connected neural network with one-hidden-layer can in theory learn any real-valued function, deep architectures that are problem-specific can develop hierarchical representations that yield better network performance. Inspired by ResNet, the VDCNN incorporates a stack of "convolutional blocks" that include two sets of a convolutional layer, a temporal batch normalization layer, and a RELU activation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. Contrastive Language-Image Pre-Training</head><p>CLIP <ref type="bibr">[15]</ref> is a large vision-language model focusing on the "zero-shot" prediction task. Traditional computer vision models rely on a fixed set of predetermined object categories, which limits their applicability to other tasks. In contrast, CLIP leverages raw text descriptions associated with images as a flexible way of supervision. Based on those observations, OpenAI built CLIP using 400 million image-text pairs from the Internet. The model excels in "zero-shot" prediction, where it labels an input image using text descriptions without needing explicit training for specific tasks. The performance of CLIP is comparable with ResNet-50 on ImageNet in a zero-shot setting. CLIP achieves that without using any predetermined object categories from ImageNet.</p><p>Using CLIP involves a text classifier that includes object classes and text templates. For example, object classes might be cat, dog, or airplane, and text templates could be "a photo of a {}" or "a photo of my {}". The {} here represents the name of an object class. Given an input image, such as a dog, CLIP calculates similarity scores for the text classifier. The text description with the highest score is selected as the label, resulting in descriptions like "a photo of a dog".</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>III. SELECTIVE PROTECTION SCHEME BY DEEP REINFORCEMENT LEARNING</head><p>In this section, we review the Selective Protection (SP) scheme for functional error correction <ref type="bibr">[10]</ref>. Then, we present the optimization of the SP scheme that can support all the layers with trainable weights. Lastly, we consider a distributed scenario for CLIP to which we will extend the SP scheme.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Selective Protection Scheme</head><p>When the weights of neural networks are stored in memory devices, we assume two ways to represent them as bits. 2) Fixed-Point Representation: The fixed-point representation is widely used in the model quantization area for compression and acceleration of neural network inference <ref type="bibr">[21]</ref>. The weights between [-c, c] are linearly quantized and represented as bits. Let m be the number of bits used for the quantization, and s = c/(2 m-1 -1) be its scaling factor. For a weight w &#8712; [-c, c], its fixed-point binary representation is</p><p>Next, we present the main idea of the SP scheme. Suppose a neural network has N layers with trainable weights. Let i denote the ith layer L i with trainable weights W i , and let m represents the number of bits used to represent every weights W i . The SP scheme selects a bit-mask vector</p><p>for each layer L i . For j = 0, 1, ..., m -1, if &#181; i,j = 1, we will protect the jth position for each weight in W i . Every weight could in principle have its own bit-mask vector but it would incur a large overhead to find them. So in this work, we consider applying each bit-mask vector to the weights in a corresponding layer. The SP scheme aims to achieve the best neural network performance while maintaining a low redundancy. Assume we use a (n, k) linear code as ECC, where n is the codeword length and k is the number of information bits. The redundancy r is defined as the number of parity-check bits divided by the total bits used to represent the weights:</p><p>where k pro represents the number of bits protected by the ECC, and k total represents the total number of bits. Assume a network with N layers, we have</p><p>Then, the redundancy of the SP scheme is:</p><p>Let P denote the performance of the neural network, and let r be the target redundancy. We optimize the SP scheme to maximize P given the redundancy constraint r = r using DRL. By setting different values of kpro k total and target redundancy r, we can control the protection level of the network.</p><p>We assume that the bits suffer from independent bit-flip errors with probability p; that is, the errors are modeled by a binary symmetric channel (BSC) with BER p. And we choose an ECC that can correct the errors induced by the BSC with a probability close to 1. The SP scheme uses DRL to learn the most important bits for ECC protection. We briefly describe the main components of the DRL algorithm -state space, action space, reward function, and policy of agents -and then review the learning process in the context of the SP scheme.</p><p>1) State Space: A global state and a set of local state spaces are used in the SP scheme. The global state space contains the global network configurations and the set of local state spaces. The local state space &#928; i &#8834; &#920; is the state used by the agent of Layer L i .</p><p>Assume a neural network with N layers, let i be the index of ith layer L i . Let c i in and c i out be the number of input and output channels (feature maps). Let s i kernel be the kernel size and s i stride be the stride for convolution. Let s i f eat be the size of the input feature map. Let A denote the action space which is discussed later, and a i &#8712; A be the most recent action taken by the agent for L i . Let</p><p>and local state &#960; i &#8712; &#928; i for each layer L i is defined as follows:</p><p>). (7) When i = 1, the action a i-1 = a 0 could be the baseline action where every layer shares the same level of protection.</p><p>2) Action Space: The agent of layer L i will take an action to set the value a i &#8712; {0, 1} m for its bit-mask M i with m bits. So the overall action is the sequence of actions (a 1 , a 2 , ..., a N ). There are two ways to update the actions: the BitM ask method and the T opBits method. The BitM ask chooses the value of a i based on the local state &#960; i and the reward function. The bits in the bit-mask vector M i are selected without any limitations. Unlike BitM ask, T opBits will only select the first few significant bits for ECC protection.</p><p>3) Reward Function: The reward function is set differently for T opBits and BitM ask. Let P 0 describe the performance of the neural network without any errors. In each iteration, the current performance P is measured after we set the bit-masks and apply the noise based on the BER p. For the T opBits method, the reward function after the iteration is set as</p><p>The reward function of BitM ask needs to consider the distance between the redundancy r after the iteration and the target redundancy r. Let &#946; + and &#946; -be two positive real scalars. We define a function f (r, r) as</p><p>and the reward function as</p><p>Note that the function f (r, r) &#8804; 0 will help the DRL algorithm reduce the redundancy while keeping performance P near P 0 .</p><p>4) Policy of Agents and the Learning Process: The policy of the DLR algorithm guides the agent A i to choose the corresponding action a i based on the local state &#960; i and reward R. To learn the optimal policy for each agent A i , we build four Multilayer Perceptron neural networks with four layers. The four networks are the Actor Network, Target Actor Network, Critic Network, and Target Critic Network.</p><p>We use the DDPG algorithm to train those four neural networks. After an iteration, the local states, the actions, and the reward of the iteration are stored in a circular buffer in memory. At the same time, some samples will be randomly chosen from the buffer to train those networks to find the optimal policy.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Optimizations</head><p>In <ref type="bibr">[10]</ref>, the authors only protect the linear layer and convolutional layer for VGG-16 and ResNet-18. The SP scheme should protect all the layers with trainable weights, and it needs to protect other types of layers in the future. So we developed a script that can detect all types of layers if those layers have trainable weights. After the SP scheme gets those layer types, the parameters will be set accordingly. In this work, we consider four kinds of layers: linear layer, convolutional layer, batch normalization layer, and embedding layer. Note that CLIP uses transformers but still contains one convolutional layer at the head of the vision encoder. The linear layer can be seen as a special case of a convolutional layer, whose input feature map has the same size as the kernel. The batch normalization layer's input and output feature maps are the same. For the embedding layer, the input feature map is 0, and the output feature map is the size of each embedding vector. The kernel and stride of these two layers are set to 0.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Distributed Weights of Networks</head><p>In the distributed scenario, we consider different devices with varying BERs, each storing different layers. For i = 1, 2, ..., N , let L i denote the ith layer, and W i denotes the weights of layer L i . For t = 1, 2, ..., T , we have T devices D 1 , D 2 , ..., D T . Let S 1 , S 2 , ..., S T denote the index of the first layer in each device. Then a device D with a set of layers can be represented as:</p><p>Suppose f is a function that takes weights W and BER p to generate the noisy weights for correction. Then the noisy weights Wi of the ith layer L i after applying BER p j are defined as</p><p>The SP scheme learns the distributed scenario and balances the protection level to generate suitable bit-masks.</p><p>For CLIP, we consider placing the vision encoder and the text encoder into two devices D 1 and D 2 . The index of the first layer in these two devices is S 1 = 1 and S 2 = 64, respectively. Consideration of a more complex model with more devices is left for future work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>IV. EXPERIMENTAL RESULTS</head><p>In this section, we first present the experimental setup. Then, we quantitatively evaluate the impact of noise, introduced globally as well as layer-by layer, on the performance of VD-CNN and CLIP networks as a function of the BER. Next, we compare the redundancy-performance trade-offs of the baseline method and the SP scheme. Finally, we demonstrate the generalizability of SP bit-masks for VDCNN across datasets and text classification tasks, and we show the effectiveness of SP in distributed CLIP networks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Experimental setup</head><p>We apply the SP scheme to VDCNN and CLIP using a server equipped with NVIDIA RTX 4090 and NVIDIA RTX A6000 GPUs. VDCNN has 15 layers with trainable weights and 14.3 million parameters. CLIP has 125 layers with trainable weights and 156 million parameters. For the text classification task, we focus on news categorization by using two datasets, AG News and Sogou News <ref type="bibr">[22]</ref>. AG News contains four classes with English content. For each class, the dataset has 30,000 training samples and 1,900 testing samples. Sogou News has five classes with Chinese content. For each class, the dataset has 90,000 training samples and 12,000 testing samples. For the "zero-shot" prediction task, we use the ImageNetV2 dataset <ref type="bibr">[23]</ref>. ImageNetV2 contains three test sets, each has 10,000 color images in a variety of sizes. The original accuracies of the VDCNNs for AG News and Sogou News are 90.28% and 95.78%, respectively. The original accuracy of CLIP for ImageNetV2 is 83.39%. The performance of the neural networks, measured before and after application of the SP bit-masks, is the average of 10 runs.</p><p>We use the same ECCs as in <ref type="bibr">[10]</ref>, namely an ideal capacityachieving code and two Bose-Chaudhuri-Hocquenghem (BCH) codes selected for the floating-point and fixed-point weight representations, respectively. The ideal ECC has code rate equal to the Shannon capacity of the BSC with BER p, that is 1 -H(p), where H(p) is the binary entropy function. We assume in our experiments that decoding of the ideal ECC is always successful and the code protects all of the bits selected by the bit-mask. The BCH codes have parameters (n, k) = (8191, 6722) and (n, k) = (8191, 6787) for the floating-point and fixed-point representations, respectively. On a BSC with BER p = 0.01, both codes can decode with a sufficiently small failure probability to minimally impact network performance in a full weight protection scenario.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Noise Impact on VDCNN and CLIP</head><p>We now demonstrate the noise impact on VDCNN, CLIP, and distributed CLIP networks. Figure <ref type="figure">1a</ref> shows the noise impact on VDCNN using the AG News dataset. Figure <ref type="figure">1b</ref> shows the layer-by-layer noise impact when errors are introduced into only one layer of the network at a time. Most layers need to be protected to avoid serious performance degradation, with the exception of layers 1, 14, and 15.  Figure <ref type="figure">2a</ref> shows the noise impact on CLIP using the ImageNetV2 dataset with floating-point representation. Since CLIP uses the attention mechanism and is trained using 400 million image-text pairs, it is more robust than VDCNN. Figure <ref type="figure">2b</ref> shows the noise impact using fixed-point representation. The performance is seen to degrade more slowly as a function of increasing BER compared to the floating-point representation. Figure <ref type="figure">2c</ref> shows the noise impact on each layer in CLIP with floating-point representation. Since the number of layers is large, namely 125, we omit the legend for all layer indexes in the figure. However, there is one layer that is clearly the most vulnerable to noise. This is layer 125, the last layer in the text encoder. If it fails to properly encode the text classifier, then it is impossible to find the best label for the given image.</p><p>Lastly, we discuss the distributed CLIP scenario. We assume the weights of the text encoder and vision encoder are stored  separately on devices D 1 and D 2 with possibly different BERs. Assuming a floating-point representation, we test the noise impact by fixing the BER for one device at 10 -6 and varying the BER of the other device from 10 -7 to 10 -3 . The choice of 10 -6 for the fixed BER is motivated by Figure <ref type="figure">2a</ref>, which shows that the CLIP performance decreases by 25% when this BER is applied to all of the weights.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Redundancy-performance trade-off</head><p>Now, we discuss the extension of the SP method to text classification tasks and "zero-shot" prediction tasks. The BER is assumed to be 0.01 for both VDCNN and CLIP models. Figure <ref type="figure">4</ref> shows the redundancy-performance trade-off for ideal ECC and BCH code on VDCNN with floating-point representation. The SP scheme outperforms the baseline method in all cases at the same redundancy levels. In Figure <ref type="figure">4a</ref> when we use the ideal ECC for decoding, the BitM ask method can protect the weights of VDCNN only given the redundancy at 0.008, compared with the redundancy of the baseline method at 0.021. The T opbits can protect the weights when the redundancy is 0.019, which is also better than the baseline method. In Figure <ref type="figure">4b</ref> when we use BCH code, the BitM ask can protect the weights given redundancy at 0.020, compared with the redundancy of the baseline method at 0.055. The T opbits also performs better by protecting the majority of weights when we give a small redundancy of 0.047. The redundancy-performance trade-off of CLIP is shown in Figure <ref type="figure">5</ref>. The SP scheme also outperforms the baseline method given the same redundancy constraint. When we use the fixed-point representation and Ideal ECC for decoding as in Figure <ref type="figure">5c</ref>, the T opbits can protect the weights of CLIP with the redundancy at 0.043, compared with the redundancy of the baseline method at 0.076. The BitM ask can protect a majority of weights when the redundancy is 0.054, which is still better than the baseline method. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. VDCNN Bit-mask Generalizability</head><p>We now evaluate the generalizability of the SP scheme in text classification tasks by transferring the bit-masks learned from AG News to Sogou News, and vice versa. The BER is set to 0.01 for the VDCNN. We aim to protect up to 25% of the total number of bits representing the weights of the VDCNN. Note that the bit-masks learned by the SP scheme may correspond to a small fraction of protected bits if satisfactory performance is achieved by protecting fewer bits. The redundancy values corresponding to the learned bitmasks for the various combinations of SP method, weight representation, dataset, and code are shown in Table <ref type="table">I</ref>.</p><p>Table I also shows the accuracy of the SP scheme using different representations and codes. The figure in the column labeled "Accuracy" is the accuracy of the SP scheme using the bit-masks learned from the specified dataset for the specified weight representation and code. The figure in the column labeled "Transferred Accuracy" is the accuracy achieved using the bit-masks learned from the other dataset with the same SP method, weight representation, and code. The "% loss" in accuracy is calculated as [100 &#215; (1 -(transf erred accuracy/accuracy))]. For example, the "Transferred Accuracy" in the first row of Table <ref type="table">I</ref> refers to the accuracy achieved on the AG News VDCNN using the bitmasks learned from Sogou News VDCNN by the BitM ask method with floating-point representation and an ideal ECC.</p><p>The average accuracy loss over all experiments is 1.16%. The maximum loss of 3.61% arises when we transfer the bit-masks learned from AG News VDCNN by the T opBits method with fixed-point representation and a BCH code to the Sogou News VDCNN. Overall, these results demonstrate that the bit-masks learned from AG News for the English text classification task can also be generalized to Sogou News for the Chinese text classification task, and vice versa.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E. Distributed CLIP</head><p>We next evaluate the redundancy-performance trade-off in a distributed CLIP setting. We set the BER p = 0.01 for the vision encoder and the BER p = 10 -6 for the text encoder. Figure <ref type="figure">6</ref> shows the redundancy-performance trade-off of this setting. The SP scheme outperforms the baseline method in all cases. In Figure <ref type="figure">6c</ref>, we can protect the weights of CLIP when the redundancy is 0.032 and 0.103 using ideal ECC and BCH code for fixed-point representation, respectively. In comparison, the baseline method protects the weights of CLIP</p><p>Method Representation Dataset Code Redundancy Accuracy Transferred Accuracy % loss BitM ask Float AG News Ideal 0.0218 90.21% 88.97% 1.37 % BitM ask Float Sogou News Ideal 0.0218 95.64% 94.73% 0.95 % BitM ask Float AG News BCH 0.0518 90.36% 88.97% 1.54 % BitM ask Float Sogou News BCH 0.0547 95.49% 94.73% 0.80 % BitM ask Fixed AG News Ideal 0.0029 90.16% 87.78% 2.64 % BitM ask Fixed Sogou News Ideal 0.0036 95.12% 93.42% 1.79 % BitM ask Fixed AG News BCH 0.0082 89.93% 88.76% 1.30 % BitM ask Fixed Sogou News BCH 0.0096 95.15% 92.85% 2.42 % T opBits Float AG News Ideal 0.0205 90.49% 90.15% 0.38 % T opBits Float Sogou News Ideal 0.0206 95.76% 95.47% 0.30 % T opBits Float AG News BCH 0.0545 90.42% 90.13% 0.32 % T opBits Float Sogou News BCH 0.0537 95.78% 95.66% 0.13 % T opBits Fixed AG News Ideal 0.0190 90.50% 90.17% 0.36 % T opBits Fixed Sogou News Ideal 0.0202 95.75% 95.61% 0.15 % T opBits Fixed AG News BCH 0.0265 90.43% 90.05% 0.42 % T opBits Fixed Sogou News BCH 0.0492 95.78% 92.32% 3.61 %</p><p>TABLE I: VDCNN SP scheme and Bit-mask Generalizability Results</p><p>when the redundancy is 0.054, and 0.155 using ideal ECC and BCH code for fixed-point representation, respectively.</p><p>Since the overall noise level is smaller than for the scenario shown in Figure <ref type="figure">5</ref>, the "zero-shot" prediction accuracy for the distributed scenario is higher than the non-distributed case given the same redundancy constraints.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>V. DISCUSSION</head><p>In this section, we first discuss the layer-wise distribution of selected bits learned by the SP scheme using VDCNN. Then, we discuss the bit-masks learned from the SP scheme using a distributed CLIP. Lastly, we illustrate the limitations of the SP scheme when conducting our experiments.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Layer-Wise Distribution of Selected Bits</head><p>We use the bit-masks learned from the generalizability experiments of VDCNN shown in Table <ref type="table">I</ref> to illustrate the layer-wise distribution of selected bits. Figure <ref type="figure">7</ref> shows the layer-wise distribution using float-point representation for AG News and Sogou News (Line 1 of the table). Although layers 13 and 14 account for 36% and 24% of the total number of weights in VDCNN, Figure <ref type="figure">1b</ref> shows that those two layers do not need as high a level of protection compared to other layers. The optimal SP scheme sacrifices the protection level of those wide layers to provide a higher level of protection for the narrower layers that are more vulnerable to errors.</p><p>In order to better understand the observed generalizability of the SP bit-masks in the VDCNN experiments, we introduce the idea of a similarity score for two sets of bit-masks. For a given SP method, weight representation, and code in Table <ref type="table">I</ref>, we measure the Hamming distance between the bit-masks obtained for each layer corresponding to AG News and Sogou News. We normalize the distance by the number of bits used in the specified weight representation, and compute the average over all 15 layers of the VDCNN. As an example, consider the first and the second rows in Table <ref type="table">I</ref>, corresponding to the BitM ask method, floating-point representation, and ideal code with comparable redundancies. The lowest normalized Hamming distance among the 15 layers is 0.53125 and the highest is 0.8125, with an average of 0.6917 as the resulting similarity score of the two sets of bit-masks. The average of the similarity scores corresponding to the four BitM ask scenarios represented in the first eight rows of Table I is 0.738. These results provide a possible explanation of the generalizability of the SP scheme since the bit-masks are similar for those two VDCNNs trained using different datasets.</p><p>Lastly, we discuss the difference between the layer-wise distribution of selected bits using T opBits and BitM ask for the floating-point representation. The major difference between those bits learned from T opBits and BitM ask is the fraction bits. The sign bit and the exponent bits are still the dominant bits that need more protection, so those bits are similar in the bit-masks. Fig. <ref type="figure">7</ref>: Distribution of the selected bits learned from SP scheme for VDCNN using AG News (a) and Sogou News (b) with floating-point representation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Bit-Masks of Distributed CLIP</head><p>When the redundancy r is high enough for ECC protection, the bit-masks learned from the SP scheme are similar to those obtained in the non-distributed case (corresponding to equal BER in all devices). When r is low and the bit-masks cannot protect all the important bits, the SP scheme will learn this distributed scenario. In the scenario discussed above, where the vision encoder experiences a higher BER, the SP scheme sacrifices the bits for preserving the text encoder and increases the level of protection for the vision encoder. We use Figure <ref type="figure">6a</ref> as an example to illustrate this observation. When r = 0.024, the baseline method can protect the network to achieve the original accuracy. The DRL agent obtained this local optimal and stayed with the current action. When r = 0.019, the baseline method cannot achieve the original accuracy, and the agent will sacrifice the bits to preserve the text encoder and increase the level of protection for the vision encoder. When r = 0.019, the average number of protected bits in the vision encoder is 11, while the average is 8 when r = 0.024. When we set the BER of the text encoder higher than the vision encoder, we can observe that the SP scheme learns this distributed scenario and increases the protection level for the text encoder.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Limitations of The SP Scheme</head><p>The SP scheme works well when we use VDCNN as it can find the optimal bit-masks with a low redundancy. However, the SP scheme suffers from a convergence problem when we use CLIP. There are three possible reasons behind this problem. First, CLIP has 125 layers with trainable weights, so the search space of the DRL for CLIP is much larger than for VDCNN. Second, the search process seems random when we use the BitM ask method, which corresponds to the policy of the agent. For T opbits, it is assumed that the first bits are important for error protection and it always protects the first few bits. For BitM ask, it takes a lot of training epochs to learn that the first few bits are important, even if we use the baseline method as the starting action. We therefore need to design a better policy to guide the BitM ask method in finding the optimal bit-masks. Third, the training efficiency should also be considered to support a large model like CLIP. The training algorithm of the SP scheme is a classic reinforcement learning approach. In the future, we may use other learning algorithms such as Soft Actor-Critic algorithm <ref type="bibr">[24]</ref> or Proximal Policy Optimization algorithm <ref type="bibr">[25]</ref> to help resolve the convergence problem.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>VI. CONCLUSIONS</head><p>We demonstrated that Selective Protection (SP) for functional error correction can be extended to different machine learning tasks, complex structures of machine learning models, and distributed neural networks. It can achieve a better redundancy-performance trade-off than the baseline protection method when applied to text classification and "zero-shot" prediction tasks. Also, it can be extended to more complex models like VDCNN and CLIP. The bit-masks learned from the SP scheme can be transferred to other datasets using the same model, with a limited classification accuracy decrease. The SP scheme can also be adapted to the partitioned model scenario in which network weights reside in distributed storage devices, providing stronger protection in a noisier device. However, improving learning efficiency and reducing convergence time to the optimized SP scheme remains an open problem. Further investigation of SP schemes in the distributed storage setting is also warranted.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>(a) Floating-point ideal ECC. (b) Floating-point BCH code. (c) Fixed-point ideal ECC. (d) Fixed-point BCH code.</p></note>
		</body>
		</text>
</TEI>
