<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Masking Feedforward Neural Networks Against Power Analysis Attacks</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>11/20/2021</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10357582</idno>
					<idno type="doi">10.2478/popets-2022-0025</idno>
					<title level='j'>Proceedings on Privacy Enhancing Technologies</title>
<idno>2299-0984</idno>
<biblScope unit="volume">2022</biblScope>
<biblScope unit="issue">1</biblScope>					

					<author>Konstantinos Athanasiou</author><author>Thomas Wahl</author><author>A. Adam Ding</author><author>Yunsi Fei</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Abstract                          Recent advances in machine learning have enabled Neural Network (NN) inference directly on constrained embedded devices. This local approach enhances the privacy of user data, as the inputs to the NN inference are not shared with third-party cloud providers over a communication network. At the same time, however, performing local NN inference on embedded devices opens up the possibility of Power Analysis attacks, which have recently been shown to be effective in recovering NN parameters, as well as their activations and structure. Knowledge of these NN characteristics constitutes a privacy threat, as it enables highly effective Membership Inference and Model Inversion attacks, which can recover information about the sensitive data that the NN model was trained on. In this paper we address the problem of securing sensitive NN inference parameters against Power Analysis attacks. Our approach employs              masking              , a countermeasure well-studied in the context of cryptographic algorithms. We design a set of              gadgets              , i.e., masked operations, tailored to NN inference. We prove our proposed gadgets secure against power attacks and show, both formally and experimentally, that they are composable, resulting in secure NN inference. We further propose optimizations that exploit intrinsic characteristics of NN inference to reduce the masking’s runtime and randomness requirements. We empirically evaluate the performance of our constructions, showing them to incur a slowdown by a factor of about 2–5.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Following ground-breaking advances in the field of machine learning, an increasing number of real-world applications, such as image and speech recognition <ref type="bibr">[17,</ref><ref type="bibr">19,</ref><ref type="bibr">26,</ref><ref type="bibr">48]</ref>, perform classification and prediction tasks using Neural Network (NN) inference. NN inference has also found numerous applications in domains that operate on real-time data. Wearable sensors for health monitoring <ref type="bibr">[36,</ref><ref type="bibr">38]</ref>, smart devices <ref type="bibr">[34,</ref><ref type="bibr">54]</ref> and autonomous vehicles <ref type="bibr">[32,</ref><ref type="bibr">56]</ref> are typical classes of embedded systems that collect data and perform a classification task in real-time.</p><p>In this paper we consider the scenario of NN inference algorithms running locally on an embedded device (rather than on a cloud computer), as employed by many popular smart applications and end-user health monitoring devices <ref type="bibr">[34,</ref><ref type="bibr">54]</ref>. This scenario is preferred in cases where data privacy is paramount (which may be compromised in the cloud either during transmission or by the cloud service itself), or where network connectivity is unreliable, e.g., in autonomous driving <ref type="bibr">[32,</ref><ref type="bibr">56]</ref>.</p><p>Even in the scenario of local inference, however, the privacy of the directly or indirectly involved data is not fully guaranteed. By pushing trained NN models to end-user devices, the NN model providers may be inadvertently making the privacy-sensitive training data partially recoverable to skilled attackers, which is particularly concerning in situations that deal with medical records, health data and biometrics. Membership Inference <ref type="bibr">[37,</ref><ref type="bibr">44,</ref><ref type="bibr">47]</ref> and Model Inversion <ref type="bibr">[11,</ref><ref type="bibr">12]</ref> attacks target the sensitive data that a NN model was trained on. When executed in a white-box setting, i.e. with detailed knowledge of the NN model (including specific values of weight parameters), both types of attacks have been shown to outperform attacks without such knowledge <ref type="bibr">[11,</ref><ref type="bibr">31]</ref>, by reducing the false-positive rate during training data recovery.</p><p>Knowledge of some NN model's parameters is a natural requirement of white-box attacks. Di erential power analysis (DPA) of the embedded device <ref type="bibr">[23]</ref> can provide this knowledge, paving the way for stealing privacy-relevant training data. DPA is a form of sidechannel attack in which an adversary recovers sensitive data involved in the computation of the device, by ob-serving its power consumption. Recent work has demonstrated that DPA can be successfully launched against NN inference models to reverse-engineer the number of layers and neurons in each layer, the activations as well as the values of its various parameters <ref type="bibr">[3,</ref><ref type="bibr">10,</ref><ref type="bibr">55]</ref>.</p><p>Motivated by the above privacy concerns, we ask whether model providers can be better equipped to limit the leakage of private information from the NN models they deploy. Specifically, we address the problem of protecting the internal NN parameters-which, in the above spirit, we consider to be secrets-against DPA. We assume the attacker has physical access to the device executing NN inference (e.g., by purchasing a health monitoring device) and can therefore control its inputs, and observe its outputs and power consumption. Our proposed DPA protection mechanism is masking <ref type="bibr">[5,</ref><ref type="bibr">16]</ref>, a form of secret sharing. We present a suite of masked operations fundamental in NN inference that are resistant against DPA. Masked computations are carried out in a finite ring and performed in a shared fashion, i.e., they may only operate on randomized shares of a secret, without reconstructing it.</p><p>Two characteristics of NN inference algorithms make their masking a challenging task. The first is related to data representation, as NN inference operates on numerical values which are usually represented by floating-point arithmetic. Masking of floating-point values is problematic due to significant precision losses during secret sharing and complex floating-point semantics. In this work we use fixed-point arithmetic for masking NN inference, as it represents numerical values using (signed or unsigned) bounded integers, which in turn are amenable to masking, with marginal losses in prediction accuracy <ref type="bibr">[14,</ref><ref type="bibr">27]</ref>.</p><p>The second is related to masking of the required operations themselves, which include primitive operations on fixed-point numbers as well as operations unique to NN inference. Multiplication in fixed-point arithmetic requires a truncation operation, which must be masked as well. We perform masked truncation by adapting an approach originally proposed in the context of multiparty computation <ref type="bibr">[35]</ref>. Among the non-linear operations, masking of the activations requires a masked sign operation, which we accomplish using the bitwise representation of signed integers and masked multiplication.</p><p>After obtaining a complete set of masked operations for NN inference, we turn our focus to computational and randomness requirements. To improve the e ciency of masked NN inference along both axes, we take into account its intrinsic characteristics, such as the chaining of layer operations, and propose e cient masked al-gorithms that are resistant against DPA. Consider the dot product operation (which takes up a bulk of the NN inference): instead of a straightforward substitution of primitive addition, multiplication and truncation by their masked counterparts, we fuse the masked primitives into a masked dot product that is more ecient and minimizes precision loss due to truncation. As for randomness requirements, given the large number of secret weight parameters (compared to the number of secret key values of cryptographic algorithms), we devise e ective ways to reduce the demands for randomness that is required for the weight parameters to be secret-shared, as well as to perform masked operations on them.</p><p>In summary our contribution are: 1. We devise a library of masked operations fundamental in NN inference; we call each of them a gadget.</p><p>We prove the security of our gadgets under the notion of 1-strong-non-interference (1-SNI) <ref type="bibr">[2]</ref>. This notion guarantees that an attacker observing single points in the shared computation cannot learn any information on the secrets, and that 1-SNI gadgets can be securely composed into larger constructions. 2. We show how to compose our proposed gadgets into larger constructions, e.g., Multi-Layer Perceptron (MLP) inference models. We reason about their security both formally-using again the notion of 1-SNI-and experimentally, using Test Vector Leakage Assessment (TVLA), a standard methodology for the security evaluation of DPA targets. 3. We describe a tightening of the randomness requirements for our proposed inference models. Our approach vastly reduces the required random numbers, a scarce resource in embedded devices, from linear in the number of NN parameters to linear in the depth of the NN, without impacting security guarantees. 4. Finally, we provide secure implementations of MLP and Convolutional NN (CNN) inference in a publicly available C library, and experimentally evaluate the runtime penalty incurred by our proposed constructions, as well as the improvements of the tightening. We provide results for both x86 and ARM, across di erent NN architectures. We show that our masked inference models do not hinder the accuracy performance and incur about a 5-fold slowdown for MLP and 2-3-fold slowdown for CNN (including the cost of generating the required amount of random numbers), comparable to or lower than the slowdowns su ered by commonly encountered masked cryptographic software <ref type="bibr">[42]</ref> (3-to 20-fold).</p><p>Masking of NN in the context of DPA has only been explored very recently <ref type="bibr">[9,</ref><ref type="bibr">10]</ref>. The prior work targets a special class of NN, namely Binarized NN (BNN) <ref type="bibr">[39]</ref>, which operate mostly on Boolean values, and implements a masked version of the latter on an FPGA. To our knowledge, our work is the first to support implementing arbitrary masked Feedforward NN.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Motivating Example</head><p>Side-channel attacks exploit physical characteristics of computational systems, such as timing <ref type="bibr">[24]</ref>, power consumption <ref type="bibr">[23]</ref> and electromagnetic emanations <ref type="bibr">[13]</ref>, to recover confidential data of the computation. In their seminal work on Di erential Power Analysis (DPA) <ref type="bibr">[23]</ref>, <ref type="bibr">Kocher et al.</ref> show that an attacker able to observe power traces of a cryptographic implementation, i.e., variations of power consumption during the steps of the computation, can exploit the correlation between the values produced along the computation and the amount of power required to process them, in order to recover cryptographic secrets.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Threat Model</head><p>The main objective of this work is to devise algorithms for NN inference that are resistant against Power Analysis. The pre-trained NNs are executed as bare-metal applications on a microprocessor. Given a classification input, the microprocessor performs an inference round using the NN model. The parameters constitute the secrets of the NN. Our threat model assumes a non-invasive, passive attacker <ref type="bibr">[49]</ref> that is able to probe the power consumption of the processing unit. The attacker knows the algorithm executed on the microprocessor, but doesn't have access to the exact implementation. The attacker is non-invasive since they cannot depackage the processing unit in order to read secrets directly from the wires of the device. The attacker is passive as they observe the device's behavior during the processing without tampering with its functionality, e.g., they don't inject faults to the computation or provide malformed inputs. Finally, the attacker is bounded, i.e., there is a limit on the number of probes that monitor the device's power consumption. Attackers that violate these assumptions, e.g., by using an unlimited number of probes <ref type="bibr">[25]</ref> fall outside the scope of this work.</p><p>Example Setting Human Activity Recognition (HAR) uses mobile sensors of real-time and embedded systems to enable a wide array of applications such as life-logging, healthcare, senior care etc. Deep Learning is a beneficial technology for HAR, as it improves the inference accuracy without burdening the underlying hardware <ref type="bibr">[18,</ref><ref type="bibr">29,</ref><ref type="bibr">57]</ref>. A malicious user with access to a sensor-based device can launch DPA to reverse engineer the parameters of the trained network performing HAR. We demonstrate DPA on a MLP with a single hidden layer of 512 neurons (common in HAR tasks <ref type="bibr">[29]</ref>) that expects as input motion readings and classifies them into one of 14 possible activities. The attacker provides multiple inference inputs to the device, recording the input and the device's power consumption for the inference round. The power consumption of an inference round for input i is the power trace t i ; each power trace consists of the same number of samples. The power consumption of the device for the trace t i at sample s is denoted by t i [s]. As a result, the attacker has collected sets of inputs I and power traces T (|I| = |T |).</p><p>Having recorded each inference input, the attacker targets the multiplication operation between inputs and weights of the fully connected layer, and makes a guess w g for the value of a specific weight. Using the guessed value and the set of inference inputs the attacker computes the set V = {v i := HW (i &#8226; w g )|i oe I}, where HW returns the Hamming weight of a value. Then, the attacker computes Pearson's correlation fl V,Ts , where</p><p>|i oe I} is the set of power consumption values at sample s. Figure <ref type="figure">1</ref> shows the values of fl for each sample of the power traces, for a total of 10K power traces, for an MLP performing HAR. Around sample 11K we observe a correlation peak, i.e., a high linear relationship between v i and the power consumption at sample 11K, which indicates that at this point of the computation the MLP performs a multiplication between an input and the value w g . By performing multiple such correlation tests for di erent weight guesses, the attacker can eventually recover the weights of the MLP <ref type="bibr">[3,</ref><ref type="bibr">49]</ref>. In our work we aim to provide NN inference algorithms that eliminate any correlations between the values processed by the device and its power consumption. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Background</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">(Deep) Feedforward Neural Networks</head><p>NN are computing systems that, given some input, ultimately render a decision, e.g., assign to the input a particular classification label. In the context of classification, Neural networks (NN) can be thought of as mathematical functions that map an input vector to a vector of probabilities, each representing the likelihood for one of the classification outcomes.</p><p>Feedforward neural networks (FNN) are the quintessential deep learning model and form the basis of many NN applications. They are called feedforward because information flows through the computation without any feedback. They are called networks because they can be thought of as composition of multiple functions. Each function represents a layer of computation, and the composition of multiple functions forms a chain of layers. The length of this chain is the depth of the network. The last layer of the chain is the output layer. All other layers are referred to as the hidden layers.</p><p>The number of units in a layer, named neurons, defines the layer's width. Each neuron's output is defined by an expression of the form Activation( q (weight&#8226; input) + bias): the results of the preceding layer form the inputs of the neuron, which undergo a linear transformation-i.e., a weighted summation followed by a bias addition-, and a non-linear activation step. An entire FNN architecture can be specified by its parameters along with the type of each layer. An architecture such that all the neurons in one layer are connected to the neurons in the next layer, i.e., it is Fully Connected (FC), and undergo a linear transformation followed by an activation function is known as a MLP. A MLP whose weight parameters are either -1 or 1 is a BNN. A FNN that uses convolutional layer and potentially pooling and FC layers is a CNN.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Fixed-Point Arithmetic</head><p>Fixed-point arithmetic is an approximation of real arithmetic that lends itself well to representation on computational platforms. Fixed-point numbers allocate a fixed amount of bits to represent their fractional part. An immediate caveat is that for a given word size &#184;, a fixedpoint number can represent a smaller range of decimals compared to a floating-point number. The fixed number of bits of the fractional part brings, however, the advantage that operations over fixed-point numbers can be carried out by means of simpler circuits. As an example, addition in fixed-point arithmetic amounts to addition over integers that implement wrap-around semantics when overflows occur. This trade-o between precision/range and simplicity of data representation and operations has made fixed-point arithmetic a popular choice on embedded systems and more generally environments with constrained resources.</p><p>Fixed-Point numbers can be represented by signed or unsigned machine integers parameterized by two positive numbers &#184;and &#184;d, where &#184;is the bit-length of the integer word, and &#184;d &lt; &#184;. A fixed-point number x can be now interpreted as a fraction whose numerator is the two's complement representation of the &#184;-bit integer x and whose denominator is 2 &#184;d . That is, in fixed-point representation there is an implicit binary point between bits &#184;d + 1 and &#184;d. A fixed-point number x can be decomposed into x 1 &#8226;2 &#184;d +x 2 , where x 1 is a (signed) integer and x 2 oe [0, 2 &#184;d ).</p><p>Fixed-point arithmetic comes in signed or unsigned forms, depending on whether the bits of the numerator are interpreted as signed or unsigned numbers. The difference in sign representation results in di erent representation ranges of a (&#184;, &#184;d) fixed-point number, namely</p><p>for signed and &#203; 0, 2 &#184;&#8800;1 2 &#184;d &#200; for unsigned. Addition over fixed-point numbers is straightforward: it amounts to addition over (signed) integers. Multiplication is more complicated as it results in a fixed-point number with 2 &#8226; &#184;d bits representing the result's fractional part. The solution is an extra operation that truncates &#184;d bits from the computed product. For the fixed-point number x = x 1 &#8226; 2 &#184;d + x 2 , we define the truncation operation as &#194;x&#202; := x 1 , which is equivalent to a right-shift by &#184;d bits of x.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Masking</head><p>Masking <ref type="bibr">[5,</ref><ref type="bibr">16]</ref>, the most widely used countermeasure against DPA, is a form of secret sharing <ref type="bibr">[46]</ref> in which a secret is split into shares that can be combined using a suitable function to recover the secret. Let Z L be a finite ring of L elements. A (t + 1)-sharing of an element x oe Z L is a (t + 1)-tuple &#200;x&#205; = &#200;x 0 , . . . , x t &#205; oe Z t+1 L such that x 0 &#305; . . . &#305; x t = x and x i oe $ Z L for i oe [0, t), where &#305; oe {+, &#252;}. Symbol + denotes ring addition, &#252; is bitwise XOR, and oe $ denotes uniform random selection of an element from a set. Intuitively, t of the shares are drawn at random, and the (t + 1)-th share is computed from x and the rest of the shares. If &#305; = +, we speak of (t + 1)-arithmetic sharing and denote the (t + 1)-tuple by &#200;x&#205; L . If &#305; = &#252;, we speak of (t + 1)-Boolean sharing and denote the (t + 1)-tuple by &#200;x&#205; B . Unless stated otherwise, we assume (t + 1)-arithmetic sharings, or simply sharings, over Z L and write &#200;x&#205;. In rest of the paper we use masking and sharing interchangeably.</p><p>We call each element of a sharing &#200;x&#205; a share and denote the i-th share by &#200;x&#205; i . Given a sharing &#200;x&#205;, we denote its set of shares by S x . We use uppercase letters for matrices and vectors, and lowercase for scalars. Sharings of matrices and vectors are defined as pointwise sharings of their elements. For an n &#9674; m matrix</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Security Model</head><p>Our proposed algorithms use masking as the DPA countermeasure <ref type="bibr">[5,</ref><ref type="bibr">16]</ref>. <ref type="bibr">Ishai</ref>  NN inference can be organized as a sequence of operations, each applied layer after layer, giving it a compositional structure. Motivated by this characteristic of NN inference, we provide a set of secure basic NN operations which can in turn be securely composed into full NN inference algorithms. The standard notion of tprobing security, however, does not guarantee that the composition of t-probing secure functions is also secure <ref type="bibr">[8]</ref>. In this paper, we prove all our algorithms secure under the notion of strong non-interference (SNI) <ref type="bibr">[2]</ref> that enables reasoning about security in a compositional fashion. Gadgets, i.e., fundamental units of a larger con-struction, whose inputs are (t + 1)-sharings and that are shown to satisfy SNI, can be composed into algorithms that satisfy probing security.</p><p>Broadly, strong non-interference di ers from the weaker notion of non-interference (NI) (which is a rephrased notion of t-probing security proposed after the latter) by distinguishing output shares from intermediate shares (for NI the output shares are also considered to be intermediate). When output shares are considered in isolation, their joint distribution must be uniform, i.e., they must be independent of the input shares. When considered in conjunction with intermediate shares, their joint distribution may only depend on as many input shares as the number of intermediate shares considered. These two properties allow SNI gadgets to be composed safely with other NI gadgets in a construction, as their output shares reveal no information on their input shares. We restate (and slightly reformulate) here the notions of NI and SNI, originally introduced by Barthe et al. <ref type="bibr">[2]</ref>.</p><p>and the shares in V can be perfectly simulated from &#200;A 0 , . . . , A n&#8800;1 &#205;.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Definition 2. Let G be a gadget over n input sharings</head><p>&#200;x 0 &#205;, . . . , &#200;x n&#8800;1 &#205; and output sharing &#200;o&#205;. G is t-strongnon-interferent, t-SNI for short, if, for every set S &#8482; S o of at most t output shares and every set V of t &#213; := t&#8800; |S| intermediate shares, there exist n sets A 0 , . . . , A n&#8800;1 such that, for all i,</p><p>and the shares in V and S can be perfectly simulated from &#200;A 0 , . . . , A n&#8800;1 &#205;.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Example 3. We illustrate the di erences between the notions of NI and SNI. Algorithm 1 shows two gadgets that add two numbers x, y oe Z L each given as a</head><p>2-sharing into the result sharing &#200;z&#205;. It is easy to see that both gadgets are correct: recombining the sharing &#200;z&#205; by computing &#200;z&#205; 0 + &#200;z&#205; 1 results in x + y for both gadgets. The gadget on the left is 1-NI as each of the intermediate results depends on at most one of the shares of each input sharing &#200;x&#205;, &#200;y&#205;. More precisely, using Definition 1 with n = 2 and t = 1, for the intermediate share &#200;z&#205; 0 we have that it can be simulated by the 2-tuple of sets &#200;A 0 , A 1 &#205; = &#200;{&#200;x&#205; 0 }, {&#200;y&#205; 0 }&#205;. The same can be shown for &#200;z&#205; 1 and therefore the gadget is 1-NI. The same gadget is not, however, 1-SNI: using Definition 2 with n = 2 and t = 1, for the output share set S = {&#200;z&#205; 0 } we have</p><p>but the share &#200;z&#205; 0 oe S cannot be simulated from &#200;&#255;, &#255;&#205;.</p><p>In order to turn the shared addition algorithm into a 1-SNI gadget, an additional source of fresh randomness is required, denoted r in the gadget on the right. The intermediate shares &#200;w&#205; 0 and &#200;w&#205; 1 are defined using this random value and can thus be simulated from the empty set, which settles the case S = &#255; in Definition 2. For the case S = {&#200;z&#205; 0 }, we get as above: t &#213; = 0, hence</p><p>However, unlike with the gadget on the left, &#200;z&#205; 0 can be simulated from the empty set: we have &#200;z&#205; 0 = (&#200;x&#205; 0 &#8800; r) + &#200;y&#205; 0 . Since r is uniformly sampled from Z L , this output share is uniform, too. The same can be shown for the case S = {&#200;z&#205; 1 }.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Algorithm 1 Two gadgets implementing shared addition:</head><p>given 2-sharings &#200;x&#205; and &#200;y&#205;, return 2-sharing &#200;z&#205; s.t. z = x + y. The gadget on the right receives an additional input, r oe $ Z L .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>1: &#200;z&#205;</head><p>To see why NI is not composable, consider a call to the addition gadget on the left above, followed by a call to another gadget, name it G, with input sharings &#200;z&#205;, &#200;x&#205;. Under NI, an intermediate result in G that can be perfectly simulated from shares &#200;z&#205; 0 and &#200;x&#205; 1 does not reveal any information on neither x nor z. In this case however, due to the preceding addition gadget we have that &#200;z&#205; 0 = &#200;x&#205; 0 + &#200;y&#205; 0 , which implies that the ability to simulate &#200;z&#205; 0 and &#200;x&#205; 1 results in leakage of the value of x. In the rest of this paper we focus on devising gadgets that are 1-SNI and therefore lend themselves to implementing secure NN inference.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Masking Neural Networks</head><p>In this section we present our masking scheme for secure Neural Network (NN) inference. The main ideas behind our approach are to use a finite ring of integers to represent numeric values as fixed-point numbers, to arithmetically mask this representation, and finally to design operations that take masked representations of numeric values as input, and return them as output.</p><p>The masked NN hold their masked parameters. At each inference round, the masked NN must mask the unshared input and refresh the parameters' masks (but never recover them). In order to design operations over masked representations of fixed-point values, we begin with gadgets for basic operations (primitives) common in NN inference. We show masked implementations of these primitives and prove them to be 1-SNI. An intermediate result is the value stemming from a primitive operation in an algorithm's statement. In the case of multiple operations in a single statement, evaluation order is dictated first by parentheses, in their absence by the "multiplication-before-addition" rule, and is otherwise left-to-right. We provide some security proofs and lemma statements in the main paper and include the rest of the proofs in the Appendix, along with auxiliary shared algorithms. The correctness of our algorithms follows directly from recombining their output shares; the correctness of our constructions follows by composing correct gadgets. We then give examples of masking the computation of network layers by composing suitable masked gadgets for the basic operations. Crucially, we tap into the strength of SNI, which guarantees that such compositions preserve the security properties a orded by masking. Finally, we give an example full construction by implementing a MLP.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Representation</head><p>For our secure NN inference, we use a finite ring (Z L , +, &#8226;) of integers to represent signed fixed-point arithmetic. An integer x oe Z L with &#184;d bits allocated for its fractional part, where &#184;d &lt; &#184;:= log 2 L, represents the numeric value tc(x) 2 &#184;d where tc(x) interprets x in two's complement and returns its signed value. At the same time, x-being an element of Z L -readily lends itself to arithmetic masking, as described in Section 3.3, i.e., it can be split into shares &#200;x&#205; = &#200;x 0 , x 1 &#205;, where x 1 oe $ Z L and x 0 = x &#8800; x 1 . The sharings of the gadgets that follow in the rest of this section represent signed fixed-point values.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Basic Operations</head><p>Dot Product These operations are fundamental in the linear components of FNN. In MLP, each neuron computes a dot product between inputs and weight param-eters. In CNN, each feature is extracted by convolving part of the input with a kernel, i.e., by performing a dot product between the two. Algorithm 2 describes a masked dot product gadget. It uses sharing &#200;c&#205; as an accumulator and performs point-wise secure multiplication between the elements of its two input vectors.</p><p>Algorithm 2 DotProd(&#200;A&#205;, &#200;B&#205;, r): shared dot product</p><p>Algorithm 2 is similar to the secure multiplication gadget between two masked secrets <ref type="bibr">[42]</ref>. It uses a single fresh random number to mask the accumulator variable. It is crucial that the accumulator is masked first, before proceeding with the point-wise masked multiplications. Alternatively, each masked multiplication can use a fresh random number, but this would require as many random numbers as the size of the vectors. The correctness of the algorithm follows by expanding &#200;c&#205; 0 + &#200;c&#205; 1 and arriving to c. We show below that the optimization of using a single random number does not compromise security:</p><p>Proof. We use Definition 2 with t = 1 and distinguish two cases. In the first, we consider the output shares, i.e., S = {&#200;c&#205; 0 } or S = {&#200;c&#205; 1 }, hence t &#213; = 0 and V = A 0 = A 1 = &#255;. We have:</p><p>Due to the random value r, both &#200;c&#205; 0 and &#200;c&#205; 1 are uniform and thus can be perfectly simulated by the empty sets of shares &#200;A 0 , A 1 &#205;.</p><p>In the second case we consider the intermediate shares, i.e., S = &#255;, hence t &#213; = 1. We show in Table <ref type="table">1</ref> all single intermediate shares forming set V (left), and the tuples of sets of input shares with cardinality at most 1 that can perfectly simulate them (right).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Intermediate shares</head><p>Sets of input shares Truncation The arithmetic 2-sharings we use for masked NN inference represent signed integers, which in turn when interpreted as fixed-point integers represent decimal numbers. As explained in Section 3.2, one of the benefits of using the fixed-point representation of decimals is that decimal addition and multiplication can be immediately performed in fixed-point by means of signed integer addition and multiplication. In the case of multiplication, an extra truncation step is, however, required in order to keep the number of bits that represent the fraction part constant. More precisely, the result of each signed integer multiplication must be followed by the truncation of the result's &#184;d least significant bits.</p><p>In the context of multi-party computation of machine learning models, Mohassel et al. <ref type="bibr">[35]</ref> show a probabilistically approximately correct way to truncate an arithmetic 2-sharing. Their approach is easy and e cient to implement, and it incurs a small error on the least significant bit of the reconstructed result, with high probability. We recall here a theorem from the prior work:</p><p>Broadly, Theorem 5 states that, under some assumptions and with high probability, the truncated individual shares denoted by &#200;&#194;x&#202;&#205; 0 and &#200;&#194;x&#202;&#205; 1 may have an o -by-one error from the un-shared truncated value &#194;x&#202; when recombined. That is, in contrast to all other shared algorithms in this paper, the truncation operation may incur a precision loss. Moreover, if the assumptions of Theorem 5 are not satisfied, the truncation operation may result in a faulty value altogether. Specifically, for the shared truncation to return a value o -by-one from its unshared counterpart, the unshared value x must fall into a specific region of Z L , parameterized by the value &#184;t. Additionally the random number r used in the sharing &#200;x&#205; must be in the range [2 &#184;t , 2 &#184;&#8800;&#184;t ), therefore the probability of a correct result increases exponentially with the value of &#184;t. We discuss the choice of appropriate values for &#184;t further in Section 8. Algorithm 3 shows our rendering of Theorem 5 as a 1-SNI algorithm. The main di erence between the two is the additional fresh randomness at the end of Algorithm 3, which guarantees 1-SNI.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Algorithm 3 Trunc(&#200;x&#205;, r): shared truncation</head><p>Activation These functions constitute the nonlinear component of NN inference. In the context of this work we consider the rectifier function defined as ReLU (x) = max(0, x). A straightforward way to compute ReLU (x) in fixed-point arithmetic is via its derivative ReLU &#213; : Z ae {0, 1}:</p><p>Computing ReLU &#213; (x) essentially amounts to checking the sign of x. The fixed-point representation that we consider in this work uses signed integers, which are in turn represented in two's complement. The sign of the number therefore equals its most significant bit (MSB). However, the shared representation of the MSB is not simply given by the MSB of the two shares, as</p><p>In our approach we use known transformation functions between arithmetic and Boolean sharing. We transform the arithmetic 2-sharing &#200;x&#205; to its corresponding Boolean 2-sharing &#200;x&#205; B , which satisfies &#200;x&#205; B 0 &#252;&#200;x&#205; B 1 = &#200;x&#205; B . We then compute the MSB of x by bit extraction.</p><p>Algorithm 4 shows how to compute ReLU &#213; . First it calls the ArithToBoolean gadget, to transform the sharing from arithmetic to Boolean. It then extracts the MSB of the Boolean shares. If their XOR is zero then-according to two's complement representation-</p><p>x is positive, so ReLU &#213; must return 1, otherwise it must return 0. Hence, we negate one of the extracted bits (Line 3, via &#252; 1). Finally, to return a value that is arithmetically shared, we transform the Boolean 2-sharing of the extracted bit to an arithmetic sharing.</p><p>ArithToBoolean and BooleanToArith were originally introduced by Goubin <ref type="bibr">[15]</ref> and shown to be 1-NI. They have since been optimized and extended to higherorders of protection <ref type="bibr">[7]</ref> as well as shown to be 1-SNI <ref type="bibr">[6]</ref>. The conversion gadgets require 2 fresh random numbers each, in order to guarantee 1-SNI, so Algorithm 4 expects an array R consisting of 4 fresh random numbers.</p><p>ReLU &#213; is the first non-primitive gadget so far, constructed by composition of other 1-SNI gadgets. We provide a lemma useful in proving non-primitive gadgets.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Lemma 6.</head><p>Let G be a 1-SNI gadget over input sharings &#200;x 0 &#205;, . . . , &#200;x n&#8800;1 &#205;, output sharing &#200;o&#205;. Each output share of G can be perfectly simulated from the empty set.</p><p>Theorem 7. The ReLU &#213; gadget (Alg. 5) is 1-SNI.</p><p>The final step toward computing the ReLU activation is to multiply the result of ReLU &#213; (x) with x. Algorithm 5 implements the shared ReLU activation, using the Mul gadget to multiply two sharings. Mul is well studied in the side-channel literature <ref type="bibr">[20,</ref><ref type="bibr">42]</ref> and has been shown to be SNI. While Algorithm 5 performs a multiplication, notice that it is not followed by a truncation operation. This is due to the first operand of the multiplication being the result of ReLU &#213; (x) which is either 0 or 1. This implies that the final multiplication result will be either 0 or x, neither of which requires truncating excess fractional bits. Algorithm 5 expects a vector of 5 fresh random numbers as parts of its inputs: 4 dedicated to the shared ReLU &#213; computation, and 1 to Mul. after their linear and non-linear components and are meant to provide a summary statistic of nearby elements of a layer. A commonly used statistic is that of the maximum element, i.e., MaxPool. To compute MaxPool over n shared elements, we iteratively compute the maximum of two values: the i-th element and the maximum of the i &#8800; 1 previous elements and return the result of the (n &#8800; 1)st maximum operation.</p><p>The above scheme allows us to focus on the problem of computing the maximum of two values, max(x, y). We observe that ReLU is a special case of max with y = 0 and use the same idea of extracting the sign of a value, only this time of the di erence x &#8800; y. We do this via the comparison function Cmp(x, y) := ReLU &#213; (x &#8800; y). The result of Cmp, i.e., the sign bit of x &#8800; y, along with its negation can now be used to compute the maximum as max(x, y) = Cmp(x, y) &#8226; x + (Cmp(x, y) &#252; 1) &#8226; y.</p><p>Algorithm 6 implements shared MaxPool. At iteration i, the shared Cmp(&#200;X i+1 &#205;, &#200;y i &#205;) gadget returns a tuple of sign sharings &#200;s i &#205;, &#200;s &#213; i &#205; where</p><p>The sign sharings are multiplied with the current shared element &#200;X i+1 &#205; and the previous maximum &#200;y i &#205;, resp., and their results are added to compute the maximum of the two.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Algorithm 6 MaxPool(&#200;X&#205;, R): shared maxpool</head><p>As with ReLU , the two shared multiplications do not need to be followed by a shared truncation, as one of their operands is always a sign sharing, representing either 0 or 1. The Cmp gadget (shown in the Appendix) closely follows the shared ReLU &#213; gadget: it prepends Al-gorithm 4 with a shared subtraction operation and appends it with an additional call to BooleanToArith for the complementary sign sharing. Due to these two additional operations, shared Cmp requires 6 fresh random numbers. Shared MaxPool requires 9 random numbers per binary max computation, for a total of 9(n &#8800; 1) for a MaxPool of n elements.</p><p>Theorem 8 summarizes the security proofs for the remaining gadgets of basic operations. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3">Layer Operations and Full Constructions</head><p>We demonstrate how the gadgets for basic operations defined in the previous section can be composed into layer operations, and provide a full construction of a shared MLP. We compose the DotProd, Trunc, Add gadgets into the Linear gadget of Algorithm 7, which computes the linear component of a FC connected layer. To avoid unnecessary loss of precision we perform truncation once per neuron, after the dot product and before bias addition. The Linear gadget requires a number of fresh randoms linear in the layer's width m. A convolutional gadget can be constructed (and shown to be 1-SNI) using the same gadgets and invoking them in the same order (see Appendix). The di erence lies in the selection of inputs that are multiplied with the convolution kernel in the DotProd gadget.</p><p>Algorithm 7 Linear(&#200;I&#205;, &#200;W &#205;, &#200;B&#205;, R): shared FC linear component</p><p>The Activation gadget, shown in Algorithm 8, is straightforward and computes the non-linear component of a layer by applying the ReLU gadget to all neurons of a layer. As each call to ReLU requires 5 fresh random numbers the total cost of the activation layer in terms of fresh randoms is 5m where m the layer's width.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Algorithm 8 Activation(&#200;X&#205;, R): shared Activation Component</head><p>3: return &#200;Z&#205;</p><p>The layer operations are composed exclusively of 1-SNI gadgets. Thus, by Lemma 6, all their intermediate and output shares can be simulated by tuples containing empty sets of shares. The proof of security for the layer operations follows directly from this observation. Theorem 9. The Linear, Activation gadgets are 1-SNI.</p><p>Having access to 1-SNI layer operations, we can compose multiple layers in a larger chain construction and form a shared MLP. In Algorithm 9 we show a shared MLP with n shared inputs, a single hidden layer of width m and an output layer of width k. It requires (3 + 5)m + 3k fresh random numbers in total, i.e., linear in the size of the network's width. The precision loss that can be incurred in the network, when compared to its unmasked fixed-point counterpart, grows with the depth of the network, as each linear layer operation requires a truncation step. Similarly to the layer operations, the security of Algorithm 9 follows directly from the 1-SNI layer gadgets.</p><p>Algorithm 9 FNN: 2-shared MLP. (1 hidden layer) </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Security Evaluation</head><p>In Section 5 we provided masked gadgets whose composition leads to masked FNN inference algorithms. We reasoned about their security in the probing model by showing the inference algorithms to be 1-SNI. This implies that an attacker probing 1 intermediate result of the masked computation cannot recover any informa-tion on the secret, i.e., the network's parameters. The probing model however, relies on some assumptions, e.g., that each intermediate result leaks independently <ref type="bibr">[40]</ref>. In this section we investigate the leakage behavior of our proposed algorithm in practice, and experimentally validate its security.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Test Vector Leakage Assessment (TVLA)</head><p>This is a leakage evaluation methodology <ref type="bibr">[45]</ref> that uses Welch's t-test to asses the security for implementations of masked algorithms. TVLA requires the collection of two power measurement datasets: one where the inputs to the masked implementation are fixed to a specific value, and one where the input of each measurement is random. We then compute the so-called fixed-vs-random t-test values for the two datasets on all points (i.e., samples) of the measurement traces and test the hypothesis that the traces in the two datasets have the same means on all points. High t-test values indicate violation of the null hypothesis, i.e., that the traces have the same means on all points. By increasing the degree of freedom of the t-test, i.e., the number of measurement traces in the datasets, the confidence on accepting or rejecting the null hypothesis also increases. Usually a predefined threshold is set to reject the null hypothesis based on the t-test value. In common industry standards for TVLA, the t-value threshold is set to |th| = 4.5. When the tvalues on all points of the traces are lower than the threshold, it confirms that there are no leakages on the traces. Attacking an implementation whose t-test values lie within |th| is expected to have the same success rate as a random guess <ref type="bibr">[43]</ref>.</p><p>Experimental Setup We use a STM32F407IG 32bit ARM Cortex-M4 CPU, clocked at 168Mhz and with 1MB of internal flash memory, to execute our masked implementations as bare-metal applications. We collect power measurements using a LeCroy oscilloscope with a sampling rate of 1GS/s (GS =giga-samples) in order to collect approximately 6 samples per clock cycle.</p><p>The CPU is part of the Pinata Development board <ref type="bibr">[41]</ref>, which bypasses the on-board voltage regulator and provides less-noisy power signals.</p><p>Figure <ref type="figure">2</ref> shows a diagram of our collection setup. A control unit (a PC) generates the input sharings as well as any random sources required by the gadgets, and communicates them to the board. The board reads the parameters and sets a trigger signal, prior to and after executing the masked algorithm, and communicates the algorithm's result to the control. The oscilloscope captures the power measurement within the trigger window and communicates it to the control unit: we call each power measurement collected in this way a trace. For reproducibility purposes, we use an AES-based seeded pseudo-random number generator (PRNG) to generate the random numbers required for the input and parameter sharings and for the fresh random numbers of the gadget computation. Evaluation We evaluate a masked C implementation of a MLP similar to the one presented in Algorithm 9, with an architecture of 2 inputs, a single hidden layer of 2 neurons, and an output layer of width 2. We chose this minimal architecture to keep the number of samples collected for each trace small, while still being able to exercise all the basic and layer operations described in Section 5. Following Algorithm 9, our implementation expects as input the input and parameter sharings (&#200;I&#205;, &#200;W 1&#205;, &#200;B1&#205; etc.) as well as the fresh random numbers (R), and outputs a classification probability vector. The sharings and fresh randoms are generated o ine (in the control unit). In our TVLA we use the non-specific t-test, which makes no assumptions on the leakage model of the board and targets all intermediate shares. During trace collection we interleave (at random) the traces of the fixed and random datasets to avoid any external influences or dependencies on the internal state of the board.</p><p>Figure <ref type="figure">3</ref> shows fixed vs random t-test results for the whole inference computation of the MLP described in the previous paragraph. We have performed three experiments. In the first, called PRNG-O , instead of using the PRNG to supply the random values used in sharings and the fresh random numbers, we set these values to a constant. This is the baseline, as the lack of randomness in the sharings should result in high leakage. This conjecture is confirmed in Figure <ref type="figure">3 (a)</ref>: at 100K traces, we see clear deviations from set threshold |th| = 4.5, for samples throughout the trace, rejecting the null hypothesis and indicating the existence of power leakage.</p><p>In the second experiment, called PRNG-On, we use the seeded PRNG to supply the random values. The result is shown in Figure <ref type="figure">3 (b)</ref>: at 1 million traces, there is no indication of exploitable power leakage, as the t-test values fall within the desired threshold, for all sample points along the whole trace, validating that our masked gadgets do eliminate first-order power leakage. In the third experiment, we increase the order of the t-test in the TVLA. A second-order TVLA checks the hypothesis that two datasets have the same second statistical moments (i.e., no second-order leakage). Figure <ref type="figure">3 (c</ref>) invalidates this hypothesis by showing values outside of th, which indicate that there are still second-order leakages in the traces. Since our implementations are 1-SNI, they do not have first-order leakage but do not protect against second-order leakage. With this experiment we validate the correctness of our measurement setup and the security protection of our gadgets. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">Tightening Random Numbers</head><p>In Section 5.3 we gave an example of a shared MLP with n inputs, a single hidden layer of width m and an output layer of width k and found that, for an inference round, masking of the shared layer operations requires a number of randoms linear in the layer's width. In addition to masking operations, at each inference round the inputs as well as the weights of the MLP must be masked with fresh random numbers, increasing the randomness requirements to linear in the number of parameters of the MLP. Random numbers are an expensive resource, as their generation requires either hardware support or frequent invocation of PRNG routines.</p><p>In this section we describe our approach to tightening the number of required fresh randoms for masking of both the NN parameters and operations involved in a layer. The key observation that enables this tightening is that within a layer, each neuron's (in the case of MLP) or feature's (in the case of CNN) computation can be done in parallel. The main ideas behind our tightening approach are i) to re-use the same random numbers in the sharings of the parameters of a layer, and ii) to re-use the fresh random numbers in the gadgets across di erent neurons/features of the same layer. We show that, despite aggressively re-using random numbers, the gadgets of Section 5 can be made to satisfy 1-SNI.</p><p>We begin with tightening the number of randoms in a layer's parameter sharings and propose to use a single random number for all parameters of a layer. We focus on the DotProd gadget as both MLP and CNN compute their linear components first. Theorem 10 formalizes the idea behind using a single random number in sharings of parameters of a layer and a single random number in sharings of inputs. Its proof is similar to that of Theorem 4 after swapping the parameter (and input) shares for a single random number.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Theorem 10. The DotProd gadget with</head><p>We use Definition 2 with t = 1 and distinguish two cases. In the first, we consider the output shares, i.e., S = {&#200;c&#205; 0 } or S = {&#200;c&#205; 1 }, hence t &#213; = 0 and V = A 0 = A 1 = &#255;. We have:</p><p>Due to the random value r, both &#200;c&#205; 0 and &#200;c&#205; 1 are uniform and can be perfectly simulated by the empty sets of shares &#200;A 0 , A 1 &#205;.</p><p>In the second case we consider the intermediate shares, i.e., S = &#255;, hence t &#213; = 1. We show in Table <ref type="table">1</ref> all single intermediate shares forming set V (left), and the tuples of sets of input shares with cardinality at most 1 that can perfectly simulate them (right).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Intermediate shares</head><p>Sets of input shares</p><p>Sets of input shares that can perfectly simulate the intermediate shares.</p><p>We now turn to tightening the fresh random numbers required for sharing layer operations. We discuss the Linear gadget of MLP; the goal is to use the same fresh random values across its neurons. More precisely, the DotProd, Add and Trunc gadgets, which Linear is composed of, can use the same fresh random values at each neuron. The reason is that their computations are completely parallel. We can therefore rewrite Algorithm 7 so that it requires R oe $ Z 3 L instead of R oe $ Z 3m L and re-uses the random values R 0 , R 1 and R 2 at each invocation of DotProd, Add and Trunc respectively, across the m neurons. A similar type of reasoning can be applied to the Activation gadget. All invocations of ReLU at di erent neurons can use the same random sources. Therefore we can rewrite Algorithm 8 so that it requires</p><p>Following the above observations, a theorem similar to Theorem 9 can be easily shown for the tightened version of the Linear and Activation gadgets.</p><p>We make two remarks. First, the bias parameter tightening does not compromise 1-SNI of the Add gadget as all the calls to Add are invoked in parallel, and the results of DotProd are masked by R 0 . Second, the results of each neuron of a layer depend on the same fresh random numbers. This satisfies the condition of Theorem 10 that both its inputs (i.e., the results of the previous layer) and its parameters can use a single fresh random value each. We can therefore apply the tightening strategy in a chain of layers.</p><p>Table <ref type="table">3</ref> compares the randomness requirements of the original gadgets described in Section 5 and their tightened versions described in this section, which re-use random numbers. The original layer gadgets (Linear, Activation) require a number of randoms linear in their parameter count. The tightened versions require a constant number of randoms, without compromising the 1-SNI security property of the gadgets. We have experimentally confirmed that the tightening has no discernible impact on the leakage behavior of our implementations. We have performed the same fixed-vsrandom t-test experiment as in Section 6 with PRNGon for the tightened implementation and for 1 million traces. Figure <ref type="figure">3 (d)</ref> shows that, as in the original version, there is no indication of power leakage.</p><p>The tightened Linear and Activation gadgets can be readily used in CNN inference. The MaxPool gadget, however, is not amenable to our tightening approach. The reason is that the output of MaxPool depends on the intermediate shares it introduces (which hold the max value up to the i-th iteration). Using the same random values for these intermediate shares would cancel</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Parameter sharings</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Neuron computation</head><p>Total</p><p>Table <ref type="table">3</ref>. Randomness requirements for a layer of n inputs and m neurons, invoking the Linear and Activation gadgets.</p><p>the masks and result in leakage. In practice, this is not problematic as MaxPool is commonly performed over 4 elements, so its randomness requirements do not depend on the number of CNN parameters. Multiple invocation of the MaxPool gadget can reuse the same random numbers, as their results are independent of each other.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8">Performance Evaluation</head><p>In this section we investigate the performance characteristics of our proposed gadgets and pursue the following questions: 1) How much overhead does the shared NN inference incur, compared to its unshared counterpart?, 2) How much does the shared NN inference impact the model's accuracy?</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.1">Operation Complexity</head><p>We describe first a simple computational cost model and use it to provide a complexity analysis for shared and unshared inference. The complexity model counts primitive operations of the same type; it ignores loads and stores from memory. It consist of 4 types, namely addition, multiplication, bitwise, and comparison operations. We apply the model at the layer level, and compare the operation counts per type, for the unshared and shared NN inference. We focus our analysis on FC layers and the MaxPool operation. The gadgets used in FC layers also appear in convolution layers, so their results transfer directly. Since our shared NN gadgets rely on fixed-point representation of the numeric values of the network, we use the fixed-point representation for unshared NN inference to perform a fair comparison. Each cell of Table <ref type="table">4</ref> shows how often each type of operation (shown in columns) occurs, for every unshared subcomponent (shown in rows) of. We have implemented the unshared ReLU as (x &gt; 0) &#250; x, to allow a fair comparison with the shared ReLU gadget. The truncation operation is implemented as a right-shift. Asymp- totically, for a FC layer, the cost of inference is linear in the product of its number of inputs and its width. The rows in Table <ref type="table">5</ref> are now the shared gadgets we presented in Section 5. The operation counts per type can be inferred from the corresponding gadget. For example, the Add gadget is invoked m times and performs 4 primitive additions; therefore, its entry in column ADD has a count of 4 &#8226; m. Notice that there is no CMP operation, as we perform comparison by 0 by means of the ReLU &#213; or Cmp gadgets. ReLU and MaxPool are the most complicated gadgets as they make use of the Mul, BooleanToArith and ArithToBoolean (the implementations of the conversion and comparison gadgets along with a detailed operation breakdown of all basic gadgets are shown in Appendices A, C. Asymptotically, the cost of shared NN inference is still linear in the layer size. However, the dominating factor m &#8226; n is multiplied by a constant factor of 4.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.2">Experimental Evaluation</head><p>We evaluate the performance of our unshared, shared and tightened implementations of NN inference in terms of runtime and accuracy. We investigate whether the operation overhead incurred by the sharing matches the slowdown predicted by our simple complexity model in the previous section, and whether sharing influences the accuracy of NN models.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>NN Models</head><p>We have trained NN of di erent classes (MLP, CNN, BNN) in PyTorch on the MNIST dataset, extracted their floating-point parameters, and converted them to fixed-point parameters for shared inference. We choose a collection of NN ranging from standard networks for MNIST <ref type="bibr">[30]</ref>, to models that were used in prior work related to secure inference <ref type="bibr">[10,</ref><ref type="bibr">35,</ref><ref type="bibr">53]</ref>. We scale up these networks by increasing the depth, width, and number of channels to further evaluate how our algorithms respond. The architectures we work with are common in the domain of HAR <ref type="bibr">[18]</ref>.</p><p>Inference Library We have implemented the shared algorithms for secure inference in a C library 1 .</p><p>The library includes the unshared, fixed-point counterparts of the algorithms, which we use as a baseline. For each NN model we provide four inference implementations. The unshared implementation is straightforward and implements ReLU (x) as (x &gt; 0) &#8226; x to match the behavior of the shared implementation. The shared implementations follow the algorithms of Section 5 and use a software PRNG (KECCAK-based <ref type="bibr">[4]</ref>) to generate the random numbers that they require for sharing the inputs and parameters, as well as those for layer operations. We have implemented three shared versions. shared-prngoff doesn't generate random values but uses constants in their place. The intention of this version is twofold: it is meant to separate the cost of sharing from the cost of randomness generation, and it serves as a baseline for comparing implementations with di erent randomness requirements. shared-original uses a number of randoms linear in the size of each layer and in the number of the model's parameters. Finally, shared-tightened uses a constant number of random numbers for each layer as well as for the parameters of each layer, as described in Table <ref type="table">3</ref> of Section 7.</p><p>We represent fixed-point numbers using signed 32bit integers. Our shared implementations substitute each fixed-point value by a tuple of shared values, effectively doubling the memory requirements of their unshared counterparts. We defer a space optimized version of our library for future work.</p><p>We recall from Section 5.2 that the Trunc gadget's probability of correctness is parametric to the value &#184;t &lt; &#184;= 32, and discuss how to choose values for &#184;t, &#184;d. &#184;t constrains the range of integers in Z L that can be shared. In detail, only the integers in [0, 2 &#184;t ) fi [2 &#184;&#8800;&#184;t , 2 &#184;d ) = [&#8800;1025, 1025). Accuracy Our goal is to evaluate the impact of sharing, and specifically the additional precision losses of shared truncation, in the accuracy performance of fixed-point NN inference. We use the accuracy % of unshared as our baseline and report the di erences in accuracy % with the baseline for the shared-original and shared-tightened implementations (columns Acc. " in Table <ref type="table">6</ref>). We observe small di erences in the range [&#8800;0.33, 0.19] % (negative numbers indicate losses).</p><p>Neither the class nor the architecture of NN suggest trends for the accuracy ". Tightening the randomness requirements has no significant impact on the accuracy " as the tightened algorithms perform the same number of truncation operations, but on di erent random values. This is reflected in the small di erences in the accuracy " between shared-original and sharedtightened. Optimizing the accuracy of the unshared DNN by means of training, and translating the models from floating-point to fixed-point arithmetic constitute orthogonal concerns. We omit results on the accuracy of shared-prng-off since its main goal is to provide runtime insights.</p><p>Runtime We measure the runtime of the four implementations of each DNN in a desktop Intel-i7-4770@3.40GHz with 16GB RAM. Working with x86 allows us to evaluate larger networks so we use it to investigate how our algorithms scale. We measure runtime in cycles using CPU time-stamp counters and report them in Table <ref type="table">6</ref>. Figure <ref type="figure">4</ref> shows the slowdown of the shared implementations relative to the unshared version. We note that in MLP models shared-prng-off is about 5.5 times slower than unshared across all networks, which matches the complexity analysis of the previous section. shared-original incurs slowdowns larger than an order of magnitude in all MLP. shared-tightened reverses the "trend": compared to unshared its slowdown is negligibly larger than that of shared-prng-off compared to unshared. We also report in Figure <ref type="figure">4</ref> the slowdown of a MLP executed in ARM Cortex-M4, as embedded devices are the nominal target of power analysis attacks. Due to memory limitations, the MLP executed in ARM expects 250 inputs. The depth and width of hidden layers as well as the underlying execution platform have no impact on the MLP implementations' slowdown as it remains the same across MLP models and platforms. The BNN implementation's slowdown follow that of the MLP models, which is expected as our library handles BNN by setting their weight parameters as either -1 or 1, and otherwise treats them as regular MLP.</p><p>CNN models demonstrate di erent slowdown characteristics when compared to MLP, but also between di erent CNN architectures. Broadly in CNN, sharedprng-off shows smaller slowdown that in MLP. We attribute this to the better data locality between subsequent calls to the dot product operation of a convolutional layer. The slowdown of shared-original in CNN is again smaller when compared to its equivalent of MLP (an order of magnitude). CNN models have a vastly smaller number of parameters compared to MLP and require less random values to mask them in sharedoriginal. shared-tightened in CNN has the same impact as in MLP, i.e., it makes the slowdown minimally larger than that of shared-prng-off. Finally, we see that between the two CNN models, the one with fewer FC layers (and total number of neurons) performs much faster, despite the fact that it's convolution layers have a much larger number of channels.</p><p>The above trends demonstrate that shared convolution layers perform invariably better than FC layers. We consider the reported slowdowns to be encouraging toward the general direction of application of masking in NN inference. The 2.3-5.6&#9674; slowdown of shared NN inference is lower than the slowdowns su ered by com-monly encountered masked cryptographic algorithms, e.g., masked AES <ref type="bibr">[42]</ref>, in software implementations.   <ref type="bibr">[51]</ref>. BoMaNet's masked circuits have been used to implement only MLP architectures and no formal guarantees of its masked circuits is provided. The constructions they provide are shown to experimentally satisfy 1-NI security, however in principle, we conjecture they could be refined to also satisfy 1-SNI.</p><p>BNN MLP inference utilizes the fact that parameters are binary values perform space and time optimizations. As an example, a dot product computation over n elements can be carried out as an XNOR operation between two n-bit registers. The constructions in BoMaNet are specific to such BNN characteristics and do not extend to non-binarized NN that have parameters of arbitrary values. While BoMaNet's masked adder and MSB extraction for the ReLU computation could be extended to non-binarized NN, implementing masked multiplication by means of LUT is prohibited when dealing with arbitrary weight values. On the flipside, while our library can implement NN inference for any value representable in fixed-point arithmetic, including BNN (by simply representing weights as fixedpoint 1 or -1), our gadgets are not optimized to exploit the binary weights (e.g., by performing dot product via a single XNOR). In terms of randomness requirements, BoMaNet does not exploit the layered structure of NN inference to reduce its randomness requirements. Each secret parameter requires a fresh random value to be masked and so does each non-linear operation. Finally, BoMaNet's secure circuits have been used to implement a 2-layer MLP and have not been applied to CNN.</p><p>BoMaNet uses additional flip-flops to partially account for known issues related to hardware glitches in CMOS circuitry <ref type="bibr">[33]</ref>. Our work considers platform specific leakage characteristics, e.g., glitches or transition based leakage <ref type="bibr">[1]</ref>, an orthogonal question.</p><p>Secure Multi-Party Computation (MPC) A large body of work uses secret sharing techniques for NN and machine learning models in the context of MPC, both for training <ref type="bibr">[28,</ref><ref type="bibr">35,</ref><ref type="bibr">53]</ref> and inference <ref type="bibr">[21,</ref><ref type="bibr">22,</ref><ref type="bibr">52]</ref>.</p><p>Masking and MPC target orthogonal security properties, e.g., a party in MPC might be the exclusive owner of a secret that must be protected against DPA. In masking there is a single party that holds all t+1 shares at all times, and an attacker observing at most t shares must learn no information about the secret. In MPC, there are at least two parties and a number of shares. An attacker, being either one of the parties or an external observer of the parties' communication, must at no point of hold all shares of the secret. Compared to masking, MPC incurs additional computational overhead due to communication and cryptographic assumptions which are not present in masking. Without including the communication overhead, prior work on secure inference <ref type="bibr">[35,</ref><ref type="bibr">53]</ref> reported total runtimes of 4.88 and 0.043 seconds per inference for a MLP with two hidden layers of 128 neurons. The cycles of our shared-tightened implementation (MLP 128-128 of Table <ref type="table">6</ref>) translate to runtime of 0.001753 seconds, i.e., at least an order of magnitude faster than the MPC based approaches.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="10">Conclusion and Future Work</head><p>We have proposed a library of masked gadgets, resistant against DPA, that are the first to our knowledge that can be safely composed into full MLP and CNN constructions. We demonstrated the security of our construction formally and experimentally. We described how to reduce the random number requirements of our NN constructions, and showcased that they incur about a 2-5&#9674; slowdown to their unmasked counterparts, with minimal, if any, accuracy impairment. We set as future work to apply our library to more complex DNN targeting advanced datasets, as well as expanding to di erent architectures such as recurrent NN.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A Auxiliary Shared Algorithms</head><p>Algorithms 10 [6], 11 <ref type="bibr">[6]</ref>, 12 are used by the gadgets that appear in the main paper. for j from 0 to m do 3:</p><p>&#200;X i,j &#205; := DotProd(&#200;K&#205;, &#200;I i&#8800;h:i+h,j&#8800;w:j+w &#205;, R 3(i+m&#8226;j) )</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>4:</head><p>&#200;Y i,j &#205; := Trunc(&#200;X i,j &#205;, R 3(i+m&#8226;j)+1 )</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>5:</head><p>&#200;Z i,j &#205; := Add(&#200;Y i,j &#205;, &#200;b i &#205;, R 3(i+m&#8226;j)+2 )</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B Proofs of Lemmas and Theorems</head><p>We provide here the proofs for the lemmas and theorems stated in the main paper. We also provide proofs of security for the auxiliary algorithms of Appendix A. The numbers of the theorems and lemmas correspond to the numbers of theorems and lemmas in the main paper. Proof : We distinguish two cases. In the first, we consider the output shares, i.e., S = {&#200;y&#205; 0 } or S = {&#200;y&#205; 1 }, hence t &#213; = 0 and V = A 0 = &#255;, where &#200;y&#205; 0 = &#200;BooleanToArith(&#200;w&#205; B , R 2 , R 3 )&#205; 0 and &#200;y&#205; 1 = &#200;BooleanToArith(&#200;w&#205; B , R 2 , R 3 )&#205; 1 . From Lemma 6 we have that both &#200;y&#205; 0 , &#200;y&#205; 1 can be simulated by &#200;A 0 &#205;. In the second case, we consider the intermediate shares, i.e., S = &#255;, hence t &#213; = 1. From Lemma 6 we have that both &#200;z&#205; 0 , &#200;z&#205; 1 can be simulated by &#200;&#255;&#205;. Consequently, &#200;w&#205; 0 , &#200;w&#205; 1 can also be simulated by &#200;&#255;&#205; as they only depend on shares of &#200;z&#205;.  Proof : From Lemma 6 and Lemma 14, <ref type="bibr">Lemma 15</ref> we have that the shares &#200;y i &#205;, &#200;s i &#205; i , &#200;s &#213; i &#205;, &#200;w i &#205;, (i oe {0, . . . , n &#8800; 1}) can be perfectly simulated by the empty set of shares. We distinguish two cases: </p></div></body>
		</text>
</TEI>
