<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Improving Private Random Forest Prediction Using Matrix Representation</title></titleStmt>
			<publicationStmt>
				<publisher>AAAI 2025</publisher>
				<date>02/25/2025</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10633720</idno>
					<idno type="doi"></idno>
					
					<author>Arisa Tajima</author><author>Joie Wu</author><author>Amir Houmansadr</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[We introduce a novel matrix representation for differentially private training and prediction methods tailored to random forest classifiers. Our approach involves representing each root-to-leaf decision path in all trees as a row vector in a matrix. Similarly, inference queries are represented as a matrix. This representation enables us to collectively analyze privacy across multiple trees and inference queries, resulting in optimal DP noise allocation under the Laplace Mechanism. Our experimental results show significant accuracy improvements of up to 40% compared to state-of-the-art methods.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Introduction</head><p>Recent advances in machine learning services empower users to make inference queries using deployed models through black-box APIs. However, the outcomes of these predictions raise concerns about potential information disclosure regarding the training data, making them vulnerable to attacks like membership inference attacks <ref type="bibr">(Shokri et al. 2017;</ref><ref type="bibr">Hu et al. 2022</ref>). To address this, differential privacy (DP) <ref type="bibr">(Dwork and Lei 2009)</ref> has emerged as a promising solution, aiming to mitigate the risk of model predictions revealing sensitive information.</p><p>Two primary DP-based approaches exist for private prediction: DP training and DP prediction <ref type="bibr">(Ponomareva et al. 2023)</ref>. DP training <ref type="bibr">(Chaudhuri, Monteleoni, and Sarwate 2011;</ref><ref type="bibr">Jayaraman and Evans 2019;</ref><ref type="bibr">Abadi et al. 2016;</ref><ref type="bibr">Fletcher and Islam 2019)</ref> integrates DP noise into model parameters, ensuring that predictions from privately trained models do not reveal information about the underlying training data. By contrast, DP prediction <ref type="bibr">(Dwork and Feldman 2018;</ref><ref type="bibr">Nissim, Raskhodnikova, and Smith 2007;</ref><ref type="bibr">Bassily, Thakkar, and Guha Thakurta 2018)</ref> introduces perturbations to the predictions of non-private models. Unfortunately, despite the widespread adoption of DP in machine learning, a significant utility-privacy gap persists between DP-enabled machine learning and its non-private counterparts. DP-enabled Random Forests: In this paper, we focus on random forest classifiers-a versatile machine learning algorithm known for its strong performance with tabular data like demographic surveys and medical records. We address the challenge of private prediction through both DP training and prediction approaches while maintaining high accuracy.</p><p>DP random forest classifiers have been extensively studied. Past approaches primarily differ in their use of base tree algorithms and node splitting functions to balance privacy budget usage and accuracy <ref type="bibr">(Patil and Singh 2014;</ref><ref type="bibr">Jagannathan, Pillaipakkamnatt, and Wright 2009;</ref><ref type="bibr">Fletcher and</ref><ref type="bibr">Islam 2017, 2019;</ref><ref type="bibr">Hou et al. 2019)</ref>. Regarding the treebuilding process, there are greedy approaches, where optimal split points are determined <ref type="bibr">(Hou et al. 2019;</ref><ref type="bibr">Patil and Singh 2014)</ref> and random approaches, where split attributes are chosen randomly to save privacy budget for leaf nodes <ref type="bibr">(Fletcher and Islam 2017;</ref><ref type="bibr">Jagannathan, Pillaipakkamnatt, and Wright 2009;</ref><ref type="bibr">Holohan et al. 2019</ref>). The latter method, which we focus on, is known as random decision trees. It is predominantly adapted in recent works for its consistently strong performance; see the comprehensive survey by Fletcher et al. <ref type="bibr">(Fletcher and Islam 2019)</ref>.</p><p>While distributing the privacy budget is crucial for DP random forests, many approaches allocate it equally among trees <ref type="bibr">(Patil and Singh 2014;</ref><ref type="bibr">Jagannathan, Pillaipakkamnatt, and Wright 2009;</ref><ref type="bibr">Fletcher and Islam 2017;</ref><ref type="bibr">Hou et al. 2019)</ref>. However, this per-tree privacy analysis faces challenges when a large number of trees are used, resulting in low accuracy despite the ensemble learning principle that training more trees generally improves performance.</p><p>To our knowledge, no prior work has addressed DP prediction techniques specifically for random forests. Alternatively, a well-known model-agnostic DP prediction method-the subsample-and-aggregate framework-allocates the privacy budget equally to each inference query <ref type="bibr">(Dwork and Feldman 2018;</ref><ref type="bibr">Nissim, Raskhodnikova, and Smith 2007)</ref>. However, this heuristic allocation results in poor accuracy when many inference queries are required (van der Maaten and Hannun 2020).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Main Contributions</head><p>Existing methods for DP training and DP prediction add suboptimal levels of noise, hampering utility. Our primary contribution lies in addressing the inefficiencies of current methods by translating random forest training and prediction into matrix multiplication. This new representation allows us to optimize solutions for training models and predicting class labels under DP, introducing finer-grained noise than the state-of-the-art.</p><p>We introduce DP batch training, which allocates the privacy budget across an ensemble of trees. The key insight behind our approach is the observation that some leaf values can be expressed as linear combinations of others. We optimize budget allocation based on leaf values along decision paths, preserving high accuracy even as the number of trees learned increases. Our allocation strategy is dataindependent, unlike a recent method that uses weighted privacy budget allocation for trees, which incurs budget expenditure <ref type="bibr">(Li et al. 2022)</ref>. Thus, our approaches offer privacyfree hyperparameter tuning, a significant advantage over existing methods including DP-SGD <ref type="bibr">(Abadi et al. 2016)</ref>.</p><p>Additionally, we introduce DP batch prediction which takes into account prediction results collectively rather than isolating individual queries. This technique optimizes DP noise addition across a set of inference queries for majority voting in an ensemble. Batch privacy analysis closes the performance gap between DP prediction performance and DP training due to the former's limit on inference queries. Our results show that this approach maintains good accuracy, even as the number of inference queries increases.</p><p>We validate our methods on real-world datasets, demonstrating a significant accuracy improvement of up to 40% compared to existing approaches. When tested on our Car dataset, both of our DP batch training and DP batch prediction approaches achieve 85% accuracy under a privacy budget of &#1013; = 2 when predicting 345 test samples on 128 random decision trees, exceeding the accuracy of existing solutions by 30%. Finally, our matrix representation can be extended to work within the subsample-and-aggregate framework, yielding an improvement of up to 50%. Our code and technical appendix will be available at: <ref type="url">https://github.com/</ref> arisa77/mrf-public.git.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Background Data and Schema</head><p>Schema. Consider a schema consisting of d attributes: {A 1 , A 2 , . . . , A d }. Each attribute A i has a finite domain &#934;(A i ) of size n i . The full domain size is then d i=1 n i . For clarity, we assume the first d -1 attributes represent feature attributes with domains denoted as</p><p>i=1 n i . The last attribute is the target class attribute, denoted as Y with k variables. We distinguish X as the d-1 dimensional feature space and Y as the 1-dimensional target space.</p><p>Example 1 (Tennis Schema) We use a simplified tennis dataset with attributes {outlook, windy, play} and domains &#934;(outlook) = {sunny, overcast, rainy}, &#934;(windy) = {false, true}, and &#934;(play) = {no, yes}. The features are 'outlook' and 'windy', and the target is 'play'. The feature domain X contains all tuples from &#934;(outlook) &#215; &#934;(windy), resulting in X = {(sunny, false), (sunny, true), (overcast, false), (overcast, true), (rainy, false), (rainy, true)}. The target domain is Y = {no, yes}.</p><p>Data Matrix. We have a sensitive training dataset consisting of N individuals, represented as an N &#215; d matrix X.</p><p>Each row X i is an individual record and X i,j &#8712; &#934;(A j ) is the value of attribute A j . We often represent this dataset X as a 2-way contingency table of feature variables X by target variables Y, denoted as a matrix D X &#8712; R n&#215;k . We may drop the subscript X and write D if the context is clear. This matrix captures the frequency of every possible tuple (x, y) &#8712; X &#215; Y in X. Although the frequency representation is favored for mathematical convenience, our implementation uses the record-by-record format for efficiency. Example 2 (Tennis Dataset) Consider the tennis dataset with six samples shown in Figure <ref type="figure">1a</ref>. The corresponding 6 by 2 frequency matrix D is illustrated in Figure <ref type="figure">1c</ref>. For instance, D 1,1 = 1 indicates a single sample with features (sunny, false) and a target class of no.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Differential Privacy</head><p>Differential privacy (DP) <ref type="bibr">(Dwork, Roth et al. 2014)</ref> is the defacto standard for data privacy, formally defined as follows. Definition 1 (Differential Privacy) A randomized mechanism M satisfies &#1013;-DP if for any two neighboring datasets X, X &#8242; &#8712; D and for any subset of outputs S &#8838; Range(M) it holds that:</p><p>DP offers valuable properties such as the post-processing and composition theorem <ref type="bibr">(Dwork, Roth et al. 2014</ref>), which we utilize in this work. Following the literature on DP random forests, we focus on employing the Laplace mechanism <ref type="bibr">(Fletcher and Islam 2019;</ref><ref type="bibr">Holohan et al. 2019)</ref>.</p><p>Matrix Mechanism. Matrix Mechanism <ref type="bibr">(Li et al. 2015</ref>) is a variant of the Laplace mechanism designed for answering input queries defined by matrix W by using a set of underlying queries A such that there exists some X for which W = XA. A is referred to as the strategy matrix. The naive case where A = W corresponds to the vectorized Laplace mechanism. Heuristics for choosing the best strategy matrix have been widely studied <ref type="bibr">(Xiao, Gardner, and Xiong 2012;</ref><ref type="bibr">Xiao, Wang, and Gehrke 2010)</ref>. Definition 2 (Matrix Mechanism) Given l by n dataindependent query matrix W, m by n data-independent strategy matrix A (possibly induced by W), and n by k data matrix D X , the following Matrix Mechanism satisfies &#1013;-DP.</p><p>The sensitivity of a query matrix is defined as its L1 norm ||A|| 1 . A + denotes the pseudoinverse of A. The original work utilizes a data vector instead of a data matrix, although both expressions are essentially interchangeable. The meansquared error of query answers to W under the strategy matrix A is given as follows, independent of the actual data input. || &#8226; || F denotes the Frobenius norm. Definition 3 (Error of strategy query answering) Given query matrix W, strategy matrix A, the total mean squared error of Matrix Mechanism is given as:</p><p>The state-of-the-art minimizes this error through parameterized optimization, identifying a (p + n) &#215; n strategy matrix A by finding a p &#215; n parameter matrix <ref type="bibr">(McKenna et al. 2018)</ref>. We treat the optimization routine as a black box, denoted as A &#8592; OPT p (W), where p is a hyperparameter.</p><p>(a) Training data Training Data X outlook [2,0] no </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Random Decision Trees</head><p>Random decision tree models were initially proposed as efficient classifiers for large databases in non-private settings <ref type="bibr">(Fan et al. 2003;</ref><ref type="bibr">Geurts, Ernst, and Wehenkel 2006)</ref>. They have since become a dominant algorithm for state-of-the-art DP random forests due to their data-independent tree construction <ref type="bibr">(Jagannathan, Pillaipakkamnatt, and Wright 2009;</ref><ref type="bibr">Holohan et al. 2019;</ref><ref type="bibr">Fletcher and Islam 2017)</ref>. Unlike traditional decision trees such as ID3 and CART, random decision trees are constructed by randomly selecting test attributes for nodes. Because this attribute selection occurs before examining any training data, there is no need to allocate a privacy budget for finding optimal split points.</p><p>Base Classifiers. We consider &#964; random decision trees each with depth h, denoted as F = {T 1 , . . . , T &#964; }. We may denote F S as those built over a specific feature subset S.</p><p>We may build q ensembles of such random decision trees, each associated with a random feature subset, denoted as F = {F Sj } q j=1 . The tree structure, including random feature selection, is pre-computed using schema information alone. Training data X is then used to calculate class count distributions at all leaves. Each tree is fitted on the same training dataset unless stated otherwise. We provide the full algorithm of random decision trees in the technical appendix. During prediction, the trained random decision trees F take a sample s &#8712; X and output a predicted label y &#8712; Y, denoted as y &#8592; F (s). Each tree casts a vote, and the label with the most votes is chosen as the final prediction.</p><p>Private Prediction and Problem Statement. Given b inference queries Q &#8712; X b , our goal is to release prediction results y &#8592; F (Q) (abusing the notation above) under DP to limit information leakage about the training data X from the predictions. We will introduce formulations for both DP training and DP prediction. DP training is achieved by making the random forest F DP. Thus, the subsequent prediction algorithm maintains privacy due to the post-processing theorem. DP prediction, on the other hand, is achieved by adding DP noise to the predictions of the non-DP random forest F .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Matrix Random Forests</head><p>We introduce a novel approach to representing random forest training and prediction using matrices, which we call Matrix Random Forest (M-RF). This approach is essential in generating the optimal amount of DP noise, as we will detail later.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Inference Query Matrix</head><p>A sample s &#8712; X is represented as an indicator vector q of length n over X , where the cell at the corresponding tuple is set to 1. Otherwise, it is set to 0. Thus, a set of b inference queries {s 1 , . . . , s b } &#8712; X b can be expressed as a b&#215;n matrix Q = (q &#8890; 1 , . . . , q &#8890; b ). In contrast to the data matrix D, the inference query matrix Q is not treated as sensitive. Example 3 (Inference Query) Figure <ref type="figure">1d</ref> shows three unlabeled samples and the corresponding 3 &#215; 6 inference query matrix Q. For instance, the first row of Q encodes the tuple (sunny, true) from the dataset.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Decision Path Query Matrix for Training</head><p>We introduce the decision path matrix that encodes every root-to-leaf decision path in random decision trees. Each decision path represents a predicate P (x) over the feature domain X . This predicate can be represented as a binary vector of length n = |X |, denoted as p, where p i = 1 if the corresponding element in X is in the truth set (the set of all elements in X that make P (x) true); otherwise p i = 0. For instance, if P (x) stands for "outlook is sunny" in Example 1, the truth set is {(sunny, false), (sunny, true)} and thus the corresponding predicate results in p = (1, 1, 0, 0, 0, 0).</p><p>A predicate is used in a counting query to evaluate the number of instances in the dataset satisfying its corresponding decision path. For a single predicate p, the expression pD &#8712; R k will yield instance counts for each class label.</p><p>Thus, a set of o decision paths can be organized into an o by n matrix where each row corresponds to a predicate, denoted as T = (p &#8890; 1 , . . . , p &#8890; o ). The matrix encodes the dataindependent tree structure. The class counts at every leaf node can be computed as C = TD &#8712; R o&#215;k . Each C i,j represents the frequency of data instances that reach the i-th leaf node whose class label is y j &#8712; Y; see Figure <ref type="figure">1c</ref>. We may write C D to emphasize its dependence on D.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Decision Path Query Matrix for Prediction</head><p>For prediction, we construct the decision path matrix for individual instances. Given the decision path matrix T &#8712; R o&#215;n and the inference query matrix Q &#8712; R b&#215;n introduced earlier, their matrix multiplication W = QT &#8890; &#8712; R b&#215;o produces the specific decision paths for each sample. Specifically, W i,j = 1 if the i-th sample reaches the j-th leaf node on the j-th decision path defined by T j ; otherwise, W i,j = 0. Thus, each row vector W i encodes the decision paths taken by sample i across the random decision trees.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Weight Voting Matrix Operation</head><p>We introduce the matrix operations underlying the majority voting process for random decision trees. This method, which we call weight voting, involves aggregating leaf values across trees to make predictions. The aggregated votes represent a weighted count of data points that agree on the predicted class for each sample.</p><p>Given random decision trees, let T denote the decision path matrix and C D the leaf value matrix. For an inference query matrix Q, the weight votes for each target class are calculated using the matrix product: V = QT &#8890; C D &#8712; R b&#215;k . Here V i,j represents the number of training instances that agree with the class label y j &#8712; Y for the i-th inference query. When C = TD, this operation can be expressed in terms of D as: V = QT &#8890; TD. Here, QT &#8890; T represents weighted queries that aggregate tuples in the feature domain; see Figure <ref type="figure">1f</ref> for an example matrix.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>DP M-RF Training and Prediction</head><p>In this section, we present novel techniques for DP training and DP prediction of random forests. Our approach leverages the matrix representation formulated in the previous sections, enabling the derivation of optimal solutions that significantly improve the accuracy of DP random forests. Optimizing Decision Paths Utilizing our matrix representation, leaf values are computed as the matrix product C = TD, where T represents root-to-leaf paths in random decision trees. This serves as a query matrix, where each row vector counts the number of training data instances satisfying the classification rule of the corresponding leaf node. Many existing DP approaches apply the Laplace mechanism by adding the Laplace noise to the leaf values with an equal privacy budget allocation: TD + Lap(&#964; /&#1013;) o&#215;k , assigning each tree a budget of &#964; /&#1013;. However, this method can lead to suboptimal accuracy if decision paths in one tree are linearly dependent on those in others, wasting the privacy budget.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>DP Batch Training Approach</head><p>To address this, we optimize leaf-level counts by finding an optimal strategy A &#8592; OPT(T). By specifying the decision path query matrix T into the optimization procedure OPT, we derive an alternative strategy matrix A &#8712; R m&#215;n that minimizes the error in estimating class counts at leaf nodes, i.e., Err(T, A). Under this strategy, &#1013;-DP leaf values are estimated as C = TD + TA + Lap(||A|| 1 /&#1013;) m&#215;k .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>DP Matrix Random Forest</head><p>Training Algorithm 1 presents our approach for training DP random forests. The decision path matrix T is precomputed by building random decision trees. With the data matrix D and the decision path matrix T, we estimate leaf values under DP, following our above optimization approach. The resulting DP leaf values have the following error, a direct implication from Definition 3. Finally, the most frequent class is assigned as the label for each leaf node based on the noisy leaf values.</p><p>Theorem 4 Leaf values C resulting from Algorithm 1 have MSE of Err &#1013; (T, A).</p><p>Theorem 5 Algorithm 1 satisfies &#1013;-DP.</p><p>The strategy matrix selection in Line 1 incurs no privacy loss since it only depends on the data-independent decision path matrix T. The computation on leaf values in Line 2 satisfies &#1013;-DP, following the privacy of the Matrix Mechanism. Finally, from the post-processing theorem, the computation of leaf labels in Line 3 does not degrade privacy.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>DP Batch Prediction Approach</head><p>We introduce our DP random forest prediction approach, referred to as: batch prediction. Our approach predicts labels for given inference queries under DP while maintaining high accuracy. Unlike DP batch training, which optimizes leaf Algorithm 2: DP M-RF Prediction Input: Data matrix D, decision path matrix T, inference query matrix Q, privacy budget &#1013;. Output: DP predicted label vector &#7929; 1: A &#8592; OPT(QT &#8890; T) 2: &#7804; = QT &#8890; TD + QT &#8890; TA + Lap(||A|| 1 /&#1013;) m&#215;k 3: &#7929;i = arg max 1&#8804;j&#8804;k &#7804;i,j , &#8704; inference query i.</p><p>values for classifiers, DP batch prediction focuses on optimizing prediction results for specified inference queries.</p><p>Optimizing Decision Paths for Prediction The random forest prediction problem using weight voting can be expressed as the matrix product: V = QT &#8890; TD, where Q denotes the inference query matrix, and T represents decision paths in the trees. We consider QT &#8890; T as a query matrix where each vector aggregates instance counts (i.e., weighted votes) at predicted leaf nodes across all trees for the corresponding inference query. Using the optimization procedure OPT, we find a m &#215; n strategy matrix A &#8592; OPT(W), where W = QT &#8890; T, that minimizes the error of answering votes for inference queries, i.e., Err(W, A). Under this strategy, &#1013;-DP vote counts are estimated as:</p><p>DP Matrix Random Forest Prediction Algorithm 2 shows our DP prediction approach for random forests. Given non-private random decision trees, the resulting tree structure are encoded into a decision path matrix T. Training data and inference queries are similarly transformed into matrices D and Q. We then estimate votes tallied across the forest for every inference query under DP, following the above optimization approach. The resulting DP vote counts have the following error, a direct implication from Definition 3. Lastly, class labels with the majority noisy votes are returned as the final prediction results. Theorem 6 The vote counts V in Algorithm 2 have MSE of Err &#1013; (QT &#8890; T, A).</p><p>Theorem 7 Algorithm 2 satisfies &#1013;-DP. Similar to the privacy proof of Theorem 5, the privacy of Algorithm 2 primarily follows from the privacy of the Matrix Mechanism.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Optimizing Subsample-and-Aggregate Framework</head><p>We enhance the subsample-and-aggregate method (Dwork and Feldman 2018) using our batch prediction technique. In this approach, non-private random decision trees are fitted on disjoint training datasets to estimate weighted votes V = QT &#8890; C under DP. Since C &#824; = TD, the above DP batch prediction is not directly applicable. To address this, our approach identifies an m&#215;o strategy matrix A &#8592; OPT(QT &#8890; ) that minimizes Err &#1013; (QT &#8890; , A). The DP vote counts are then computed as &#7804; = QT &#8890; C + QT &#8890; A + Lap(||A|| 1 /&#1013;) m&#215;k . This method ensures query sensitivity remains independent of the number of inference queries, unlike the existing method, which allocates an equal budget among queries, significantly degrading accuracy with more queries. Further details appear in the technical appendix.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Our Full DP M-RF Framework</head><p>Our complete framework for private prediction with random decision trees is shown in Algorithm 3, capturing both DP batch training and DP batch prediction. The framework involves: 1) building trees, 2) optimizing strategies, 3) fitting training data, and 4) making predictions, detailed below.</p><p>We construct q ensembles of random decision trees, where each ensemble F i is built over a random feature subset S i of size d. Our method involves performing strategy optimizations within each ensemble over the schema S i , allowing potentially expensive computations to be performed in parallel across ensembles. The best strategy matrix A (i) is chosen from OPT for generating refined DP noise. The query matrix W (i) is instantiated with i) , depending on batch training or batch prediction, recalling the prior sections. Here, T (i) denotes the decision path matrix associated with ensemble i. The optimality of our approach comes from the theoretical optimality of the strategy matrix from OPT for a fixed strategy. The derived strategy matrix remains optimal within the constraints of each individual ensemble.</p><p>Private training data is fit on each ensemble forest by updating the leaf values and labels. For DP training, the refined DP noise is added to leaf values, and the resulting random decision trees Fi are privatized. For DP prediction, the DP noise is added to the predicted vote counts, with predictions made on non-DP random decision trees. Finally, prediction labels are determined collectively across the q ensembles. Algorithm 3 can capture our improved subsample-andaggregate method as well by setting W (i) = Q(T (i) ) &#8890; and fitting disjoint subsets of data on each ensemble forest.</p><p>For privacy, following Theorem 5 and 7, each ensemble algorithm with a privacy budget of &#1013;/q satisfies &#1013;/q-DP for batch training and prediction. Thus, from the composition theorem, Algorithm 3 satisfies &#1013;-DP.</p><p>Complexity. The primary overhead of DP M-RF in Algorithm 3 comes from the optimization for refining DP noise, crucial for balancing accuracy and efficiency. This process is parallelizable across ensembles for efficiency. Assuming each feature has n values with a total domain size of n d, the size of</p><p>training, DP batch prediction, and subsample-and-aggregate. Here, o is the total number of leaves per ensemble and b is the number of inference queries. Our implementation adopts an implicit matrix representation, effectively reducing the matrix size to e.g., o &#215; dn for batch training <ref type="bibr">(McKenna et al. 2018)</ref>.</p><p>The optimization cost depends on the chosen strategies, producing a</p><p>The space and time complexity of noise generation in Line 6 are both O(n d) for DP batch training and prediction; for subsample-and-aggregate, they are O(o+b) and O((o+b)o), independent of domain sizes.</p><p>Tree building and optimization do not require the actual training data and can be preprocessed efficiently. Thus, the overhead for DP training and prediction is negligible. Moreover, matrices Y (i) and V (i) are computed without materializing the data matrix D, ensuring efficient space complexity.</p><p>Algorithm 3: DP M-RF Framework Input: features S, training data X, inference queries Q, privacy budget &#1013;, number of ensembles q, number of trees &#964; 1 , . . . , &#964; q , tree depth h 1 , . . . , h q , max feature size d. Output: DP predicted label vector &#7929; 1: for every ensemble i = 1 . . . q do 2:</p><p>S i &#8592; d random features from S 3:</p><p>Let Y (i) be leaf values of F i 10:</p><p>Fi &#8592; update F i with leaf values Y (i) + B (i)   11:</p><p>else if DP Prediction then 13:</p><p>end if 15: end for 16: &#7929;i = arg max 1&#8804;l&#8804;k q j=1 &#7804;(j)</p><p>i,l , &#8704; inference query i Setting Hyperparameters. Following Theorem 4 and 6, the utility of Algorithm 3 is measured by Err &#1013;/q (W (i) , A (i) ). This metric corresponds to errors in leaf values for DP training and vote counts for DP prediction. The error rates are affected by the matrix W (i) , which is defined by the number of trees &#964; i , depth h i , and schema information, including the number of features d. Since this error metric does not depend on the actual training data, we can effectively perform privacy-free hyperparameter tuning without consuming any privacy budget.</p><p>The hyperparameters q (number of ensembles) and d (size of the feature subspace) involve a trade-off between accuracy and efficiency. Smaller feature sets enable more efficient optimizations but might exclude important features, reducing learning capability. Increasing the number of ensembles reduces the privacy budget per ensemble, potentially degrading accuracy. For smaller datasets, we recommend using a single ensemble forest (q = 1) with the original feature set ( d = |S|) to maximize accuracy. For larger datasets, d and q should be chosen to balance efficiency and accuracy.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Experiments</head><p>We empirically evaluate the performance of our DP training and prediction techniques for random forests, demonstrating higher accuracy compared to competing techniques.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Experimental Setup</head><p>Datasets. We use six popular classification datasets from the UCI ML Repository <ref type="bibr">(Kelly, Longjohn, and Nottingham 2023)</ref> with feature dimensions ranging from 4 to 128: Car, Iris, Scale, Adult, Heart, and Mushroom. Certain datasets with continuous values were preprocessed using public domain knowledge, including discretization. The Adult dataset was already discretized <ref type="bibr">(Chang and Lin 2011)</ref>.</p><p>Implementation and Competing Techniques. We evaluate Algorithm 3 against various competing techniques. For consistency, all DP methods use random decision trees. All implementations are in Python and experiments were conducted on a MacBook Air (M2 chip with 16GB RAM). We adopt a multi-way tree structure as in ID3. It can be easily extended to binary trees.</p><p>We compare our DP batch training against two widelyused methods. The first baseline employs the same batch training but applies the Laplace mechanism, commonly used in state-of-the-art works <ref type="bibr">(Jagannathan, Pillaipakkamnatt, and Wright 2009;</ref><ref type="bibr">Maddock et al. 2022)</ref>. This corresponds to Algorithm 3 with B (i) = Lap(q&#964; i /&#1013;) o&#215;k and q = 1, d = |S|. The second baseline trains each tree on disjoint datasets using the Laplace mechanism, as seen in prior works <ref type="bibr">(Holohan et al. 2019;</ref><ref type="bibr">Fletcher and Islam 2017)</ref>. For all methods above, we adopt the standard hard voting for prediction. We do not compare against work that uses a weaker definition of DP, such as <ref type="bibr">(Rana, Gupta, and Venkatesh 2015)</ref>. For DP prediction techniques, we compare our DP batch prediction and subsample-and-aggregate approach from Algorithm 3 against the existing subsampleand-aggregate <ref type="bibr">(Dwork and Feldman 2018)</ref>. Additionally, we benchmark our techniques against an optimized nonprivate random forest algorithm, the Extra-Trees classifier from scikit-learn to obtain an empirical upper bound on accuracy for private algorithms <ref type="bibr">(Pedregosa et al. 2011)</ref>. For runtime comparisons, we evaluate the non-private version of Algorithm 3 to ensure consistency.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Results</head><p>We measure the accuracy and runtime of Algorithm 3 with at least 5 trials under fixed parameters. Unless explicitly denoted, each dataset is split into train and test subsets with a 80:20 ratio. Each ensemble consists of &#964; /q trees of depth h. Main Results. Figure <ref type="figure">2</ref> shows the test accuracy of our DP batch training and prediction techniques, varying values of &#1013;. We consider commonly used values of &#964; = 64 &#8764; 128. We use q = 1, d = d -1 for small to medium-sized datasets like Car, Iris, and Balance; for larger datasets, multiple ensembles are employed. Our optimized approaches consistently outperform the baselines, showing a significant accuracy improvement of 20-40% for small to medium-sized datasets. Larger datasets such as Heart, Mushroom, and Adult show a notable accuracy improvement of 10-20%, particularly with smaller &#1013; values. Our novel matrix representation improves DP batch training and prediction, enabling the learning of numerous base classifiers and the prediction of a large number of test samples, all without sacrificing accuracy.</p><p>Increasing the number of trees does not negatively impact the accuracy of DP Batch training. With random forests, predictive power is supposed to increase with the number of trees. However, DP batch and disjoint training with the Laplace mechanism both suffer from tree scaling issues, making the use of a smaller number of trees optimal. Batch training with the Laplace mechanism employs equal budget allocation; as the number of trees increases,   the privacy budget per tree decreases. The same tree scaling issue occurs during disjoint training; as the number of trees increases, the number of training samples per tree decreases. In both cases, the predictive power of a single tree is inversely proportional to the total number of trees. On the other hand, our DP batch training method preserves good accuracy as the number of trees increases, reflecting the scaling principle of ensemble learning; see Figure <ref type="figure">3a</ref>.</p><p>Increasing the number of test samples has minimal impact on the accuracy of DP batch prediction. Figure3b illustrates the relationship between accuracy and the number of inference queries for the Car dataset, comparing our DP batch prediction against the baseline subsample-andaggregate. Because of the limited number of samples, we use the entire Car dataset to train a model and report the prediction accuracy for 5 to 1000 inference queries. The existing approach exhibits low accuracy as the number of inference queries increases, reaching 30% accuracy for 100 samples. The per-sample budget allocation leads to an immediate degradation in accuracy. In contrast, with our DP batch prediction, accuracy remains at 90% even when predicting as many as 1000 samples. This effectively addresses potential challenges that existing DP prediction approaches face when making predictions on numerous samples.</p><p>Applying to the subsample-and-aggregate framework.</p><p>We show the generalizability of our techniques by adapting them to the subsample-and-aggregate framework. Figure <ref type="figure">4</ref> shows the test accuracy of our optimized approach compared to the existing non-optimized version, when performing prediction on 345 examples with the Car dataset. Our technique improves the accuracy of existing solutions by up to 50%, with similar improvement in other datasets.</p><p>Runtime. Table <ref type="table">1</ref> compares the runtime of Algorithm 3 to non-private random decision trees with the same hyperparameters as in Figure <ref type="figure">2</ref>. The runtime for data-independent tree building is excluded as it incurs no DP overhead. The primary overhead comes from optimization, which is crucial for high accuracy. Despite this, our approach maintains minimal runtime overhead during training and prediction.</p><p>Although our DP baselines have similar training and prediction complexities, they lack noise optimization. Our results show that the DP M-RF algorithm is feasible on commodity hardware, even with large datasets. The optimization overhead can be reduced by reducing the number of features per ensemble, potentially at the expense of some accuracy. Further experiments are detailed in the technical appendix.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>The Thirty-Ninth AAAI Conference on Artificial Intelligence </p></note>
		</body>
		</text>
</TEI>
