<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Predictive Modeling of an Unbalanced Binary Outcome in Food Insecurity Data</title></titleStmt>
			<publicationStmt>
				<publisher>Proceedings of the 15th International Conference on Data Science (2019)</publisher>
				<date>07/01/2019</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10109275</idno>
					<idno type="doi"></idno>
					<title level='j'>Predictive Modeling of an Unbalanced Binary Outcome in Food Insecurity Data</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>J Fabish</author><author>L. Davis</author><author>S. Kim</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Predictive modeling of a rare event using anunbalanced data set leads to poor prediction sensitivity.Although this obstacle is often accompanied by other analyticalissues such as a large number of predictors andmulticollinearity, little has been done to address these issuessimultaneously. The objective of this study is to compareseveral predictive modeling techniques in this setting. Theunbalanced data set is addressed using four resamplingmethods: undersampling, oversampling, hybrid sampling,and ROSE synthetic data generation. The large number ofpredictors is addressed using penalized regression methodsand ensemble methods. The predictive models are evaluatedin terms of sensitivity and F1 score via simulation studiesand applied to the prediction of food deserts in NorthCarolina. Our results show that balancing the data viaresampling methods leads to an improved prediction sensitivityfor every classifier. The application analysis showsthat resampling also leads to an increase in F1 score forevery classifier while the simulated data showed that the F1score tended to decrease slightly in most cases. Our findingsmay help improve classification performance for unbalancedrare event data in many other applications.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>As defined by the USDA, food insecurity is a householdlevel economic and social condition of limited or uncertain access to adequate food. This is in contrast to hunger, which is an individual level physiological condition which results from food insecurity. According to the Food Research and Action Center in 2016, North Carolina had the 14 th highest food insecurity level among all states with 15.1% (603,094) households experiencing food insecurity <ref type="bibr">[1]</ref>. Coined in Scotland in the early 1990s, the term "food desert" is used to describe communities which have limited access to affordable and nutritious foods <ref type="bibr">[2]</ref>. The U.S. Census Bureau conducts the American Community Survey (ACS) [3] on a yearly basis to provide social, housing, economic, and demographic data. A census block is the smallest geographic area for which they collect and record population data. Census block groups, typically containing between 600 and 300 people, are one level above census blocks in terms of geographical area and are the smallest unit for which the Census Bureau collects and records sample data. They do not cross census tract, county, or state boundaries <ref type="bibr">[4]</ref>.</p><p>The motivation of this study stems from an unsuccessful attempt to build a model to predict the binary food desert status of U.S. census block groups in North Carolina. The objective is to build a predictive model for food insecurity in North Carolina using the predictors statistically selected. Unfortunately, due to a severe unbalance between the classes of the response variable, i.e., only 3.3% of observations are food deserts, prediction sensitivity was low and no trustworthy inferences could be made. It has been shown that resampling methods such as oversampling and undersampling are effective in improving prediction performance in such a situation <ref type="bibr">[8]</ref><ref type="bibr">[9]</ref><ref type="bibr">[10]</ref><ref type="bibr">[11]</ref><ref type="bibr">[12]</ref>.</p><p>Much of the recent research on resampling methods in the predictive modeling has involved using decision tree methods and support vector machines. However, the data sets used in these studies did not suffer from the large p problem encountered here. Our data has 2,780 covariates, i.e., there are 2 2780 possible combinations of predictors. This renders traditional variable selection methods, such as forward and backward selection, too computationally intensive. Furthermore, many of the predictors in the food insecurity data set are known to be linear combinations of others. Thus, multicollinearity is an additional obstacle to overcome. Modern penalized regression methods perform simultaneous parameter estimation and variable selection in a setting with large p and multicollinearity.</p><p>We apply oversampling, undersampling, a hybrid of overand undersampling, and Random Over-Sampling Examples (ROSE) <ref type="bibr">[8]</ref> synthetic data generation to an unbalanced binary response variable. These resampling methods are applied to unbalanced data sets, and result in an equal distribution of observations from the majority and minority response classes. It has been shown that these sampling methods lead to improvement in classification sensitivity, while each method has drawbacks <ref type="bibr">[7]</ref>.</p><p>Using the original and resampled data sets we train four penalized logistic regression models, the least absolute shrinkage and selection operator (LASSO) <ref type="bibr">[12]</ref>, elastic net (ENET) <ref type="bibr">[15]</ref>, smoothly clipped absolute deviation (SCAD) <ref type="bibr">[17]</ref>, and minimax concave penalty (MCP) <ref type="bibr">[18]</ref>, and two ensemble classifiers, random forest <ref type="bibr">[21]</ref> and boosting <ref type="bibr">[22]</ref>. These classifiers have been successfully applied to big data sets. The penalized regression methods used here are highly interpretable since they shrink many regression coefficients to exactly zero. Ensemble methods tend to improve prediction accuracy, but lose interpretability by combining the results from many classifiers.</p><p>Correctly classifying the minority observations is the main purpose of our research, which makes the accuracy itself an unsuitable performance measure. Sensitivity, often called recall, measures the proportion of the minority observations which are correctly classified. Precision measures the proportion of positive predictions which are correct. The F1 score is the harmonic mean of precision and sensitivity. To achieve a high F1 score, a model needs a high precision and high sensitivity which makes it ideal for assessing a predictive model focused on correctly identifying observations from the minority class.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Statistical Methods</head><p>In this study, we apply novel combinations of well-known methods for dealing with a sparse parameter set and unbalanced binary response variable. Penalized regression methods such as LASSO, ENET, SCAD, and MCP as well as ensemble methods random forest and boosting trees are well known to handle classification problems involving a large number of predictors, p. Oversampling, undersampling, hybrid sampling methods, and synthetic data generation have been used successfully to overcome an unbalanced response variable. We apply combinations of these methods to real and simulated data exhibiting large p and an unbalanced response variable. In this chapter, we briefly summarize the theoretical background of the statistical methods selected.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Penalized Regression Methods</head><p>Let Y be the binary response vector, let X be a n &#215; p predictor matrix, and let &#946; be the p &#215; 1 vector of regression coefficients. The penalized log-likelihood function is given by &#946;LOG = argmin</p><p>where,</p><p>and P &#955;,&#947; (&#946;) is one of the penalty functions described in the following.</p><p>Penalized regression methods impose a unique penalty function on the coefficients as in equation <ref type="bibr">(1)</ref>. The methods considered here perform simultaneous variable selection and parameter estimation and are applicable for both continuous and discrete response variables, making them suitable for regression or classification problems with large p. For all the following penalized regression methods, we assume that the covariate matrix has been standardized by subtracting means and dividing by the standard deviation of each respective column.</p><p>LASSO. The LASSO, proposed by <ref type="bibr">Tibshirani (1996)</ref>  <ref type="bibr">[12]</ref>, imposes the L 1 penalty which is hyperrectangular in nature. The penalty of LASSO is given by</p><p>which is called the L 1 penalty. The regularization parameter, &#955;, is data-driven and calculated via cross-validation for LASSO and each of the following penalties as well.</p><p>The cyclical coordinate descent algorithm (CCDA) along a regularization path is used to compute solutions to the LASSO efficiently. The LASSO solution and other penalized regression solutions are typically computed via CCDA which solves a series of univariate optimization problems until some convergence criteria is met.</p><p>ENET. LASSO shrinks many coefficients to exactly zero, which is a highly interpretable form of variable selection, but can be unstable in a setting involving multicollinearity. Ridge regression, which imposes the L 2 penalty, is known to perform well in such a setting. The ENET penalty, proposed by <ref type="bibr">Zou and Hastie (2003)</ref>  <ref type="bibr">[16]</ref>, is a convex combination of the L 1 and L 2 penalties given by</p><p>0 &#8804; &#945; &#8804; 1 controls the trade-off between the L 1 and L 2 penalties, with &#945; = 0 equivalent to the LASSO and &#945; = 1 equivalent to ridge regression. We set &#945; = 0.5 for the elastic net penalty which yields a strictly convex constraint region which still has non-differentiable corners, enabling it to perform variable selection while remaining more stable than LASSO among highly correlated predictors <ref type="bibr">[13]</ref>. SCAD. LASSO and ENET are both biased estimators and employ convex penalties. SCAD and MCP are both continuous piecewise non-convex penalties which start equivalent to the LASSO penalty but weaken as the magnitude of &#946; j increases. SCAD was introduced by Fan and Li (2001) <ref type="bibr">[17]</ref>. The SCAD penalty, defined on [0, &#8734;), is given by</p><p>&#947; is known as the threshold parameter and determines the point at which the penalty transitions to the subsequent piece of the function for SCAD, and MCP which follows.</p><p>MCP. The MCP, proposed by Zhang (2010) <ref type="bibr">[18]</ref>, was designed to approach to the unbiased estimates faster than SCAD. The MCP penalty, defined on [0, &#8734;), is given by</p><p>Unlike LASSO or ENET, both SCAD and MCP are known to achieve the oracle property <ref type="bibr">[19]</ref>, that is, as n &#8594; &#8734; the model identifies the zero regression coefficients correctly with probability approaching 1 while remaining consistent for non-zero coefficients <ref type="bibr">[20]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Ensemble Methods</head><p>Ensemble methods combine the results of multiple base statistical learning algorithms to construct an improved learning algorithm. In this study we apply two ensemble methods, random forest and boosting for classification.</p><p>Random Forest. Random forest employs a resampling method, closely related to bagging, which decreases variance of predictions. In bagging, we draw with replacement M random samples of size n from our training data set of n observations. A different model is trained and the statistic of interest is calculated based on each the M random samples. The final estimate is the mean or mode, for continuous or discrete random variables respectively, of the values computed across all M samples. Random forest differs from bagging in that rather than training on all p predictors at each node, a subset of size &#8730; p is drawn at each split in a given tree. The result is a sequence of highly uncorrelated classification trees. Random forest is an effective method of decreasing variance and improving prediction accuracy but variable selection for individual predictors is not achieved. Rather, the out-of-the-bag-sample is used to assess variable importance in terms of mean decrease in accuracy upon permuting the values of each variable in succession, as compared to including all variables in the model <ref type="bibr">[7] [13]</ref>.</p><p>Boosting Trees. Boosting is another powerful resampling method. Unlike random forest, it produces decorrelated samples through an iterative weighting scheme. Boosting with classification trees consists of fitting a sequence of trees in which the first tree is fit to the response variable and each subsequent tree is fit to the residuals of the previous tree. Some benefits of boosting trees are the speed, insensitivity to the scale of the predictors, and relatively high accuracy. However, they have three hyperparameters to tune and are sensitive to overfitting the training data. Cross-validation can help to mitigate these issues <ref type="bibr">[23]</ref> [7] <ref type="bibr">[13]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Resampling Methods for Unbalanced Data</head><p>This study applies resampling methods, namely undersampling, oversampling, a hybrid of both, and ROSE synthetic data generation to overcome an unbalanced data set. All of these methods have been shown to improve prediction sensitivity, which is often a primary assessment measure for the type of classification problem considered here <ref type="bibr">[7] [10]</ref>. For the purpose of illustrating each resampling methods in the following subsections, we consider a data set of n = 1000 observations for a binary response variable Y i such that y i &#8712; {0, 1} for i = 1, &#8226; &#8226; &#8226; , 1000. Suppose also that 1000 i=1 Y i = 100, so that the minority class label '1' makes up only 10% of the observations.</p><p>Oversampling. Oversampling balances the data set by randomly sampling with replacement, from the minority class, the same number observations which make up the majority class <ref type="bibr">[8]</ref> and combining the observations from the resampled minority class and entire majority class into a single data set. In the example data set, oversampling would result in a new data set with 1800 observations of which 900 pertain to each class. Potential overfitting of the training data is a concern with this method.</p><p>Undersampling. Undersampling consists of randomly sampling without replacement, from the majority class, the same number of observations which make up the minority class and combining the observations from the sampled majority class and entire minority class into a balanced data set <ref type="bibr">[8]</ref>. In the example, undersampling generates a data set with 200 observations of which 100 pertain to each class. A downside to this approach is that we eliminate a significant portion of our data set which likely contains useful information.</p><p>Hybrid Sampling. A combination of oversampling and undersampling which results in a data set with the same dimensions as the original. First, the minority class is oversampled sequentially until the number of observations from the minority class reaches some proportion p of the desired final sample size n (both of which are required arguments for the function in the ROSE package in R). Next, the majority class is undersampled to yield the balanced data set of size n <ref type="bibr">[8]</ref>. In our example, we set n = 1000 and p = 0.5. The minority set is oversampled until there were 500 observations pertaining to class '1', and the majority set is undersampled to 500 observations, resulting in a data set with n = 1000 observations. ROSE. The ROSE method balances the data set by modeling the joint density of a given observation. The process of data generation for our example data set is as follows.</p><p>Let the class labels pertain to the set G = {0, 1}. Let n g for g = 0, 1 be the number of observations pertaining to class g, i.e., n 0 = 900 and n 1 = 100.</p><p>1) For i = 1, &#8226; &#8226; &#8226; , 1000: a) Randomly select g &#8712; G with probability 0.5. b) From the training set, randomly select with replacement an observation (y g , x g ) from the n g observations pertaining to the class g. c) Sample x * from K Hg (&#8226;, x g ), where K Hg is the estimated probability distribution centered at x g with covariance matrix H g . Upon completing the for loop, the ROSE procedure has generated an independent data set of 1000 observations, each with an equal chance of coming from either class. It is notable that the synthetic data set does not contain any of the original observations which leaves them available for model validation <ref type="bibr">[8]</ref> [9].</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.4">Assessment Measures for Prediction Performance</head><p>On an unbalanced testing data set, a classifier that predicts that every observation comes from the majority class will appear to do well in terms of accuracy. This is a common scenario when a classifier is trained on an unbalanced data set. In many such binary classification tasks, correctly identifying observations from the minority class is the primary goal. Therefore, we need an alternative to mean accuracy to assess the quality of predictions made my our model. Sensitivity, also known as recall, measures the proportion of positive cases correctly predicted. It is insensitive to the unbalanced class distribution. However, a classifier which predicts that every observation comes from the minority class will achieve a high sensitivity,</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Sensitivity =</head><p>T rueP ositive T rueP ositive + F alseN egative .</p><p>Precision, also known as positive predictive value, measures the proportion of positive predictions that were correct. This evaluates the validity of attaining a high sensitivity,</p><p>The F1 score is the harmonic mean of precision and sensitivity,</p><p>A high sensitivity at the expense of a low precision or a high precision at the expense of a low sensitivity translates to a low F1 score, making it an ideal assessment measure for rare event classification. All of these measures can easily be obtained from the confusion matrix.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Simulation Studies</head><p>We applied four sampling methods to real and simulated data in order to balance the classes of the response variable prior to prediction using penalized and ensemble classification methods. For comparison, the six classification algorithms selected were also applied on a simulated data set that was balanced from the offset. The mean prediction sensitivity and F1 score computed over 300 iterations are reported to assess the effectiveness of the various classification algorithm/sampling method combinations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Design of Simulations</head><p>Each iteration generates independent testing and training data using the mvrnorm function in the MASS package in R. We fixed p = 1000 and n = 4000, and considered two covariance structures, independent and first order autoregressive (AR(1)) with &#961; = 0.5 and &#963; 2 = 1.</p><p>In the AR(1) covariance structure, correlation is highest among adjacent predictors and decreases exponentially with distance. The true parameter space is given by &#946; = [3, 3, 0, 3, 2, 0, 2, 0, 0, &#8226; &#8226; &#8226; , 0] where only 5 of the coefficients are non-zero and the remaining parameters consist of zero coefficients. To generate the simulated response variable, the intercept value in the logistic regression was adjusted such that the minority class comprises approximately 10% of all observations.</p><p>Next, using the ROSE <ref type="bibr">[8]</ref> package in R, oversampling, undersampling, hybrid sampling, and ROSE were applied to the unbalanced training data set. Finally, the six classification algorithms were trained using the now balanced training data sets, as well as the unbalanced training set for reference. Predictions were then made by all models on the testing data. The mean sensitivity and F1 score and their respective standard deviations were calculated over all iterations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Results</head><p>The results for the Monte Carlo simulations are presented in Tables <ref type="table">1</ref> and<ref type="table">2</ref>. In each cell, the top number represents the mean and the bottom number in parenthesis represents the standard deviation.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Empirical Study 4.1 Data Source</head><p>An ideal data set for the study at hand would include a binary response variable representing the food desert status of each block group in North Carolina and a set of mutually uncorrelated predictors which are associated with food desert status. A fixed time data set for the current analysis was compiled from the following sources.</p><p>PolicyMap. The binary response variable for this study, limited supermarket access (LSA) status by block group in NC, was attained from Reinvestment Fund (2016) via PolicyMap <ref type="bibr">[6]</ref>. LSA status is a measure of whether a block group is well-served by a supermarket or experiences limited access. This aligns nicely with the definition of a food desert, i.e., an area in which residents have limited access to an affordable and healthy diet. For this reason, we chose LSA status as a surrogate variable for food desert status. The data for supermarket location was acquired by Reinvestment Fund from the 2017 Nielsen TDLinx database [24] and includes supermarkets, supercenters, limited assortment stores, and natural food stores, but excludes superettes and dollar stores due to their lack of healthy food options. To account for variability in urban and rural areas, population density of various block groups was considered upon assignment of LSA status. This is achieved by considering how far the distance to the nearest supermarket would need to be re-duced to equal the distance of a well-served block group of the same population density class. The LSA status was eventually encoded as a dummy variable having the value of 1 for positive LSA status and 0 for negative LSA status <ref type="bibr">[6]</ref>, which we refer to here as positive and negative food desert status, respectively. US Census Bureau. The predictor variables used in the analysis were published by the US Census Bureau in the 2016 American Community Survey [3] and acquired from American Fact Finder <ref type="bibr">[5]</ref>. The matrix of covariates was obtained from American Fact Finder <ref type="bibr">[5]</ref>, a public database on U.S. Census Bureau data, and consists of 2,780 variables representing 5-year estimates of various social, housing, economic, and demographic data from the 2016 ACS. The response variable, limited supermarket access (LSA), which we are using as a surrogate for food desert status, was obtained from the Reinvestment Fund through PolicyMap <ref type="bibr">[6]</ref>. Each random variable is discrete numeric and represents the count of a particular characteristic with respect to the given block group.</p><p>Data Wrangling. The data from Reinvestment Fund and the US Census Bureau were cleaned using the R statistical programming language, via the tidyverse package. The data were combined using census block group number as a common key. Prior to cleaning, the data set had 6066 observations, each representing a different block group, and 2835 predictor variables associated with each observation. Any predictor that was missing data for 10 or more observations was excluded and then all the remaining incomplete observations were removed. After cleaning, the data set had final dimensions 6062 x 2780 (45769 KB).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Analysis and Results</head><p>Using 5-fold cross-validation, the classification algorithms were applied before and after balancing the data set using the four resampling methods. On each fold, 80% of the food insecurity data set was allocated to the training set. Using the ROSE package for R <ref type="bibr">[8]</ref>, oversampling, undersampling, hybrid sampling, and ROSE were each applied to the training portion of the data set. Then, the balanced training data were used to train the six selected classification algorithms. LASSO and ENET were applied using the glmnet function from the glmnet <ref type="bibr">[15]</ref> package with &#945; = 1 for LASSO and &#945; = 0.5 for ENET. SCAD and MCP were applied using the ncvreg <ref type="bibr">[19]</ref> package. Random forest was applied using the randomForest <ref type="bibr">[21]</ref> package. Gradient boosting was applied using the gbm <ref type="bibr">[22]</ref> package with interaction.depth = 1. 500 trees were grown for both random forest and gradient boosting. Predictions were then made on the testing portion of the data set. The mean prediction sensitivity and F1 score were calculated for each fold and averaged. The results are presented in Table <ref type="table">3</ref>.</p><p>Since LASSO, SCAD, and MCP all performed well on the undersampled data set, we chose to examine which predictor variables were selected as significant by these models in this setting. We took the intersection of the significant predictors across all 5-folds of cross-validation. Ideally, knowing which predictors are associated with positive food desert status could provide information to help address the issue of food insecurity throughout the region being studied, North Carolina in this case. The intersection of significant predictors contained about <ref type="bibr">20</ref>  Years And Over -Information'. Some of these variables seem to be conflicting, such as 'Value -125,000 -149999$' and 'Value -60,000-69,999$'. It could be informative to separate food deserts based on population density and run an individual analysis for each group since it is likely that urban and rural food deserts have different profiles with respect to these predictors. This may account for the apparent conflict among the covariates selected here.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Discussion</head><p>The resampling methods were vital to the results of each model in the food insecurity data. Their application led to an increase in sensitivity and F1 score for essentially every model. No usable results were obtained by any model on the unbalanced data set. The ensemble methods failed to obtain usable results for every combination of resampling. Among all the resampling methods, undersampling led to the largest improvement in terms of sensitivity while the other resampling methods improved the F1 score slightly  With respect to the simulated data with independent covariance, the only resampling method for which every classifier performed well was again the undersampling method. The highest sensitivity was obtained by penalized regression methods applied to the undersampled data set. No model was able to improve over the F1 score attained by SCAD or MCP on the unbalanced data set. Additionally, the sensitivity of SCAD and MCP on the unbalanced data set was superior to that of many of the other models on the balanced data sets. If we knew that the F1 score was a more relevant measure than sensitivity, we would prefer to use SCAD or MCP on the unbalanced data set, rather than another model with a resampling method. SCAD and MCP attained a high sensitivity for every resampling method.</p><p>Considering the simulated data with AR(1) covariance, similar to the simulations with independent covariance, no method improved with respect to the F1 score of SCAD or MCP on the unbalanced data. Only if we knew that prediction sensitivity were a more relevant measure than F1 score, would we choose to apply resampling methods. With this covariance structure, ROSE led to prediction results that were competitive with those from undersampling. This is an important fact because sometimes the severity of the unbalanced data can make undersampling impractical. Boosting with the ROSE data set achieved the highest sensitivity. That being said, SCAD and MCP performed consistently across all resampled data sets again and should be considered attractive options in this and the previous settings.</p><p>Overall, the combination of penalized regression methods with resampling enabled us to improve the sensitivity and F1 scores such that we could identify predictors associated with positive food desert status in NC. This is promising in that it can help us to gain insight into the social, economic, demographic, and housing data profiles of food deserts in NC or elsewhere in the world. This certainly provides motivation to expand the food desert study to include larger regions. Additionally, this could help us to develop creative ideas to address societal problems such as food insecurity or other similar issues involving rare events.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Conclusion</head><p>For every simulation except undersampling with penalized regression led to the highest sensitivity. With AR(1) covariance, ROSE with boosting outperformed undersampling for each classifier with respect to sensitivity, but only by a slim margin. In every case, random forest performed poorly on unbalanced data, oversampled data, and hybrid sampled data sets. For random forest, the introduction of duplicate observations due to sampling with replacement crippled the performance. Boosting outperformed SCAD and MCP by a small margin in a few cases but was also unstable when applied to the ROSE food insecurity data. SCAD and MCP were consistent performers across every data set, in particular when undersampling was performed. The results of undersampling are promising but as the number of minority observations m decreases this method becomes less useful since the final number of observations in the resampled data set is 2m. Interestingly, ROSE tended to perform better on data with AR(1) covariance than on data with independent covariance, making it a highly attractive option when multicollinearity is present.</p><p>There are many facets of this investigation warrant future investigation. One can consider new predictive modeling techniques which can reduce false positive results. Also, there are historical time series data from the ACS that could be incorporated into this work, and the LSA status is available for all of the U.S. on PolicyMap. This would enable a more systematic investigation of the parameters predicted to be associated with food desert status to gain insight into the socioeconomic and demographic profiles of food deserts in various regions of the U.S. The food insecurity data had only 3% minority observations but our simulated data approximately contained 10% minority observations due to the instability in the logistic regression. It is worthwhile to address this issue as well.</p></div></body>
		</text>
</TEI>
