<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Application of Convolutional Neural Network Algorithms for Advancing Sedentary and Activity Bout Classification</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>01/01/2020</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10227390</idno>
					<idno type="doi">10.1123/jmpb.2020-0016</idno>
					<title level='j'>Journal for the Measurement of Physical Behaviour</title>
<idno>2575-6605</idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Supun Nakandala</author><author>Marta M. Jankowska</author><author>Fatima Tuz-Zahra</author><author>John Bellettiere</author><author>Jordan A. Carlson</author><author>Andrea Z. LaCroix</author><author>Sheri J. Hartman</author><author>Dori E. Rosenberg</author><author>Jingjing Zou</author><author>Arun Kumar</author><author>Loki Natarajan</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Background              : Machine learning has been used for classification of physical behavior bouts from hip-worn accelerometers; however, this research has been limited due to the challenges of directly observing and coding human behavior “in the wild.” Deep learning algorithms, such as convolutional neural networks (CNNs), may offer better representation of data than other machine learning algorithms without the need for engineered features and may be better suited to dealing with free-living data. The purpose of this study was to develop a modeling pipeline for evaluation of a CNN model on a free-living data set and compare CNN inputs and results with the commonly used machine learning random forest and logistic regression algorithms.              Method              : Twenty-eight free-living women wore an ActiGraph GT3X+ accelerometer on their right hip for 7 days. A concurrently worn thigh-mounted activPAL device captured ground truth activity labels. The authors evaluated logistic regression, random forest, and CNN models for classifying sitting, standing, and stepping bouts. The authors also assessed the benefit of performing feature engineering for this task.              Results              : The CNN classifier performed best (average balanced accuracy for bout classification of sitting, standing, and stepping was 84%) compared with the other methods (56% for logistic regression and 76% for random forest), even without performing any feature engineering.              Conclusion              : Using the recent advancements in deep neural networks, the authors showed that a CNN model can outperform other methods even without feature engineering. This has important implications for both the model’s ability to deal with the complexity of free-living data and its potential transferability to new populations.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Numerous studies have shown that sedentary behavior can be deleterious to human health. This research has demonstrated that even for individuals with moderate levels of physical activity, the overall amount and the pattern of sitting has been connected to health outcomes such as cardiovascular disease, diabetes, and cancer mortality <ref type="bibr">(Bellettiere et al., 2019;</ref><ref type="bibr">Chang et al., 2020;</ref><ref type="bibr">Knaeps et al., 2018)</ref>. Accurately quantifying sedentary behavior is the foundation of studying its relationship with health. Previously, many studies assessed sedentary behavior with self-reported questionnaires <ref type="bibr">(Patterson et al., 2018)</ref>. Due to the ubiquity of sitting behaviors, self-reporting of sedentary time is subject to high recall bias, leading to unreliable or inaccurate results in younger <ref type="bibr">(Atkin et al., 2012)</ref> and older adults <ref type="bibr">(LaMonte et al., 2019)</ref>.</p><p>There is increasing interest among researchers and health care providers in objective methods for measuring sedentary time and patterns; such measurements have been most commonly achieved using hip-worn accelerometers. In a review of 46 studies of sedentary behavior using objective measurement methods, 34 utilized a hip-worn accelerometer; 31 out of these 34 used an ActiGraph device <ref type="bibr">(Powell, Herring, Dowd, Donnelly, &amp; Carson, 2018)</ref>. Objective measurement of an adult's sedentary time from hip-worn accelerometers is most often quantified using a cut-pointbased threshold of &lt;100 counts/min that is applied to the vertical axis <ref type="bibr">(Migueles et al., 2017)</ref>, even on triaxial accelerometers. While this approach has good accuracy for measuring the total amount of time spent sedentary, it misclassifies standing without ambulation and vehicle sitting, and is inaccurate for measuring sit-to-stand transitions and other sitting pattern metrics <ref type="bibr">(Barreira, Zderic, Schuna, Hamilton, &amp; Tudor-Locke, 2015;</ref><ref type="bibr">Carlson et al., 2019)</ref>.</p><p>The desire for more accurate measurement of free-living behavior has led to alternate data processing techniques, such as machine learning (ML) algorithms. Numerous studies from the computer science domains have demonstrated the utility of ML methods for successful human activity recognition from sensor and accelerometer data <ref type="bibr">(Ramasamy Ramamurthy &amp; Roy, 2018)</ref>. In a recent review focused specifically on ML models for predicting type, class, and intensity of physical/sedentary activity domains using data acquired from a single body-fixed accelerometer, 62 studies were identified as using a variety of ML models, including artificial neural networks (32), support vector machines (18), random forest (RF) (12), decision trees (11), and logistic regression (LR) (7) <ref type="bibr">(Farrahi, Niemel&#228;, Kangas, Korpelainen, &amp; J&#228;ms&#228;, 2019)</ref>. <ref type="bibr">Farrahi et al. noted</ref> that most of the studies included in the review trained ML models on laboratory or prescribed activity data sets, leading to high accuracy levels. However, once algorithms are applied to free-living populations, accuracy rates fall below 80% accuracy thresholds <ref type="bibr">(Farrahi et al., 2019)</ref>. Possible reasons for this include skew of behaviors of interest in natural settings (e.g., very few transitions or stepping behavior compared with sedentary time) and data exclusion from laboratory-based data sets of messy or nonclassifiable behaviors, which are often abundantly present in free-living data <ref type="bibr">(Bastian et al., 2015;</ref><ref type="bibr">Sasaki et al., 2016)</ref>. It is becoming increasingly apparent that in order for ML algorithms to become more generalizable, they will need to be trained and calibrated on free-living populations that can provide continuous and multiday data in order to adequately account for a diversity of behaviors as well as to obtain enough data to potentially balance class types <ref type="bibr">(Keadle, Lyden, Strath, Staudenmayer, &amp; Freedson, 2019;</ref><ref type="bibr">Kerr et al., 2018)</ref>.</p><p>Research on the application of deep learning methods to human activity recognition is showing promising results in computer science domains, which may translate into improving the flexibility and generalizability of activity classification <ref type="bibr">(Nweke, Teh, Al-Garadi, &amp; Alo, 2018;</ref><ref type="bibr">Wang, Chen, Hao, Peng, &amp; Hu, 2019)</ref>. Two main types of deep models have been applied in activity recognition: convolutional neural networks (CNNs) <ref type="bibr">(Krizhevsky, Sutskever, &amp; Hinton, 2012)</ref> and long short-term memory models <ref type="bibr">(Guan &amp; Pl&#246;tz, 2017)</ref>. The hallmark feature of deep neural network models is their ability to learn relevant features without relying on handengineered features (e.g., researcher processed features, such as mean vector magnitude), which can take considerable data processing and development time, potentially introduce bias, and make the generalizability of an algorithm to new populations challenging. The ability to learn relevant features is particularly useful for accelerometer data in free-living settings (relative to laboratory settings) due to the variability and complexity of behavior during free-living. CNNs have been shown to excel at adapting to new data sets, opening up possibilities for reducing the need for ML models to be trained for each new cohort or context <ref type="bibr">(Rokni, Nourollahi, &amp; Ghasemzadeh, 2018;</ref><ref type="bibr">Saeedi, Norgaard, &amp; Gebremedhin, 2018)</ref>. These aspects of CNNs make them a good ML candidate for identifying physical behaviors in free-living populations as well as offering flexibility in transferring developed CNNs from one population to another.</p><p>The objective of this study was to develop a modeling pipeline to evaluate a CNN model on a free-living data set of 28 individuals and compare CNN results with the commonly used ML RF and LR algorithms. We built off of previous work that developed an RF classifier for estimating sedentary time using several engineered features <ref type="bibr">(Kerr et al., 2018)</ref>, and in this study, we detailed considerations and steps for application of a CNN model to the same data set, comparing the use of engineered and raw features. Our goal was to differentiate sitting postures from upright postures (sitting, standing, and stepping) in raw triaxial accelerometer data obtained from hip-worn accelerometers during free-living for a 7-day period during waking hours. Ground truth activity labels were produced from a thigh-mounted activPAL device (PAL Technologies, Glasgow, Scotland, United Kingdom), which contains starting and ending events of standing, stepping, and sitting. ActivPAL has been shown to be a good measure of sitting time and of sit-stand transitions and has been used in previous studies for ground truth posture labeling <ref type="bibr">(Barreira et al., 2015;</ref><ref type="bibr">Carlson et al., 2019;</ref><ref type="bibr">Kerr et al., 2018;</ref><ref type="bibr">Powell et al., 2018)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Methods</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Data</head><p>Data for this study were collected from women (mean age = 62.7 years, SD = 7.3 years) enrolled in a cross-sectional study of sedentary behavior and breast cancer-related biomarkers among breast cancer survivors. Eligible participants were women diagnosed with Stages I-III breast cancer within the past 5 years who had completed active treatment (e.g., radiation, chemotherapy) and were fluent in English. Women were excluded if they had a primary or recurrent invasive cancer within the last 10 years (other than nonmelanomic skin cancer or carcinoma of the cervix in situ), were over 85 years of age, recently had bariatric surgery, were taking insulin or corticosteroid medications, or were diabetic <ref type="bibr">(Hartman et al., 2018)</ref>. All participants provided written informed consent, and ethical approval was obtained from the institutional review board of University of California, San Diego.</p><p>Data collection included two accelerometers (hip ActiGraph and thigh-worn activPAL). While all 30 participants had hip accelerometer data, two participants did not have thigh accelerometer data, and the final n for the study was 28. Participants wore an ActiGraph GT3X+ accelerometer device (ActiGraph LLC, Pensacola, FL) on a belt on the hip for 7 days during waking hours for an average of 854 min/day (SD of 46.7 min). Raw accelerometer data were collected in a time series format recorded at a 30-Hz frequency on three axes. Participants also wore the activPAL triaxial accelerometer (PAL Technologies Ltd., Glasgow, Scotland) on the anterior aspect of the thigh over the same 7-day period. Event files were output from activPAL software (version 7.2.32; PAL Technologies) as a time series with starting and ending times of sitting, standing, and stepping bouts. GT3X+ and activPAL data were time-stamp matched to create one output file at the resolution of 30 Hz for each user. The lower resolution activPAL data were repeated to match the higher resolution GT3X+ data, resulting in 9,239,038 s of concomitant activPAL and ActiGraph data across all participants and days. Periods of nonwear time greater than 60 s were identified using the Choi algorithm applied to the ActiGraph data <ref type="bibr">(Choi, Ward, Schnelle, &amp; Buchowski, 2012)</ref>. Nonwear time was then removed from the combined (activPAL and ActiGraph) output file.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Exploratory Analysis-ActivPAL Data</head><p>Exploratory data analysis of the activPAL data was conducted prior to setting up ML models in order to assist with modeling decisions. Time distributions of the three activity types (sitting, standing, and stepping) were evaluated in aggregate (across participants and days) and distribution (Figure <ref type="figure">1</ref>). Figure <ref type="figure">1a</ref> shows a large skew between the three activity types. Sitting accounted for an average of 57% of total activity time, while stepping accounted for 13%. Box plots of the activity types (Figure <ref type="figure">1b</ref>) showed large variations in the bout times of the activities. The median bout times were more than 2 min for sitting, 14 s for standing, and 7 s for stepping. Furthermore, 18% of stepping bouts and 16% of standing bouts were less than 3 s long. Figure <ref type="figure">1</ref> illustrates a cohort that did significantly more sitting than standing or stepping, often for longer periods that did not involve many transitions between sitting and standing.</p><p>Further data exploration was conducted by visualizing the raw accelerometer data. Figure <ref type="figure">2</ref> demonstrates eight random 5-s time windows of accelerometer instances for the three activPAL activity classes. Overall, variation in the accelerometer data was low for sitting and standing and high for stepping activity, as expected. However, outliers in these patterns were apparent, such as the fifth example of sitting, which has high variation and is unlikely to be a true sitting instance. Similarly, the second example of standing exhibits no variation and is unlikely to be a standing instance. These examples of discrepancies between the GT3X+ data and activPAL activity labels demonstrate the unique challenge for classifying activities in free-living compared with the laboratory data commonly used in the literature.</p><p>The exploratory data analysis provided information for decision points in the remaining analysis, which had to account for: (a) the highly skewed nature of the data toward sitting time, (b) high variation among bout times of different activities in which sitting occurred for longer periods than stepping or standing, (c) unreliability of very small bout times, and (d) the need for a filtering procedure to remove highly unlikely activPAL labels.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Prediction Time Window</head><p>The time window was used to extract windowed input from the time series data for feature engineering and to be fed into the ML algorithm. In the literature, different temporal window (or input context) sizes have been used, ranging from 1 to 60 s <ref type="bibr">(Farrahi et al., 2019)</ref>. In selecting a time window size, we considered activity bout times in the data set (which were small for stepping and standing bouts) and confidence in activPAL labels for very small bout times. More than 80% of activity bouts in the data set had bout times  (Ahead of Print) greater than 3 s; therefore, we selected a sliding window of 3 s (3 &#215; 30 = 90 data points) as our input context, which also served to reduce noise from bouts smaller than 3 s. The sliding window approach was applied to a continuous stream of input data, called segments, in order to extract the input contexts. Within these segments, the sliding window may map into regions in which the activPAL ground truth label is not the same; in other words, the window is entering into a transition with another label. Because the sliding window is small, we filtered out these border cases and considered only the time windows for which the ground truth label was consistent in the window. In total, 3.4% of the total data set was removed in filtering border cases.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Filtering</head><p>Filtering consisted of two steps: gravity removal and unlikely label removal. Participants were not instructed on which way (up or down) to wear the accelerometer, resulting in discrepancies in the orientation of the accelerometer between participants. Separating out the gravity component from the accelerometer signal helped to determine the orientation of the accelerometer device, which could be different for the same person for different moments and between different people. Gravity filtering was performed by applying a low-pass filter on the time series data as shown in the following algorithm, as notated using pseudo code. The algorithm took in an input accelerometer window acc &#8712; R n * 3 , where n is the number of time steps in the window. For a 3-s input window, n is equal to 90 (30 &#215; 3). Removing the gravity components transforms all axial acceleration components to the same scale and amplifies the local changes in the signal. Figure <ref type="figure">3</ref> shows examples of sitting, standing, and stepping activities before and after removing the gravity components from the accelerometer signal.</p><p>Algorithm: Remove Gravity Component 1: procedure RemoveGravity(acc) 2: &#945; = .9 3: temp &#8592;ZEROS(acc.shape) The second step in filtering removed unlikely activPAL labels for sitting, standing, and stepping, in line with the previously discussed visual inspection (e.g., Figure <ref type="figure">2</ref>, Window 5 for sitting and Window 2 for standing). Wrong labels incorporate more noise into the ML models and hinder their learning and generalization capabilities. Therefore, we removed likely false labels using a simple heuristic. If a person is standing or stepping, the chances of the SD of their total acceleration</p><p>being close to zero is very low. Input time windows with labels corresponding to standing or stepping activities with an SD &#8804; 10 -4 were removed.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Feature Engineering</head><p>Increased discriminative power, reduction of noise, and removal of gravity can be achieved from feature engineering and extraction of triaxial accelerometer data. Removal of gravity helps to account for orientation mismatches between devices and acts as a data standardization step. In this study, we employed a feature engineering procedure for each 3-s window utilizing our group's previously developed and well-documented procedures, resulting in a total of 41 feature vectors <ref type="bibr">(Ellis, Godbole, et al., 2014;</ref><ref type="bibr">Ellis, Kerr, et al., 2014;</ref><ref type="bibr">Kerr et al., 2016)</ref>. Engineered features were divided into two main groups: time-domain features and frequency-domain features. Time-domain features included mean, SD, and percentiles, and frequency-domain features included entropy and power of certain frequencies, which were obtained by performing fast Fourier transform over the temporal window. All engineered features are listed and described in Supplementary Table <ref type="table">1</ref> (available online). Engineered features were standardized into a (-1, 1) scale.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>ML Models</head><p>Three model sets were run (LR, RF, and CNN) for the purpose of classifying sitting, standing, and stepping activities. For all models, a train-validation-test split of 60-20-20 percentages was used.</p><p>The split was based on individual users (16 users were selected as training data and six users each were selected as validation and test data) rather than sampling from the data pool of all users in order to allow for generalization to unseen users in the future. Models were trained to predict ground truth (activPAL) labels in independent 3-s intervals, thus ignoring the ordered time sequence and potential dependency across time intervals. Hyperparameters are model-specific configurations that cannot be directly learned from the data, while parameters can be learned from the data. Hyperparameters are manually tuned until the model with the best prediction accuracy is found. The training data set is used to estimate the parameters of a model for prespecified hyperparameters, a validation set is used to find the best hyperparameters among the set of all hyperparameters used, and the test set is used to estimate the accuracy of the final model and hyperparameter selected. For tuning the hyperparameters in the models, the validation data set was used. Balanced accuracy rate (BAR) was chosen to be the performance metric for the study due to the significant activity class skew in the data set. BAR is the proportional average of accuracy in each category or class. Accuracy is the proportion of correctly classified instances out of the total. BAR is preferred over regular accuracy if the data are imbalanced. Results report BARs for training, validation, and test data sets for all the hyperparameters evaluated, and the test accuracy corresponding to the model that had the best validation accuracy was selected as the final accuracy metric for comparison. Classification reports were generated for the best model with precision, recall, and leave-one-out validation results. Precision is the proportion of correctly classified instances out of the predicted instances for each respective class. Recall is the proportion of correctly classified instances out of the actual or ground truth instances for each respective class. Leave-one-out cross-validation accuracy is estimated by training the model on all participants except one; the data of the participant who is left out are used as a test set. This process is repeated for all participants and the average accuracy of all the leftout participants is the leave-one-out cross-validation accuracy.</p><p>Our baseline model utilized an LR model, which was run using the scikit-learn (version 0.20.0) ML library in Python (version 2.7; Python Software Foundation, Fredericksburg, VA). Because the problem was a multiclass classification problem, we used multinomial logistic loss as the loss function. We used L2 regularization and tuned the model parameters using a validation data set. When feeding in raw data, triaxial accelerometer data in the 3-s window were flattened to produce a one-dimensional (1D) feature vector, which did not preserve the time series information. Flattening can be thought of as representing a matrix in row major order. The LR model was run on raw data (with and without removing the gravity component) and using engineered features.</p><p>The second model was an RF model, which was also run using scikit-learn. RF models are classifiers with high representational power compared with simpler linear models like LR. Thus, they have more learning and generalization capability. However, if not properly regularized, they tend to overfit for the training data due to their high representational power. The RF model used in this study had 100 trees and was regularized by setting the maximum depth of the trees, which we tuned by using a validation data set. The RF model was run on raw triaxial accelerometer data (with and without removing the gravity component) and using engineered features.</p><p>The third model was a CNN model. CNNs are a specialized form of neural network that excel at exploiting the spatial locality of information among features, such as relationships between neighboring pixels in images. Therefore, they have yielded near-human accuracy on benchmark image classification tasks, such as ImageNet <ref type="bibr">(Krizhevsky et al., 2012)</ref>. This same notion of locality of information can also be applied to the time domain, with temporal patterns resembling pixel variations. The dimensionality in this application is reduced from a two-dimensional (2D) spread of pixels to a 1D spread of time series values. Thus, such CNNs are also called 1D CNNs.</p><p>We trained a 1D CNN on raw accelerometer data as well as gravity component-removed accelerometer data. The CNN model had 6 layers, including convolution, pooling, dropout <ref type="bibr">(Srivastava, Hinton, Krizhevsky, Sutskever, &amp; Salakhutdinov, 2014)</ref>, fullyconnected, and softmax layers. The architecture of the CNN model is detailed in Supplementary Table 2 (available online). For dropout layers, a keep probability of .5 was used, and cross entropy loss was deployed as the loss function for training. To account for the significant class skew present in the data, we modified the cross entropy loss by multiplying it with values proportional to the class frequencies per the following equation, where I l (.) is the indicator function, and &#945; l is a value proportional to the class frequencies of label l.</p><p>The CNN model was trained on the complete data set using back propagation and the Adam optimizer for 15 iterations <ref type="bibr">(Kingma &amp; Ba, 2015)</ref>. The learning rate was tuned using a validation data set. Furthermore, we drew on recent work on wide and deep networks, which combined learning from deep neural networks and structured or engineered features, to generate a modified version of our CNN model called Wide-CNN <ref type="bibr">(Cheng et al., 2016)</ref>. Wide-CNN essentially augments the CNN model by concatenating the flattened Pooling5 features with the previously described hand-engineered features and adds two more fully connected layers. The architecture of the wide-CNN model is shown in Supplementary Figure <ref type="figure">1</ref> (available online). This model was trained like the original CNN model, and the learning rate was tuned using a validation data set. The final wide-CNN model and test dataset are available through the Github repository github. com/ADALabUCSD/DeepPostures.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Experimental Setup</head><p>All experiments were run in a single node machine on CloudLab, a free and flexible cloud for research. The machine had two Intel Xeon Silver 4114 10-core CPUs at 2.20 GHz, 192 GB RAM (Intel, Santa Clara, CA), and one Nvidia P100 GPU (Nvidia, Santa Clara, CA). </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Results</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Ground Truth Activity Measures</head><p>On average, over the 7 days of wear time, participants engaged in 452.9 min of sitting per day, 231.7 min of standing, and 93.5 min of stepping. Participants wore devices for an average of 778.1 min per day.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Logistic Regression and RF Models</head><p>Results for the LR model are displayed in Table <ref type="table">1</ref>. Both the LR and RF models were evaluated with raw features, raw features after removing the gravity component, and engineered features. For tuning the L2 regularization factor, four different values were evaluated (0.1, 0.2, 1, and 10). The model with engineered features significantly outperformed the raw feature and gravity-removed feature models, with the highest BAR value at 0.76 for the L2 regularization value of 1. Results for the RF model are shown in Table <ref type="table">2</ref>. The tree depth was evaluated for four different values <ref type="bibr">(20, 40, 60, and 80)</ref>. For the RR model, BAR values across the three models were more similar than for the LR model, with all reaching values in the 0.7 range. The highest BAR (0.79) was achieved by the engineered features model with a tree depth of 20.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>CNN Model</head><p>Results for the CNN and wide-CNN models are displayed in Table <ref type="table">3</ref>. The CNN model was evaluated with raw features and raw features with gravity removed, while the wide-CNN combed raw features with engineered features. Four learning rate values were evaluated for all models (0.01, 0.001, 0.0001, and 0.00001). All CNN models performed well, with most achieving BAR values higher than 0.8. The best performing model was the CNN with gravity removed features at a learning rate of 0.0001, with a BAR result of 0.84. Classification statistics were calculated for the CNN with gravity removed features, because it was the best performing model. Precision values for the three activities (sitting, standing, and stepping) were 0.93, 0.78, and 0.72, respectively, with an average precision value of 0.81. Recall values were 0.92, 0.74, and 0.86, respectively, with an average recall value of 0.84. F 1 -Scores   (harmonic mean of precision and recall) were 0.93, 0.76, and 0.78, respectively, with an average of 0.82. Finally, accuracy values were 0.92, 0.74, and 0.85, respectively, with an average value of 0.84. Confusion matrix results for prediction events (3-s windows) (Table <ref type="table">4</ref>) showed that sitting was misclassified as standing 15.7% of the time and as stepping at a rate of 3.1%. Standing was misclassified as sitting 6.7% of the time and as stepping at a rate of 31.0%. Stepping was misclassified as sitting 0.2% of the time and as standing at a rate of 4.6%. Leave-one-out crossvalidation accuracy was performed on the CNN with gravity removed features model for each individual using a fixed learning rate of 0.0001. The maximum BAR was 0.93, the minimum was 0.67, and the average was 0.84.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Discussion</head><p>The results provide several insights into the task of identifying activities from accelerometer data using different types of ML models.</p><p>The LR model on raw accelerometer features performed poorly, yielding a BAR of 0.47, which was marginally better than a trivial random baseline classifier accuracy of 0.33. Even removing the gravity component from the accelerometer data did not improve the prediction accuracy. However, when engineered features were introduced, the accuracy was boosted to 0.76, demonstrating the benefit of performing feature engineering. From these results, we can conclude that raw accelerometer features are not easily linearly separable, and the transformation performed by feature engineering makes the features more linearly separable. Results in Table <ref type="table">1</ref> show that the accuracy of the LR model was not sensitive to the L2 regularization parameter. The training and test accuracies were also comparable, indicating that there was not much overfitting. It is also important to note that out of all the models, the LR is the most interpretable. Analysis of the corresponding coefficients of the learned model showed that the highest contributing feature for identifying sitting and stepping activities was the power of the 1 Hz frequency in the frequency domain. For standing, the highest contributing feature was the SD of the vector magnitude of the triaxial acceleration. For the RF models, the top three features were roll, pitch, and mean of vector magnitude of the triaxial acceleration. Unlike the LR model, the RF model performed well even with the raw accelerometer features, yielding a balanced accuracy of 0.73. Removing the gravity component increased the accuracy by 3%, and using engineered features improved the accuracy up to 0.79. The relative better performance of the RF over the LR model can be attributed to the high representational power of the model. However, Table <ref type="table">2</ref> shows how the RF model overfits to the training data, resulting in perfect classification for some tree depths. Regularization of the model by limiting the tree depth is important for improving generalization capability. However, the test accuracy dropped with regularization, suggesting that more labeled data would be needed to mitigate this overfitting.</p><p>The CNN model outperformed the other two models even without using engineered features (raw accelerometer BAR = 0.83, raw accelerometer without gravity component BAR = 0.84). Augmenting the CNN model with engineered features using the wide-CNN architecture did not improve the accuracy. This could be because the CNN model already learned 19 relevant features automatically and including engineered features did not provide new information. Table <ref type="table">3</ref> shows that the CNN models were sensitive to the learning rate parameter.</p><p>For training 15 iterations, a learning rate of 10 -3 yielded the best accuracy for the CNN model with raw features and wide-CNN. A learning rate of 10 -4 yielded the best accuracy for the CNN model with raw features without gravity. Learning rates that were too low or too high resulted in suboptimal final accuracies, showing the importance of hyperparameter tuning when training complex neural network ML models. Confusion matrix results showed that the CNN gravity removed model performed well for identifying sitting/ stepping activities but moderately for identifying sitting activities. Precision, recall, and F1-score for sitting were high; however, standing and stepping activities had relatively low precision. Standing also had a low recall, while stepping had a better recall value. The model struggled most with identifying standing activities, which can be partially explained by Figure <ref type="figure">2</ref>, in which standing raw accelerometer data at times looks like sitting and other times like stepping.</p><p>Because the sample size was small in this data set, it was computationally feasible to further evaluate the generalization capability of the CNN model by performing leave-one-out cross validation. The results for balanced test accuracies showed that most participants performed well, with an average BAR of 0.84. All participants performed over 0.75, except for one participant who had a BAR value of 0.67. However, note that the training accuracy was only at 0.86, unlike the RF. This suggests that amplifying the representation power of the CNN by making it deeper and larger could be beneficial, under the caveat that it may lead to more overfitting unless there is enough labeled data. We leave such exploration to future work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Conclusions</head><p>This study evaluated the effectiveness of several ML methods, including CNNs, for the task of identifying activity classes from hip-worn accelerometer data in a free-living sample of 28 women. We employed the thigh-worn activPAL to specify the true labels, which has been shown to be similar to gold standard observations in previous studies <ref type="bibr">(Steeves et al., 2015)</ref>. However, hip or wrist accelerometry is still the most often utilized form of activity measurement in research studies. There is a need to improve not only accuracy/precision of activity classification from hip-worn accelerometer data but also the generalizability of generated models to other populations and contexts <ref type="bibr">(Farrahi et al., 2019)</ref>. Using the recent advancements in deep neural networks, we showed that a CNN model can outperform other methods, and, furthermore, it can do this without any feature engineering. The ability of these models to significantly reduce data processing time because of their ability to learn features from the data itself is a key advantage of CNNs over previously utilized machine learned models. Furthermore, these models have been shown to be highly generalizable to new populations <ref type="bibr">(Rokni et al., 2018)</ref>.</p><p>Ensuring accurate classification of free-living data with minimal feature engineering through researcher engagement can allow for larger data sets to be analyzed, with better quantification of dose-response relationships between behaviors and health. An important next step in this research will be to independently validate the developed CNN model in a different population in order to test its generalizability. Another avenue of future research will be to apply combined CNN and long short-term memory models, which explicitly model the sequence information of the data. Previous research has shown that machine learning models have difficulty in identifying activity transitions, particularly in free-living data <ref type="bibr">(Kerr et al., 2018)</ref>. The application of a combined unstructured and structured ML model may be able to derive improvements for classification of activity transitions.</p><p>Limitations of the study include the small sample size. A larger sample will be especially important for assessing ML approaches in identifying transitions (such as sit to stand), because there tend to be much fewer occurrences of transitional behaviors in free-living populations compared with sitting, standing, and moving. Another limitation was a lack of sufficiently rich temporal features in the engineered data, which may contain useful information for predicting what behavior is most likely to be next within a sequence. In future studies, we will explore the utility of time as a feature in the models by combining CNN and long short-term memory models, which explicitly model longer-term temporal information in the data. A significant limitation of the current study was the exclusion of transitions. Algorithms for identifying behaviors in free-living populations must include identification of transitions from one behavior to another. Future development of the CNN model will focus on transition periods in order to allow for application in large free-living cohort studies.</p><p>Based on our findings in this free-living population, CNN models are a possible tool for dealing with the complexity of freeliving data; however, future work focused on transitions is needed. Work in the computer science domain and even public health has relied to a large extent on laboratory or activity prescribed data sets. While these data offer clean examples of activities with messier transitions often removed, they may provide overly optimistic accuracy values for algorithms that then fall in accuracy statistics when applied to free-living data <ref type="bibr">(Farrahi et al., 2019)</ref>. This study provides compelling results for the ability of CNNs to adapt to free-living data.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>(Ahead of Print)</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_1"><p>(Ahead of Print) 4Nakandala et al.</p></note>
		</body>
		</text>
</TEI>
