skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Categorical variable coding for machine learning in engineering education
This work-in-progress research paper describes a study of different categorical data coding procedures for machine learning(ML) in engineering education. Often left out of methodology sections, preprocessing steps in data analysis can have important ramifications on project outcomes. In this study, we applied three different coding schemes (i.e., scalar conversion, one-hot encoding, and binary) for the categorical variable of Race across three different ML models (i.e., Neural Network, Random Forest, and Naive Bayes classifiers) looking at the four standard measures of ML classification models (i.e., accuracy, precision, recall, and F1-score). Results showed that, in general, the coding scheme did not affect predictive outcomes as much as ML model type did. However, one-hot encoding – the strategy of transforming a categorical variable with k possible values to k binary nodes, a common practice in educational research – does not work well with a Naive Bayes classifier model. Our results indicate that such sensitivity studies at the beginning of ML modeling projects are necessary. Future work includes performing a full range of sensitivity studies on our complete, grant-funded project dataset that has been collected, and publishing our findings.  more » « less
Award ID(s):
2335725
PAR ID:
10572151
Author(s) / Creator(s):
; ; ; ; ;
Publisher / Repository:
IEEE
Date Published:
Subject(s) / Keyword(s):
engineering education persistence expectancy-value theory machine learning
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The COVID-19 pandemic was a catalyst for many different trends in our daily life worldwide. While there has been an overall rise in cybercrime during this time, there has been relatively little research done about malicious COVID-19 themed AndroidOS applications. With the rise in reports of users falling victim to malicious COVID-19 themed AndroidOS applications, there is a need to learn about the detection of malware for pandemics-themed mobile apps.. In this project, we extracted the permissions requests from 1959 APK files from a dataset containing benign and malware COVID-19 themed apps. We then created and compared eight unique models of four varying classifiers to determine their ability to identify potentially malicious APK files based on the permissions the APK file requests: support vector machine, neural network, decision trees, and categorical naive bayes. These classifiers were then trained using Synthetic Minority Oversampling Technique (SMOTE) to balance the dataset due to the lack of samples of malware compared to non-malware APKs. Finally, we evaluated the models using K-Fold Cross-Validation and found the decision tree classifier to be the best performing classifier. 
    more » « less
  2. The profession debates how to encode a categorical variable for input to machine learning algorithms, such as neural networks. A conventional approach is to convert a categorical variable into a collection of binary variables, which causes a burdensome number of correlated variables. TerrSet’s Land Change Modeler proposes encoding a categorical variable onto the continuous closed interval from 0 to 1 based on each category’s Population Evidence Likelihood (PEL) for input to the Multi-Layer Perceptron, which is a type of neural network. We designed examples to test the wisdom of these encodings. The results show that encoding a categorical variable based on each category’s Sample Empirical Probability (SEP) produces results similar to binary encoding and superior to PEL encoding. The Multi-Layer Perceptron’s sigmoidal smoothing function can cause PEL encoding to produce nonsensical results, while SEP encoding produces straightforward results. We reveal the encoding methods by illustrating how a dependent variable gains across an independent variable that has four categories. The results show that PEL can differ substantially from SEP in ways that have important implications for practical extrapolations. If users must encode a categorical variable for input to a neural network, then we recommend SEP encoding, because SEP efficiently produces outputs that make sense. 
    more » « less
  3. IntroductionAI fairness seeks to improve the transparency and explainability of AI systems by ensuring that their outcomes genuinely reflect the best interests of users. Data augmentation, which involves generating synthetic data from existing datasets, has gained significant attention as a solution to data scarcity. In particular, diffusion models have become a powerful technique for generating synthetic data, especially in fields like computer vision. MethodsThis paper explores the potential of diffusion models to generate synthetic tabular data to improve AI fairness. The Tabular Denoising Diffusion Probabilistic Model (Tab-DDPM), a diffusion model adaptable to any tabular dataset and capable of handling various feature types, was utilized with different amounts of generated data for data augmentation. Additionally, reweighting samples from AIF360 was employed to further enhance AI fairness. Five traditional machine learning models—Decision Tree (DT), Gaussian Naive Bayes (GNB), K-Nearest Neighbors (KNN), Logistic Regression (LR), and Random Forest (RF)—were used to validate the proposed approach. Results and discussionExperimental results demonstrate that the synthetic data generated by Tab-DDPM improves fairness in binary classification. 
    more » « less
  4. Abstract Previous work using logistic regression suggests that cognitive control‐related frontoparietal activation in early psychosis can predict symptomatic improvement after 1 year of coordinated specialty care with 66% accuracy. Here, we evaluated the ability of six machine learning (ML) algorithms and deep learning (DL) to predict “Improver” status (>20% improvement on Brief Psychiatric Rating Scale [BPRS] total score at 1‐year follow‐up vs. baseline) and continuous change in BPRS score using the same functional magnetic resonance imaging‐based features (frontoparietal activations during the AX‐continuous performance task) in the same sample (individuals with either schizophrenia (n =65, 49M/16F, mean age 20.8 years) or Type I bipolar disorder (n= 17, 9M/8F, mean age 21.6 years)). 138 healthy controls were included as a reference group. “Shallow” ML methods included Naive Bayes, support vector machine, K Star, AdaBoost, J48 decision tree, and random forest. DL included an explainable artificial intelligence (XAI) procedure for understanding results. The best overall performances (70% accuracy for the binary outcome and root mean square error = 9.47 for the continuous outcome) were achieved using DL. XAI revealed left DLPFC activation was the strongest feature used to make binary classification decisions, with a classification activation threshold (adjusted beta = .017) intermediate to the healthy control mean (adjusted beta = .15, 95% CI = −0.02 to 0.31) and patient mean (adjusted beta = −.13, 95% CI = −0.37 to 0.11). Our results suggest DL is more powerful than shallow ML methods for predicting symptomatic improvement. The left DLPFC may be a functional target for future biomarker development as its activation was particularly important for predicting improvement. 
    more » « less
  5. We study the convergence properties of the Expectation-Maximization algorithm in the Naive Bayes model. We show that EM can get stuck in regions of slow convergence, even when the features are binary and i.i.d. conditioning on the class label, and even under random (i.e. non worst-case) initialization. In turn, we show that EM can be bootstrapped in a pre-training step that computes a good initialization. From this initialization we show theoretically and experimentally that EM converges exponentially fast to the true model parameters. Our bootstrapping method amounts to running the EM algorithm on appropriately centered iterates of small magnitude, which as we show corresponds to effectively performing power iteration on the covariance matrix of the mixture model, although power iteration is performed under the hood by EM itself. As such, we call our bootstrapping approach “power EM.” Specifically for the case of two binary features, we show global exponentially fast convergence of EM, even without bootstrapping. Finally, as the Naive Bayes model is quite expressive, we show as corollaries of our convergence results that the EM algorithm globally converges to the true model parameters for mixtures of two Gaussians, recovering recent results of Xu et al.’2016 and Daskalakis et al. 2017. 
    more » « less