Unsupervised feature selection aims to select a subset from the original features that are most useful for the downstream tasks without external guidance information. While most unsupervised feature selection methods focus on ranking features based on the intrinsic properties of data, most of them do not pay much attention to the relationships between features, which often leads to redundancy among the selected features. In this paper, we propose a two-stage Second-Order unsupervised Feature selection via knowledge contrastive disTillation (SOFT) model that incorporates the second-order covariance matrix with the first-order data matrix for unsupervised feature selection. In the first stage, we learn a sparse attention matrix that can represent second-order relations between features by contrastively distilling the intrinsic structure. In the second stage, we build a relational graph based on the learned attention matrix and perform graph segmentation. To this end, we conduct feature selection by only selecting one feature from each cluster to decrease the feature redundancy. Experimental results on 12 public datasets show that SOFT outperforms classical and recent state-of-the-art methods, which demonstrates the effectiveness of our proposed method. Moreover, we also provide rich in-depth experiments to further explore several key factors of SOFT.
more »
« less
Towards Interpretable Automated Machine Learning for STEM Career Prediction
In this paper, we describe our solution to predict student STEM career choices during the 2017 ASSISTments Datamining Competition. We built a machine learning system that automatically reformats the data set, generates new features and prunes redundant ones, and performs model and feature selection. We designed the system to automatically find a model that optimizes prediction performance, yet the final model is a simple logistic regression that allows researchers to discover important features and study their effects on STEM career choices. We also compared our method to other methods, which revealed that the key to good prediction is proper feature enrichment in the beginning stage of the data analysis, while feature selection in a later stage allows a simpler final model.
more »
« less
- PAR ID:
- 10190398
- Date Published:
- Journal Name:
- Journal of educational data mining
- Volume:
- 12
- Issue:
- 2
- ISSN:
- 2157-2100
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Background: The number of survivors of cancer is growing, and they often experience negative long-term behavioral outcomes due to cancer treatments. There is a need for better computational methods to handle and predict these outcomes so that physicians and health care providers can implement preventive treatments. Objective: This study aimed to create a new feature selection algorithm to improve the performance of machine learning classifiers to predict negative long-term behavioral outcomes in survivors of cancer. Methods: We devised a hybrid deep learning–based feature selection approach to support early detection of negative long-term behavioral outcomes in survivors of cancer. Within a data-driven, clinical domain–guided framework to select the best set of features among cancer treatments, chronic health conditions, and socioenvironmental factors, we developed a 2-stage feature selection algorithm, that is, a multimetric, majority-voting filter and a deep dropout neural network, to dynamically and automatically select the best set of features for each behavioral outcome. We also conducted an experimental case study on existing study data with 102 survivors of acute lymphoblastic leukemia (aged 15-39 years at evaluation and >5 years postcancer diagnosis) who were treated in a public hospital in Hong Kong. Finally, we designed and implemented radial charts to illustrate the significance of the selected features on each behavioral outcome to support clinical professionals’ future treatment and diagnoses. Results: In this pilot study, we demonstrated that our approach outperforms the traditional statistical and computation methods, including linear and nonlinear feature selectors, for the addressed top-priority behavioral outcomes. Our approach holistically has higher F1, precision, and recall scores compared to existing feature selection methods. The models in this study select several significant clinical and socioenvironmental variables as risk factors associated with the development of behavioral problems in young survivors of acute lymphoblastic leukemia. Conclusions: Our novel feature selection algorithm has the potential to improve machine learning classifiers’ capability to predict adverse long-term behavioral outcomes in survivors of cancer.more » « less
-
null (Ed.)Automatic machine learning (AML) is a family of techniques to automate the process of training predictive models, aiming to both improve performance and make machine learning more accessible. While many recent works have focused on aspects of the machine learning pipeline like model selection, hyperparameter tuning, and feature selection, relatively few works have focused on automatic data augmentation. Automatic data augmentation involves finding new features relevant to the user's predictive task with minimal "human-in-the-loop" involvement. We present ARDA, an end-to-end system that takes as input a dataset and a data repository, and outputs an augmented data set such that training a predictive model on this augmented dataset results in improved performance. Our system has two distinct components: (1) a framework to search and join data with the input data, based on various attributes of the input, and (2) an efficient feature selection algorithm that prunes out noisy or irrelevant features from the resulting join. We perform an extensive empirical evaluation of different system components and benchmark our feature selection algorithm on real-world datasets.more » « less
-
Cite: Abulfaz Hajizada and Sharmin Jahan. 2023. Feature Selections for Phishing URLs Detection Using Combination of Multiple Feature Selection Methods. In 2023 15th International Conference on Machine Learning and Computing (ICMLC 2023), February 17–20, 2023, Zhuhai, China. ACM, New York, NY, USA, 7 pages. https://doi.org/10.1145/3587716.3587790 ABSTRACT In this internet era, we are very prone to fall under phishing attacks where attackers apply social engineering to persuade and manipulate the user. The core attack target is to steal users’ sensitive information or install malicious software to get control over users’ devices. Attackers use different approaches to persuade the user. However, one of the common approaches is sending a phishing URL to the user that looks legitimate and difficult to distinguish. Machine learning is a prominent approach used for phishing URLs detection. There are already some established machine learning models available for this purpose. However, the model’s performance depends on the appropriate selection of features during model building. In this paper, we combine multiple filter methods for feature selections in a procedural way that allows us to reduce a large number of feature list into a reduced number of the feature list. Then we finally apply the wrapper method to select the features for building our phishing detection model. The result shows that combining multiple feature selection methods improves the model’s detection accuracy. Moreover, since we apply the backward feature selection method as our wrapper method on the data set with a reduced number of features, the computational time for backward feature selection gets faster.more » « less
-
Background:Science internships have been suggested as a powerful way to engage high school students in conducting authentic science inquiry. However, despite the recognized significance of high school science internships, little research is done to examinehowthese experiences affect high school students’ career choices.Purpose:Our study drew on the theoretical framework of social cognitive career theory to examine how a 7-month science internship might shape high school students’ career choices.Method:88 students were interviewed 6–8 months after their internship graduation.Findings:The analysis suggests that the science internships altered more than 90% of the participating students’ career choices by either enhancing, expanding, narrowing down, or even replacing their original career choices. Students reported that the science internships boosted their self-efficacy through their first-hand mastery of authentic STEM practices, by directly observing scientists’ STEM performance, by hearing scientists’ opinions on students’ capabilities and potential in STEM, and by the impact of the students’ own physiological and affective states on the STEM practices.Implications:These findings help educators better understand how a unique learning environment like science internship may influence high school students’ career choices; they have important implications for internship design, career counseling, and education policy.more » « less
An official website of the United States government

