skip to main content


Title: The Maximum Separation Subspace in Sufficient Dimension Reduction with Categorical Response
Sufficient dimension reduction (SDR) is a very useful concept for exploratory analysis and data visualization in regression, especially when the number of covariates is large. Many SDR methods have been proposed for regression with a continuous response, where the central subspace (CS) is the target of estimation. Various conditions, such as the linearity condition and the constant covariance condition, are imposed so that these methods can estimate at least a portion of the CS. In this paper we study SDR for regression and discriminant analysis with categorical response. Motivated by the exploratory analysis and data visualization aspects of SDR, we propose a new geometric framework to reformulate the SDR problem in terms of manifold optimization and introduce a new concept called Maximum Separation Subspace (MASES). The MASES naturally preserves the “sufficiency” in SDR without imposing additional conditions on the predictor distribution, and directly inspires a semi-parametric estimator. Numerical studies show MASES exhibits superior performance as compared with competing SDR methods in specific settings.  more » « less
Award ID(s):
1908969 1613154 1617691
NSF-PAR ID:
10192086
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Journal of machine learning research
Volume:
21
Issue:
29
ISSN:
1533-7928
Page Range / eLocation ID:
1-36
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Effects of High Impact Educational Practices on Engineering and Computer Science Student Participation, Persistence, and Success at Land Grant Universities: Award# RIEF-1927218 – Year 2 Abstract Funded by the National Science Foundation (NSF), this project aims to investigate and identify associations (if any) that exist between student participation in High Impact Educational Practices (HIP) and their educational outcomes in undergraduate engineering and computer science (E/CS) programs. To understand the effects of HIP participation among E/CS students from groups historically underrepresented and underserved in E/CS, this study takes place within the rural, public university context at two western land grant institutions (one of which is an Hispanic-serving institution). Conceptualizing diversity broadly, this study considers gender, race and ethnicity, and first-generation, transfer, and nontraditional student status to be facets of identity that contribute to the diversity of academic programs and the technical workforce. This sequential, explanatory, mixed-methods study is guided by the following research questions: 1. To what extent do E/CS students participate in HIP? 2. What relationships (if any) exist between E/CS student participation in HIP and their educational outcomes (i.e., persistence in major, academic performance, and graduation)? 3. How do contextual factors (e.g., institutional, programmatic, personal, social, financial, etc.) affect E/CS student awareness of, interest in, and participation in HIP? During Project Year 1, a survey driven quantitative study was conducted. A survey informed by results of the National Survey of Student Engagement (NSSE) from each institution was developed and deployed. Survey respondents (N = 531) were students enrolled in undergraduate E/CS programs at either institution. Frequency distribution analyses were conducted to assess the respondents’ level of participation in extracurricular HIPs (i.e., global learning and study aboard, internships, learning communities, service and community-based learning, and undergraduate research) that have been shown in the literature to positively impact undergraduate student success. Further statistical analysis was conducted to understand the effects of HIP participation, coursework enjoyability, and confidence at completing a degree on the academic success of underrepresented and nontraditional E/CS students. Exploratory factor analysis was used to derive an "academic success" variable from five items that sought to measure how students persevere to attain academic goals. Results showed that a linear relationship in the target population exists and that the resultant multiple regression model is a good fit for the data. During the Project Year 2, survey results were used to develop focus group interview protocols and guide the purposive selection of focus group participants. Focus group interviews were conducted with a total of 27 undergraduates (12 males, 15 females, 16 engineering students, 11 computer science students) across both institutions via video conferencing (i.e., ZOOM) during the spring and fall 2021 semesters. Currently, verified focus group transcripts are being systematically analyzed and coded by a team of four trained coders to identify themes and answer the research questions. This paper will provide an overview of the preliminary themes so far identified. Future project activities during Project Year 3 will focus on refining themes identified during the focus group transcript analysis. Survey and focus group data will then be combined to develop deeper understandings of why and how E/CS students participate in the HIP at their university, taking into account the institutional and programmatic contexts at each institution. Ultimately, the project will develop and disseminate recommendations for improving diverse E/CS student awareness of, interest in, and participation in HIP, at similar land grant institutions nationally. 
    more » « less
  2. Visualizations of data provide a proven method for analysts to explore and make data-driven discoveries. However, current visualization tools provide only limited support for hypothesis-driven analyses, and often lack capabilities that would allow users to visually test the fit of their conceptual models against the data. This imbalance could bias users to overly rely on exploratory visual analysis as the principal mode of inquiry, which can be detrimental to discovery. To address this gap, we propose a new paradigm for ‘concept-driven’ visual analysis. In this style of analysis, analysts share their conceptual models and hypotheses with the system. The system then uses those inputs to drive the generation of visualizations, while providing plots and interactions to explore places where models and data disagree. We discuss key characteristics and design considerations for concept-driven visualizations, and report preliminary findings from a formative study. 
    more » « less
  3. Over the years, researchers have found that student engagement facilitates desired academic success outcomes for college undergraduate students. Much research on student engagement has focused on academic tasks and classroom context. High impact engagement practices (HIEP) have been shown to be effective for undergraduate student academic success. However, less is known about the effects of HIEP specifically on engineering and computer science (E/CS) student outcomes. Given the high attrition rates for E/CS students, student involvement in HIEP could be effective in improving student outcomes for E/CS students, including those from various underrepresented groups. More generally, student participation in specific HIEP activities has been shown to shape their everyday experiences in school, both academically and socially. Hence, the primary goal of this study is to examine the factors that predict academic success in E/CS using multiple regression analysis. Specifically, this study seeks to understand the effects of high impact engagement practices (HIEP), coursework enjoyability, confidence at completing a degree on academic success of the underrepresented and nontraditional E/CS students. We used exploratory factor analyses to derive “academic success” variable from five items that sought to measure how students persevere to attain academic goals. A secondary goal of the present study is to address the gap in research literature concerning how participation in HIEP affects student persistence and success in E/CS degree programs. Our research team developed and administered an online survey to investigate and identify factors that affect participation in HIEP among underrepresented and nontraditional E/CS students. Respondents (N = 531) were students enrolled in two land grant universities in the Western U.S. Multiple regression analyses were conducted to examine the proportion of the variation in the dependent variable (academic success) explained by the independent variables (i.e., high impact engagement practice (HIEP), coursework enjoyability, and confidence at completing a degree). We hypothesized that (1) high impact engagement practices will predict academic success; (2) coursework enjoyability will predict academic success; and (3) confidence at completing a degree will predict academic success. Results showed that the multiple regression model statistically predicted academic success , F(3, 270) = 33.064, p = .001, adjusted R2 = .27. This results indicate that there is a linear relationship in the population and the multiple regression model is a good fit for the data. Further, findings show that confidence at completing a degree is significantly predictive of academic success. In addition, coursework enjoyability is a strong predictor of academic success. Specifically, the result shows that an increase in high impact engagement activity is associated with an increase in students’ academic success. In sum, these findings suggest that student participation in High Impact Engagement Practices might improve academic success and course retention. Theoretical and practical implications are discussed. 
    more » « less
  4. Abstract Background

    Differential correlation networks are increasingly used to delineate changes in interactions among biomolecules. They characterize differences between omics networks under two different conditions, and can be used to delineate mechanisms of disease initiation and progression.

    Results

    We present a new R package, , that facilitates the estimation and visualization of differential correlation networks using multiple correlation measures and inference methods. The software is implemented in , and , and is available athttps://github.com/sqyu/CorDiffViz. Visualization has been tested for the Chrome and Firefox web browsers. A demo is available athttps://diffcornet.github.io/CorDiffViz/demo.html.

    Conclusions

    Our software offers considerable flexibility by allowing the user to interact with the visualization and choose from different estimation methods and visualizations. It also allows the user to easily toggle between correlation networks for samples under one condition and differential correlations between samples under two conditions. Moreover, the software facilitates integrative analysis of cross-correlation networks between two omics data sets.

     
    more » « less
  5. null (Ed.)
    The first major goal of this project is to build a state-of-the-art information storage, retrieval, and analysis system that utilizes the latest technology and industry methods. This system is leveraged to accomplish another major goal, supporting modern search and browse capabilities for a large collection of tweets from the Twitter social media platform, web pages, and electronic theses and dissertations (ETDs). The backbone of the information system is a Docker container cluster running with Rancher and Kubernetes. Information retrieval and visualization is accomplished with containers in a pipelined fashion, whether in the cluster or on virtual machines, for Elasticsearch and Kibana, respectively. In addition to traditional searching and browsing, the system supports full-text and metadata searching. Search results include facets as a modern means of browsing among related documents. The system supports text analysis and machine learning to reveal new properties of collection data. These new properties assist in the generation of available facets. Recommendations are also presented with search results based on associations among documents and with logged user activity. The information system is co-designed by five teams of Virginia Tech graduate students, all members of the same computer science class, CS 5604. Although the project is an academic exercise, it is the practice of the teams to work and interact as though they are groups within a company developing a product. The teams on this project include three collection management groups -- Electronic Theses and Dissertations (ETD), Tweets (TWT), and Web-Pages (WP) -- as well as the Front-end (FE) group and the Integration (INT) group to help provide the overarching structure for the application. This submission focuses on the work of the Integration (INT) team, which creates and administers Docker containers for each team in addition to administering the cluster infrastructure. Each container is a customized application environment that is specific to the needs of the corresponding team. Each team will have several of these containers set up in a pipeline formation to allow scaling and extension of the current system. The INT team also contributes to a cross-team effort for exploring the use of Elasticsearch and its internally associated database. The INT team administers the integration of the Ceph data storage system into the CS Department Cloud and provides support for interactions between containers and the Ceph filesystem. During formative stages of development, the INT team also has a role in guiding team evaluations of prospective container components and workflows. The INT team is responsible for the overall project architecture and facilitating the tools and tutorials that assist the other teams in deploying containers in a development environment according to mutual specifications agreed upon with each team. The INT team maintains the status of the Kubernetes cluster, deploying new containers and pods as needed by the collection management teams as they expand their workflows. This team is responsible for utilizing a continuous integration process to update existing containers. During the development stage the INT team collaborates specifically with the collection management teams to create the pipeline for the ingestion and processing of new collection documents, crossing services between those teams as needed. The INT team develops a reasoner engine to construct workflows with information goal as input, which are then programmatically authored, scheduled, and monitored using Apache Airflow. The INT team is responsible for the flow, management, and logging of system performance data and making any adjustments necessary based on the analysis of testing results. The INT team has established a Gitlab repository for archival code related to the entire project and has provided the other groups with the documentation to deposit their code in the repository. This repository will be expanded using Gitlab CI in order to provide continuous integration and testing once it is available. Finally, the INT team will provide a production distribution that includes all embedded Docker containers and sub-embedded Git source code repositories. The INT team will archive this distribution on the Virginia Tech Docker Container Registry and deploy it on the Virginia Tech CS Cloud. The INT-2020 team owes a sincere debt of gratitude to the work of the INT-2019 team. This is a very large undertaking and the wrangling of all of the products and processes would not have been possible without their guidance in both direct and written form. We have relied heavily on the foundation they and their predecessors have provided for us. We continue their work with systematic improvements, but also want to acknowledge their efforts Ibid. Without them, our progress to date would not have been possible. 
    more » « less