skip to main content


Title: Automatic Domain Model Creation and Improvement
We describe a data mining pipeline to convert data from educational systems into knowledge component (KC) models. In contrast to other approaches, our approach employs and compares multiple model search methodologies (e.g., sparse factor analysis, covariance clustering) within a single pipeline. In this preliminary work, we describe our approach's results on two datasets when using 2 model search methodologies for inferring item or KCs relations (i.e., implied transfer). The first method uses item covariances which are clustered to determine related KCs, and the second method uses sparse factor analysis to derive the relationship matrix for clustering. We evaluate these methods on data from experimentally controlled practice of statistics items as well as data from the Andes physics system. We explain our plans to upgrade our pipeline to include additional methods of finding item relationships and creating domain models. We discuss advantages of improving the domain model that go beyond model fit, including the fact that models with clustered item KCs result in performance predictions transferring between KCs, enabling the learning system to be more adaptive and better able to track student knowledge.  more » « less
Award ID(s):
1934745
NSF-PAR ID:
10291622
Author(s) / Creator(s):
Date Published:
Journal Name:
Proceedings of The 14th International Conference on Educational Data Mining
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Popular parametric and semiparametric hazards regression models for clustered survival data are inappropriate and inadequate when the unknown effects of different covariates and clustering are complex. This calls for a flexible modeling framework to yield efficient survival prediction. Moreover, for some survival studies involving time to occurrence of some asymptomatic events, survival times are typically interval censored between consecutive clinical inspections. In this article, we propose a robust semiparametric model for clustered interval‐censored survival data under a paradigm of Bayesian ensemble learning, called soft Bayesian additive regression trees or SBART (Linero and Yang, 2018), which combines multiple sparse (soft) decision trees to attain excellent predictive accuracy. We develop a novel semiparametric hazards regression model by modeling the hazard function as a product of a parametric baseline hazard function and a nonparametric component that uses SBART to incorporate clustering, unknown functional forms of the main effects, and interaction effects of various covariates. In addition to being applicable for left‐censored, right‐censored, and interval‐censored survival data, our methodology is implemented using a data augmentation scheme which allows for existing Bayesian backfitting algorithms to be used. We illustrate the practical implementation and advantages of our method via simulation studies and an analysis of a prostate cancer surgery study where dependence on the experience and skill level of the physicians leads to clustering of survival times. We conclude by discussing our method's applicability in studies involving high‐dimensional data with complex underlying associations.

     
    more » « less
  2. Knowledge components (KCs) have many applications. In computing education, knowing the demonstration of specific KCs has been challenging. This paper introduces an entirely data-driven approach for (i) discovering KCs and (ii) demonstrating KCs, using students’ actual code submissions. Our system is based on two expected properties of KCs: (i) generate learning curves following the power law of practice, and (ii) are predictive of response correctness. We train a neural architecture (named KC-Finder) that classifies the correctness of student code submissions and captures problem-KC relationships. Our evaluation on data from 351 students in an introductory Java course shows that the learned KCs can generate reasonable learning curves and predict code submission correctness. At the same time, some KCs can be interpreted to identify programming skills. We compare the learning curves described by our model to four baselines, showing that (i) identifying KCs with naive methods is a difficult task and (ii) our learning curves exhibit a substantially better curve fit. Our work represents a first step in solving the data-driven KC discovery problem in computing education. 
    more » « less
  3. A longstanding goal of learner modeling and educational data min-ing is to improve the domain model of knowledge that is used to make inferences about learning and performance. In this report we present a tool for finding domain models that is built into an exist-ing modeling framework, logistic knowledge tracing (LKT). LKT allows the flexible specification of learner models in logistic re-gression by allowing the modeler to select whatever features of the data are relevant to prediction. Each of these features (such as the count of prior opportunities) is a function computed for a compo-nent of data (such as a student or knowledge component). In this context, we have developed the “autoKC” component, which clus-ters knowledge components and allows the modeler to compute features for the clustered components. For an autoKC, the input component (initial KC or item assignment) is clustered prior to computing the feature and the feature is a function of that cluster. Another recent new function for LKT, which allows us to specify interactions between the logistic regression predictor terms, is com-bined with autoKC for this report. Interactions allow us to move beyond just assuming the cluster information has additive effects to allow us to model situations where a second factor of the data mod-erates a first factor. 
    more » « less
  4. The problem of knowledge graph (KG) reasoning has been widely explored by traditional rule-based systems and more recently by knowledge graph embedding methods. While logical rules can capture deterministic behavior in a KG, they are brittle and mining ones that infer facts beyond the known KG is challenging. Probabilistic embedding methods are effective in capturing global soft statistical tendencies and reasoning with them is computationally efficient. While embedding representations learned from rich training data are expressive, incompleteness and sparsity in real-world KGs can impact their effectiveness. We aim to leverage the complementary properties of both methods to develop a hybrid model that learns both high-quality rules and embeddings simultaneously. Our method uses a cross feedback paradigm wherein an embedding model is used to guide the search of a rule mining system to mine rules and infer new facts. These new facts are sampled and further used to refine the embedding model. Experiments on multiple benchmark datasets show the effectiveness of our method over other competitive standalone and hybrid baselines. We also show its efficacy in a sparse KG setting. 
    more » « less
  5. Abstract

    Studies of spatial point patterns (SPPs) are often used to examine the role that density‐dependence (DD) and environmental filtering (EF) play in community assembly and species coexistence in forest communities. However, SPP analyses often struggle to distinguish the opposing effects that DD and EF may have on the distribution of tree species.

    We tested percolation threshold analysis on simulated tree communities as a method to distinguish the importance of thinning from DD EF on SPPs. We then compared the performance of percolation threshold analysis results and a Gibbs point process model in detecting environmental associations as well as clustering patterns or overdispersion. Finally, we applied percolation threshold analysis and the Gibbs point process model to observed SPPs of 12 dominant tree species in a Puerto Rican forest to detect evidence of DD and EF.

    Percolation threshold analysis using simulated SPPs detected a decrease in clustering due to DD and an increase in clustering from EF. In contrast, the Gibbs point process model clearly detected the effects of EF but only identified DD thinning in two of the four types of simulated SPPs. Percolation threshold analysis on the 12 observed tree species' SPPs found that the SPPs for two species were consistent with thinning from DD processes only, four species had SPPs consistent with EF only and SPP for five reflected a combination of both processes. Gibbs models of observed SPPs of living trees detected significant environmental associations for 11 species and clustering consistent with DD processes for seven species.

    Percolation threshold analysis is a robust method for detecting community assembly processes in simulated SPPs. By applying percolation threshold analysis to natural communities, we found that tree SPPs were consistent with thinning from both DD and EF. Percolation threshold analysis was better suited to detect DD thinning than Gibbs models for clustered simulated communities. Percolation threshold analysis improves our understanding of forest community assembly processes by quantifying the relative importance of DD and EF in forest communities.

     
    more » « less