Abstract. Many applications in science require that computational models and data becombined. In a Bayesian framework, this is usually done by defininglikelihoods based on the mismatch of model outputs and data. However,matching model outputs and data in this way can be unnecessary or impossible.For example, using large amounts of steady state data is unnecessary becausethese data are redundant. It is numerically difficult to assimilate data inchaotic systems. It is often impossible to assimilate data of a complexsystem into a low-dimensional model. As a specific example, consider alow-dimensional stochastic model for the dipole of the Earth's magneticfield, while other field components are ignored in the model. The aboveissues can be addressed by selecting features of the data, and defininglikelihoods based on the features, rather than by the usual mismatch of modeloutput and data. Our goal is to contribute to a fundamental understanding ofsuch a feature-based approach that allows us to assimilate selected aspectsof data into models. We also explain how the feature-based approach can beinterpreted as a method for reducing an effective dimension and derive newnoise models, based on perturbed observations, that lead to computationallyefficient solutions. Numerical implementations of our ideas are illustratedin four examples.
more » « less- Award ID(s):
- 1740858
- NSF-PAR ID:
- 10111644
- Date Published:
- Journal Name:
- Nonlinear Processes in Geophysics
- Volume:
- 25
- Issue:
- 2
- ISSN:
- 1607-7946
- Page Range / eLocation ID:
- 355 to 374
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
Network intrusion detection systems (NIDS) today must quickly provide visibility into anomalous behavior on a growing amount of data. Meanwhile different data models have evolved over time, each providing a different set of features to classify attacks. Defenders have limited time to retrain classifiers, while the scale of data and feature mismatch between data models can affect the ability to periodically retrain. Much work has focused on classification accuracy yet feature selection is a key part of machine learning that, when optimized, reduces the training time and can increase accuracy by removing poorly performing features that introduce noise. With a larger feature space, the pursuit of more features is not as valuable as selecting better features. In this paper, we use an ensemble approach of filter methods to rank features followed by a voting technique to select a subset of features. We evaluate our approach using three datasets to show that, across datasets and network topologies, similar features have a trivial effect on classifier accuracy after removal. Our approach identifies poorly performing features to remove in a classifier-agnostic manner that can significantly save time for periodic retraining of production NIDS.more » « less
-
We consider a stochastic differential equation model for Earth's axial magnetic dipole field. The model's parameters are estimated using diverse and independent data sources that had previously been treated separately. The result is a numerical model that is informed by the full paleomagnetic record on kyr to Myr time scales and whose outputs match data of Earth's dipole in a precisely defined feature-based sense. Specifically, we compute model parameters and associated uncertainties that lead to model outputs that match spectral data of Earth's axial magnetic dipole field but our approach also reveals difficulties with simultaneously matching spectral data and reversal rates. This could be due to model deficiencies or inaccuracies in the limited amount of data. More generally, the approach we describe can be seen as an example of an effective strategy for combining diverse data sets that is particularly useful when the amount of data is limited.more » « less
-
Sparse support vector machine (SVM) is a popular classification technique that can simultaneously learn a small set of the most interpretable features and identify the support vectors. It has achieved great successes in many real-world applications. However, for large-scale problems involving a huge number of samples and extremely high-dimensional features, solving sparse SVMs remains challenging. By noting that sparse SVMs induce sparsities in both feature and sample spaces, we propose a novel approach, which is based on accurate estimations of the primal and dual optima of sparse SVMs, to simultaneously identify the features and samples that are guaranteed to be irrelevant to the outputs. Thus, we can remove the identified inactive samples and features from the training phase, leading to substantial savings in both the memory usage and computational cost without sacrificing accuracy. To the best of our knowledge, the proposed method is the first static feature and sample reduction method for sparse SVMs. Experiments on both synthetic and real datasets (e.g., the kddb dataset with about 20 million samples and 30 million features) demonstrate that our approach significantly outperforms state-of-the-art methods and the speedup gained by our approach can be orders of magnitude.more » « less
-
Machine learning models exhibit strong performance on datasets with abundant labeled samples. However, for tabular datasets with extremely high d-dimensional features but limited n samples (i.e. d ≫ n), machine learning models struggle to achieve strong performance due to the risk of overfitting. Here, our key insight is that there is often abundant, auxiliary domain information describing input features which can be structured as a heterogeneous knowledge graph (KG). We propose PLATO, a method that achieves strong performance on tabular data with d ≫ n by using an auxiliary KG describing input features to regularize a multilayer perceptron (MLP). In PLATO, each input feature corresponds to a node in the auxiliary KG. In the MLP’s first layer, each input feature also corresponds to a weight vector. PLATO is based on the inductive bias that two input features corresponding to similar nodes in the auxiliary KG should have similar weight vectors in the MLP’s first layer. PLATO captures this inductive bias by inferring the weight vector for each input feature from its corresponding node in the KG via a trainable message-passing function. Across 6 d ≫ n datasets, PLATO outperforms 13 state-of-the-art baselines by up to 10.19%.more » « less
-
Summary A new approach has been developed for combining and enhancing the results from an existing computational fluid dynamics model with experimental data using the weighted least‐squares finite element method (WLSFEM). Development of the approach was motivated by the existence of both limited experimental blood velocity in the left ventricle and inexact numerical models of the same flow. Limitations of the experimental data include measurement noise and having data only along a two‐dimensional plane. Most numerical modeling approaches do not provide the flexibility to assimilate noisy experimental data. We previously developed an approach that could assimilate experimental data into the process of numerically solving the Navier–Stokes equations, but the approach was limited because it required the use of specific finite element methods for solving all model equations and did not support alternative numerical approximation methods. The new approach presented here allows virtually any numerical method to be used for approximately solving the Navier–Stokes equations, and then the WLSFEM is used to combine the experimental data with the numerical solution of the model equations in a final step. The approach dynamically adjusts the influence of the experimental data on the numerical solution so that more accurate data are more closely matched by the final solution and less accurate data are not closely matched. The new approach is demonstrated on different test problems and provides significantly reduced computational costs compared with many previous methods for data assimilation. Copyright © 2016 John Wiley & Sons, Ltd.