Multi-instance learning (MIL) is an area of machine learning that handles data that is organized into sets of instances known as bags. Traditionally, MIL is used in the supervised-learning setting and is able to classify bags which can contain any number of instances. This property allows MIL to be naturally applied to solve the problems in a wide variety of real-world applications from computer vision to healthcare. However, many traditional MIL algorithms do not scale efficiently to large datasets. In this paper we present a novel Primal-Dual Multi-Instance Support Vector Machine (pdMISVM) derivation and implementation that can operate efficiently on large scale data. Our method relies on an algorithm derived using a multi-block variation of the alternating direction method of multipliers (ADMM). The approach presented in this work is able to scale to large-scale data since it avoids iteratively solving quadratic programming problems which are generally used to optimize MIL algorithms based on SVMs. In addition, we modify our derivation to include an additional optimization designed to avoid solving a least-squares problem during our algorithm; this optimization increases the utility of our approach to handle a large number of features as well as bags. Finally, we apply our approach to synthetic and real-world multi-instance datasets to illustrate the scalability, promising predictive performance, and interpretability of our proposed method. We end our discussion with an extension of our approach to handle non-linear decision boundaries. Code and data for our methods are available online at: https://github.com/minds-mines/pdMISVM.jl.
more »
« less
A linear primal–dual multi-instance SVM for big data classifications
Multi-instance learning (MIL) handles data that is organized into sets of instances known as bags. Traditionally, MIL is used in the supervised-learning setting for classifying bags which contain any number of instances. However, many traditional MIL algorithms do not scale efficiently to large datasets. In this paper, we present a novel primal–dual multi-instance support vector machine that can operate efficiently on large-scale data. Our method relies on an algorithm derived using a multi-block variation of the alternating direction method of multipliers. The approach presented in this work is able to scale to large-scale data since it avoids iteratively solving quadratic programming problems which are broadly used to optimize MIL algorithms based on SVMs. In addition, we improve our derivation to include an additional optimization designed to avoid solving a least-squares problem in our algorithm, which increases the utility of our approach to handle a large number of features as well as bags. Finally, we derive a kernel extension of our approach to learn nonlinear decision boundaries for enhanced classification capabilities. We apply our approach to both synthetic and real-world multi-instance datasets to illustrate the scalability, promising predictive performance, and interpretability of our proposed method.
more »
« less
- PAR ID:
- 10469934
- Publisher / Repository:
- Springer
- Date Published:
- Journal Name:
- Knowledge and Information Systems
- ISSN:
- 0219-1377
- Subject(s) / Keyword(s):
- Multi-instance learning Support vector machine Alternating direction method of multipliers Scalability
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
With the prevalence of machine learning in many high-stakes decision-making processes, e.g., hiring and admission, it is important to take fairness into account when practitioners design and deploy machine learning models, especially in scenarios with imperfectly labeled data. Multiple-Instance Learning (MIL) is a weakly supervised approach where instances are grouped in labeled bags, each containing several instances sharing the same label. However, current fairness-centric methods in machine learning often fall short when applied to MIL due to their reliance on instance-level labels. In this work, we introduce a Fair Multiple-Instance Learning (FMIL) framework to ensure fairness in weakly supervised learning. In particular, our method bridges the gap between bag-level and instance-level labeling by leveraging the bag labels, inferring high-confidence instance labels to improve both accuracy and fairness in MIL classifiers. Comprehensive experiments underscore that our FMIL framework substantially reduces biases in MIL without compromising accuracy.more » « less
-
A major challenge in computational biology regards recognizing one or more biologically- active/native tertiary protein structures among thousands of physically-realistic structures generated via template-free protein structure prediction algorithms. Clustering structures based on structural similarity remains a popular approach. However, clustering orga- nizes structures into groups and does not directly provide a mechanism to select individual structures for prediction. In this paper, we provide a few algorithms for this selection prob- lem. We approach the problem under unsupervised multi-instance learning and address it in three stages, first organizing structures into bags, identifying relevant bags, and then drawing individual structures/instances from these bags. We present both non-parametric and parametric algorithms for drawing individual instances. In the latter, parameters are trained over training data and evaluated over testing data via rigorous metrics.more » « less
-
Learning from label proportions (LLP) is a weakly supervised classification problem where data points are grouped into bags, and the label proportions within each bag are observed instead of the instance-level labels. The task is to learn a classifier to predict the labels of future individual instances. Prior work on LLP for multi-class data has yet to develop a theoretically grounded algorithm. In this work, we propose an approach to LLP based on a reduction to learning with label noise, using the forward correction (FC) loss of Patrini et al. [30]. We establish an excess risk bound and generalization error analysis for our approach, while also extending the theory of the FC loss which may be of independent interest. Our approach demonstrates improved empirical performance in deep learning scenarios across multiple datasets and architectures, compared to the leading methods.more » « less
-
Learning from label proportions (LLP) is a weakly supervised classification problem where data points are grouped into bags, and the label proportions within each bag are observed instead of the instance-level labels. The task is to learn a classifier to predict the labels of future individual instances. Prior work on LLP for multi-class data has yet to develop a theoretically grounded algorithm. In this work, we propose an approach to LLP based on a reduction to learning with label noise, using the forward correction (FC) loss of Patrini et al. [30]. We establish an excess risk bound and generalization error analysis for our approach, while also extending the theory of the FC loss which may be of independent interest. Our approach demonstrates improved empirical performance in deep learning scenarios across multiple datasets and architectures, compared to the leading methods.more » « less