skip to main content


Title: Distance‐weighted discrimination of face images for gender classification

We illustrate the advantages of distance‐weighted discrimination for classification and feature extraction in a high‐dimension low sample size (HDLSS) situation. The HDLSS context is a gender classification problem of face images in which the dimension of the data is several orders of magnitude larger than the sample size. We compare distance‐weighted discrimination with Fisher's linear discriminant, support vector machines and principal component analysis by exploring their classification interpretation through insightfulvisuanimationsand by examining the classifiers' discriminant errors. This analysis enables us to make new contributions to the understanding of the drivers of human discrimination between men and women. Copyright © 2017 John Wiley & Sons, Ltd.

 
more » « less
NSF-PAR ID:
10036431
Author(s) / Creator(s):
 ;  ;  ;  
Publisher / Repository:
Wiley Blackwell (John Wiley & Sons)
Date Published:
Journal Name:
Stat
Volume:
6
Issue:
1
ISSN:
2049-1573
Page Range / eLocation ID:
p. 231-240
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Multi‐view data, which is matched sets of measurements on the same subjects, have become increasingly common with advances in multi‐omics technology. Often, it is of interest to find associations between the views that are related to the intrinsic class memberships. Existing association methods cannot directly incorporate class information, while existing classification methods do not take into account between‐views associations. In this work, we propose a framework for Joint Association and Classification Analysis of multi‐view data (JACA). Our goal is not to merely improve the misclassification rates, but to provide a latent representation of high‐dimensional data that is both relevant for the subtype discrimination and coherent across the views. We motivate the methodology by establishing a connection between canonical correlation analysis and discriminant analysis. We also establish the estimation consistency of JACA in high‐dimensional settings. A distinct advantage of JACA is that it can be applied to the multi‐view data with block‐missing structure, that is to cases where a subset of views or class labels is missing for some subjects. The application of JACA to quantify the associations between RNAseq and miRNA views with respect to consensus molecular subtypes in colorectal cancer data from The Cancer Genome Atlas project leads to improved misclassification rates and stronger found associations compared to existing methods.

     
    more » « less
  2. We investigate the nonparametric, composite hypothesis testing problem for arbitrary unknown distributions in the asymptotic regime where both the sample size and the number of hypothesis grow exponentially large. Such asymptotic analysis is important in many practical problems, where the number of variations that can exist within a family of distributions can be countably infinite. We introduce the notion of discrimination capacity , which captures the largest exponential growth rate of the number of hypothesis relative to the sample size so that there exists a test with asymptotically vanishing probability of error. Our approach is based on various distributional distance metrics in order to incorporate the generative model of the data. We provide analyses of the error exponent using the maximum mean discrepancy and Kolmogorov–Smirnov distance and characterize the corresponding discrimination rates, i.e., lower bounds on the discrimination capacity, for these tests. Finally, an upper bound on the discrimination capacity based on Fano's inequality is developed. Numerical results are presented to validate the theoretical results. 
    more » « less
  3. Summary

    We introduce an L2-type test for testing mutual independence and banded dependence structure for high dimensional data. The test is constructed on the basis of the pairwise distance covariance and it accounts for the non-linear and non-monotone dependences among the data, which cannot be fully captured by the existing tests based on either Pearson correlation or rank correlation. Our test can be conveniently implemented in practice as the limiting null distribution of the test statistic is shown to be standard normal. It exhibits excellent finite sample performance in our simulation studies even when the sample size is small albeit the dimension is high and is shown to identify non-linear dependence in empirical data analysis successfully. On the theory side, asymptotic normality of our test statistic is shown under quite mild moment assumptions and with little restriction on the growth rate of the dimension as a function of sample size. As a demonstration of good power properties for our distance-covariance-based test, we further show that an infeasible version of our test statistic has the rate optimality in the class of Gaussian distributions with equal correlation.

     
    more » « less
  4. Abstract

    High-resolution millimeter-wave imaging (HR-MMWI), with its high discrimination contrast and sufficient penetration depth, can potentially provide affordable tissue diagnostic information noninvasively. In this study, we evaluate the application of a real-time system of HR-MMWI for in-vivo skin cancer diagnosis. 136 benign and malignant skin lesions from 71 patients, including melanoma, basal cell carcinoma, squamous cell carcinoma, actinic keratosis, melanocytic nevi, angiokeratoma, dermatofibroma, solar lentigo, and seborrheic keratosis were measured. Lesions were classified using a 3-D principal component analysis followed by five classifiers including linear discriminant analysis (LDA), K-nearest neighbor (KNN) with different K-values, linear and Gaussian support vector machine (LSVM and GSVM) with different margin factors, and multilayer perception (MLP). Our results suggested that the best classification was achieved by using five PCA components followed by MLP with 97% sensitivity and 98% specificity. Our findings establish that real-time millimeter-wave imaging can be used to distinguish malignant tissues from benign skin lesions with high diagnostic accuracy comparable with clinical examination and other methods.

     
    more » « less
  5. Abstract

    Doubled haploids (DHs) are an important breeding tool for creating maize inbred lines. One bottleneck in the DH process is the manual separation of haploids from among the much larger pool of hybrid siblings in a haploid induction cross. Here, we demonstrate the ability of single‐kernel near‐infrared reflectance spectroscopy (skNIR) to identify haploid kernels. The skNIR is a high‐throughput device that acquires an NIR spectrum to predict individual kernel traits. We collected skNIR data from haploid and hybrid kernels in 15 haploid induction crosses and found significant differences in multiple traits such as percent oil, seed weight, or volume, within each cross. The two kernel classes were separated by their NIR profile using Partial Least Squares Linear Discriminant Analysis (PLS‐LDA). A general classification model, in which all induction crosses were used in the discrimination model, and a specific model, in which only kernels within a specific induction cross, were compared. Specific models outperformed the general model and were able to enrich a haploid selection pool to above 50% haploids. Applications for the instrument are discussed.

     
    more » « less