skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: A Two-Stage Classification for Dealing with Unseen Clusters in the Testing Data
Classification is an important statistical tool that has increased its importance since the emergence of the data science revolution. However, a training data set that does not capture all underlying population subgroups (or clusters) will result in biased estimates or misclassification. In this paper, we introduce a statistical and computational solution to a possible bias in classification when implemented on estimated population clusters. An unseen-cluster problem denotes the case in which the training data does not contain all underlying clusters in the population. Such a scenario may occur due to various reasons, such as sampling errors, selection bias, or emerging and disappearing population clusters. Once an unseen-cluster problem occurs, a testing observation will be misclassified because a classification rule based on the sample cannot capture a cluster not observed in the training data (sample). To overcome such issues, we suggest a two-stage classification method to ameliorate the unseen-cluster problem in classification. We suggest a test to identify the unseen-cluster problem and demonstrate the performance of the two-stage tailored classifier using simulations and a public data example.  more » « less
Award ID(s):
2015320
PAR ID:
10542505
Author(s) / Creator(s):
;
Publisher / Repository:
School of Statistics and the Center for Applied Statistics, Renmin University of China.
Date Published:
Journal Name:
Journal of Data Science
ISSN:
1680-743X
Page Range / eLocation ID:
1 to 20
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. ABSTRACT Galaxy clusters have a triaxial matter distribution. The weak-lensing signal, an important part in cosmological studies, measures the projected mass of all matter along the line of sight, and therefore changes with the orientation of the cluster. Studies suggest that the shape of the brightest cluster galaxy (BCG) in the centre of the cluster traces the underlying halo shape, enabling a method to account for projection effects. We use 324 simulated clusters at four redshifts between 0.1 and 0.6 from ‘The Three Hundred Project’ to quantify correlations between the orientation and shape of the BCG and the halo. We find that haloes and their embedded BCGs are aligned, with an average ∼20 degree angle between their major axes. The bias in weak lensing cluster mass estimates correlates with the orientation of both the halo and the BCG. Mimicking observations, we compute the projected shape of the BCG, as a measure of the BCG orientation, and find that it is most strongly correlated to the weak-lensing mass for relaxed clusters. We also test a 2D cluster relaxation proxy measured from BCG mass isocontours. The concentration of stellar mass in the projected BCG core compared to the total stellar mass provides an alternative proxy for the BCG orientation. We find that the concentration does not correlate to the weak-lensing mass bias, but does correlate with the true halo mass. These results indicate that the BCG shape and orientation for large samples of relaxed clusters can provide information to improve weak-lensing mass estimates. 
    more » « less
  2. Airborne remote sensing offers unprecedented opportunities to efficiently monitor vegetation, but methods to delineate and classify individual plant species using the collected data are still actively being developed and improved. The Integrating Data science with Trees and Remote Sensing (IDTReeS) plant identification competition openly invited scientists to create and compare individual tree mapping methods. Participants were tasked with training taxon identification algorithms based on two sites, to then transfer their methods to a third unseen site, using field-based plant observations in combination with airborne remote sensing image data products from the National Ecological Observatory Network (NEON). These data were captured by a high resolution digital camera sensitive to red, green, blue (RGB) light, hyperspectral imaging spectrometer spanning the visible to shortwave infrared wavelengths, and lidar systems to capture the spectral and structural properties of vegetation. As participants in the IDTReeS competition, we developed a two-stage deep learning approach to integrate NEON remote sensing data from all three sensors and classify individual plant species and genera. The first stage was a convolutional neural network that generates taxon probabilities from RGB images, and the second stage was a fusion neural network that “learns” how to combine these probabilities with hyperspectral and lidar data. Our two-stage approach leverages the ability of neural networks to flexibly and automatically extract descriptive features from complex image data with high dimensionality. Our method achieved an overall classification accuracy of 0.51 based on the training set, and 0.32 based on the test set which contained data from an unseen site with unknown taxa classes. Although transferability of classification algorithms to unseen sites with unknown species and genus classes proved to be a challenging task, developing methods with openly available NEON data that will be collected in a standardized format for 30 years allows for continual improvements and major gains for members of the computational ecology community. We outline promising directions related to data preparation and processing techniques for further investigation, and provide our code to contribute to open reproducible science efforts. 
    more » « less
  3. null (Ed.)
    Current approaches to A/B testing in networks focus on limiting interference, the concern that treatment effects can ”spill over” from treatment nodes to control nodes and lead to biased causal effect estimation. Prominent methods for network experiment design rely on two-stage randomization, in which sparsely-connected clusters are identified and cluster randomization dictates the node assignment to treatment and control. Here, we show that cluster randomization does not ensure sufficient node randomization and it can lead to selection bias in which treatment and control nodes represent different populations of users. To address this problem, we propose a principled framework for network experiment design which jointly minimizes interference and selection bias. We introduce the concepts of edge spillover probability and cluster matching and demonstrate their importance for designing network A/B testing. Our experiments on a number of real-world datasets show that our proposed framework leads to significantly lower error in causal effect estimation than existing solutions. 
    more » « less
  4. Avidan, S. (Ed.)
    The subpopulation shifting challenge, known as some subpopulations of a category that are not seen during training, severely limits the classification performance of the state-of-the-art convolutional neural networks. Thus, to mitigate this practical issue, we explore incremental subpopulation learning (ISL) to adapt the original model via incrementally learning the unseen subpopulations without retaining the seen population data. However, striking a great balance between subpopulation learning and seen population forgetting is the main challenge in ISL but is not well studied by existing approaches. These incremental learners simply use a pre-defined and fixed hyperparameter to balance the learning objective and forgetting regularization, but their learning is usually biased towards either side in the long run. In this paper, we propose a novel two-stage learning scheme to explicitly disentangle the acquisition and forgetting for achieving a better balance between subpopulation learning and seen population forgetting: in the first “gain-acquisition” stage, we progressively learn a new classifier based on the margin-enforce loss, which enforces the hard samples and population to have a larger weight for classifier updating and avoid uniformly updating all the population; in the second “counter-forgetting” stage, we search for the proper combination of the new and old classifiers by optimizing a novel objective based on proxies of forgetting and acquisition. We benchmark the representative and state-of-the-art non-exemplar-based incremental learning methods on a large-scale subpopulation shifting dataset for the first time. Under almost all the challenging ISL protocols, we significantly outperform other methods by a large margin, demonstrating our superiority to alleviate the subpopulation shifting problem (Code is released in https://github.com/wuyujack/ISL). 
    more » « less
  5. Abstract In this paper, we present the results of a blind survey for compact sources in 243 Galaxy clusters that were identified using the thermal Sunyaev–Zel'dovich effect (tSZ). The survey was carried out at 90 GHz using MUSTANG2 on the Green Bank Telescope and achieved a 5σdetection limit of 1 mJy in the center of each cluster. We detected 24 discrete sources. The majority (18) of these correspond to known radio sources, and of these, five show signs of significant variability, either with time or in spectral index. The remaining sources have no clear counterparts at other wavelengths. Searches for galaxy clusters via the tSZ strongly rely on observations at 90 GHz, and the sources found have the potential to bias mass estimates of clusters. We compare our results to the Websky simulation that can be used to estimate the source contamination in galaxy cluster catalogs. While the simulation shows a good match to our observations at the clusters’ centers, it does not match our source distribution further out. Sources over 104″ from a cluster’s center bias the tSZ signal high, for some of the sources found, by over 50%. When averaged over the whole cluster population, the effect is smaller but still at a level of 1%–2%. We also discovered that unlike previous measurements and simulations, we see an enhancement of source counts in the outer regions of the clusters and fewer sources than expected in the centers of this tSZ-selected sample. 
    more » « less