skip to main content


Title: Detection of unknown galaxy types in large databases of galaxy images

Modern digital sky surveys utilize robotic telescopes that collect extremely large multi- PB astronomical databases. While these databases can contain billions of galaxies, most of the galaxies are “regular” galaxies of known galaxy types. However, a small portion of the galaxies is rare “peculiar” galaxies that are not yet known. These unknown galaxies are of paramount scientific interest, but due to the enormous size of astronomical databases they are practically impossible to find without automation. Since these novelty galaxies are, by definition, not known, machine learning models cannot be trained to detect them. In this paper, an unsupervised machine learning method for automatic detection of novelty galaxies in large databases is proposed. The method is based on a large and comprehensive set of numerical image content descriptors weighted by their entropy, and the farthest neighbors are ranked-ordered to handle self-similar peculiar galaxies that are expected in the very large datasets. Experimental results using data from the Panoramic Survey Telescope and Rapid Response System (Pan-STARRS) show that the ability of the method to detect novelty galaxies outperforms other shallow learning methods such as one-class SVM, Local Outlier Factor, and K-Means, and also newer deep learning-based methods such as auto-encoders. The dataset used to evaluate the method is publicly available and can be used as a benchmark to test future algorithms for automatic detection of peculiar galaxies.

 
more » « less
Award ID(s):
1903823
NSF-PAR ID:
10268489
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
EPiC Series in Computing
Volume:
76
ISSN:
2398-7340
Page Range / eLocation ID:
29 to 18
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Galaxy images of the order of multi-PB are collected as part of modern digital sky surveys using robotic telescopes. While there is a plethora of imaging data available, the majority of the images that are captured resemble galaxies that are “regular”, i.e., galaxy types that are already known and probed. However, “novelty" galaxy types, i.e., little-known galaxy types are encountered on occasion. The astronomy community shows paramount interest in the novelty galaxy types since they contain the potential for scientific discovery. However, since these galaxies are rare, the identification of such novelty galaxies is not trivial and requires automation techniques. Since these novelty galaxies are by definition, not known, supervised machine learning models cannot be trained to detect them. In this paper, an unsupervised machine learning method for automatic detection of novelty galaxies in large databases is proposed. The method uses a large set of image features weighted by their entropy. To handle the impact of self-similar novelty galaxies, the most similar galaxies are ranked-ordered. In addition, Bag of Visual Words (BOVW) is assimilated to the problem of detecting novelty galaxies. Each image in the dataset is represented as a set of features made up of key-points and descriptors. A histogram of the features is constructed and is leveraged to identify the neighbors of each of the images. Experimental results using data from the Panoramic Survey Telescope and Rapid Response System (Pan-STARRS) show that the performance of the methods in detecting novelty galaxies is superior to other shallow learning methods such as one-class SVM, Local Outlier Factor, and K-Means, and also newer deep learning-based methods such as auto-encoders. The dataset used to evaluate the method is publicly available and can be used as a benchmark to test future algorithms for automatic detection of peculiar galaxies. 
    more » « less
  2. null (Ed.)
    ABSTRACT Rare extragalactic objects can carry substantial information about the past, present, and future universe. Given the size of astronomical data bases in the information era, it can be assumed that very many outlier galaxies are included in existing and future astronomical data bases. However, manual search for these objects is impractical due to the required labour, and therefore the ability to detect such objects largely depends on computer algorithms. This paper describes an unsupervised machine learning algorithm for automatic detection of outlier galaxy images, and its application to several Hubble Space Telescope fields. The algorithm does not require training, and therefore is not dependent on the preparation of clean training sets. The application of the algorithm to a large collection of galaxies detected a variety of outlier galaxy images. The algorithm is not perfect in the sense that not all objects detected by the algorithm are indeed considered outliers, but it reduces the data set by two orders of magnitude to allow practical manual identification. The catalogue contains 147 objects that would be very difficult to identify without using automation. 
    more » « less
  3. ABSTRACT

    Current large-scale astrophysical experiments produce unprecedented amounts of rich and diverse data. This creates a growing need for fast and flexible automated data inspection methods. Deep learning algorithms can capture and pick up subtle variations in rich data sets and are fast to apply once trained. Here, we study the applicability of an unsupervised and probabilistic deep learning framework, the probabilistic auto-encoder, to the detection of peculiar objects in galaxy spectra from the SDSS survey. Different to supervised algorithms, this algorithm is not trained to detect a specific feature or type of anomaly, instead it learns the complex and diverse distribution of galaxy spectra from training data and identifies outliers with respect to the learned distribution. We find that the algorithm assigns consistently lower probabilities (higher anomaly score) to spectra that exhibit unusual features. For example, the majority of outliers among quiescent galaxies are E+A galaxies, whose spectra combine features from old and young stellar population. Other identified outliers include LINERs, supernovae, and overlapping objects. Conditional modelling further allows us to incorporate additional information. Namely, we evaluate the probability of an object being anomalous given a certain spectral class, but other information such as metrics of data quality or estimated redshift could be incorporated as well. We make our code publicly available.

     
    more » « less
  4. The Arecibo Pisces-Perseus Supercluster Survey (APPSS) attempts to detect the infall of galaxies onto the Pisces-Perseus Supercluster (PPS). The ALFALFA survey has greatly augmented the known redshifts across the region. APPSS sources will complement the ALFALFA sources, with the goal of building a large enough sample to make a high confidence measurement of infall and backflow onto the PSS filament via peculiar velocity estimates from the Tully-Fisher (TFR) and Baryonic Tully-Fisher (BTFR) relations. APPSS galaxies are selected using photometric data from the Sloan Digital Sky Survey (SDSS), aimed to detect low-mass, nearby gas-rich objects below the ALFALFA detection limit. The L-band wide receiver at Arecibo Observatory in Puerto Rico is used to obtain a five-minute ON-OFF measurement for each galaxy. Since the candidate galaxy redshifts are unknown, the receiver and spectrograph system are used in a search mode that spans the expected frequencies of HI emission from PPS galaxies. We will describe the goals, target selection, and data reduction process for the survey. Our collaboration has divided the PPS into two-degree wide declination strips for data reduction; we report preliminary results for strips 23 and 33. We have made the initial data reduction on more than 200 targets, and determined the systemic velocity, line width, integrated flux density, and HI mass for each candidate detection. We will compare results on our two declination strips, and point out interesting detections found along the way as examples of the data reduction process. This work has been supported by NSF grants AST-1211005 and AST-1637339. Publication: American Astronomical Society, AAS Meeting #233, id.356.07 Pub Date: January 2019 Bibcode: 2019AAS...23335607L 
    more » « less
  5. Abstract

    Among many structural assessment methods, the change of modal characteristics is considered a well‐accepted damage detection method. However, the presence of environmental or operational variations may pollute the baseline and prevent a dependable assessment of the change. In recent years, the use of machine learning algorithms gained interest within structural health community, especially due to their ability and success in the elimination of ambient uncertainty. This paper proposes an end‐to‐end architecture to detect damage reliably by employing machine learning algorithms. The proposed approach streamlines (a) collection of structural response data, (b) modal analysis using system identification, (c) learning model, and (d) novelty detection. The proposed system aims to extract latent features of accessible modal parameters such as natural frequencies and mode shapes measured at undamaged target structure under temperature uncertainty and to reconstruct a new representation of these features that is similar to the original using well‐established machine learning methods for damage detection. The deviation between measured and reconstructed parameters, also known as novelty index, is the essential information for detecting critical changes in the system. The approach is evaluated by analyzing the structural response data obtained from finite element models and experimental structures. For the machine learning component of the approach, both principal component analysis (PCA) and autoencoder (AE) are examined. While mode shapes are known to be a well‐researched damage indicator in the literature, to our best knowledge, this research is the first time that unsupervised machine learning is applied using PCA and AE to utilize mode shapes in addition to natural frequencies for effective damage detection. The detection performance of this pipeline is compared to a similar approach where its learning model does not utilize mode shapes. The results demonstrate that the effectiveness of the damage detection under temperature variability improves significantly when mode shapes are used in the training of learning algorithm. Especially for small damages, the proposed algorithm performs better in discriminating system changes.

     
    more » « less