skip to main content

Title: Beyond the hubble sequence – exploring galaxy morphology with unsupervised machine learning
ABSTRACT We explore unsupervised machine learning for galaxy morphology analyses using a combination of feature extraction with a vector-quantized variational autoencoder (VQ-VAE) and hierarchical clustering (HC). We propose a new methodology that includes: (1) consideration of the clustering performance simultaneously when learning features from images; (2) allowing for various distance thresholds within the HC algorithm; (3) using the galaxy orientation to determine the number of clusters. This set-up provides 27 clusters created with this unsupervised learning that we show are well separated based on galaxy shape and structure (e.g. Sérsic index, concentration, asymmetry, Gini coefficient). These resulting clusters also correlate well with physical properties such as the colour–magnitude diagram, and span the range of scaling relations such as mass versus size amongst the different machine-defined clusters. When we merge these multiple clusters into two large preliminary clusters to provide a binary classification, an accuracy of $\sim 87{{\ \rm per\ cent}}$ is reached using an imbalanced data set, matching real galaxy distributions, which includes 22.7 per cent early-type galaxies and 77.3 per cent late-type galaxies. Comparing the given clusters with classic Hubble types (ellipticals, lenticulars, early spirals, late spirals, and irregulars), we show that there is an intrinsic vagueness in visual classification systems, in particular more » galaxies with transitional features such as lenticulars and early spirals. Based on this, the main result in this work is not how well our unsupervised method matches visual classifications and physical properties, but that the method provides an independent classification that may be more physically meaningful than any visually based ones. « less
; ; ; ; ;
Award ID(s):
Publication Date:
Journal Name:
Monthly Notices of the Royal Astronomical Society
Sponsoring Org:
National Science Foundation
More Like this

    In this work, we explore the possibility of applying machine learning methods designed for 1D problems to the task of galaxy image classification. The algorithms used for image classification typically rely on multiple costly steps, such as the point spread function deconvolution and the training and application of complex Convolutional Neural Networks of thousands or even millions of parameters. In our approach, we extract features from the galaxy images by analysing the elliptical isophotes in their light distribution and collect the information in a sequence. The sequences obtained with this method present definite features allowing a direct distinction between galaxy types. Then, we train and classify the sequences with machine learning algorithms, designed through the platform Modulos AutoML. As a demonstration of this method, we use the second public release of the Dark Energy Survey (DES DR2). We show that we are able to successfully distinguish between early-type and late-type galaxies, for images with signal-to-noise ratio greater than 300. This yields an accuracy of $86{{\ \rm per\ cent}}$ for the early-type galaxies and $93{{\ \rm per\ cent}}$ for the late-type galaxies, which is on par with most contemporary automated image classification approaches. The data dimensionality reduction of our novelmore »method implies a significant lowering in computational cost of classification. In the perspective of future data sets obtained with e.g. Euclid and the Vera Rubin Observatory, this work represents a path towards using a well-tested and widely used platform from industry in efficiently tackling galaxy classification problems at the peta-byte scale.

    « less

    Galaxy morphology is a fundamental quantity, which is essential not only for the full spectrum of galaxy-evolution studies, but also for a plethora of science in observational cosmology (e.g. as a prior for photometric-redshift measurements and as contextual data for transient light-curve classifications). While a rich literature exists on morphological-classification techniques, the unprecedented data volumes, coupled, in some cases, with the short cadences of forthcoming ‘Big-Data’ surveys (e.g. from the LSST), present novel challenges for this field. Large data volumes make such data sets intractable for visual inspection (even via massively distributed platforms like Galaxy Zoo), while short cadences make it difficult to employ techniques like supervised machine learning, since it may be impractical to repeatedly produce training sets on short time-scales. Unsupervised machine learning, which does not require training sets, is ideally suited to the morphological analysis of new and forthcoming surveys. Here, we employ an algorithm that performs clustering of graph representations, in order to group image patches with similar visual properties and objects constructed from those patches, like galaxies. We implement the algorithm on the Hyper-Suprime-Cam Subaru-Strategic-Program Ultra-Deep survey, to autonomously reduce the galaxy population to a small number (160) of ‘morphological clusters’, populated by galaxiesmore »with similar morphologies, which are then benchmarked using visual inspection. The morphological classifications (which we release publicly) exhibit a high level of purity, and reproduce known trends in key galaxy properties as a function of morphological type at z < 1 (e.g. stellar-mass functions, rest-frame colours, and the position of galaxies on the star-formation main sequence). Our study demonstrates the power of unsupervised machine learning in performing accurate morphological analysis, which will become indispensable in this new era of deep-wide surveys.

    « less

    Misalignments between the rotation axis of stars and gas are an indication of external processes shaping galaxies throughout their evolution. Using observations of 3068 galaxies from the SAMI Galaxy Survey, we compute global kinematic position angles for 1445 objects with reliable kinematics and identify 169 (12 per cent) galaxies which show stellar-gas misalignments. Kinematically decoupled features are more prevalent in early-type/passive galaxies compared to late-type/star-forming systems. Star formation is the main source of gas ionization in only 22 per cent of misaligned galaxies; 17 per cent are Seyfert objects, while 61 per cent show Low-Ionization Nuclear Emission-line Region features. We identify the most probable physical cause of the kinematic decoupling and find that, while accretion-driven cases are dominant, for up to 8 per cent of our sample, the misalignment may be tracing outflowing gas. When considering only misalignments driven by accretion, the acquired gas is feeding active star formation in only ∼1/4 of cases. As a population, misaligned galaxies have higher Sérsic indices and lower stellar spin and specific star formation rates than appropriately matched samples of aligned systems. These results suggest that both morphology and star formation/gas content are significantly correlated with the prevalence and timescales of misalignments. Specifically, torques on misaligned gas discs are smaller for more centrallymore »concentrated galaxies, while the newly accreted gas feels lower viscous drag forces in more gas-poor objects. Marginal evidence of star formation not being correlated with misalignment likelihood for late-type galaxies suggests that such morphologies in the nearby Universe might be the result of preferentially aligned accretion at higher redshifts.

    « less
  4. Introduction: Vaso-occlusive crises (VOCs) are a leading cause of morbidity and early mortality in individuals with sickle cell disease (SCD). These crises are triggered by sickle red blood cell (sRBC) aggregation in blood vessels and are influenced by factors such as enhanced sRBC and white blood cell (WBC) adhesion to inflamed endothelium. Advances in microfluidic biomarker assays (i.e., SCD Biochip systems) have led to clinical studies of blood cell adhesion onto endothelial proteins, including, fibronectin, laminin, P-selectin, ICAM-1, functionalized in microchannels. These microfluidic assays allow mimicking the physiological aspects of human microvasculature and help characterize biomechanical properties of adhered sRBCs under flow. However, analysis of the microfluidic biomarker assay data has so far relied on manual cell counting and exhaustive visual morphological characterization of cells by trained personnel. Integrating deep learning algorithms with microscopic imaging of adhesion protein functionalized microfluidic channels can accelerate and standardize accurate classification of blood cells in microfluidic biomarker assays. Here we present a deep learning approach into a general-purpose analytical tool covering a wide range of conditions: channels functionalized with different proteins (laminin or P-selectin), with varying degrees of adhesion by both sRBCs and WBCs, and in both normoxic and hypoxic environments. Methods: Our neuralmore »networks were trained on a repository of manually labeled SCD Biochip microfluidic biomarker assay whole channel images. Each channel contained adhered cells pertaining to clinical whole blood under constant shear stress of 0.1 Pa, mimicking physiological levels in post-capillary venules. The machine learning (ML) framework consists of two phases: Phase I segments pixels belonging to blood cells adhered to the microfluidic channel surface, while Phase II associates pixel clusters with specific cell types (sRBCs or WBCs). Phase I is implemented through an ensemble of seven generative fully convolutional neural networks, and Phase II is an ensemble of five neural networks based on a Resnet50 backbone. Each pixel cluster is given a probability of belonging to one of three classes: adhered sRBC, adhered WBC, or non-adhered / other. Results and Discussion: We applied our trained ML framework to 107 novel whole channel images not used during training and compared the results against counts from human experts. As seen in Fig. 1A, there was excellent agreement in counts across all protein and cell types investigated: sRBCs adhered to laminin, sRBCs adhered to P-selectin, and WBCs adhered to P-selectin. Not only was the approach able to handle surfaces functionalized with different proteins, but it also performed well for high cell density images (up to 5000 cells per image) in both normoxic and hypoxic conditions (Fig. 1B). The average uncertainty for the ML counts, obtained from accuracy metrics on the test dataset, was 3%. This uncertainty is a significant improvement on the 20% average uncertainty of the human counts, estimated from the variance in repeated manual analyses of the images. Moreover, manual classification of each image may take up to 2 hours, versus about 6 minutes per image for the ML analysis. Thus, ML provides greater consistency in the classification at a fraction of the processing time. To assess which features the network used to distinguish adhered cells, we generated class activation maps (Fig. 1C-E). These heat maps indicate the regions of focus for the algorithm in making each classification decision. Intriguingly, the highlighted features were similar to those used by human experts: the dimple in partially sickled RBCs, the sharp endpoints for highly sickled RBCs, and the uniform curvature of the WBCs. Overall the robust performance of the ML approach in our study sets the stage for generalizing it to other endothelial proteins and experimental conditions, a first step toward a universal microfluidic ML framework targeting blood disorders. Such a framework would not only be able to integrate advanced biophysical characterization into fast, point-of-care diagnostic devices, but also provide a standardized and reliable way of monitoring patients undergoing targeted therapies and curative interventions, including, stem cell and gene-based therapies for SCD. Disclosures Gurkan: Dx Now Inc.: Patents & Royalties; Xatek Inc.: Patents & Royalties; BioChip Labs: Patents & Royalties; Hemex Health, Inc.: Consultancy, Current Employment, Patents & Royalties, Research Funding.« less
  5. Galaxy images of the order of multi-PB are collected as part of modern digital sky surveys using robotic telescopes. While there is a plethora of imaging data available, the majority of the images that are captured resemble galaxies that are “regular”, i.e., galaxy types that are already known and probed. However, “novelty" galaxy types, i.e., little-known galaxy types are encountered on occasion. The astronomy community shows paramount interest in the novelty galaxy types since they contain the potential for scientific discovery. However, since these galaxies are rare, the identification of such novelty galaxies is not trivial and requires automation techniques. Since these novelty galaxies are by definition, not known, supervised machine learning models cannot be trained to detect them. In this paper, an unsupervised machine learning method for automatic detection of novelty galaxies in large databases is proposed. The method uses a large set of image features weighted by their entropy. To handle the impact of self-similar novelty galaxies, the most similar galaxies are ranked-ordered. In addition, Bag of Visual Words (BOVW) is assimilated to the problem of detecting novelty galaxies. Each image in the dataset is represented as a set of features made up of key-points and descriptors. Amore »histogram of the features is constructed and is leveraged to identify the neighbors of each of the images. Experimental results using data from the Panoramic Survey Telescope and Rapid Response System (Pan-STARRS) show that the performance of the methods in detecting novelty galaxies is superior to other shallow learning methods such as one-class SVM, Local Outlier Factor, and K-Means, and also newer deep learning-based methods such as auto-encoders. The dataset used to evaluate the method is publicly available and can be used as a benchmark to test future algorithms for automatic detection of peculiar galaxies.« less