Abstract We propose a novel exact method to solve the probabilistic catalog matching problem faster than previously possible. Our new approach uses mixed integer programming and introduces quadratic constraints to shrink the problem by multiple orders of magnitude. We also provide a method to use a feasible solution to dramatically speed up our algorithm. This gain in performance is dependent on how close to optimal the feasible solution is. Also, we are able to provide good solutions by stopping our mixed integer programming solver early. Using simulated catalogs, we empirically show that our new mixed integer program with quadratic constraints is able to be set up and solved much faster than previous large linear formulations. We also demonstrate our new approach on real-world data from the Hubble Source Catalog. This paper is accompanied by publicly available software to demonstrate the proposed method.
more »
« less
Globally Optimal and Scalable N-way Matching of Astronomy Catalogs
Abstract Building on previous Bayesian approaches, we introduce a novel formulation of probabilistic cross-identification, where detections are directly associated to (hypothesized) astronomical objects in a globally optimal way. We show that this new method scales better for processing multiple catalogs than enumerating all possible candidates, especially in the limit of crowded fields, which is the most challenging observational regime for new-generation astronomy experiments such as the Rubin Observatory Legacy Survey of Space and Time. Here we study simulated catalogs where the ground truth is known and report on the statistical and computational performance of the method. The paper is accompanied by a public software tool to perform globally optimal catalog matching based on directional data.
more »
« less
- PAR ID:
- 10485095
- Publisher / Repository:
- DOI PREFIX: 10.3847
- Date Published:
- Journal Name:
- The Astronomical Journal
- Volume:
- 163
- Issue:
- 6
- ISSN:
- 0004-6256
- Format(s):
- Medium: X Size: Article No. 296
- Size(s):
- Article No. 296
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Abstract Repeating earthquakes—sequences of colocated, quasi-periodic earthquakes of similar size—are widespread along California’s San Andreas fault (SAF) system. Catalogs of repeating earthquakes are vital for studying earthquake source processes, fault properties, and improving seismic hazard models. Here, we introduce an unsupervised machine learning-based method for detecting repeating earthquake sequences (RES) to expand existing RES catalogs or to perform initial, exploratory searches. We implement the “SpecUFEx” algorithm (Holtzman et al., 2018) to reduce earthquake spectrograms into low-dimensional, characteristic fingerprints, and apply hierarchical clustering to group similar fingerprints together independent of location, allowing for a global search for potential RES throughout the data set. We then relocate the potential RES and subject them to the same detection criteria as Waldhauser and Schaff (2021). We apply our method to ∼4000 small (ML 0–3.5) earthquakes located on a 10 km long segment of the creeping SAF and double the number of detected RES, allowing for greater spatial coverage of slip-rate estimations at seismogenic depths. Our method is novel in its ability to detect RES independent of initial locations and is complimentary to existing cross-correlation-based methods, leading to more complete RES catalogs and a better understanding of slip rates at depth.more » « less
-
Abstract Biodiversity catalogs are an invaluable resource for biological research. Efforts to scientifically document biodiversity have not been evenly applied, either because of charisma or because of ease of study. Spiders are among the most precisely cataloged and diverse invertebrates, having surpassed 50,000 described species globally. The World Spider Catalog presents a unique opportunity to assess the disproportionate documentation of spider diversity. In the present article, we develop a taxonomic ratio relating new species descriptions to other taxonomic activity as a proxy for taxonomic effort, using spiders as a case study. We use this taxonomic effort metric to examine biases along multiple axes: phylogeny, zoogeography, and socioeconomics. We also use this metric to estimate the number of species that remain to be described. This work informs arachnologists in identifying high-priority taxa and regions for species discovery and highlights the benefits of maintaining open-access taxonomic databases—a necessary step in overcoming bias and documenting the world's biodiversity.more » « less
-
Abstract We present the second data release of the Massive and Distant Clusters of WISE Survey 2 (MaDCoWS2). We expand from the equatorial first data release to most of the Dark Energy Camera Legacy Survey area, covering a total area of 6498 deg2. The catalog consists of 133,036 signal-to-noise ratio (S/N) ≥ 5 galaxy cluster candidates at 0.1 ≤z≤ 2, including 6790 candidates atz> 1.5. We train a convolutional neural network (CNN) to identify spurious detections and include CNN-based cluster probabilities in the final catalog. We also compare the MaDCoWS2 sample with literature catalogs in the same area. The larger sample provides robust results that are consistent with our first data release. At S/N ≥ 5, we rediscover 59%–91% of clusters in existing catalogs that lie in the unmasked area of MC2. The median positional offsets are under 250 kpc, and the standard deviation of the redshifts is 0.031(1 +z). We fit a redshift-dependent power law to the relation between MaDCoWS2 S/N and observables from existing catalogs. Over the redshift ranges where the surveys overlap with MaDCoWS2, the lowest scatter is found between S/N and observables from optical/infrared surveys. We also assess the performance of our method using a mock light cone measuring purity and completeness as a function of cluster mass. The purity is above 90%, and we estimate the 50% completeness threshold at a virial mass of log(M/M⊙) ≈ 14.3. The completeness estimate is uncertain due to the small number of massive halos in the light cone, but consistent with the recovery fraction found by comparing to other cluster catalogs.more » « less
-
We study the problem of learning to choose from $$m$$ discrete treatment options (e.g., news item or medical drug) the one with best causal effect for a particular instance (e.g., user or patient) where the training data consists of passive observations of covariates, treatment, and the outcome of the treatment. The standard approach to this problem is regress and compare: split the training data by treatment, fit a regression model in each split, and, for a new instance, predict all $$m$$ outcomes and pick the best. By reformulating the problem as a single learning task rather than $$m$$ separate ones, we propose a new approach based on recursively partitioning the data into regimes where different treatments are optimal. We extend this approach to an optimal partitioning approach that finds a globally optimal partition, achieving a compact, interpretable, and impactful personalization model. We develop new tools for validating and evaluating personalization models on observational data and use these to demonstrate the power of our novel approaches in a personalized medicine and a job training application.more » « less
An official website of the United States government
