The production of complex astronomical data is accelerating, especially with newer telescopes producing ever more large-scale surveys. The increased quantity, complexity, and variety of astronomical data demand a parallel increase in skill and sophistication in developing, deciding, and deploying statistical methods. Understanding limitations and appreciating nuances in statistical and machine learning methods and the reasoning behind them is essential for improving data-analytic proficiency and acumen. Aiming to facilitate such improvement in astronomy, we delineate cautionary tales in statistics via six maxims, with examples drawn from the astronomical literature. Inspired by the significant quality improvement in business and manufacturing processes by the routine adoption of Six Sigma, we hope the routine reflection on these Six Maxims will improve the quality of both data analysis and scientific findings in astronomy.
more »
« less
Six Maxims of Statistical Acumen for Astronomical Data Analysis
Abstract The acquisition of complex astronomical data is accelerating, especially with newer telescopes producing ever more large-scale surveys. The increased quantity, complexity, and variety of astronomical data demand a parallel increase in skill and sophistication in developing, deciding, and deploying statistical methods. Understanding limitations and appreciating nuances in statistical and machine learning methods and the reasoning behind them is essential for improving data-analytic proficiency and acumen. Aiming to facilitate such improvement in astronomy, we delineate cautionary tales in statistics via six maxims, with examples drawn from the astronomical literature. Inspired by the significant quality improvement in business and manufacturing processes by the routine adoption of Six Sigma, we hope the routine reflection on these six maxims will improve the quality of both data analysis and scientific findings in astronomy.
more »
« less
- Award ID(s):
- 2113397
- PAR ID:
- 10555821
- Publisher / Repository:
- DOI PREFIX: 10.3847
- Date Published:
- Journal Name:
- The Astrophysical Journal Supplement Series
- Volume:
- 275
- Issue:
- 2
- ISSN:
- 0067-0049
- Format(s):
- Medium: X Size: Article No. 30
- Size(s):
- Article No. 30
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Abstract This paper presents a new statistical method that enables the use of systematic errors in the maximum-likelihood regression of integer-count Poisson data to a parametric model. The method is primarily aimed at the characterization of the goodness-of-fit statistic in the presence of the over-dispersion that is induced by sources of systematic error, and is based on a quasi-maximum-likelihood method that retains the Poisson distribution of the data. We show that the Poisson deviance, which is the usual goodness-of-fit statistic and that is commonly referred to in astronomy as the Cash statistics, can be easily generalized in the presence of systematic errors, under rather general conditions. The method and the associated statistics are first developed theoretically, and then they are tested with the aid of numerical simulations and further illustrated with real-life data from astronomical observations. The statistical methods presented in this paper are intended as a simple general-purpose framework to include additional sources of uncertainty for the analysis of integer-count data in a variety of practical data analysis situations.more » « less
-
Abstract Building on previous Bayesian approaches, we introduce a novel formulation of probabilistic cross-identification, where detections are directly associated to (hypothesized) astronomical objects in a globally optimal way. We show that this new method scales better for processing multiple catalogs than enumerating all possible candidates, especially in the limit of crowded fields, which is the most challenging observational regime for new-generation astronomy experiments such as the Rubin Observatory Legacy Survey of Space and Time. Here we study simulated catalogs where the ground truth is known and report on the statistical and computational performance of the method. The paper is accompanied by a public software tool to perform globally optimal catalog matching based on directional data.more » « less
-
Achieving GPT-4o level performance in astronomy with a specialized 8B-parameter large language modelAbstract AstroSage-Llama-3.1-8B is a domain-specialized natural-language AI assistant tailored for research in astronomy, astrophysics, cosmology, and astronomical instrumentation. Trained on the complete collection of astronomy-related arXiv papers from 2007 to 2024 along with millions of synthetically-generated question-answer pairs and other astronomical literature, AstroSage-Llama-3.1-8B demonstrates remarkable proficiency on a wide range of questions. AstroSage-Llama-3.1-8B scores 80.9% on the AstroMLab-1 benchmark, greatly outperforming all models—proprietary and open-weight—in the 8-billion parameter class, and performing on par with GPT-4o. This achievement demonstrates the potential of domain specialization in AI, suggesting that focused training can yield capabilities exceeding those of much larger, general-purpose models. AstroSage-Llama-3.1-8B is freely available, enabling widespread access to advanced AI capabilities for astronomical education and research.more » « less
-
Abstract Many astronomical surveys are limited by the brightness of the sources, and gravitational-wave searches are no exception. The detectability of gravitational waves from merging binaries is affected by the mass and spin of the constituent compact objects. To perform unbiased inference on the distribution of compact binaries, it is necessary to account for this selection effect, which is known as Malmquist bias. Since systematic error from selection effects grows with the number of events, it will be increasingly important over the coming years to accurately estimate the observational selection function for gravitational-wave astronomy. We employ density estimation methods to accurately and efficiently compute the compact binary coalescence selection function. We introduce a simple pre-processing method, which significantly reduces the complexity of the required machine-learning models. We demonstrate that our method has smaller statistical errors at comparable computational cost than the method currently most widely used allowing us to probe narrower distributions of spin magnitudes. The currently used method leaves 10%–50% of the interesting black hole spin models inaccessible; our new method can probe >99% of the models and has a lower uncertainty for >80% of the models.more » « less
An official website of the United States government
