skip to main content


The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 10:00 PM ET on Friday, December 8 until 2:00 AM ET on Saturday, December 9 due to maintenance. We apologize for the inconvenience.

Title: Multivariate time series dataset for space weather data analytics

We introduce and make openly accessible a comprehensive, multivariate time series (MVTS) dataset extracted from solar photospheric vector magnetograms in Spaceweather HMI Active Region Patch (SHARP) series. Our dataset also includes a cross-checked NOAA solar flare catalog that immediately facilitates solar flare prediction efforts. We discuss methods used for data collection, cleaning and pre-processing of the solar active region and flare data, and we further describe a novel data integration and sampling methodology. Our dataset covers 4,098 MVTS data collections from active regions occurring between May 2010 and December 2018, includes 51 flare-predictive parameters, and integrates over 10,000 flare reports. Potential directions toward expansion of the time series, either “horizontally” – by adding more prediction-specific parameters, or “vertically” – by generalizing flare into integrated solar eruption prediction, are also explained. The immediate tasks enabled by the disseminated dataset include: optimization of solar flare prediction and detailed investigation for elusive flare predictors or precursors, with both operational (research-to-operations), and basic research (operations-to-research) benefits potentially following in the future.

more » « less
Award ID(s):
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ; ; ;
Publisher / Repository:
Nature Publishing Group
Date Published:
Journal Name:
Scientific Data
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Solar flares are explosions on the Sun. They happen when energy stored in magnetic fields around solar active regions (ARs) is suddenly released. Solar flares and accompanied coronal mass ejections are sources of space weather, which negatively affects a variety of technologies at or near Earth, ranging from blocking high-frequency radio waves used for radio communication to degrading power grid operations. Monitoring and providing early and accurate prediction of solar flares is therefore crucial for preparedness and disaster risk management. In this article, we present a transformer-based framework, named SolarFlareNet, for predicting whether an AR would produce a$$\gamma$$γ-class flare within the next 24 to 72 h. We consider three$$\gamma$$γclasses, namely the$$\ge$$M5.0 class, the$$\ge$$M class and the$$\ge$$C class, and build three transformers separately, each corresponding to a$$\gamma$$γclass. Each transformer is used to make predictions of its corresponding$$\gamma$$γ-class flares. The crux of our approach is to model data samples in an AR as time series and to use transformers to capture the temporal dynamics of the data samples. Each data sample consists of magnetic parameters taken from Space-weather HMI Active Region Patches (SHARP) and related data products. We survey flare events that occurred from May 2010 to December 2022 using the Geostationary Operational Environmental Satellite X-ray flare catalogs provided by the National Centers for Environmental Information (NCEI), and build a database of flares with identified ARs in the NCEI flare catalogs. This flare database is used to construct labels of the data samples suitable for machine learning. We further extend the deterministic approach to a calibration-based probabilistic forecasting method. The SolarFlareNet system is fully operational and is capable of making near real-time predictions of solar flares on the Web.

    more » « less
  2. Abstract We present a case study of solar flare forecasting by means of metadata feature time series, by treating it as a prominent class-imbalance and temporally coherent problem. Taking full advantage of pre-flare time series in solar active regions is made possible via the Space Weather Analytics for Solar Flares (SWAN-SF) benchmark data set, a partitioned collection of multivariate time series of active region properties comprising 4075 regions and spanning over 9 yr of the Solar Dynamics Observatory period of operations. We showcase the general concept of temporal coherence triggered by the demand of continuity in time series forecasting and show that lack of proper understanding of this effect may spuriously enhance models’ performance. We further address another well-known challenge in rare-event prediction, namely, the class-imbalance issue. The SWAN-SF is an appropriate data set for this, with a 60:1 imbalance ratio for GOES M- and X-class flares and an 800:1 imbalance ratio for X-class flares against flare-quiet instances. We revisit the main remedies for these challenges and present several experiments to illustrate the exact impact that each of these remedies may have on performance. Moreover, we acknowledge that some basic data manipulation tasks such as data normalization and cross validation may also impact the performance; we discuss these problems as well. In this framework we also review the primary advantages and disadvantages of using true skill statistic and Heidke skill score, two widely used performance verification metrics for the flare-forecasting task. In conclusion, we show and advocate for the benefits of time series versus point-in-time forecasting, provided that the above challenges are measurably and quantitatively addressed. 
    more » « less
  3. Abstract

    Solar flares, especially the M- and X-class flares, are often associated with coronal mass ejections. They are the most important sources of space weather effects, which can severely impact the near-Earth environment. Thus it is essential to forecast flares (especially the M- and X-class ones) to mitigate their destructive and hazardous consequences. Here, we introduce several statistical and machine-learning approaches to the prediction of an active region’s (AR) flare index (FI) that quantifies the flare productivity of an AR by taking into account the number of different class flares within a certain time interval. Specifically, our sample includes 563 ARs that appeared on the solar disk from 2010 May to 2017 December. The 25 magnetic parameters, provided by the Space-weather HMI Active Region Patches (SHARP) from the Helioseismic and Magnetic Imager on board the Solar Dynamics Observatory, characterize coronal magnetic energy stored in ARs by proxy and are used as the predictors. We investigate the relationship between these SHARP parameters and the FI of ARs with a machine-learning algorithm (spline regression) and the resampling method (Synthetic Minority Oversampling Technique for Regression with Gaussian Noise). Based on the established relationship, we are able to predict the value of FIs for a given AR within the next 1 day period. Compared with other four popular machine-learning algorithms, our methods improve the accuracy of FI prediction, especially for a large FI. In addition, we sort the importance of SHARP parameters by the Borda count method calculated from the ranks that are rendered by nine different machine-learning methods.

    more » « less
  4. Abstract

    We develop a mixed long short‐term memory (LSTM) regression model to predict the maximum solar flare intensity within a 24‐hr time window 0–24, 6–30, 12–36, and 24–48 hr ahead of time using 6, 12, 24, and 48 hr of data (predictors) for each Helioseismic and Magnetic Imager (HMI) Active Region Patch (HARP). The model makes use of (1) the Space‐Weather HMI Active Region Patch (SHARP) parameters as predictors and (2) the exact flare intensities instead of class labels recorded in the Geostationary Operational Environmental Satellites (GOES) data set, which serves as the source of the response variables. Compared to solar flare classification, the model offers us more detailed information about the exact maximum flux level, that is, intensity, for each occurrence of a flare. We also consider classification models built on top of the regression model and obtain better results in solar flare classifications as compared to Chen et al. (2019, Our results suggest that the most efficient time period for predicting the solar activity is within 24 hr before the prediction time using the SHARP parameters and the LSTM model.

    more » « less
  5. Solar flare prediction is a central problem in space weather forecasting and has captivated the attention of a wide spectrum of researchers due to recent advances in both remote sensing as well as machine learning and deep learning approaches. The experimental findings based on both machine and deep learning models reveal significant performance improvements for task specific datasets. Along with building models, the practice of deploying such models to production environments under operational settings is a more complex and often time-consuming process which is often not addressed directly in research settings. We present a set of new heuristic approaches to train and deploy an operational solar flare prediction system for ≥M1.0-class flares with two prediction modes: full-disk and active region-based. In full-disk mode, predictions are performed on full-disk line-of-sight magnetograms using deep learning models whereas in active region-based models, predictions are issued for each active region individually using multivariate time series data instances. The outputs from individual active region forecasts and full-disk predictors are combined to a final full-disk prediction result with a meta-model. We utilized an equal weighted average ensemble of two base learners’ flare probabilities as our baseline meta learner and improved the capabilities of our two base learners by training a logistic regression model. The major findings of this study are: 1) We successfully coupled two heterogeneous flare prediction models trained with different datasets and model architecture to predict a full-disk flare probability for next 24 h, 2) Our proposed ensembling model, i.e., logistic regression, improves on the predictive performance of two base learners and the baseline meta learner measured in terms of two widely used metrics True Skill Statistic (TSS) and Heidke Skill Score (HSS), and 3) Our result analysis suggests that the logistic regression-based ensemble (Meta-FP) improves on the full-disk model (base learner) by ∼9% in terms TSS and ∼10% in terms of HSS. Similarly, it improves on the AR-based model (base learner) by ∼17% and ∼20% in terms of TSS and HSS respectively. Finally, when compared to the baseline meta model, it improves on TSS by ∼10% and HSS by ∼15%. 
    more » « less