Honeybees, as natural crop pollinators, play a significant role in biodiversity and food production for human civilization. Bees actively regulate hive temperature (homeostasis) to maintain a colony’s proper functionality. Deviations from usual thermoregulation behavior due to external stressors (e.g., extreme environmental temperature, parasites, pesticide exposure) indicate an impending colony collapse. Anticipating such threats by forecasting hive temperature and finding changes in temperature patterns would allow beekeepers to take early preventive measures and avoid critical issues. In that case, how can we model bees’ thermoregulation behavior for an interpretable and effective hive monitoring system? In this article, we propose theprincipledElectronic Bee-Veterinarian Plus (EBV+) method based on the thermal diffusion equation and a novel “sigmoid” feedback-loop (P) controller for analyzing hive health with the following properties: (i) it iseffectiveon multiple, real-world beehive time sequences (recorded and streaming), (ii) it isexplainablewith only a few parameters (e.g., hive health factor) that beekeepers can easily quantify and trust, (iii) it issuesproactivealerts to beekeepers before any potential issue affecting homeostasis becomes detrimental, and (iv) it isscalablewith a time complexity of\(O(t)\)for reconstructing and\(O(t\times m)\)for findingmcuts of a sequence withttime-ticks. Experimental results on multiple real-world time sequences showcase the potential and practical feasibility of EBV+. Our method yields accurate forecasting (up to72%improvement in RMSE) with up to600times fewer parameters compared to baselines (ARX, seasonal ARX, Holt-winters, and DeepAR), as well as detects discontinuities and raises alerts that coincide with domain experts’ opinions. Moreover, EBV+ is scalable and fast, taking less than1 minuteon a stock laptop to reconstruct 2 months of sensor data.
more »
« less
EBV: Electronic Bee-Veterinarian for Principled Mining and Forecasting of Honeybee Time Series
Honeybees are vital for pollination and food production. Among many factors, extreme temperature (e.g., due to climate change) is particularly dangerous for bee health. Anticipating such extremities would allow beekeepers to take early preventive action. Thus, given sensor (temperature) time series data from beehives, how can we find patterns and do forecasting? Forecasting is crucial as it helps spot unexpected behavior and thus issue warnings to the beekeepers. In that case, what are the right models for forecasting? ARIMA, RNNs, or something else? We propose the EBV (Electronic Bee-Veterinarian) method, which has the following desirable properties: (i) principled: it is based on a) diffusion equations from physics and b) control theory for feedback-loop con- trollers; (ii) effective: it works well on multiple, real- world time sequences, (iii) explainable: it needs only a handful of parameters (e.g., bee strength) that beekeep- ers can easily understand and trust, and (iv) scalable: it performs linearly in time. We applied our method to multiple real-world time sequences, and found that it yields accurate forecasting (up to 49% improvement in RMSE compared to baselines), and segmentation. Specifically, discontinuities detected by EBV mostly co- incide with domain expert’s opinions, showcasing our approach’s potential and practical feasibility. Moreover, EBV is scalable and fast, taking about 20 minutes on a stock laptop for reconstructing two months of sensor data.
more »
« less
- Award ID(s):
- 1954644
- PAR ID:
- 10658843
- Publisher / Repository:
- Society for Industrial and Applied Mathematics
- Date Published:
- Page Range / eLocation ID:
- 298 to 306
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Advances in sensor technology have enabled the collection of largescale datasets. Such datasets can be extremely noisy and often contain a significant amount of outliers that result from sensor malfunction or human operation faults. In order to utilize such data for real-world applications, it is critical to detect outliers so that models built from these datasets will not be skewed by outliers. In this paper, we propose a new outlier detection method that utilizes the correlations in the data (e.g., taxi trip distance vs. trip time). Different from existing outlier detection methods, we build a robust regression model that explicitly models the outliers and detects outliers simultaneously with the model fitting. We validate our approach on real-world datasets against methods specifically designed for each dataset as well as the state of the art outlier detectors. Our outlier detection method achieves better performances, demonstrating the robustness and generality of our method. Last, we report interesting case studies on some outliers that result from atypical events.more » « less
-
Continuous-time dynamics models, e.g., neural ordinary differential equations, enable accurate modeling of underlying dynamics in time-series data. However, employing neural networks for parameterizing dynamics makes it challenging for humans to identify dependence structures, especially in the presence of delayed effects. In consequence, these models are not an attractive option when capturing dependence carries more importance than accurate modeling, e.g., in tsunami forecasting. In this paper, we present a novel method for identifying dependence structures in continuous-time dynamics models. We take a two-step approach: (1) During training, we promote weight sparsity in the model’s first layer during training. (2) We prune the sparse weights after training to identify dependence structures. In evaluation, we test our method in scenarios where the exact dependence structures of time-series are known. Compared to baselines, our method is more effective in uncovering dependence structures in data even when there are delayed effects. Moreover, we evaluate our method to a real-world tsunami forecasting, where the exact dependence structures are unknown beforehand. Even in this challenging scenario, our method still effective learns physically-consistent dependence structures and achieves high accuracy in forecasting.more » « less
-
Recent studies show that large language models (LLM) unintendedly memorize part of the training data, which brings serious privacy risks. For example, it has been shown that over 1% of tokens generated unprompted by an LLM are part of sequences in the training data. However, current studies mainly focus on the exact memorization behaviors. In this paper, we propose to evaluate how many generated texts have near-duplicates (e.g., only differ by a couple of tokens out of 100) in the training corpus. A major challenge of conducting this evaluation is the huge computation cost incurred by near-duplicate sequence searches. This is because modern LLMs are trained on larger and larger corpora with up to 1 trillion tokens. What's worse is that the number of sequences in a text is quadratic to the text length. To address this issue, we develop an efficient and scalable near-duplicate sequence search algorithm in this paper. It can find (almost) all the near-duplicate sequences of the query sequence in a large corpus with guarantees. Specifically, the algorithm generates and groups the min-hash values of all the sequences with at least t tokens (as very short near-duplicates are often irrelevant noise) in the corpus in linear time to the corpus size. We formally prove that only 2 n+1/t+1 -1 min-hash values are generated for a text with n tokens in expectation. Thus the index time and size are reasonable. When a query arrives, we find all the sequences sharing enough min-hash values with the query using inverted indexes and prefix filtering. Extensive experiments on a few large real-world LLM training corpora show that our near-duplicate sequence search algorithm is efficient and scalable.more » « less
-
null (Ed.)Hydropower is the largest renewable energy source for electricity generation in the world, with numerous benefits in terms of: environment protection (near-zero air pollution and climate impact), cost-effectiveness (long-term use, without significant impacts of market fluctuation), and reliability (quickly respond to surge in demand). However, the effectiveness of hydropower plants is affected by multiple factors such as reservoir capacity, rainfall, temperature and fluctuating electricity demand, and particularly their complicated relationships, which make the prediction/recommendation of station operational output a difficult challenge. In this paper, we present DeepHydro, a novel stochastic method for modeling multivariate time series (e.g., water inflow/outflow and temperature) and forecasting power generation of hydropower stations. DeepHydro captures temporal dependencies in co-evolving time series with a new conditioned latent recurrent neural networks, which not only considers the hidden states of observations but also preserves the uncertainty of latent variables. We introduce a generative network parameterized on a continuous normalizing flow to approximate the complex posterior distribution of multivariate time series data, and further use neural ordinary differential equations to estimate the continuous-time dynamics of the latent variables constituting the observable data. This allows our model to deal with the discrete observations in the context of continuous dynamic systems, while being robust to the noise. We conduct extensive experiments on real-world datasets from a large power generation company consisting of cascade hydropower stations. The experimental results demonstrate that the proposed method can effectively predict the power production and significantly outperform the possible candidate baseline approaches.more » « less
An official website of the United States government

