Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
-
Decision forest, including RandomForest, XGBoost, and Light-GBM, dominates the machine learning tasks over tabular data. Recently, several frameworks were developed for decision forest inference, such as ONNX, TreeLite from Amazon, TensorFlow Decision Forest from Google, HummingBirdfrom Microsoft, Nvidia FIL, and lleaves. While these frameworks are fully optimized for inference computations, they are all decoupled with databases and general data management frameworks, which leads to cross-system performance overheads. We first provided a DICT model to understand the performance gaps between decoupled and in-database inference. We further identified that for in-database inference, in addition to the popular UDF-centric representation that encapsulates the ML into one User Defined Function(UDF), there also exists a relation-centric representation that breaks down the decision forest inference into several fine-grained SQL operations. The relation-centric representation can achieve significantly better performance for large models. We optimized both implementations and conducted a comprehensive benchmark to compare these two implementations to the aforementioned decoupled inference pipelines and existing in-database inference pipelines such as Spark-SQL and PostgresML. The evaluation results validated the DICT model and demonstrated the superior performance of our in-database inference design compared to the baselines.more » « less
-
Existing approaches to automatic data transformation are insufficient to meet the requirements in many real-world scenarios, such as the building sector. First, there is no convenient interface for domain experts to provide domain knowledge easily. Second, they require significant training data collection overheads. Third, the accuracy suffers from complicated schema changes. To address these shortcomings, we present a novel approach that leverages the unique capabilities of large language models (LLMs) in coding, complex reasoning, and zero-shot learning to generate SQL code that transforms the source datasets into the target datasets. We demonstrate the viability of this approach by designing an LLM-based framework, termed SQLMorpher, which comprises a prompt generator that integrates the initial prompt with optional domain knowledge and historical patterns in external databases. It also implements an iterative prompt optimization mechanism that automatically improves the prompt based on flaw detection. The key contributions of this work include (1) pioneering an end-to-end LLM-based solution for data transformation, (2) developing a benchmark dataset of 105 real-world building energy data transformation problems, and (3) conducting an extensive empirical evaluation where our approach achieved 96% accuracy in all 105 problems. SQLMorpher demonstrates the effectiveness of utilizing LLMs in complex, domain-specific challenges, highlighting the potential of their potential to drive sustainable solutions.more » « less
-
Abstract. Subseasonal-to-seasonal (S2S) prediction, especially the prediction of extreme hydroclimate events such as droughts and floods, is not only scientifically challenging, but also has substantial societal impacts. Motivated by preliminary studies, the Global Energy and Water Exchanges(GEWEX)/Global Atmospheric System Study (GASS) has launched a new initiativecalled “Impact of Initialized Land Surface Temperature and Snowpack on Subseasonal to Seasonal Prediction” (LS4P) as the first international grass-roots effort to introduce spring land surface temperature(LST)/subsurface temperature (SUBT) anomalies over high mountain areas as acrucial factor that can lead to significant improvement in precipitationprediction through the remote effects of land–atmosphere interactions. LS4P focuses on process understanding and predictability, and hence it is differentfrom, and complements, other international projects that focus on theoperational S2S prediction. More than 40 groups worldwide have participated in this effort, including 21 Earth system models, 9 regionalclimate models, and 7 data groups. This paper provides an overview of the history and objectives of LS4P, provides the first-phase experimental protocol (LS4P-I) which focuses on the remote effect ofthe Tibetan Plateau, discusses the LST/SUBT initialization, and presents thepreliminary results. Multi-model ensemble experiments and analyses ofobservational data have revealed that the hydroclimatic effect of the springLST on the Tibetan Plateau is not limited to the Yangtze River basin but may have a significant large-scale impact on summer precipitation beyond EastAsia and its S2S prediction. Preliminary studies and analysis have alsoshown that LS4P models are unable to preserve the initialized LST anomaliesin producing the observed anomalies largely for two main reasons: (i) inadequacies in the land models arising from total soil depths which are tooshallow and the use of simplified parameterizations, which both tend to limit the soil memory; (ii) reanalysis data, which are used for initial conditions, have large discrepancies from the observed mean state andanomalies of LST over the Tibetan Plateau. Innovative approaches have beendeveloped to largely overcome these problems.more » « less