skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on January 1, 2026

Title: Data Management Perspectives on Prescriptive Analytics (Invited Talk)
Decision makers in a broad range of domains, such as finance, transportation, manufacturing, and healthcare, often need to derive optimal decisions given a set of constraints and objectives. Traditional solutions to such constrained optimization problems are typically application-specific, complex, and do not generalize. Further, the usual workflow requires slow, cumbersome, and error-prone data movement between a database, and predictive-modeling and optimization packages. All of these problems are exacerbated by the unprecedented size of modern data-intensive optimization problems. The emerging research area of in-database prescriptive analytics aims to provide seamless domain-independent, declarative, and scalable approaches powered by the system where the data typically resides: the database. Integrating optimization with database technology opens up prescriptive analytics to a much broader community, amplifying its benefits. We discuss how deep integration between the DBMS, predictive models, and optimization software creates opportunities for rich prescriptive-query functionality with good scalability and performance. Summarizing some of our main results and ongoing work in this area, we highlight challenges related to usability, scalability, data uncertainty, and dynamic environments, and argue that perspectives from data management research can drive novel strategies and solutions.  more » « less
Award ID(s):
2211918
PAR ID:
10627357
Author(s) / Creator(s):
; ; ; ; ;
Editor(s):
Roy, Sudeepa; Kara, Ahmet
Publisher / Repository:
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Date Published:
Volume:
328
ISSN:
1868-8969
ISBN:
978-3-95977-364-5
Page Range / eLocation ID:
2:1-2:12
Subject(s) / Keyword(s):
Prescriptive analytics decision making scalable constrained optimization Information systems → Decision support systems Information systems → Database design and models Information systems → Query languages Information systems → Database query processing Computing methodologies → Modeling and simulation Applied computing → Decision analysis
Format(s):
Medium: X Size: 12 pages; 837869 bytes Other: application/pdf
Size(s):
12 pages 837869 bytes
Right(s):
Creative Commons Attribution 4.0 International license; info:eu-repo/semantics/openAccess
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    In the first part of the review, we observed that there exists a significant gap between the predictive and prescriptive models pertaining to crash risk prediction and minimization, respectively. In this part, we review and categorize the optimization/ prescriptive analytic models that focus on minimizing crash risk. Although the majority of works in this segment of the literature are related to the hazardous materials (hazmat) trucking problems, we show that (with some exceptions) many can also be utilized in non-hazmat scenarios. In an effort to highlight the effect of crash risk prediction model on the accumulated risk obtained from the prescriptive model, we present a simulated example where we utilize four risk indicators (obtained from logistic regression, Poisson regression, XGBoost, and neural network) in the k-shortest path algorithm. From our example, we demonstrate two major designed takeaways: (a) the shortest path may not always result in the lowest crash risk, and (b) a similarity in overall predictive performance may not always translate to similar outcomes from the prescriptive models. Based on the review and example, we highlight several avenues for future research. 
    more » « less
  2. A package query returns a package---a multiset of tuples---that maximizes or minimizes a linear objective function subject to linear constraints, thereby enabling in-database decision support. Prior work has established the equivalence of package queries to Integer Linear Programs (ILPs) and developed the SketchRefine algorithm for package query processing. While this algorithm was an important first step toward supporting prescriptive analytics scalably inside a relational database, it struggles when the data size grows beyond a few hundred million tuples or when the constraints become very tight. In this paper, we present Progressive Shading, a novel algorithm for processing package queries that can scale efficiently to billions of tuples and gracefully handle tight constraints. Progressive Shading solves a sequence of optimization problems over a hierarchy of relations, each resulting from an ever-finer partitioning of the original tuples into homogeneous groups until the original relation is obtained. This strategy avoids the premature discarding of high-quality tuples that can occur with SketchRefine. Our novel partitioning scheme, Dynamic Low Variance, can handle very large relations with multiple attributes and can dynamically adapt to both concentrated and spread-out sets of attribute values, provably outperforming traditional partitioning schemes such as kd-tree. We further optimize our system by replacing our off-the-shelf optimization software with customized ILP and LP solvers, called Dual Reducer and Parallel Dual Simplex respectively, that are highly accurate and orders of magnitude faster. 
    more » « less
  3. Recent advances in Graph Neural Networks (GNNs) have changed the landscape of modern graph analytics. The complexity of GNN training and the scalability challenges have also sparked interest from the systems community, with efforts to build systems that provide higher efficiency and schemes to reduce costs. However, we observe that many such systems basically reinvent the wheel of much work done in the database world on scalable graph analytics engines. Further, they often tightly couple the scalability treatments of graph data processing with that of GNN training, resulting in entangled complex problems and systems that often do not scale well on one of those axes. In this paper, we ask a fundamental question: How far can we push existing systems for scalable graph analytics and deep learning (DL) instead of building custom GNN systems? Are compromises inevitable on scalability and/or runtimes? We propose Lotan, the first scalable and optimized data system for full-batch GNN training withdecoupled scalingthat bridges the hitherto siloed worlds of graph analytics systems and DL systems. Lotan offers a series of technical innovations, including re-imagining GNN training as query plan-like dataflows, execution plan rewriting, optimized data movement between systems, a GNN-centric graph partitioning scheme, and the first known GNN model batching scheme. We prototyped Lotan on top of GraphX and PyTorch. An empirical evaluation using several real-world benchmark GNN workloads reveals a promising nuanced picture: Lotan significantly surpasses the scalability of state-of-the-art custom GNN systems, while often matching or being only slightly behind on time-to-accuracy metrics in some cases. We also show the impact of our system optimizations. Overall, our work shows that the GNN world can indeed benefit from building on top of scalable graph analytics engines. Lotan's new level of scalability can also empower new ML-oriented research on ever-larger graphs and GNNs. 
    more » « less
  4. null (Ed.)
    Managing large-scale systems often involves simultaneously solving thousands of unrelated stochastic optimization problems, each with limited data. Intuition suggests that one can decouple these unrelated problems and solve them separately without loss of generality. We propose a novel data-pooling algorithm called Shrunken-SAA that disproves this intuition. In particular, we prove that combining data across problems can outperform decoupling, even when there is no a priori structure linking the problems and data are drawn independently. Our approach does not require strong distributional assumptions and applies to constrained, possibly nonconvex, nonsmooth optimization problems such as vehicle-routing, economic lot-sizing, or facility location. We compare and contrast our results to a similar phenomenon in statistics (Stein’s phenomenon), highlighting unique features that arise in the optimization setting that are not present in estimation. We further prove that, as the number of problems grows large, Shrunken-SAA learns if pooling can improve upon decoupling and the optimal amount to pool, even if the average amount of data per problem is fixed and bounded. Importantly, we highlight a simple intuition based on stability that highlights when and why data pooling offers a benefit, elucidating this perhaps surprising phenomenon. This intuition further suggests that data pooling offers the most benefits when there are many problems, each of which has a small amount of relevant data. Finally, we demonstrate the practical benefits of data pooling using real data from a chain of retail drug stores in the context of inventory management. This paper was accepted by Chung Piaw Teo, Special Issue on Data-Driven Prescriptive Analytics. 
    more » « less
  5. Query optimization is a key component in database management systems (DBMS) and distributed data processing platforms. Re- cent research in the database community incorporated techniques from artificial intelligence to enhance query optimization. Various learning models have been extended and applied to the query optimization tasks, including query execution plan, query rewriting, and cost estimation. The tasks involved in query optimization differ based on the type of data being processed, such as relational data or spatial geometries. This tutorial reviews recent learning-based approaches for spatial query optimization tasks. We go over methods designed specifically for spatial data, as well as solutions proposed for high-dimensional data. Additionally, we present learning-based spatial indexing and spatial partitioning methods, which are also vital components in spatial data processing. We also identify several open research problems in these fields. 
    more » « less