skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Fast and Accurate Spreading Process Temporal Scale Estimation
Spreading processes on graphs arise in a host of application domains, from the study of online social networks to viral marketing to epidemiology. Various discrete-time probabilistic models for spreading processes have been proposed. These are used for downstream statistical estimation and prediction problems, often involving messages or other information that is transmitted along with infections caused by the process. These models generally model cascade behavior at a small time scale but are insufficiently flexible to model cascades that exhibit intermittent behavior governed by multiple scales. We argue that the presence of such time scales that are unaccounted for by a cascade model can result in degradation of performance of models on downstream statistical and time-sensitive optimization tasks. To address these issues, we formulate a model that incorporates multiple temporal scales of cascade behavior. This model is parameterized by a \emph{clock}, which encodes the times at which sessions of cascade activity start. These sessions are themselves governed by a small-scale cascade model, such as the discretized independent cascade (IC) model. Estimation of the multiscale cascade model parameters leads to the problem of \emph{clock estimation} in terms of a natural distortion measure that we formulate. Our framework is inspired by the optimization problem posed by DiTursi et al, 2017, which can be seen as providing one possible estimator (a maximum-proxy-likelihood estimator) for the parameters of our generative model. We give a clock estimation algorithm, which we call FastClock, that runs in linear time in the size of its input and is provably statistically accurate for a broad range of model parameters when cascades are generated from any spreading process model with well-concentrated session infection set sizes and when the underlying graph is at least in the semi-sparse regime. We exemplify our algorithm for the case where the small-scale model is the discretized independent cascade process and extend substantially to processes whose infection set sizes satisfy a general martingale difference property. We further evaluate the performance of FastClock empirically in comparison to the state of the art estimator from DiTursi et al, 2017. We find that in a broad parameter range on synthetic networks and on a real network, our algorithm substantially outperforms that algorithm in terms of both running time and accuracy. In all cases, our algorithm's running time is asymptotically lower than that of the baseline.  more » « less
Award ID(s):
2212327
PAR ID:
10464547
Author(s) / Creator(s):
; ;
Editor(s):
Massoulie, Laurent
Date Published:
Journal Name:
Transactions on machine learning research
ISSN:
2835-8856
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Woodruff, David P. (Ed.)
    Matroids are a fundamental object of study in combinatorial optimization. Three closely related and important problems involving matroids are maximizing the size of the union of $$k$$ independent sets (that is, \emph{$$k$$-fold matroid union}), computing $$k$$ disjoint bases (a.k.a.\ \emph{matroid base packing}), and covering the elements by $$k$$ bases (a.k.a.\ \emph{matroid base covering}). These problems generalize naturally to integral and real-valued capacities on the elements. This work develops faster exact and/or approximation problems for these and some other closely related problems such as optimal reinforcement and matroid membership. We obtain improved running times both for general matroids in the independence oracle model and for the graphic matroid. The main thrust of our improvements comes from developing a faster and unifying \emph{push-relabel} algorithm for the integer-capacitated versions of these problems, building on previous work by [FM12]. We then build on this algorithm in two directions. First we develop a faster augmenting path subroutine for $$k$$-fold matroid union that, when appended to an approximation version of the push-relabel algorithm, gives a faster exact algorithm for some parameters of $$k$$. In particular we obtain a subquadratic-query running time in the uncapacitated setting for the three basic problems listed above. We also obtain faster approximation algorithms for these problems with real-valued capacities by reducing to small integral capacities via randomized rounding. To this end, we develop a new randomized rounding technique for base covering problems in matroids that may also be of independent interest. 
    more » « less
  2. Large Language Models (LLMs) have a natural role in answering complex queries about data streams, but the high computational cost of LLM inference makes them infeasible in many such tasks. We propose online cascade learning as an approach to address this challenge. The objective here is to learn a “cascade” of models, starting with lower-capacity models (such as logistic regression) and ending with a powerful LLM, along with a deferral policy that determines the model to be used on a given input. We formulate the task of learning cascades online as an imitation-learning problem, where smaller models are updated over time imitating the LLM expert demonstrations, and give a no-regret algorithm for the problem. Experimental results across four benchmarks show that our method parallels LLMs in accuracy while cutting down inference costs by as much as 90% with strong robustness against input distribution shifts, underscoring its efficacy and adaptability in stream processing. 
    more » « less
  3. The presence of interference, where the outcome of an individual may depend on the treatment assignment and behavior of neighboring nodes, can lead to biased causal effect estimation. Current approaches to network experiment design focus on limiting interference through cluster-based randomization, in which clusters are identified using graph clustering, and cluster randomization dictates the node assignment to treatment and control. However, cluster-based randomization approaches perform poorly when interference propagates in cascades, whereby the response of individuals to treatment propagates to their multi-hop neighbors. When we have knowledge of the cascade seed nodes, we can leverage this interference structure to mitigate the resulting causal effect estimation bias. With this goal, we propose a cascade-based network experiment design that initiates treatment assignment from the cascade seed node and propagates the assignment to their multi-hop neighbors to limit interference during cascade growth and thereby reduce the overall causal effect estimation error. Our extensive experiments on real-world and synthetic datasets demonstrate that our proposed framework outperforms the existing state-of-the-art approaches in estimating causal effects in network data. 
    more » « less
  4. Stiff dynamics continue to pose challenges for power system dynamic state estimation. In particular, models of inverters with control schemes designed to support grid voltage and frequency, namely, grid-forming inverters (GFMs), are highly prone to numerical instability. This paper develops a novel analytical modeling technique derived from two cascading subsystems, namely synchronization and dq-frame voltage control. This allows us to obtain a closed-form discrete-time state-space model based on the matrix exponential function. The resulting model enables a numerically stable and decentralized dynamic state estimator that can track the dynamics of GFMs at standard synchrophasor reporting rates. In contrast, existing dynamic state estimators are subject to numerical issues. The proposed algorithm is tested on a 14-bus power system with a GFM and compared with the standard algorithm whose process model is discretized using well-known Runge-Kutta methods. Numerical results demonstrate the superiority of the proposed method under various conditions. 
    more » « less
  5. We study the fundamental problem of estimating the mean of a d-dimensional distribution with covariance Σ≼σ2Id given n samples. When d=1, \cite{catoni} showed an estimator with error (1+o(1))⋅σ2log1δn−−−−−√, with probability 1−δ, matching the Gaussian error rate. For d>1, a natural estimator outputs the center of the minimum enclosing ball of one-dimensional confidence intervals to achieve a 1−δ confidence radius of 2dd+1−−−√⋅σ(dn−−√+2log1δn−−−−−√), incurring a 2dd+1−−−√-factor loss over the Gaussian rate. When the dn−−√ term dominates by a log1δ−−−−√ factor, \cite{lee2022optimal-highdim} showed an improved estimator matching the Gaussian rate. This raises a natural question: Is the 2dd+1−−−√ loss \emph{necessary} when the 2log1δn−−−−−√ term dominates? We show that the answer is \emph{no} -- we construct an estimator that improves over the above naive estimator by a constant factor. We also consider robust estimation, where an adversary is allowed to corrupt an ϵ-fraction of samples arbitrarily: in this case, we show that the above strategy of combining one-dimensional estimates and incurring the 2dd+1−−−√-factor \emph{is} optimal in the infinite-sample limit. 
    more » « less