skip to main content

Title: Variance-Reduced Stochastic Gradient Descent on Streaming Data
We present an algorithm STRSAGA for efficiently maintaining a machine learning model over data points that arrive over time, quickly updating the model as new training data is observed. We present a competitive analysis comparing the sub-optimality of the model maintained by STRSAGA with that of an offline algorithm that is given the entire data beforehand, and analyze the risk-competitiveness of STRSAGA under different arrival patterns. Our theoretical and experimental results show that the risk of STRSAGA is comparable to that of offline algorithms on a variety of input arrival patterns, and its experimental performance is significantly better than prior algorithms suited for streaming data, such as SGD and SSVRG.
Authors:
; ; ;
Award ID(s):
1725702
Publication Date:
NSF-PAR ID:
10110864
Journal Name:
Advances in neural information processing systems
Page Range or eLocation-ID:
9928--9937
ISSN:
1049-5258
Sponsoring Org:
National Science Foundation
More Like this
  1. We present an algorithm STRSAGA for efficiently maintaining a machine learning model over data points that arrive over time, quickly updating the model as new training data is observed. We present a competitive analysis comparing the suboptimality of the model maintained by STRSAGA with that of an offline algorithm that is given the entire data beforehand, and analyze the risk-competitiveness of STRSAGA under different arrival patterns. Our theoretical and experimental results show that the risk of STRSAGA is comparable to that of offline algorithms on a variety of input arrival patterns, and its experimental performance is significantly better than prior algorithms suited for streaming data, such as SGD and SSVRG.
  2. Efficient allocation of tasks to workers is a central problem in crowdsourcing. In this paper, we consider a special setting inspired by spatial crowdsourcing platforms where both workers and tasks arrive dynamically. Additionally, we assume all tasks are heterogeneous and each worker-task assignment brings a known reward. The natural challenge lies in how to incorporate the uncertainty in the arrivals from both workers and tasks into our online allocation policy such that the total expected reward is maximized. To formulate this, we assume the arrival patterns of worker “types” and task “types” can be predicted from historical data. Specifically, we consider a finite time horizon T and assume that in each time-step, a single worker and task are sampled (i.e., “arrive”) from two respective distributions independently, and that this sampling process repeats identically and independently for the entire T online time-steps. Our model, called Online Task Assignment with Two-Sided Arrival (OTA-TSA), is a significant generalization of the classical online task assignment where the set of tasks is assumed to be available offline. For the general version of OTA-TSA, we present an optimal non-adaptive algorithm which achieves an online competitive ratio of 0.295. For the special case of OTA-TSA where themore »reward is a function of just the worker type, we present an improved algorithm (which is adaptive) and achieves a competitive ratio of at least 0.343. On the hardness side, along with showing that the ratio obtained by our non-adaptive algorithm is the best possible among all non-adaptive algorithms, we further show that no (adaptive) algorithm can achieve a ratio better than 0.581 (unconditionally), even for the special case of OTA-TSA with homogenous tasks (i.e., all rewards are the same). At the heart of our analysis lies a new technical tool (which is a refined notion of the birth-death process), called the two-stage birth-death process, which may be of independent interest. Finally, we perform numerical experiments on two real-world datasets obtained from crowdsourcing platforms to complement our theoretical results.« less
  3. Online bipartite matching and allocation models are widely used to analyze and design markets such as Internet advertising, online labor, and crowdsourcing. Traditionally, vertices on one side of the market are fixed and known a priori, while vertices on the other side arrive online and are matched by a central agent to the offline side. The issue of possible conflicts among offline agents emerges in various real scenarios when we need to match each online agent with a set of offline agents. For example, in event-based social networks (e.g., Meetup), offline events conflict for some users since they will be unable to attend mutually-distant events at proximate times; in advertising markets, two competing firms may prefer not to be shown to one user simultaneously; and in online recommendation systems (e.g., Amazon Books), books of the same type “conflict” with each other in some sense due to the diversity requirement for each online buyer. The conflict nature inherent among certain offline agents raises significant challenges in both modeling and online algorithm design. In this paper, we propose a unifying model, generalizing the conflict models proposed in (She et al., TKDE 2016) and (Chen et al., TKDE 16). Our model can capturemore »not only a broad class of conflict constraints on the offline side (which is even allowed to be sensitive to each online agent), but also allows a general arrival pattern for the online side (which is allowed to change over the online phase). We propose an efficient linear programming (LP) based online algorithm and prove theoretically that it has nearly-optimal online performance. Additionally, we propose two LP-based heuristics and test them against two natural baselines on both real and synthetic datasets. Our LP-based heuristics experimentally dominate the baseline algorithms, aligning with our theoretical predictions and supporting our unified approach.« less
  4. With the popularity of the Internet, traditional offline resource allocation has evolved into a new form, called online resource allocation. It features the online arrivals of agents in the system and the real-time decision-making requirement upon the arrival of each online agent. Both offline and online resource allocation have wide applications in various real-world matching markets ranging from ridesharing to crowdsourcing. There are some emerging applications such as rebalancing in bike sharing and trip-vehicle dispatching in ridesharing, which involve a two-stage resource allocation process. The process consists of an offline phase and another sequential online phase, and both phases compete for the same set of resources. In this paper, we propose a unified model which incorporates both offline and online resource allocation into a single framework. Our model assumes non-uniform and known arrival distributions for online agents in the second online phase, which can be learned from historical data. We propose a parameterized linear programming (LP)-based algorithm, which is shown to be at most a constant factor of 1/4 from the optimal. Experimental results on the real dataset show that our LP-based approaches outperform the LP-agnostic heuristics in terms of robustness and effectiveness.

  5. Abstract. Mixed-phase Southern Ocean clouds are challenging to simulate, and theirrepresentation in climate models is an important control on climatesensitivity. In particular, the amount of supercooled water and frozen massthat they contain in the present climate is a predictor of their planetaryfeedback in a warming climate. The recent Southern Ocean Clouds, Radiation, Aerosol Transport Experimental Study (SOCRATES) vastly increased theamount of in situ data available from mixed-phase Southern Ocean clouds usefulfor model evaluation. Bulk measurements distinguishing liquid and ice watercontent are not available from SOCRATES, so single-particle phaseclassifications from the Two-Dimensional Stereo (2D-S) probe are invaluablefor quantifying mixed-phase cloud properties. Motivated by the presence oflarge biases in existing phase discrimination algorithms, we develop a noveltechnique for single-particle phase classification of binary 2D-S images usinga random forest algorithm, which we refer to as the University of WashingtonIce–Liquid Discriminator (UWILD). UWILD uses 14 parameters computed frombinary image data, as well as particle inter-arrival time, to predict phase.We use liquid-only and ice-dominated time periods within the SOCRATES datasetas training and testing data. This novel approach to model training avoidsmajor pitfalls associated with using manually labeled data, including reducedmodel generalizability and high labor costs. We find that UWILD is wellcalibrated and has an overall accuracymore »of 95 % compared to72 % and 79 % for two existing phase classificationalgorithms that we compare it with. UWILD improves classifications of smallice crystals and large liquid drops in particular and has more flexibilitythan the other algorithms to identify both liquid-dominated and ice-dominatedregions within the SOCRATES dataset. UWILD misclassifies a small percentageof large liquid drops as ice. Such misclassified particles are typicallyassociated with model confidence below 75 % and can easily befiltered out of the dataset. UWILD phase classifications show that particleswith area-equivalent diameter (Deq)  < 0.17 mm are mostlyliquid at all temperatures sampled, down to −40 ∘C. Largerparticles (Deq>0.17 mm) are predominantly frozen at alltemperatures below 0 ∘C. Between 0 and 5 ∘C,there are roughly equal numbers of frozen and liquid mid-sized particles (0.170.33 mm) are mostly frozen. We also use UWILD's phaseclassifications to estimate sub-1 Hz phase heterogeneity, and we showexamples of meter-scale cloud phase heterogeneity in the SOCRATES dataset.« less