skip to main content


Title: RRR: Rank-Regret Representative
Selecting the best items in a dataset is a common task in data exploration. However, the concept of “best” lies in the eyes of the beholder: different users may consider different attributes more important, and hence arrive at different rankings. Nevertheless, one can remove “dominated” items and create a “representative” subset of the data, comprising the “best items” in it. A Pareto-optimal representative is guaranteed to contain the best item of each possible ranking, but it can be a large portion of data. A much smaller representative can be found if we relax the requirement to include the best item for each user, and instead just limit the users’ “regret”. Existing work defines regret as the loss in score by limiting consideration to the representative instead of the full data set, for any chosen ranking function. However, the score is often not a meaningful number and users may not understand its absolute value. Sometimes small ranges in score can include large fractions of the data set. In contrast, users do understand the notion of rank ordering. Therefore, we consider the position of the items in the ranked list for defining the regret and propose the rank-regret representative as the minimal subset of the data containing at least one of the top-k of any possible ranking function. This problem is NP-complete. We use a geometric interpretation of items to bound their ranks on ranges of functions and to utilize combinatorial geometry notions for developing effective and efficient approximation algorithms for the problem. Experiments on real datasets demonstrate that we can efficiently find small subsets with small rank-regrets.  more » « less
Award ID(s):
1741022
NSF-PAR ID:
10108005
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
Proc. ACM SIGMOD Intl Conf. on Mgmt of Data
Page Range / eLocation ID:
263 to 280
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Selecting the best items in a dataset is a common task in data exploration. However, the concept of “best” lies in the eyes of the beholder: different users may consider different attributes more important and, hence, arrive at different rankings. Nevertheless, one can remove “dominated” items and create a “representative” subset of the data, comprising the “best items” in it. A Pareto-optimal representative is guaranteed to contain the best item of each possible ranking, but it can be a large portion of data. A much smaller representative can be found if we relax the requirement of including the best item for each user and, instead, just limit the users’ “regret”. Existing work defines regret as the loss in score by limiting consideration to the representative instead of the full dataset, for any chosen ranking function. However, the score is often not a meaningful number, and users may not understand its absolute value. Sometimes small ranges in score can include large fractions of the dataset. In contrast, users do understand the notion of rank ordering. Therefore, we consider items’ positions in the ranked list in defining the regret and propose the rank-regret representative as the minimal subset of the data containing at least one of the top-k of any possible ranking function. This problem is polynomial time solvable in 2D space but is NP-hard on 3 or more dimensions. We design a suite of algorithms to fulfill different purposes, such as whether relaxation is permitted on k, the result size, or both, whether a distribution is known, whether theoretical guarantees or practical efficiency is important, etc. Experiments on real datasets demonstrate that we can efficiently find small subsets with small rank-regrets. 
    more » « less
  2. In this paper, we consider the Collaborative Ranking (CR) problem for recommendation systems. Given a set of pairwise preferences between items for each user, collaborative ranking can be used to rank un-rated items for each user, and this ranking can be naturally used for recommendation. It is observed that collaborative ranking algorithms usually achieve better performance since they directly minimize the ranking loss; however, they are rarely used in practice due to the poor scalability. All the existing CR algorithms have time complexity at least O(|Ω|r) per iteration, where r is the target rank and |Ω| is number of pairs which grows quadratically with number of ratings per user. For example, the Netflix data contains totally 20 billion rating pairs, and at this scale all the current algorithms have to work with significant subsampling, resulting in poor prediction on testing data. In this paper, we propose a new collaborative ranking algorithm called Primal-CR that reduces the time complexity toO(|Ω|+d1d2r), where d1 is number of users and d2 is the averaged number of items rated by a user. Note that d1, d2 is strictly smaller and open much smaller than |Ω|. Furthermore, by exploiting the fact that most data is in the form of numerical ratings instead of pairwise comparisons, we propose Primal-CR++ with O(d1d2(r + log d2)) time complexity. Both algorithms have better theoretical time complexity than existing approaches and also outperform existing approaches in terms of NDCG and pairwise error on real data sets. To the best of our knowledge, this is the first collaborative ranking algorithm capable of working on the full Netflix dataset using all the 20 billion rating pairs, and this leads to a model with much better recommendation compared with previous models trained on subsamples. Finally, compared with classical matrix factorization algorithm which also requires O(d1 d2r) time, our algorithm has almost the same efficiency while making much better recommendations since we consider the ranking loss. 
    more » « less
  3. Items from a database are often ranked based on a combination of criteria. The weight given to each criterion in the combination can greatly affect the fairness of the produced ranking, for example, preferring men over women. A user may have the flexibility to choose combinations that weigh these criteria differently, within limits. In this paper, we develop a system that helps users choose criterion weights that lead to greater fairness. We consider ranking functions that compute the score of each item as a weighted sum of (numeric) attribute values, and then sort items on their score. Each ranking function can be expressed as a point in a multidimensional space. For a broad range of fairness criteria, including proportionality, we show how to efficiently identify regions in this space that satisfy these criteria. Using this identification method, our system is able to tell users whether their proposed ranking function satisfies the desired fairness criteria and, if it does not, to suggest the smallest modification that does. Our extensive experiments on real datasets demonstrate that our methods are able to find solutions that satisfy fairness criteria effectively (usually with only small changes to proposed weight vectors) and efficiently (in interactive time, after some initial pre-processing). 
    more » « less
  4. null (Ed.)
    We consider the problem of selling perishable items to a stream of buyers in order to maximize social welfare. A seller starts with a set of identical items, and each arriving buyer wants any one item, and has a valuation drawn i.i.d. from a known distribution. Each item, however, disappears after an a priori unknown amount of time that we term the horizon for that item. The seller knows the (possibly different) distribution of the horizon for each item, but not its realization till the item actually disappears. As with the classic prophet inequalities, the goal is to design an online pricing scheme that competes with the prophet that knows the horizon and extracts full social surplus (or welfare). Our main results are for the setting where items have independent horizon distributions satisfying the monotone-hazard-rate (MHR) condition. Here, for any number of items, we achieve a constant-competitive bound via a conceptually simple policy that balances the rate at which buyers are accepted with the rate at which items are removed from the system. We implement this policy via a novel technique of matching via probabilistically simulating departures of the items at future times. Moreover, for a single item and MHR horizon distribution with mean, we show a tight result: There is a fixed pricing scheme that has competitive ratio at most 2 - 1/μ, and this is the best achievable in this class. We further show that our results are best possible. First, we show that the competitive ratio is unbounded without the MHR assumption even for one item. Further, even when the horizon distributions are i.i.d. MHR and the number of items becomes large, the competitive ratio of any policy is lower bounded by a constant greater than 1, which is in sharp contrast to the setting with identical deterministic horizons. 
    more » « less
  5. We consider the problem of subset selection in the online setting, where data arrive incrementally. Instead of storing and running subset selection on the entire dataset, we propose an incremental subset selection framework that, at each time instant, uses the previously selected set of representatives and the new batch of data in order to update the set of representatives. We cast the problem as an integer binary optimization minimizing the encoding cost of the data via representatives regularized by the number of selected items. As the proposed optimization is, in general, NP-hard and non-convex, we study a greedy approach based on unconstrained submodular optimization and also propose an efficient convex relaxation. We show that, under appropriate conditions, the solution of our proposed convex algorithm achieves the global optimal solution of the non-convex problem. Our results also address the conventional problem of subset selection in the offline setting, as a special case. By extensive experiments on the problem of video summarization, we demonstrate that our proposed online subset selection algorithms perform well on real data, capturing diverse representative events in videos, while they obtain objective function values close to the offline setting. 
    more » « less