skip to main content


Title: Revisiting Subset Selection
In the subset-selection approach to ranking and selection, a decision-maker seeks a subset of simulated systems that contains the best with high probability. We present a new, generalized framework for constructing these subsets and demonstrate that some existing subset-selection procedures are situated within this framework. The subsets are built by calculating, for each system, a minimum standardized discrepancy between the observed performances and the space of problem instances for which that system is the best. A system’s minimum standardized discrepancy is then compared to a cutoff to determine whether the system is included in the subset. We examine the problem of finding the tightest statistically valid cutoff for each system and draw connections between our approach and other subset-selection methodologies. Simulation experiments demonstrate how the screening power and subset size are affected by the choice of standardized discrepancy.  more » « less
Award ID(s):
1634982
NSF-PAR ID:
10287343
Author(s) / Creator(s):
; ;
Editor(s):
Bae, K.-H.; Feng, B.; Kim, S.; Lazarova-Molnar, S.; Zheng, Z.; Roeder, T.; Thiesing, R.
Date Published:
Journal Name:
Proceedings of the 2020 Winter Simulation Conference
Page Range / eLocation ID:
2972-2983
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Subset selection is an integral component of AI systems that is increasingly affecting people’s livelihoods in applications ranging from hiring, healthcare, education, to financial decisions. Subset selections powered by AI-based methods include top- analytics, data summarization, clustering, and multi-winner voting. While group fairness auditing tools have been proposed for classification systems, these state-of-the-art tools are not directly applicable to measuring and conceptualizing fairness in selected subsets. In this work, we introduce the first comprehensive auditing framework, FINS, to support stakeholders in interpretably quantifying group fairness across a diverse range of subset-specific fairness concerns. FINS offers a family of novel measures that provide a flexible means to audit group fairness for fairness goals ranging from item-based, score-based, and a combination thereof. FINS provides one unified easy-to-understand interpretation across these different fairness problems. Further, we develop guidelines through the FINS Fair Subset Chart, that supports auditors in determining which measures are relevant to their problem context and fairness objectives. We provide a comprehensive mapping between each fairness measure and the belief system (i.e., worldview) that is encoded within its measurement of fairness. Lastly, we demonstrate the interpretability and efficacy of FINS in supporting the identification of real bias with case studies using AirBnB listings and voter records. 
    more » « less
  2. The key to optimizing the performance of an anycast-based sys- tem (e.g., the root DNS or a CDN) is choosing the right set of sites to announce the anycast prefix. One challenge here is predicting catchments. A naïve approach is to advertise the prefix from all subsets of available sites and choose the best-performing subset, but this does not scale well. We demonstrate that by conducting pairwise experiments between sites peering with tier-1 networks, we can predict the catchments that would result if we announce to any subset of the sites. We prove that our method is effective in a simplified model of BGP, consistent with common BGP routing policies, and evaluate it in a real-world testbed. We then present AnyOpt, a system that predicts anycast catchments. Using AnyOpt, a network operator can find a subset of anycast sites that minimizes client latency without using the naïve approach. In an experiment using 15 sites, each peering with one of six transit providers, AnyOpt predicted site catchments of 15,300 clients with 94.7% accuracy and client RTTs with a mean error of 4.6%. AnyOpt identified a subset of 12 sites, announcing to which lowers the mean RTT to clients by 33ms compared to a greedy approach that enables the same number of sites with the lowest average unicast latency. 
    more » « less
  3. Selecting the best items in a dataset is a common task in data exploration. However, the concept of “best” lies in the eyes of the beholder: different users may consider different attributes more important, and hence arrive at different rankings. Nevertheless, one can remove “dominated” items and create a “representative” subset of the data, comprising the “best items” in it. A Pareto-optimal representative is guaranteed to contain the best item of each possible ranking, but it can be a large portion of data. A much smaller representative can be found if we relax the requirement to include the best item for each user, and instead just limit the users’ “regret”. Existing work defines regret as the loss in score by limiting consideration to the representative instead of the full data set, for any chosen ranking function. However, the score is often not a meaningful number and users may not understand its absolute value. Sometimes small ranges in score can include large fractions of the data set. In contrast, users do understand the notion of rank ordering. Therefore, we consider the position of the items in the ranked list for defining the regret and propose the rank-regret representative as the minimal subset of the data containing at least one of the top-k of any possible ranking function. This problem is NP-complete. We use a geometric interpretation of items to bound their ranks on ranges of functions and to utilize combinatorial geometry notions for developing effective and efficient approximation algorithms for the problem. Experiments on real datasets demonstrate that we can efficiently find small subsets with small rank-regrets. 
    more » « less
  4. We consider the problem of subset selection in the online setting, where data arrive incrementally. Instead of storing and running subset selection on the entire dataset, we propose an incremental subset selection framework that, at each time instant, uses the previously selected set of representatives and the new batch of data in order to update the set of representatives. We cast the problem as an integer binary optimization minimizing the encoding cost of the data via representatives regularized by the number of selected items. As the proposed optimization is, in general, NP-hard and non-convex, we study a greedy approach based on unconstrained submodular optimization and also propose an efficient convex relaxation. We show that, under appropriate conditions, the solution of our proposed convex algorithm achieves the global optimal solution of the non-convex problem. Our results also address the conventional problem of subset selection in the offline setting, as a special case. By extensive experiments on the problem of video summarization, we demonstrate that our proposed online subset selection algorithms perform well on real data, capturing diverse representative events in videos, while they obtain objective function values close to the offline setting. 
    more » « less
  5. We consider the problem of subset selection in the online setting, where data arrive incrementally. Instead of storing and running subset selection on the entire dataset, we propose an incremental subset selection framework that, at each time instant, uses the previously selected set of representatives and the new batch of data in order to update the set of representatives. We cast the problem as an integer bi- nary optimization minimizing the encoding cost of the data via representatives regularized by the number of selected items. As the proposed optimization is, in general, NP-hard and non-convex, we study a greedy approach based on un- constrained submodular optimization and also propose an efficient convex relaxation. We show that, under appropriate conditions, the solution of our proposed convex algorithm achieves the global optimal solution of the non-convex problem. Our results also address the conventional problem of subset selection in the offline setting, as a special case. By extensive experiments on the problem of video summarization, we demonstrate that our proposed online subset selection algorithms perform well on real data, capturing diverse representative events in videos, while they obtain objective function values close to the offline setting. 
    more » « less