skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Equitable Top-k Results for Long Tail Data
For datasets exhibiting long tail phenomenon, we identify a fairness concern in existing top-k algorithms, that return a fixed set of k results for a given query. This causes a handful of popular records (products, items, etc) getting overexposed and always be returned to the user query, whereas, there exists a long tail of niche records that may be equally desirable (have similar utility). To alleviate this, we propose θ-Equiv-top-k-MMSP inside existing top-k algorithms - instead of returning a fixed top-k set, it generates all (or many) top-k sets that are equivalent in utility and creates a probability distribution over those sets. The end user will be returned one of these sets during the query time proportional to its associated probability, such that, after many draws from many end users, each record will have as equal exposure as possible (governed by uniform selection probability). θ-Equiv-top-k-MMSP is formalized with two sub-problems. (a) θ-Equiv-top-k-Sets to produce a set S of sets, each set has k records, where the sets are equivalent in utility with the top-k set; (b) MaxMinFair to produce a probability distribution over S, that is, PDF(S), such that the records in S have uniform selection probability. We formally study the hardness of θ-Equiv-top-k-MMSP. We present multiple algorithmic results - (a) An exact solution for θ-Equiv-top-k-Sets, and MaxMinFair. (b) We design highly scalable algorithms that solve θ-Equiv-top-k-Sets through a random walk and is backed by probability theory, as well as a greedy solution designed for MaxMinFair. (c) We finally present an adaptive random walk based algorithm that solves θ-Equiv-top-k-Sets and MaxMinFair at the same time. We empirically study how θ-Equiv-top-k-MMSP can alleviate a equitable exposure concerns that group fairness suffers from. We run extensive experiments using 6 datasets and design intuitive baseline algorithms that corroborate our theoretical analysis.  more » « less
Award ID(s):
1942913 2007935
PAR ID:
10508225
Author(s) / Creator(s):
; ;
Publisher / Repository:
ACM
Date Published:
Journal Name:
Proceedings of the ACM on Management of Data
Volume:
1
Issue:
4
ISSN:
2836-6573
Page Range / eLocation ID:
1 to 24
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Similarity search is a fundamental algorithmic primitive, widely used in many computer science disciplines. Given a set of points S and a radius parameter r > 0, the r -near neighbor ( r -NN) problem asks for a data structure that, given any query point q , returns a point p within distance at most r from q. In this paper, we study the r -NN problem in the light of individual fairness and providing equal opportunities: all points that are within distance r from the query should have the same probability to be returned. The problem is of special interest in high dimensions, where Locality Sensitive Hashing (LSH), the theoretically leading approach to similarity search, does not provide any fairness guarantee. In this work, we show that LSH-based algorithms can be made fair, without a significant loss in efficiency. We propose several efficient data structures for the exact and approximate variants of the fair NN problem. Our approach works more generally for sampling uniformly from a sub-collection of sets of a given collection and can be used in a few other applications. We also carried out an experimental evaluation that highlights the inherent unfairness of existing NN data structures. 
    more » « less
  2. Guruswami, Venkatesan (Ed.)
    We provide a simple (1-O(1/(√{k)}))-selectable Online Contention Resolution Scheme for k-uniform matroids against a fixed-order adversary. If A_i and G_i denote the set of selected elements and the set of realized active elements among the first i (respectively), our algorithm selects with probability 1-1/(√{k)} any active element i such that |A_{i-1}| + 1 ≤ (1-1/(√{k)})⋅ 𝔼[|G_i|]+√k. This implies a (1-O(1/(√{k)})) prophet inequality against fixed-order adversaries for k-uniform matroids that is considerably simpler than previous algorithms [Alaei, 2014; Azar et al., 2014; Jiang et al., 2022]. We also prove that no OCRS can be (1-Ω(√{(log k)/k}))-selectable for k-uniform matroids against an almighty adversary. This guarantee is matched by the (known) simple greedy algorithm that selects every active element with probability 1-Θ(√{(log k)/k}) [Hajiaghayi et al., 2007]. 
    more » « less
  3. For a graph G on n vertices, naively sampling the position of a random walk of at time t requires work Ω(t). We desire local access algorithms supporting positionG(t) queries, which return the position of a random walk from some fixed start vertex s at time t, where the joint distribution of returned positions is 1/ poly(n) close to those of a uniformly random walk in ℓ1 distance. We first give an algorithm for local access to random walks on a given undirected d-regular graph with eO( 1 1−λ √ n) runtime per query, where λ is the second-largest eigenvalue of the random walk matrix of the graph in absolute value. Since random d-regular graphs G(n, d) are expanders with high probability, this gives an eO(√ n) algorithm for a graph drawn from G(n, d) whp, which improves on the naive method for small numbers of queries. We then prove that no algorithm with subconstant error given probe access to an input d-regular graph can have runtime better than Ω(√ n/ log(n)) per query in expectation when the input graph is drawn from G(n, d), obtaining a nearly matching lower bound. We further show an Ω(n1/4) runtime per query lower bound even with an oblivious adversary (i.e. when the query sequence is fixed in advance). We then show that for families of graphs with additional group theoretic structure, dramatically better results can be achieved. We give local access to walks on small-degree abelian Cayley graphs, including cycles and hypercubes, with runtime polylog(n) per query. This also allows for efficient local access to walks on polylog degree expanders. We show that our techniques apply to graphs with high degree by extending or results to graphs constructed using the tensor product (giving fast local access to walks on degree nϵ graphs for any ϵ ∈ (0, 1]) and Cartesian product. 
    more » « less
  4. In the last two decades, the IR community has seen numerous advances in top-k query processing and inverted index compression techniques. While newly proposed methods are typically compared against several baselines, these evaluations are often very limited, and we feel that there is no clear overall picture on the best choices of algorithms and compression methods. In this paper, we attempt to address this issue by evaluating a number of state-of-the-art index compression methods and safe disjunctive DAAT query processing algorithms. Our goal is to understand how much index compression performance impacts overall query processing speed, how the choice of query processing algorithm depends on the compression method used, and how performance is impacted by document reordering techniques and the number of results returned, keeping in mind that current search engines typically use sets of hundreds or thousands of candidates for further reranking. 
    more » « less
  5. null (Ed.)
    Many large multimedia applications require efficient processing of nearest neighbor queries. Often, multimedia data are represented as a collection of important high-dimensional feature vectors. Existing Locality Sensitive Hashing (LSH) techniques require users to find top-k similar feature vectors for each of the feature vectors that represent the query object. This leads to wasted and redundant work due to two main reasons: 1) not all feature vectors may contribute equally in finding the top-k similar multimedia objects, and 2) feature vectors are treated independently during query processing. Additionally, there is no theoretical guarantee on the returned multimedia results. In this work, we propose a practical and efficient indexing approach for finding top-k approximate nearest neighbors for multimedia data using LSH called mmLSH, which can provide theoretical guarantees on the returned multimedia results. Additionally, we present a buffer-conscious strategy to speed up the query processing. Experimental evaluation shows significant gains in performance time and accuracy for different real multimedia datasets when compared against state-of-the-art LSH techniques. 
    more » « less