NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Fair Set Cover

https://doi.org/10.1145/3690624.3709184

Dehghankar, Mohsen; Raychaudhury, Rahul; Sintos, Stavros; Asudeh, Abolfazl (July 2025, ACM SIGKDD Conference on Knowledge Discovery and Data Mining)

Free, publicly-accessible full text available July 20, 2026
Evaluating the Feasibility of Sampling-Based Techniques for Training Multilayer Perceptrons

https://doi.org/10.48786/edbt.2025.15

Ebrahimi, Sana; Advani, Rishi; Asudeh, Abolfazl (March 2025, EDBT'25: Proceedings of the International Conference on Extending Database Technology)
Mining the Minoria: Unknown, Under-represented, and Under-performing Minority Groups

Dehghankar, Mohsen; Asudeh, Abolfazl (January 2025, Proceedings of the VLDB Endowment)

Free, publicly-accessible full text available January 10, 2026
Axolotl: Fairness through Assisted Prompt Rewriting of Large Language Model Outputs

https://doi.org/10.1109/ICKG63256.2024.00017

Ebrahimi, Sana; Chen, Kaiwen; Asudeh, Abolfazl; Das, Gautam; Koudas, Nick (December 2024, IEEE)

Free, publicly-accessible full text available December 11, 2025
Improved Approximation Algorithms for Relational Clustering

https://doi.org/10.1145/3695831

Esmailpour, Aryan; Sintos, Stavros (November 2024, Proceedings of the ACM on Management of Data)

Clustering plays a crucial role in computer science, facilitating data analysis and problem-solving across numerous fields. By partitioning large datasets into meaningful groups, clustering reveals hidden structures and relationships within the data, aiding tasks such as unsupervised learning, classification, anomaly detection, and recommendation systems. Particularly in relational databases, where data is distributed across multiple tables, efficient clustering is essential yet challenging due to the computational complexity of joining tables. This paper addresses this challenge by introducing efficient algorithms for k-median and k-means clustering on relational data without the need for pre-computing the join query results. For the relational k-median clustering, we propose the first efficient relative approximation algorithm. For the relational k-means clustering, our algorithm significantly improves both the approximation factor and the running time of the known relational k-means clustering algorithms, which suffer either from large constant approximation factors, or expensive running time. Given a join query q and a database instance D of O(N) tuples, for both k-median and k-means clustering on the results of q on D, we propose randomized (1+ε)γ-approximation algorithms that run in roughly O(k²N^fhw)+T_γ(k²) time, where ε ∈ (0,1) is a constant parameter decided by the user, \fhw is the fractional hyper-tree width of Q, while γ and T_γ(x) represent the approximation factor and the running time, respectively, of a traditional clustering algorithm in the standard computational setting over x points.
more » « less
Free, publicly-accessible full text available November 4, 2025
Computing A Well-Representative Summary of Conjunctive Query Results

https://doi.org/10.1145/3695835

Agarwal, Pankaj K; Esmailpour, Aryan; Hu, Xiao; Sintos, Stavros; Yang, Jun (November 2024, Proceedings of the ACM on Management of Data)

Data summarization is a powerful approach to deal with large-scale data analytics, which has wide applications in web search, recommendation systems, approximate query processing, etc. It computes a small, compact summary that preserves vital properties of the original data. In this paper, we study the data summarization problem of conjunctive query results, i.e., computing a k-size subset of a conjunctive query output, for any given k>0, that optimizes a certain objective. More specifically, we are interested in two commonly studied objectives: cohesion, which measures the maximum distance between a tuple in the query result tuples and its closest tuple in the summary (k-center clustering); and diversity, which measures the pairwise distances between the summary items. A simple approach that computes the entire query output and then applies existing algorithms on top of these materialized tuples suffers from high computational complexity because the query output can be large, e.g., for a relational database of N tuples, the number of result tuples can be N^O(1).We propose O(1)-approximation algorithms that compute well-representative summaries of size k in time O(N*k^O(1)), or even O(N+ k^O(1)) in some cases, without computing all result tuples. We also propose the first efficient (2+\eps)-approximation algorithm for the k-center clustering problem over relational data. Our main idea is to formulate a few oracles that enable us to access specific query result tuples with certain properties, to show how these oracles can be implemented efficiently, and to compute desired summaries with few invocations of these oracles.
more » « less
Free, publicly-accessible full text available November 4, 2025
k-Clustering with Comparison and Distance Oracles

https://doi.org/10.1145/3695830

Galhotra, Sainyam; Raychaudhury, Rahul; Sintos, Stavros (November 2024, Proceedings of the ACM on Management of Data)

In this paper, we address clustering problems in scenarios where accurate direct access to the full dataset is impractical or impossible. Instead, we leverage oracle-based methods, which are particularly valuable in real-world applications where the data may be noisy, restricted due to privacy concerns or sheer volume. We utilize two oracles, the quadruplet and the distance oracle. The quadruplet oracle is a weaker oracle that only approximately compares the distances of two pairs of vertices. In practice, these oracles can be implemented using crowdsourcing or training classifiers or other predictive models. On the other hand, the distance oracle returns exactly the distance of two vertices, so it is a stronger and more expensive oracle to implement. We consider two noise models for the quadruplet oracle. In the adversarial noise model, if two pairs have similar distances, the response is chosen by an adversary. In the probabilistic noise model, the pair with the smaller distance is returned with a constant probability. We consider a set V of n vertices in a metric space that supports the quadruplet and the distance oracle. For each of the k-center, k-median, and k-means clustering problem on V, we design constant approximation algorithms that perform roughly O(nk) calls to the quadruplet oracle and O(k^2) calls to the distance oracle in both noise models. When the dataset has low intrinsic dimension, we significantly improve the approximation factors of our algorithms by performing a few additional calls to the distance oracle. We also show that for k-median and k-means clustering there is no hope to return any sublinear approximation using only the quadruplet oracle. Finally, we give constant approximation algorithms for estimating the clustering cost induced by any set of k vertices, performing roughly O(nk) calls to the quadruplet oracle and O(k^2) calls to the distance oracle.
more » « less
Free, publicly-accessible full text available November 4, 2025
FairEM360: A Suite for Responsible Entity Matching

Shahbazi, Nima; Erfanian, Mahdi; Asudeh, Abolfazl; Nargesian, Fatemeh; Srivastava, Divesh (August 2024, Proceedings of the VLDB Endowment)

Full Text Available
Reliability evaluation of individual predictions: a data-centric approach

https://doi.org/10.1007/s00778-024-00857-w

Shahbazi, Nima; Asudeh, Abolfazl (July 2024, The VLDB Journal)

Full Text Available
Chameleon: Foundation Models for Fairness-Aware Multi-Modal Data Augmentation to Enhance Coverage of Minorities

https://doi.org/10.14778/3681954.3682014

Erfanian, Mahdi; Jagadish, H V; Asudeh, Abolfazl (July 2024, Proceedings of the VLDB Endowment)

Potential harms from the under-representation of minorities in data, particularly in multi-modal settings, is a well-recognized concern. While there has been extensive effort in detecting such under-representation, resolution has remained a challenge. With recent generative AI advancements, large language and foundation models have emerged as versatile tools across various domains. In this paper, we propose Chameleon, a system that efficiently utilizes these tools to augment a dataset with minimal addition of synthetically generated tuples to enhance the coverage of the under-represented groups. Our system applies quality and outlier-detection tests to ensure the quality and semantic integrity of the generated tuples. In order to minimize the rejection chance of the generated tuples, we propose multiple strategies to provide a guide for the foundation model. Our experiment results, in addition to confirming the efficiency of our proposed algorithms, illustrate our approach's effectiveness, as the model's unfairness in a downstream task significantly dropped after data repair using Chameleon.
more » « less
Full Text Available

« Prev Next »

Search for: All records