skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: OTClean: Data Cleaning for Conditional Independence Violations using Optimal Transport
Ensuring Conditional Independence (CI) constraints is pivotal for the development of fair and trustworthy machine learning models. In this paper, we introduce OTClean, a framework that harnesses optimal transport theory for data repair under CI constraints. Optimal transport theory provides a rigorous framework for measuring the discrepancy between probability distributions, thereby ensuring control over data utility. We formulate the data repair problem concerning CIs as a Quadratically Constrained Linear Program (QCLP) and propose an alternating method for its solution. However, this approach faces scalability issues due to the computational cost associated with computing optimal transport distances, such as the Wasserstein distance. To overcome these scalability challenges, we reframe our problem as a regularized optimization problem, enabling us to develop an iterative algorithm inspired by Sinkhorn's matrix scaling algorithm, which efficiently addresses high-dimensional and large-scale data. Through extensive experiments, we demonstrate the efficacy and efficiency of our proposed methods, showcasing their practical utility in real-world data cleaning and preprocessing tasks. Furthermore, we provide comparisons with traditional approaches, highlighting the superiority of our techniques in terms of preserving data utility while ensuring adherence to the desired CI constraints.  more » « less
Award ID(s):
2012266 2112606
PAR ID:
10547514
Author(s) / Creator(s):
; ; ; ;
Publisher / Repository:
Proceedings of the ACM on Management of Data
Date Published:
Journal Name:
Proceedings of the ACM on Management of Data (SIGMOD)
Volume:
2
Issue:
3
ISSN:
2836-6573
Page Range / eLocation ID:
1 to 26
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. We commonly encounter the problem of identifying an optimally weight-adjusted version of the empirical distribution of observed data, adhering to predefined constraints on the weights. Such constraints often manifest as restrictions on the moments, tail behavior, shapes, number of modes, etc., of the resulting weight-adjusted empirical distribution. In this article, we substantially enhance the flexibility of such a methodology by introducing a nonparametrically imbued distributional constraint on the weights and developing a general framework leveraging the maximum entropy principle and tools from optimal transport. The key idea is to ensure that the maximum entropy weight-adjusted empirical distribution of the observed data is close to a pre-specified probability distribution in terms of the optimal transport metric, while allowing for subtle departures. The proposed scheme for the re-weighting of observations subject to constraints is reminiscent of the empirical likelihood and related ideas, but offers greater flexibility in applications where parametric distribution-guided constraints arise naturally. The versatility of the proposed framework is demonstrated in the context of three disparate applications where data re-weighting is warranted to satisfy side constraints on the optimization problem at the heart of the statistical task—namely, portfolio allocation, semi-parametric inference for complex surveys, and ensuring algorithmic fairness in machine learning algorithms. 
    more » « less
  2. Despite the power and flexibility of configuration interaction (CI) based methods in computational chemistry, their broader application is limited by an exponential increase in both computational and storage requirements, particularly due to the substantial memory needed for excitation lists that are crucial for scalable parallel computing. The objective of this work is to develop a new CI framework, namely, the small tensor product distributed active space (STP-DAS) framework, aimed at drastically reducing memory demands for extensive CI calculations on individual workstations or laptops, while simultaneously enhancing scalability for extensive parallel computing. Moreover, the STP-DAS framework can support various CI-based techniques, such as complete active space (CAS), restricted active space, generalized active space, multireference CI, and multireference perturbation theory, applicable to both relativistic (two- and four-component) and non-relativistic theories, thus extending the utility of CI methods in computational research. We conducted benchmark studies on a supercomputer to evaluate the storage needs, parallel scalability, and communication downtime using a realistic exact-two-component CASCI (X2C-CASCI) approach, covering a range of determinants from 109 to 1012. Additionally, we performed large X2C-CASCI calculations on a single laptop and examined how the STP-DAS partitioning affects performance. 
    more » « less
  3. In this work, we develop a new framework for dynamic network flow problems based on optimal transport theory. We show that the dynamic multicommodity minimum-cost network flow problem can be formulated as a multimarginal optimal transport problem, where the cost function and the constraints on the marginals are associated with a graph structure. By exploiting these structures and building on recent advances in optimal transport theory, we develop an efficient method for such entropy-regularized optimal transport problems. In particular, the graph structure is utilized to efficiently compute the projections needed in the corresponding Sinkhorn iterations, and we arrive at a scheme that is both highly computationally efficient and easy to implement. To illustrate the performance of our algorithm, we compare it with a state-of-the-art linear programming (LP) solver. We achieve good approximations to the solution at least one order of magnitude faster than the LP solver. Finally, we showcase the methodology on a traffic routing problem with a large number of commodities. Funding: This work was supported by KTH Digital Futures, Knut och Alice Wallenbergs Stiftelse [Grants KAW 2018.0349, KAW 2021.0274, the Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation], Vetenskapsrådet [Grant 2020-03454], and the National Science Foundation [Grants 1942523 and 2206576]. 
    more » « less
  4. n this paper we study the bicausal optimal transport problem for Markov chains, an optimal transport formulation suitable for stochastic processes which takes into consideration the accumulation of information as time evolves. Our analysis is based on a relation between the transport problem and the theory of Markov decision processes. This way we are able to derive necessary and sufficient conditions for optimality in the transport problem, as well as an iterative algorithm, namely the value iteration, for the calculation of the transportation cost. Additionally, we draw the connection with the classic theory on couplings for Markov chains, and in particular with the notion of faithful couplings. 
    more » « less
  5. Network alignment is a critical steppingstone behind a variety of multi-network mining tasks. Most of the existing methods essentially optimize a Frobenius-like distance or ranking-based loss, ignoring the underlying geometry of graph data. Optimal transport (OT), together with Wasserstein distance, has emerged to be a powerful approach accounting for the underlying geometry explicitly. Promising as it might be, the state-of-the-art OT-based alignment methods suffer from two fundamental limitations, including (1) effectiveness due to the insufficient use of topology and consistency information and (2) scalability due to the non-convex formulation and repeated computationally costly loss calculation. In this paper, we propose a position-aware regularized optimal transport framework for network alignment named PARROT. To tackle the effectiveness issue, the proposed PARROT captures topology information by random walk with restart, with three carefully designed consistency regularization terms. To tackle the scalability issue, the regularized OT problem is decomposed into a series of convex subproblems and can be efficiently solved by the proposed constrained proximal point method with guaranteed convergence. Extensive experiments show that our algorithm achieves significant improvements in both effectiveness and scalability, outperforming the state-of-the-art network alignment methods and speeding up existing OT-based methods by up to 100 times. 
    more » « less