skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Detecting Temporal Dependencies in Data
Organizations collect data from various sources, and these datasets may have characteristics that are unknown. Selecting the appropriate statistical and machine learning algorithm for data analytical purposes benefits from understanding these characteristics, such as if it contains temporal attributes or not. This paper presents a theoretical basis for automatically determining the presence of temporal data in a dataset given no prior knowledge about its attributes. We use a method to classify an attribute as temporal, non-temporal, or hidden temporal. A hidden (grouping) temporal attribute can only be treated as temporal if its values are categorized in groups. Our method uses a Ljung-Box test for autocorrelation as well as a set of metrics we proposed based on the classification statistics. Our approach detects all temporal and hidden temporal attributes in 15 datasets from various domains.  more » « less
Award ID(s):
2027750 1822118
PAR ID:
10340373
Author(s) / Creator(s):
; ; ;
Editor(s):
Pirk, Holger; Heinis, Thomas
Date Published:
Journal Name:
Proceedings of the British International Conference on Databases
Page Range / eLocation ID:
29-39
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Pirk, Holger; Heinis, Thomas (Ed.)
    Organizations collect data from various sources, and these datasets may have characteristics that are unknown. Selecting the appropriate statistical and machine learning algorithm for data analytical purposes benefits from understanding these characteristics, such as if it contains temporal attributes or not. This paper presents a theoretical basis for automatically determining the presence of temporal data in a dataset given no prior knowledge about its attributes. We use a method to classify an attribute as temporal, non-temporal, or hidden temporal. A hidden (grouping) temporal attribute can only be treated as temporal if its values are categorized in groups. Our method uses a Ljung-Box test for autocorrelation as well as a set of metrics we proposed based on the classification statistics. Our approach detects all temporal and hidden temporal attributes in 15 datasets from various domains. 
    more » « less
  2. Subspace clustering algorithms are used for understanding the cluster structure that explains the patterns prevalent in the dataset well. These methods are extensively used for data-exploration tasks in various areas of Natural Sciences. However, most of these methods fail to handle confounding attributes in the dataset. For datasets where a data sample represent multiple attributes, naively applying any clustering approach can result in undesired output. To this end, we propose a novel framework for jointly removing confounding attributes while learning to cluster data points in individual subspaces. Assuming we have label information about these confounding attributes, we regularize the clustering method by adversarially learning to minimize the mutual information between the data representation and the confounding attribute labels. Our experimental result on synthetic and real-world datasets demonstrate the effectiveness of our approach. 
    more » « less
  3. Geospatial data visualization on a map is an essential tool for modern data exploration tools. However, these tools require users to manually configure the visualization style including color scheme and attribute selection, a process that is both complex and domain-specific. Large Language Models (LLMs) provide an opportunity to intelligently assist in styling based on the underlying data distribution and characteristics. This paper demonstrates LASEK, an LLM-assisted visualization framework that automates attribute selection and styling in large-scale spatio-temporal datasets. The system leverages LLMs to determine which attributes should be highlighted for visual distinction and even suggests how to integrate them in styling options improving interpretability and efficiency. We demonstrate our approach through interactive visualization scenarios, showing how LLM-driven attribute selection enhances clarity, reduces manual effort, and provides data-driven justifications for styling decisions. 
    more » « less
  4. The operationalization of algorithmic fairness comes with several practical challenges, not the least of which is the availability or reliability of protected attributes in datasets. In real-world contexts, practical and legal impediments may prevent the collection and use of demographic data, making it difficult to ensure algorithmic fairness. While initial fairness algorithms did not consider these limitations, recent proposals aim to achieve algorithmic fairness in classification by incorporating noisiness in protected attributes or not using protected attributes at all. To the best of our knowledge, this is the first head-to-head study of fair classification algorithms to compare attribute-reliant, noise-tolerant and attribute-unaware algorithms along the dual axes of predictivity and fairness. We evaluated these algorithms via case studies on four real-world datasets and synthetic perturbations. Our study reveals that attribute-unaware and noise-tolerant fair classifiers can potentially achieve similar level of performance as attribute-reliant algorithms, even when protected attributes are noisy. However, implementing them in practice requires careful nuance. Our study provides insights into the practical implications of using fair classification algorithms in scenarios where protected attributes are noisy or partially available. 
    more » « less
  5. Federated learning (FL) has been widely studied recently due to its property to collaboratively train data from different devices without sharing the raw data. Nevertheless, recent studies show that an adversary can still be possible to infer private information about devices' data, e.g., sensitive attributes such as income, race, and sexual orientation. To mitigate the attribute inference attacks, various existing privacy-preserving FL methods can be adopted/adapted. However, all these existing methods have key limitations: they need to know the FL task in advance, or have intolerable computational overheads or utility losses, or do not have provable privacy guarantees. We address these issues and design a task-agnostic privacy-preserving presentation learning method for FL (TAPPFL) against attribute inference attacks. TAPPFL is formulated via information theory. Specifically, TAPPFL has two mutual information goals, where one goal learns task-agnostic data representations that contain the least information about the private attribute in each device's data, and the other goal ensures the learnt data representations include as much information as possible about the device data to maintain FL utility. We also derive privacy guarantees of TAPPFL against worst-case attribute inference attacks, as well as the inherent tradeoff between utility preservation and privacy protection. Extensive results on multiple datasets and applications validate the effectiveness of TAPPFL to protect data privacy, maintain the FL utility, and be efficient as well. Experimental results also show that TAPPFL outperforms the existing defenses. 
    more » « less