skip to main content


Title: Detecting Temporal Dependencies in Data
Organizations collect data from various sources, and these datasets may have characteristics that are unknown. Selecting the appropriate statistical and machine learning algorithm for data analytical purposes benefits from understanding these characteristics, such as if it contains temporal attributes or not. This paper presents a theoretical basis for automatically determining the presence of temporal data in a dataset given no prior knowledge about its attributes. We use a method to classify an attribute as temporal, non-temporal, or hidden temporal. A hidden (grouping) temporal attribute can only be treated as temporal if its values are categorized in groups. Our method uses a Ljung-Box test for autocorrelation as well as a set of metrics we proposed based on the classification statistics. Our approach detects all temporal and hidden temporal attributes in 15 datasets from various domains.  more » « less
Award ID(s):
2027750 1822118
NSF-PAR ID:
10340373
Author(s) / Creator(s):
; ; ;
Editor(s):
Pirk, Holger; Heinis, Thomas
Date Published:
Journal Name:
Proceedings of the British International Conference on Databases
Page Range / eLocation ID:
29-39
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Pirk, Holger ; Heinis, Thomas (Ed.)
    Organizations collect data from various sources, and these datasets may have characteristics that are unknown. Selecting the appropriate statistical and machine learning algorithm for data analytical purposes benefits from understanding these characteristics, such as if it contains temporal attributes or not. This paper presents a theoretical basis for automatically determining the presence of temporal data in a dataset given no prior knowledge about its attributes. We use a method to classify an attribute as temporal, non-temporal, or hidden temporal. A hidden (grouping) temporal attribute can only be treated as temporal if its values are categorized in groups. Our method uses a Ljung-Box test for autocorrelation as well as a set of metrics we proposed based on the classification statistics. Our approach detects all temporal and hidden temporal attributes in 15 datasets from various domains. 
    more » « less
  2. The historical settlement data compilation for Spain (HISDAC-ES) is a geospatial dataset consisting of over 240 gridded surfaces measuring the physical, functional, age-related, and evolutionary characteristics of the Spanish building stock. We scraped, harmonized, and aggregated cadastral building footprint data for Spain, covering over 12,000,000 building footprints including construction year attributes, to create a multi-faceted series of gridded surfaces (GeoTIFF format), describing the evolution of human settlements in Spain from 1900 to 2020, at 100m spatial and 5 years temporal resolution. Also, the dataset contains aggregated characteristics and completeness statistics at the municipality level, in CSV and GeoPackage format.

    !!! UPDATE 08-2023 !!!: We provide a new, improved version of HISDAC-ES. Specifically, we fixed two bugs in the production code that caused an incorrect rasterization of the multitemporal BUFA layers and of the PHYS layers (BUFA, BIA, DWEL, BUNITS sum and mean). Moreover, we added decadal raster datasets measuring residential building footprint and building indoor area (1900-2020), and provide a country-wide, harmonized building footprint centroid dataset in GeoPackage vector data format.

    File descriptions:

    Datasets are available in three spatial reference systems:

    1. HISDAC-ES_All_LAEA.zip: Raster data in Lambert Azimuthal Equal Area (LAEA) covering all Spanish territory.
    2. HISDAC-ES_IbericPeninsula_UTM30.zip: Raster data in UTM Zone 30N covering all the Iberic Peninsula + Céuta and Melilla.
    3. HISDAC-ES_CanaryIslands_REGCAN.zip: Raster data in REGCAN-95, covering the Canary Islands only.
    4. HISDAC-ES_MunicipAggregates.zip: Municipality-level aggregates and completeness statistics (CSV, GeoPackage), in LAEA projection.
    5. ES_building_centroids_merged_spatjoin.gpkg: 7,000,000+ building footprint centroids in GeoPackage format, harmonized from the different cadastral systems, representing the input data for HISDAC-ES. These data can be used for sanity checks or for the creation of further, user-defined gridded surfaces.

    Source data:

    HISDAC-ES is derived from cadastral building footprint data, available from different authorities in Spain:

    • Araba province: https://geo.araba.eus/WFS_Katastroa?SERVICE=WFS&VERSION=1.1.0&REQUEST=GetCapabilities
    • Bizkaia province: https://web.bizkaia.eus/es/inspirebizkaia
    • Gipuzkoa province: https://b5m.gipuzkoa.eus/web5000/es/utilidades/inspire/edificios/
    • Navarra region: https://inspire.navarra.es/services/BU/wfs
    • Other regions: http://www.catastro.minhap.es/INSPIRE/buildings/ES.SDGC.bu.atom.xml
    • Data source of municipality polygons: Centro Nacional de Información Geográfica (https://centrodedescargas.cnig.es/CentroDescargas/index.jsp)

    Technical notes:

    Gridded data

    File nomenclature:

    ./region_projection_theme/hisdac_es_theme_variable_version_resolution[m][_year].tif

    Regions:

    • all: complete territory of Spain
    • can: Canarian Islands only
    • ibe: Iberic peninsula + Céuta + Melilla

    Projections:

    • laea: Lambert azimuthal equal area (EPSG:3035)
    • regcan: REGCAN95 / UTM zone 28N (EPSG:4083)
    • utm: ETRS89 / UTM zone 30N (EPSG:25830)

    Themes:

    • evolution / evol: multi-temporal physical measurements
    • landuse: multi-temporal building counts per land use (i.e., building function) class
    • physical / phys: physical building characteristics in 2020
    • temporal / temp: temporal characteristics (construction year statistics)

    Variables: evolution

    • budens: building density (count per grid cell area)
    • bufa: building footprint area
    • deva: developed area (any grid cell containing at least one building)
    • resbufa: residential building footprint area
    • resbia: residential building indoor area

    Variables: physical

    • bia: building indoor area
    • bufa: building footprint area
    • bunits: number of building units
    • dwel: number of dwellings

    Variables: temporal

    • mincoy: minimum construction year per grid cell
    • maxcoy: minimum construction year per grid cell
    • meancoy: mean construction year per grid cell
    • medcoy: median construction year per grid cell
    • modecoy: mode (most frequent) construction year per grid cell
    • varcoy: variety of construction years per grid cell

    Variable: landuse

    Counts of buildings per grid cell and land use type.

    Municipality-level data

    • hisdac_es_municipality_stats_multitemporal_longform_v1.csv: This CSV file contains the zonal sums of the gridded surfaces (e.g., number of buildings per year and municipality) in long form. Note that a value of 0 for the year attribute denotes the statistics for records without construction year information.
    • hisdac_es_municipality_stats_multitemporal_wideform_v1.csv: This CSV file contains the zonal sums of the gridded surfaces (e.g., number of buildings per year and municipality) in wide form. Note that a value of 0 for the year suffix denotes the statistics for records without construction year information.
    • hisdac_es_municipality_stats_completeness_v1.csv: This CSV file contains the missingness rates (in %) of the building attribute per municipality, ranging from 0.0 (attribute exists for all buildings) to 100.0 (attribute exists for none of the buildings) in a given municipality.

    Column names for the completeness statistics tables:

    • NATCODE: National municipality identifier*
    • num_total: number of buildings per munic
    • perc_bymiss: Percentage of buildings with missing built year (construction year)
    • perc_lumiss: Percentage of buildings with missing landuse attribute
    • perc_luother: Percentage of buildings with landuse type "other"
    • perc_num_floors_miss: Percentage of buildings without valid number of floors attribute
    • perc_num_dwel_miss: Percentage of buildings without valid number of dwellings attribute
    • perc_num_bunits_miss: Percentage of buildings without valid number of building units attribute
    • perc_offi_area_miss: Percentage of buildings without valid official area (building indoor area, BIA) attribute
    • perc_num_dwel_and_num_bunits_miss: Percentage of buildings missing both number of dwellings and number of building units attribute

    The same statistics are available as geopackage file including municipality polygons in Lambert azimuthal equal area (EPSG:3035).

    *From the NATCODE, other regional identifiers can be derived as follows:

    • NATCODE: 34 01 04 04001
    • Country: 34
    • Comunidad autónoma (CA_CODE): 01
    • Province (PROV_CODE): 04
    • LAU code: 04001 (province + municipality code)
     
    more » « less
  3. Federated learning (FL) has been widely studied recently due to its property to collaboratively train data from different devices without sharing the raw data. Nevertheless, recent studies show that an adversary can still be possible to infer private information about devices' data, e.g., sensitive attributes such as income, race, and sexual orientation. To mitigate the attribute inference attacks, various existing privacy-preserving FL methods can be adopted/adapted. However, all these existing methods have key limitations: they need to know the FL task in advance, or have intolerable computational overheads or utility losses, or do not have provable privacy guarantees. We address these issues and design a task-agnostic privacy-preserving presentation learning method for FL (TAPPFL) against attribute inference attacks. TAPPFL is formulated via information theory. Specifically, TAPPFL has two mutual information goals, where one goal learns task-agnostic data representations that contain the least information about the private attribute in each device's data, and the other goal ensures the learnt data representations include as much information as possible about the device data to maintain FL utility. We also derive privacy guarantees of TAPPFL against worst-case attribute inference attacks, as well as the inherent tradeoff between utility preservation and privacy protection. Extensive results on multiple datasets and applications validate the effectiveness of TAPPFL to protect data privacy, maintain the FL utility, and be efficient as well. Experimental results also show that TAPPFL outperforms the existing defenses.

     
    more » « less
  4. Subspace clustering algorithms are used for understanding the cluster structure that explains the patterns prevalent in the dataset well. These methods are extensively used for data-exploration tasks in various areas of Natural Sciences. However, most of these methods fail to handle confounding attributes in the dataset. For datasets where a data sample represent multiple attributes, naively applying any clustering approach can result in undesired output. To this end, we propose a novel framework for jointly removing confounding attributes while learning to cluster data points in individual subspaces. Assuming we have label information about these confounding attributes, we regularize the clustering method by adversarially learning to minimize the mutual information between the data representation and the confounding attribute labels. Our experimental result on synthetic and real-world datasets demonstrate the effectiveness of our approach. 
    more » « less
  5. We consider the task of interorganizational data sharing, in which data owners, data clients, and data subjects have different and sometimes competing privacy concerns. One real-world scenario in which this problem arises concerns law-enforcement use of phone-call metadata: The data owner is a phone company, the data clients are law-enforcement agencies, and the data subjects are individuals who make phone calls. A key challenge in this type of scenario is that each organization uses its own set of proprietary intraorganizational attributes to describe the shared data; such attributes cannot be shared with other organizations. Moreover, data-access policies are determined by multiple parties and may be specified using attributes that are not directly comparable with the ones used by the owner to specify the data.

    We propose a system architecture and a suite of protocols that facilitate dynamic and efficient interorganizational data sharing, while allowing each party to use its own set of proprietary attributes to describe the shared data and preserving the confidentiality of both data records and proprietary intraorganizational attributes. We introduce the novel technique ofAttribute-Based Encryption with Oblivious Attribute Translation (OTABE), which plays a crucial role in our solution. This extension of attribute-based encryption uses semi-trusted proxies to enable dynamic and oblivious translation between proprietary attributes that belong to different organizations; it supports hidden access policies, direct revocation, and fine-grained, data-centric keys and queries. We prove that our OTABE-based framework is secure in the standard model and provide two real-world use cases.

     
    more » « less