skip to main content


Title: Managing Uncertainty in Evolving Geo-Spatial Data
Our ability to extract knowledge from evolving spatial phenomena and make it actionable is often impaired by unreliable, erroneous, obsolete, imprecise, sparse, and noisy data. Integrating the impact of this uncertainty is a paramount when estimating the reliability/confidence of any time-varying query result from the underlying input data. The goal of this advanced seminar is to survey solutions for managing, querying and mining uncertain spatial and spatio-temporal data. We survey different models and show examples of how to efficiently enrich query results with reliability information. We discuss both analytical solutions as well as approximate solutions based on geosimulation.  more » « less
Award ID(s):
1637541
NSF-PAR ID:
10187151
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
21st IEEE International Conference on Mobile Data Management (MDM)
Page Range / eLocation ID:
5 to 8
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. An inherent challenge arising in any dataset containing information of space and/or time is uncertainty due to various sources of imprecision. Integrating the impact of the uncertainty is a paramount when estimating the reliability (confidence) of any query result from the underlying input data. To deal with uncertainty, solutions have been proposed independently in the geo-science and the data-science research community. This interdisciplinary tutorial bridges the gap between the two communities by providing a comprehensive overview of the different challenges involved in dealing with uncertain geo-spatial data, by surveying solutions from both research communities, and by identifying similarities, synergies and open research problems. 
    more » « less
  2. A key barrier to applying any smart technology to a building is the requirement of locating and connecting to the necessary resources among the thousands of sensing and control points, i.e., the metadata mapping problem. Existing solutions depend on exhaustive manual annotation of sensor metadata - a laborious, costly, and hardly scalable process. To reduce the amount of manual effort required, this paper presents a multi-oracle selective sampling framework to leverage noisy labels from information sources with unknown reliability such as existing buildings, which we refer to as weak oracles, for metadata mapping. This framework involves an interactive process, where a small set of sensor instances are progressively selected and labeled for it to learn how to aggregate the noisy labels as well as to predict sensor types. Two key challenges arise in designing the framework, namely, weak oracle reliability estimation and instance selection for querying. To address the first challenge, we develop a clustering-based approach for weak oracle reliability estimation to capitalize on the observation that weak oracles perform differently in different groups of instances. For the second challenge, we propose a disagreement-based query selection strategy to combine the potential effect of a labeled instance on both reducing classifier uncertainty and improving the quality of label aggregation. We evaluate our solution on a large collection of real-world building sensor data from 5 buildings with more than 11, 000 sensors of 18 different types. The experiment results validate the effectiveness of our solution, which outperforms a set of state-of-the-art baselines. 
    more » « less
  3. Skyline queries are used to find the Pareto optimal solution from datasets containing multi-dimensional data points. In this paper, we propose a new type of skyline queries whose evaluation is constrained by a multi-cost transportation network (MCTN) and whose answers are off the network. This type of skyline queries is useful in many applications. For example, a person wants to find an apartment by considering not only the price and the surrounding area of the apartment, but also the transportation cost, time, and distance between the apartment and his/her work place. Most existing works that evaluate skyline queries on multi-cost networks (MCNs), which are either MCTNs or road networks, find interesting objects that locate on edges of the networks. Formally, our new type of skyline queries takes as input an MCTN, a query point q, and a set of objects of interest D with spatial information, where q and the objects in D are off the network. The answers to such queries are objects in D that are not dominated by other D objects when considering the multiple attributes of these objects and the multiple network cost from q to the solution objects. To evaluate such queries, we propose an exact search algorithm and its improved version by implementing several properties. The space of the exact skyline solutions is huge and can easily reach the order of thousands and incur long evaluation time. We further design much more efficient heuristic methods to find approximate solutions. We run extensive experiments using both real and synthetic datasets to test the effectiveness and efficiency of our proposed approaches. The results show that the exact search algorithm can be dramatically improved by utilizing several properties. The heuristic approaches to find approximate answers can largely reduce the query time and retrieve results that are comparable to the exact solutions. 
    more » « less
  4. Silva, Daniel de (Ed.)
    Biodiversity loss is a global ecological crisis that is both a driver of and response to environmental change. Understanding the connections between species declines and other components of human-natural systems extends across the physical, life, and social sciences. From an analysis perspective, this requires integration of data from different scientific domains, which often have heterogeneous scales and resolutions. Community science projects such as eBird may help to fill spatiotemporal gaps and enhance the resolution of standardized biological surveys. Comparisons between eBird and the more comprehensive North American Breeding Bird Survey (BBS) have found these datasets can produce consistent multi-year abundance trends for bird populations at national and regional scales. Here we investigate the reliability of these datasets for estimating patterns at finer resolutions, inter-annual changes in abundance within town boundaries. Using a case study of 14 focal species within Massachusetts, we calculated four indices of annual relative abundance using eBird and BBS datasets, including two different modeling approaches within each dataset. We compared the correspondence between these indices in terms of multi-year trends, annual estimates, and inter-annual changes in estimates at the state and town-level. We found correspondence between eBird and BBS multi-year trends, but this was not consistent across all species and diminished at finer, inter-annual temporal resolutions. We further show that standardizing modeling approaches can increase index reliability even between datasets at coarser temporal resolutions. Our results indicate that multiple datasets and modeling methods should be considered when estimating species population dynamics at finer temporal resolutions, but standardizing modeling approaches may improve estimate correspondence between abundance datasets. In addition, reliability of these indices at finer spatial scales may depend on habitat composition, which can impact survey accuracy. 
    more » « less
  5. We address the problem of maintaining the correct answer-sets to the Conditional Maximizing Range-Sum (C-MaxRS) query in spatial data streams. Given a set of (possibly weighted) 2D point objects, the traditional MaxRS problem determines an optimal placement for an axes-parallel rectangle r so that the number – or, the weighted sum – of objects in its interior is maximized. In many practical settings, the objects from a particular set – e.g., restaurants – can be of distinct types – e.g., fast-food, Asian, etc. The C-MaxRS problem deals with maximizing the overall sum, given class-based existential constraints, i.e., a lower bound on the count of objects of interests from particular classes. We first propose an efficient algorithm to the static C-MaxRS query, and extend the solution to handle dynamic (data streams) settings. Our experiments over datasets of up to 100,000 objects show that the proposed solutions provide significant efficiency benefits. 
    more » « less