Adoption of data and compute-intensive research in geosciences is hindered by the same social and technological reasons as other science disciplines - we're humans after all. As a result, many of the new opportunities to advance science in today's rapidly evolving technology landscape are not approachable by domain geoscientists. Organizations must acknowledge and actively mitigate these intrinsic biases and knowledge gaps in their users and staff. Over the past ten years, CyVerse (www.cyverse.org) has carried out the mission "to design, deploy, and expand a national cyberinfrastructure for life sciences research, and to train scientists in its use." During this time, CyVerse has supported and enabled transdisciplinary collaborations across institutions and communities, overseen many successes, and encountered failures. Our lessons learned in user engagement, both social and technical, are germane to the problems facing the geoscience community today. A key element of overcoming social barriers is to set up an effective education, outreach, and training (EOT) team to drive initial adoption as well as continued use. A strong EOT group can reach new users, particularly those in under-represented communities, reduce power distance relationships, and mitigate users' uncertainty avoidance toward adopting new technology. Timely user support across the life of a project,more »
Handling Uncertainty in Geo-Spatial Data
An inherent challenge arising in any dataset containing information of space and/or time is uncertainty due to various sources of imprecision. Integrating the impact of the uncertainty is a paramount when estimating the reliability (confidence) of any query result from the underlying input data. To deal with uncertainty, solutions have been proposed independently in the geo-science and the data-science research community. This interdisciplinary tutorial bridges the gap between the two communities by providing a comprehensive overview of the different challenges involved in dealing with uncertain geo-spatial data, by surveying solutions from both research communities, and by identifying similarities, synergies and open research problems.
- Award ID(s):
- 1637541
- Publication Date:
- NSF-PAR ID:
- 10036539
- Journal Name:
- 33rd IEEE International Conference on Data Engineering (ICDE)
- Page Range or eLocation-ID:
- 1467 to 1470
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
The Chicxulub impact crater, on the Yucatán Peninsula of México, is unique. It is the only known terrestrial impact structure that has been directly linked to a mass extinction event and the only terrestrial impact with a global ejecta layer. Of the three largest impact structures on Earth, Chicxulub is the best preserved. Chicxulub is also the only known terrestrial impact structure with an intact, unequivocal topographic peak ring. Chicxulub’s role in the Cretaceous/Paleogene (K-Pg) mass extinction and its exceptional state of preservation make it an important natural laboratory for the study of both large impact crater formation on Earth and other planets and the effects of large impacts on the Earth’s environment and ecology. Our understanding of the impact process is far from complete, and despite more than 30 years of intense debate, we are still striving to answer the question as to why this impact was so catastrophic. During International Ocean Discovery Program (IODP) and International Continental Scientific Drilling Program (ICDP) Expedition 364, Paleogene sedimentary rocks and lithologies that make up the Chicxulub peak ring were cored to investigate (1) the nature and formational mechanism of peak rings, (2) how rocks are weakened during large impacts, (3) themore »
-
Large-scale real-time analytics services continuously collect and analyze data from end-user applications and devices distributed around the globe. Such analytics requires data to be transferred over the wide-area network (WAN) to data centers (DCs) capable of processing the data. Since WAN bandwidth is expensive and scarce, it is beneficial to reduce WAN traffic by partially aggregating the data closer to end-users. We propose aggregation networks for per- forming aggregation on a geo-distributed edge-cloud infrastructure consisting of edge servers, transit and destination DCs. We identify a rich set of research questions aimed at reducing the traffic costs in an aggregation network. We present an optimization formula- tion for solving these questions in a principled manner, and use insights from the optimization solutions to propose an efficient, near-optimal practical heuristic. We implement the heuristic in AggNet, built on top of Apache Flink. We evaluate our approach using a geo-distributed deployment on Amazon EC2 as well as a WAN-emulated local testbed. Our evaluation using real-world traces from Twitter and Akamai shows that our approach is able to achieve 47% to 83% reduction in traffic cost over existing baselines without any compromise in timeliness.
-
Large-scale real-time analytics services continuously collect and analyze data from end-user applications and devices distributed around the globe. Such analytics requires data to be transferred over the wide-area network (WAN) to data centers (DCs) capable of processing the data. Since WAN bandwidth is expensive and scarce, it is beneficial to reduce WAN traffic by partially aggregating the data closer to end-users. We propose aggregation networks for performing aggregation on a geo-distributed edge-cloud infrastructure consisting of edge servers, transit and destination DCs. We identify a rich set of research questions aimed at reducing the traffic costs in an aggregation network. We present an optimization formulation for solving these questions in a principled manner, and use insights from the optimization solutions to propose an efficient, near-optimal practical heuristic. We implement the heuristic in AggNet, built on top of Apache Flink. We evaluate our approach using a geo-distributed deployment on Amazon EC2 as well as a WAN-emulated local testbed. Our evaluation using real-world traces from Twitter and Akamai shows that our approach is able to achieve 47% to 83% reduction in traffic cost over existing baselines without any compromise in timeliness.
-
The DeepLearningEpilepsyDetectionChallenge: design, implementation, andtestofanewcrowd-sourced AIchallengeecosystem Isabell Kiral*, Subhrajit Roy*, Todd Mummert*, Alan Braz*, Jason Tsay, Jianbin Tang, Umar Asif, Thomas Schaffter, Eren Mehmet, The IBM Epilepsy Consortium◊ , Joseph Picone, Iyad Obeid, Bruno De Assis Marques, Stefan Maetschke, Rania Khalaf†, Michal Rosen-Zvi† , Gustavo Stolovitzky† , Mahtab Mirmomeni† , Stefan Harrer† * These authors contributed equally to this work † Corresponding authors: rkhalaf@us.ibm.com, rosen@il.ibm.com, gustavo@us.ibm.com, mahtabm@au1.ibm.com, sharrer@au.ibm.com ◊ Members of the IBM Epilepsy Consortium are listed in the Acknowledgements section J. Picone and I. Obeid are with Temple University, USA. T. Schaffter is with Sage Bionetworks, USA. E. Mehmet is with the University of Illinois at Urbana-Champaign, USA. All other authors are with IBM Research in USA, Israel and Australia. Introduction This decade has seen an ever-growing number of scientific fields benefitting from the advances in machine learning technology and tooling. More recently, this trend reached the medical domain, with applications reaching from cancer diagnosis [1] to the development of brain-machine-interfaces [2]. While Kaggle has pioneered the crowd-sourcing of machine learning challenges to incentivise data scientists from around the world to advance algorithm and model design, the increasing complexity of problem statements demands of participants to be expert datamore »