skip to main content


Title: A COMPARATIVE STUDY OF METHODS FOR DRIVE TIME ESTIMATION ON GEOSPATIAL BIG DATA: A CASE STUDY IN USA

Abstract. Travel time estimation is crucial for several geospatial research studies, particularly healthcare accessibility studies. This paper presents a comparative study of six methods for drive time estimation on geospatial big data in the USA. The comparison is done with respect to the cost, accuracy, and scalability of these methods. The six methods examined are Google Maps API, Bing Maps API, Esri Routing Web Service, ArcGIS Pro Desktop, OpenStreetMap NetworkX (OSMnx), and Open Source Routing Machine (OSRM). Our case study involves calculating driving times of 10,000 origin-destination (OD) pairs between ZIP code population centroids and pediatric hospitals in the USA. We found that OSRM provides a low-cost, accurate, and efficient solution for calculating travel time on geospatial big data. Our study provides valuable insight into selecting the most appropriate drive time estimation method and is a benchmark for comparing the six different methods. Our open-source scripts are published on GitHub (https://github.com/wybert/Comparative-Study-of-Methods-for-Drive-Time-Estimation) to facilitate further usage and research by the wider academic community.

 
more » « less
Award ID(s):
1841403
NSF-PAR ID:
10492127
Author(s) / Creator(s):
; ; ; ; ;
Publisher / Repository:
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences
Date Published:
Journal Name:
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences
Volume:
XLVIII-4/W7-2023
ISSN:
2194-9034
Page Range / eLocation ID:
53 to 60
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Abstract Background Personal privacy is a significant concern in the era of big data. In the field of health geography, personal health data are collected with geographic location information which may increase disclosure risk and threaten personal geoprivacy. Geomasking is used to protect individuals’ geoprivacy by masking the geographic location information, and spatial k-anonymity is widely used to measure the disclosure risk after geomasking is applied. With the emergence of individual GPS trajectory datasets that contains large volumes of confidential geospatial information, disclosure risk can no longer be comprehensively assessed by the spatial k-anonymity method. Methods This study proposes and develops daily activity locations (DAL) k-anonymity as a new method for evaluating the disclosure risk of GPS data. Instead of calculating disclosure risk based on only one geographic location (e.g., home) of an individual, the new DAL k-anonymity is a composite evaluation of disclosure risk based on all activity locations of an individual and the time he/she spends at each location abstracted from GPS datasets. With a simulated individual GPS dataset, we present case studies of applying DAL k-anonymity in various scenarios to investigate its performance. The results of applying DAL k-anonymity are also compared with those obtained with spatial k-anonymity under these scenarios. Results The results of this study indicate that DAL k-anonymity provides a better estimation of the disclosure risk than does spatial k-anonymity. In various case-study scenarios of individual GPS data, DAL k-anonymity provides a more effective method for evaluating the disclosure risk by considering the probability of re-identifying an individual’s home and all the other daily activity locations. Conclusions This new method provides a quantitative means for understanding the disclosure risk of sharing or publishing GPS data. It also helps shed new light on the development of new geomasking methods for GPS datasets. Ultimately, the findings of this study will help to protect individual geoprivacy while benefiting the research community by promoting and facilitating geospatial data sharing. 
    more » « less
  2. Obeid, I. (Ed.)
    The Neural Engineering Data Consortium (NEDC) is developing the Temple University Digital Pathology Corpus (TUDP), an open source database of high-resolution images from scanned pathology samples [1], as part of its National Science Foundation-funded Major Research Instrumentation grant titled “MRI: High Performance Digital Pathology Using Big Data and Machine Learning” [2]. The long-term goal of this project is to release one million images. We have currently scanned over 100,000 images and are in the process of annotating breast tissue data for our first official corpus release, v1.0.0. This release contains 3,505 annotated images of breast tissue including 74 patients with cancerous diagnoses (out of a total of 296 patients). In this poster, we will present an analysis of this corpus and discuss the challenges we have faced in efficiently producing high quality annotations of breast tissue. It is well known that state of the art algorithms in machine learning require vast amounts of data. Fields such as speech recognition [3], image recognition [4] and text processing [5] are able to deliver impressive performance with complex deep learning models because they have developed large corpora to support training of extremely high-dimensional models (e.g., billions of parameters). Other fields that do not have access to such data resources must rely on techniques in which existing models can be adapted to new datasets [6]. A preliminary version of this breast corpus release was tested in a pilot study using a baseline machine learning system, ResNet18 [7], that leverages several open-source Python tools. The pilot corpus was divided into three sets: train, development, and evaluation. Portions of these slides were manually annotated [1] using the nine labels in Table 1 [8] to identify five to ten examples of pathological features on each slide. Not every pathological feature is annotated, meaning excluded areas can include focuses particular to these labels that are not used for training. A summary of the number of patches within each label is given in Table 2. To maintain a balanced training set, 1,000 patches of each label were used to train the machine learning model. Throughout all sets, only annotated patches were involved in model development. The performance of this model in identifying all the patches in the evaluation set can be seen in the confusion matrix of classification accuracy in Table 3. The highest performing labels were background, 97% correct identification, and artifact, 76% correct identification. A correlation exists between labels with more than 6,000 development patches and accurate performance on the evaluation set. Additionally, these results indicated a need to further refine the annotation of invasive ductal carcinoma (“indc”), inflammation (“infl”), nonneoplastic features (“nneo”), normal (“norm”) and suspicious (“susp”). This pilot experiment motivated changes to the corpus that will be discussed in detail in this poster presentation. To increase the accuracy of the machine learning model, we modified how we addressed underperforming labels. One common source of error arose with how non-background labels were converted into patches. Large areas of background within other labels were isolated within a patch resulting in connective tissue misrepresenting a non-background label. In response, the annotation overlay margins were revised to exclude benign connective tissue in non-background labels. Corresponding patient reports and supporting immunohistochemical stains further guided annotation reviews. The microscopic diagnoses given by the primary pathologist in these reports detail the pathological findings within each tissue site, but not within each specific slide. The microscopic diagnoses informed revisions specifically targeting annotated regions classified as cancerous, ensuring that the labels “indc” and “dcis” were used only in situations where a micropathologist diagnosed it as such. Further differentiation of cancerous and precancerous labels, as well as the location of their focus on a slide, could be accomplished with supplemental immunohistochemically (IHC) stained slides. When distinguishing whether a focus is a nonneoplastic feature versus a cancerous growth, pathologists employ antigen targeting stains to the tissue in question to confirm the diagnosis. For example, a nonneoplastic feature of usual ductal hyperplasia will display diffuse staining for cytokeratin 5 (CK5) and no diffuse staining for estrogen receptor (ER), while a cancerous growth of ductal carcinoma in situ will have negative or focally positive staining for CK5 and diffuse staining for ER [9]. Many tissue samples contain cancerous and non-cancerous features with morphological overlaps that cause variability between annotators. The informative fields IHC slides provide could play an integral role in machine model pathology diagnostics. Following the revisions made on all the annotations, a second experiment was run using ResNet18. Compared to the pilot study, an increase of model prediction accuracy was seen for the labels indc, infl, nneo, norm, and null. This increase is correlated with an increase in annotated area and annotation accuracy. Model performance in identifying the suspicious label decreased by 25% due to the decrease of 57% in the total annotated area described by this label. A summary of the model performance is given in Table 4, which shows the new prediction accuracy and the absolute change in error rate compared to Table 3. The breast tissue subset we are developing includes 3,505 annotated breast pathology slides from 296 patients. The average size of a scanned SVS file is 363 MB. The annotations are stored in an XML format. A CSV version of the annotation file is also available which provides a flat, or simple, annotation that is easy for machine learning researchers to access and interface to their systems. Each patient is identified by an anonymized medical reference number. Within each patient’s directory, one or more sessions are identified, also anonymized to the first of the month in which the sample was taken. These sessions are broken into groupings of tissue taken on that date (in this case, breast tissue). A deidentified patient report stored as a flat text file is also available. Within these slides there are a total of 16,971 total annotated regions with an average of 4.84 annotations per slide. Among those annotations, 8,035 are non-cancerous (normal, background, null, and artifact,) 6,222 are carcinogenic signs (inflammation, nonneoplastic and suspicious,) and 2,714 are cancerous labels (ductal carcinoma in situ and invasive ductal carcinoma in situ.) The individual patients are split up into three sets: train, development, and evaluation. Of the 74 cancerous patients, 20 were allotted for both the development and evaluation sets, while the remain 34 were allotted for train. The remaining 222 patients were split up to preserve the overall distribution of labels within the corpus. This was done in hope of creating control sets for comparable studies. Overall, the development and evaluation sets each have 80 patients, while the training set has 136 patients. In a related component of this project, slides from the Fox Chase Cancer Center (FCCC) Biosample Repository (https://www.foxchase.org/research/facilities/genetic-research-facilities/biosample-repository -facility) are being digitized in addition to slides provided by Temple University Hospital. This data includes 18 different types of tissue including approximately 38.5% urinary tissue and 16.5% gynecological tissue. These slides and the metadata provided with them are already anonymized and include diagnoses in a spreadsheet with sample and patient ID. We plan to release over 13,000 unannotated slides from the FCCC Corpus simultaneously with v1.0.0 of TUDP. Details of this release will also be discussed in this poster. Few digitally annotated databases of pathology samples like TUDP exist due to the extensive data collection and processing required. The breast corpus subset should be released by November 2021. By December 2021 we should also release the unannotated FCCC data. We are currently annotating urinary tract data as well. We expect to release about 5,600 processed TUH slides in this subset. We have an additional 53,000 unprocessed TUH slides digitized. Corpora of this size will stimulate the development of a new generation of deep learning technology. In clinical settings where resources are limited, an assistive diagnoses model could support pathologists’ workload and even help prioritize suspected cancerous cases. ACKNOWLEDGMENTS This material is supported by the National Science Foundation under grants nos. CNS-1726188 and 1925494. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. REFERENCES [1] N. Shawki et al., “The Temple University Digital Pathology Corpus,” in Signal Processing in Medicine and Biology: Emerging Trends in Research and Applications, 1st ed., I. Obeid, I. Selesnick, and J. Picone, Eds. New York City, New York, USA: Springer, 2020, pp. 67 104. https://www.springer.com/gp/book/9783030368432. [2] J. Picone, T. Farkas, I. Obeid, and Y. Persidsky, “MRI: High Performance Digital Pathology Using Big Data and Machine Learning.” Major Research Instrumentation (MRI), Division of Computer and Network Systems, Award No. 1726188, January 1, 2018 – December 31, 2021. https://www. isip.piconepress.com/projects/nsf_dpath/. [3] A. Gulati et al., “Conformer: Convolution-augmented Transformer for Speech Recognition,” in Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), 2020, pp. 5036-5040. https://doi.org/10.21437/interspeech.2020-3015. [4] C.-J. Wu et al., “Machine Learning at Facebook: Understanding Inference at the Edge,” in Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA), 2019, pp. 331–344. https://ieeexplore.ieee.org/document/8675201. [5] I. Caswell and B. Liang, “Recent Advances in Google Translate,” Google AI Blog: The latest from Google Research, 2020. [Online]. Available: https://ai.googleblog.com/2020/06/recent-advances-in-google-translate.html. [Accessed: 01-Aug-2021]. [6] V. Khalkhali, N. Shawki, V. Shah, M. Golmohammadi, I. Obeid, and J. Picone, “Low Latency Real-Time Seizure Detection Using Transfer Deep Learning,” in Proceedings of the IEEE Signal Processing in Medicine and Biology Symposium (SPMB), 2021, pp. 1 7. https://www.isip. piconepress.com/publications/conference_proceedings/2021/ieee_spmb/eeg_transfer_learning/. [7] J. Picone, T. Farkas, I. Obeid, and Y. Persidsky, “MRI: High Performance Digital Pathology Using Big Data and Machine Learning,” Philadelphia, Pennsylvania, USA, 2020. https://www.isip.piconepress.com/publications/reports/2020/nsf/mri_dpath/. [8] I. Hunt, S. Husain, J. Simons, I. Obeid, and J. Picone, “Recent Advances in the Temple University Digital Pathology Corpus,” in Proceedings of the IEEE Signal Processing in Medicine and Biology Symposium (SPMB), 2019, pp. 1–4. https://ieeexplore.ieee.org/document/9037859. [9] A. P. Martinez, C. Cohen, K. Z. Hanley, and X. (Bill) Li, “Estrogen Receptor and Cytokeratin 5 Are Reliable Markers to Separate Usual Ductal Hyperplasia From Atypical Ductal Hyperplasia and Low-Grade Ductal Carcinoma In Situ,” Arch. Pathol. Lab. Med., vol. 140, no. 7, pp. 686–689, Apr. 2016. https://doi.org/10.5858/arpa.2015-0238-OA. 
    more » « less
  3. Abstract

    Quantifying movement and demographic events of free‐ranging animals is fundamental to studying their ecology, evolution and conservation. Technological advances have led to an explosion in sensor‐based methods for remotely observing these phenomena. This transition to big data creates new challenges for data management, analysis and collaboration.

    We present the Movebank ecosystem of tools used by thousands of researchers to collect, manage, share, visualize, analyse and archive their animal tracking and other animal‐borne sensor data. Users add sensor data through file uploads or live data streams and further organize and complete quality control within the Movebank system. All data are harmonized to a data model and vocabulary. The public can discover, view and download data for which they have been given access to through the website, the Animal Tracker mobile app or by API. Advanced analysis tools are available through the EnvDATA System, the MoveApps platform and a variety of user‐developed applications. Data owners can share studies with select users or the public, with options for embargos, licenses and formal archiving in a data repository.

    Movebank is used by over 3,100 data owners globally, who manage over 6 billion animal location and sensor measurements across more than 6,500 studies, with thousands of active tags sending over 3 million new data records daily. These data underlie >700 published papers and reports. We present a case study demonstrating the use of Movebank to assess life‐history events and demography, and engage with citizen scientists to identify mortalities and causes of death for a migratory bird.

    A growing number of researchers, government agencies and conservation organizations use Movebank to manage research and conservation projects and to meet legislative requirements. The combination of historic and new data with collaboration tools enables broad comparative analyses and data acquisition and mapping efforts. Movebank offers an integrated system for real‐time monitoring of animals at a global scale and represents a digital museum of animal movement and behaviour. Resources and coordination across countries and organizations are needed to ensure that these data, including those that cannot be made public, remain accessible to future generations.

     
    more » « less
  4. SUMMARY

    Global variations in the propagation of fundamental-mode and overtone surface waves provide unique constraints on the low-frequency source properties and structure of the Earth’s upper mantle, transition zone and mid mantle. We construct a reference data set of multimode dispersion measurements by reconciling large and diverse catalogues of Love-wave (49.65 million) and Rayleigh-wave dispersion (177.66 million) from eight groups worldwide. The reference data set summarizes measurements of dispersion of fundamental-mode surface waves and up to six overtone branches from 44 871 earthquakes recorded on 12 222 globally distributed seismographic stations. Dispersion curves are specified at a set of reference periods between 25 and 250 s to determine propagation-phase anomalies with respect to a reference Earth model. Our procedures for reconciling data sets include: (1) controlling quality and salvaging missing metadata; (2) identifying discrepant measurements and reasons for discrepancies; (3) equalizing geographic coverage by constructing summary rays for travel-time observations and (4) constructing phase velocity maps at various wavelengths with combination of data types to evaluate inter-dataset consistency. We retrieved missing station and earthquake metadata in several legacy compilations and codified scalable formats to facilitate reproducibility, easy storage and fast input/output on high-performance-computing systems. Outliers can be attributed to cycle skipping, station polarity issues or overtone interference at specific epicentral distances. By assessing inter-dataset consistency across similar paths, we empirically quantified uncertainties in traveltime measurements. More than 95 per cent measurements of fundamental-mode dispersion are internally consistent, but agreement deteriorates for overtones especially branches 5 and 6. Systematic discrepancies between raw phase anomalies from various techniques can be attributed to discrepant theoretical approximations, reference Earth models and processing schemes. Phase-velocity variations yielded by the inversion of the summary data set are highly correlated (R ≥ 0.8) with those from the quality-controlled contributing data sets. Long-wavelength variations in fundamental-mode dispersion (50–100 s) are largely independent of the measurement technique with high correlations extending up to degree ∼25. Agreement degrades with increasing branch number and period; highly correlated structure is found only up to degree ∼10 at longer periods (T > 150 s) and up to degree ∼8 for overtones. Only 2ζ azimuthal variations in phase velocity of fundamental-mode Rayleigh waves were required by the reference data set; maps of 2ζ azimuthal variations are highly consistent between catalogues ( R = 0.6–0.8). Reference data with uncertainties are useful for improving existing measurement techniques, validating models of interior structure, calculating teleseismic data corrections in local or multiscale investigations and developing a 3-D reference Earth model.

     
    more » « less
  5. SUMMARY

    Cross-correlations of ambient seismic noise are widely used for seismic velocity imaging, monitoring and ground motion analyses. A typical step in analysing noise cross-correlation functions (NCFs) is stacking short-term NCFs over longer time periods to increase the signal quality. Spurious NCFs could contaminate the stack, degrade its quality and limit its use. Many methods have been developed to improve the stacking of coherent waveforms, including earthquake waveforms, receiver functions and NCFs. This study systematically evaluates and compares the performance of eight stacking methods, including arithmetic mean or linear stacking, robust stacking, selective stacking, cluster stacking, phase-weighted stacking, time–frequency phase-weighted stacking, Nth-root stacking and averaging after applying an adaptive covariance filter. Our results demonstrate that, in most cases, all methods can retrieve clear ballistic or first arrivals. However, they yield significant differences in preserving the phase and amplitude information. This study provides a practical guide for choosing the optimal stacking method for specific research applications in ambient noise seismology. We evaluate the performance using multiple onshore and offshore seismic arrays in the Pacific Northwest region. We compare these stacking methods for NCFs calculated from raw ambient noise (referred to as Raw NCFs) and from ambient noise normalized using a one-bit clipping time normalization method (referred to as One-bit NCFs). We evaluate six metrics, including signal-to-noise ratios, phase dispersion images, convergence rate, temporal changes in the ballistic and coda waves, relative amplitude decays with distance and computational time. We show that robust stacking is the best choice for all applications (velocity tomography, monitoring and attenuation studies) using Raw NCFs. For applications using One-bit NCFs, all methods but phase-weighted and Nth-root stacking are good choices for seismic velocity tomography. Linear, robust and selective stacking methods are all equally appropriate choices when using One-bit NCFs for monitoring applications. For applications relying on accurate relative amplitudes, the linear, robust, selective and cluster stacking methods all perform well with One-bit NCFs. The evaluations in this study can be generalized to a broad range of time-series analysis that utilizes data coherence to perform ensemble stacking. Another contribution of this study is the accompanying open-source software package, StackMaster, which can be used for general purposes of time-series stacking.

     
    more » « less