skip to main content

Title: TransRisk: Mobility Privacy Risk Prediction based on Transferred Knowledge
Human mobility data may lead to privacy concerns because a resident can be re-identified from these data by malicious attacks even with anonymized user IDs. For an urban service collecting mobility data, an efficient privacy risk assessment is essential for the privacy protection of its users. The existing methods enable efficient privacy risk assessments for service operators to fast adjust the quality of sensing data to lower privacy risk by using prediction models. However, for these prediction models, most of them require massive training data, which has to be collected and stored first. Such a large-scale long-term training data collection contradicts the purpose of privacy risk prediction for new urban services, which is to ensure that the quality of high-risk human mobility data is adjusted to low privacy risk within a short time. To solve this problem, we present a privacy risk prediction model based on transfer learning, i.e., TransRisk, to predict the privacy risk for a new target urban service through (1) small-scale short-term data of its own, and (2) the knowledge learned from data from other existing urban services. We envision the application of TransRisk on the traffic camera surveillance system and evaluate it with real-world mobility datasets already collected in a Chinese city, Shenzhen, including four source datasets, i.e., (i) one call detail record dataset (CDR) with 1.2 million users; (ii) one cellphone connection data dataset (CONN) with 1.2 million users; (iii) a vehicular GPS dataset (Vehicles) with 10 thousand vehicles; (iv) an electronic toll collection transaction dataset (ETC) with 156 thousand users, and a target dataset, i.e., a camera dataset (Camera) with 248 cameras. The results show that our model outperforms the state-of-the-art methods in terms of RMSE and MAE. Our work also provides valuable insights and implications on mobility data privacy risk assessment for both current and future large-scale services.  more » « less
Award ID(s):
1849238 2047822 1951890 1952096 2003874 1932223 2002985
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies
Page Range / eLocation ID:
1 to 19
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Recently, the ubiquity of mobile devices leads to an increasing demand of public network services, e.g., WiFi hot spots. As a part of this trend, modern transportation systems are equipped with public WiFi devices to provide Internet access for passengers as people spend a large amount of time on public transportation in their daily life. However, one of the key issues in public WiFi spots is the privacy concern due to its open access nature. Existing works either studied location privacy risk in human traces or privacy leakage in private networks such as cellular networks based on the data from cellular carriers. To the best of our knowledge, none of these work has been focused on bus WiFi privacy based on large-scale real-world data. In this paper, to explore the privacy risk in bus WiFi systems, we focus on two key questions how likely bus WiFi users can be uniquely re-identified if partial usage information is leaked and how we can protect users from the leaked information. To understand the above questions, we conduct a case study in a large-scale bus WiFi system, which contains 20 million connection records and 78 million location records from 770 thousand bus WiFi users during a two-month period. Technically, we design two models for our uniqueness analyses and protection, i.e., a PB-FIND model to identify the probability a user can be uniquely re-identified from leaked information; a PB-HIDE model to protect users from potentially leaked information. Specifically, we systematically measure the user uniqueness on users' finger traces (i.e., connection URL and domain), foot traces (i.e., locations), and hybrid traces (i.e., both finger and foot traces). Our measurement results reveal (i) 97.8% users can be uniquely re-identified by 4 random domain records of their finger traces and 96.2% users can be uniquely re-identified by 5 random locations on buses; (ii) 98.1% users can be uniquely re-identified by only 2 random records if both their connection records and locations are leaked to attackers. Moreover, the evaluation results show our PB-HIDE algorithm protects more than 95% users from the potentially leaked information by inserting only 1.5% synthetic records in the original dataset to preserve their data utility. 
    more » « less
  2. Urban anomalies have a large impact on passengers' travel behavior and city infrastructures, which can cause uncertainty on travel time estimation. Understanding the impact of urban anomalies on travel time is of great value for various applications such as urban planning, human mobility studies and navigation systems. Most existing studies on travel time have been focused on the total riding time between two locations on an individual transportation modality. However, passengers often take different modes of transportation, e.g., taxis, subways, buses or private vehicles, and a significant portion of the travel time is spent in the uncertain waiting. In this paper, we study the fine-grained travel time patterns in multiple transportation systems under the impact of urban anomalies. Specifically, (i) we investigate implicit components, including waiting and riding time, in multiple transportation systems; (ii) we measure the impact of real-world anomalies on travel time components; (iii) we design a learning-based model for travel time component prediction with anomalies. Different from existing studies, we implement and evaluate our measurement framework on multiple data sources including four city-scale transportation systems, which are (i) a 14-thousand taxicab network, (ii) a 13-thousand bus network, (iii) a 10-thousand private vehicle network, and (iv) an automatic fare collection system for a public transit network (i.e., subway and bus) with 5 million smart cards. 
    more » « less
  3. Tarolli, P. ; Mudd, S. (Ed.)
    High-resolution topography (HRT) is a powerful observational tool for studying the Earth's surface, vegetation, and urban landscapes, with broad scientific, engineering, and education-based applications. Submeter resolution imaging is possible when collected with laser and photogrammetric techniques using the ground, air, and space-based platforms. Open access to these data and a cyberinfrastructure platform that enables users to discover, manage, share, and process then increases the impact of investments in data collection and catalyzes scientific discovery. Furthermore, open and online access to data enables broad interdisciplinary use of HRT across academia and in communities such as education, public agencies, and the commercial sector. OpenTopography, supported by the US National Science Foundation, aims to democratize access to Earth science-oriented, HRT data and processing tools. We utilize cyberinfrastructure, including large-scale data management, high-performance computing, and service-oriented architectures to provide efficient web-based visualization and access to large, HRT datasets. OT colocates data with processing tools to enable users to quickly access custom data and derived products for their application, with the ultimate goal of making these powerful data easier to use. OT's rapidly growing data holdings currently include 283 lidar and photogrammetric, point cloud datasets (>1.2 trillion points) covering 236,364km2. As a testament to OT's success, more than 86,000 users have processed over 5 trillion lidar points. This use has resulted in more than 290 peer-reviewed publications across numerous academic domains including Earth science, geography, computer science, and ecology. 
    more » « less
  4. With the trend of vehicles becoming increasingly connected and potentially autonomous, vehicles are being equipped with rich sensing and communication devices. Various vehicular services based on shared real-time sensor data of vehicles from a fleet have been proposed to improve the urban efficiency, e.g., HD-live map, and traffic accident recovery. However, due to the high cost of data uploading (e.g., monthly fees for a cellular network), it would be impractical to make all well-equipped vehicles to upload real-time sensor data constantly. To better utilize these limited uploading resources and achieve an optimal road segment sensing coverage, we present a real-time sensing task scheduling framework, i.e., RISC, for Resource-Constraint modeling for urban sensing by scheduling sensing tasks of commercial vehicles with sensors based on the predictability of vehicles' mobility patterns. In particular, we utilize the commercial vehicles, including taxicabs, buses, and logistics trucks as mobile sensors to sense urban phenomena, e.g., traffic, by using the equipped vehicular sensors, e.g., dash-cam, lidar, automotive radar, etc. We implement RISC on a Chinese city Shenzhen with one-month real-world data from (i) a taxi fleet with 14 thousand vehicles; (ii) a bus fleet with 13 thousand vehicles; (iii) a truck fleet with 4 thousand vehicles. Further, we design an application, i.e., track suspect vehicles (e.g., hit-and-run vehicles), to evaluate the performance of RISC on the urban sensing aspect based on the data from a regular vehicle (i.e., personal car) fleet with 11 thousand vehicles. The evaluation results show that compared to the state-of-the-art solutions, we improved sensing coverage (i.e., the number of road segments covered by sensing vehicles) by 10% on average. 
    more » « less
  5. The deep neural networks used in modern computer vision systems require enormous image datasets to train them. These carefully-curated datasets typically have a million or more images, across a thousand or more distinct categories. The process of creating and curating such a dataset is a monumental undertaking, demanding extensive effort and labelling expense and necessitating careful navigation of technical and social issues such as label accuracy, copyright ownership, and content bias.What if we had a way to harness the power of large image datasets but with few or none of the major issues and concerns currently faced? This paper extends the recent work of Kataoka et al. [15], proposing an improved pre-training dataset based on dynamically-generated fractal images. Challenging issues with large-scale image datasets become points of elegance for fractal pre-training: perfect label accuracy at zero cost; no need to store/transmit large image archives; no privacy/demographic bias/concerns of inappropriate content, as no humans are pictured; limitless supply and diversity of images; and the images are free/open-source. Perhaps surprisingly, avoiding these difficulties imposes only a small penalty in performance. Leveraging a newly-proposed pre-training task—multi-instance prediction—our experiments demonstrate that fine-tuning a network pre-trained using fractals attains 92.7-98.1% of the accuracy of an ImageNet pre-trained network. Our code is publicly available. 1 
    more » « less