skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Removing Personally Identifiable Information from Shared Dataset for Keystroke Authentication Research
Research on keystroke dynamics has the good potential to offer continuous authentication that complements conventional authentication methods in combating insider threats and identity theft before more harm can be done to the genuine users. Unfortunately, the large amount of data required by free-text keystroke authentication often contain personally identifiable information, or PII, and personally sensitive information, such as a user's first name and last name, username and password for an account, bank card numbers, and social security numbers. As a result, there are privacy risks associated with keystroke data that must be mitigated before they are shared with other researchers. We conduct a systematic study to remove PII's from a recent large keystroke dataset. We find substantial amounts of PII's from the dataset, including names, usernames and passwords, social security numbers, and bank card numbers, which, if leaked, may lead to various harms to the user, including personal embarrassment, blackmails, financial loss, and identity theft. We thoroughly evaluate the effectiveness of our detection program for each kind of PII. We demonstrate that our PII detection program can achieve near perfect recall at the expense of losing some useful information (lower precision). Finally, we demonstrate that the removal of PII's from the original dataset has only negligible impact on the detection error tradeoff of the free-text authentication algorithm by Gunetti and Picardi. We hope that this experience report will be useful in informing the design of privacy removal in future keystroke dynamics based user authentication systems.  more » « less
Award ID(s):
1650503
PAR ID:
10136315
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
2019 IEEE 5th International Conference on Identity, Security, and Behavior Analysis
Page Range / eLocation ID:
1 to 7
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Personally Identifiable Information (PII) leakage can lead to identity theft, financial loss, reputation damage, and anxiety. However, individuals remain largely unaware of their PII exposure on the Internet, and whether providing individuals with information about the extent of their PII exposure can trigger privacy protection actions requires further investigation. In this pilot study, grounded by Protection Motivation Theory (PMT), we examine whether receiving privacy alerts in the form of threat and countermeasure information will trigger senior citizens to engage in protective behaviors. We also examine whether providing personalized information moderates the relationship between information and individuals' perceptions. We contribute to the literature by shedding light on the determinants and barriers to adopting privacy protection behaviors. 
    more » « less
  2. Facebook has become an important part of our daily life. From knowing the status of our relatives, showing off a new car, to connecting with a high school classmate, abundant personally identifiable information (PII) are made visible to others by posts, images and news. However, this free flow of information has also created significant cyber-security challenges that make us vulnerable to social engineering and cyber crimes. To confront these challenges, we propose a new behavioral biometric that verifies a user based on his or her widget interaction behavior when using Facebook. Specifically, we monitor activities on the user’s Facebook account using our own logging software and verify the user’s claimed identity by binary classifiers trained with two algorithms (SVM-rbf and the GBM– Gradient Boosting Machines). Our novel dataset consists of eight users over a month of data collection with an average of 2.95k rows of data per user. We convert these activities data into meaningful features such as day-of-week, hour-of-day, and widget types and duration of mouse staying on a widget. The performance shows that our novel widget interaction modality is promising for authentication. The SVM-rbf classifiers achieve a mean Equal Error Rate (EER) and mean Accuracy (ACC) of 3.91% and 97.79%, while the GBM classifiers a mean EER and ACC of 2.76% and 97.88%, respectively. In addition, we perform an ablation study to understand the impact of individual features on authentication performance. The importance of features are ranked in the descending order of hour-of-day, day-of-week, and widget types and duration. 
    more » « less
  3. Account recovery is ubiquitous across web applications but circumvents the username/password-based login step. Therefore, it deserves the same level of security as the user authentication process. A common simplistic procedure for account recovery requires that a user enters the same email used during registration, to which a password recovery link or a new username could be sent. Therefore, an impostor with access to a user’s registration email and other credentials can trigger an account recovery session to take over the user’s account. To prevent such attacks, beyond validating the email and other credentials entered by the user, our proposed recovery method utilizes keystroke dynamics to further secure the account recovery mechanism. Keystroke dynamics is a type of behavioral biometrics that uses the analysis of typing rhythm for user authentication. Using a new dataset with over 500,000 keystrokes collected from 44 students and university staff when they fill out an account recovery web form of multiple fields, we have evaluated the performance of five scoring algorithms on individual fields as well as feature-level fusion and weighted-score fusion. We achieve the best EER of 5.47% when keystroke dynamics from individual fields are used, 0% for a feature-level fusion of five fields, and 0% for a weighted-score fusion of seven fields. Our work represents a new kind of keystroke dynamics that we would like to call it ‘medium fixed-text’ as it sits between the conventional (short) fixed text and (long) free text research. 
    more » « less
  4. In critical infrastructure (CI) sectors such as emergency management or healthcare, researchers can analyze and detect useful patterns in data and help emergency management personnel efficaciously allocate limited resources or detect epidemiology spread patterns. However, all of this data contains personally identifiable information (PII) that needs to be safeguarded for legal and ethical reasons. Traditional techniques for safeguarding, such as anonymization, have shown to be ineffective. Differential privacy is a technique that supports individual privacy while allowing the analysis of datasets for societal benefit. This paper motivates the use of differential privacy to answer a wide range of queries about CI data containing PII with better privacy guarantees than is possible with traditional techniques. Moreover, it introduces a new technique based on Multipleattribute Workload Partitioning, which does not depend on the nature of the underlying dataset and provides better protection for privacy than current differential privacy approaches. 
    more » « less
  5. Malicious cyber activities impose substantial costs on the U.S. economy and global markets. Cyber-criminals often use information-sharing social media platforms such as paste sites (e.g., Pastebin) to share vast amounts of plain text content related to Personally Identifiable Information (PII), credit card numbers, exploit code, malware, and other sensitive content. Paste sites can provide targeted Cyber Threat Intelligence (CTI) about potential threats and prior breaches. In this research, we propose a novel Bidirectional Encoder Representation from Transformers (BERT) with Latent Dirichlet Allocation (LDA) model to categorize pastes automatically. Our proposed BERTLDA model leverages a neural network transformer architecture to capture sequential dependencies when representing each sentence in a paste. BERT-LDA replaces the Bag-of-Words (BoW) approach in the conventional LDA with a Bag-of-Labels (BoL) that encompasses class labels at the sequence level. We compared the performance of the proposed BERT-LDA against the conventional LDA and BERT-LDA variants (e.g., GPT2-LDA) on 4,254,453 pastes from three paste sites. Experiment results indicate that the proposed BERT-LDA outperformed the standard LDA and each BERT-LDA variant in terms of perplexity on each paste site. Results of our BERTLDA case study suggest that significant content relating to hacker community activities, malicious code, network and website vulnerabilities, and PII are shared on paste sites. The insights provided by this study could be used by organizations to proactively mitigate potential damage on their infrastructure. 
    more » « less