skip to main content


Title: Removing Personally Identifiable Information from Shared Dataset for Keystroke Authentication Research
Research on keystroke dynamics has the good potential to offer continuous authentication that complements conventional authentication methods in combating insider threats and identity theft before more harm can be done to the genuine users. Unfortunately, the large amount of data required by free-text keystroke authentication often contain personally identifiable information, or PII, and personally sensitive information, such as a user's first name and last name, username and password for an account, bank card numbers, and social security numbers. As a result, there are privacy risks associated with keystroke data that must be mitigated before they are shared with other researchers. We conduct a systematic study to remove PII's from a recent large keystroke dataset. We find substantial amounts of PII's from the dataset, including names, usernames and passwords, social security numbers, and bank card numbers, which, if leaked, may lead to various harms to the user, including personal embarrassment, blackmails, financial loss, and identity theft. We thoroughly evaluate the effectiveness of our detection program for each kind of PII. We demonstrate that our PII detection program can achieve near perfect recall at the expense of losing some useful information (lower precision). Finally, we demonstrate that the removal of PII's from the original dataset has only negligible impact on the detection error tradeoff of the free-text authentication algorithm by Gunetti and Picardi. We hope that this experience report will be useful in informing the design of privacy removal in future keystroke dynamics based user authentication systems.  more » « less
Award ID(s):
1650503
NSF-PAR ID:
10136315
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
2019 IEEE 5th International Conference on Identity, Security, and Behavior Analysis
Page Range / eLocation ID:
1 to 7
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Facebook has become an important part of our daily life. From knowing the status of our relatives, showing off a new car, to connecting with a high school classmate, abundant personally identifiable information (PII) are made visible to others by posts, images and news. However, this free flow of information has also created significant cyber-security challenges that make us vulnerable to social engineering and cyber crimes. To confront these challenges, we propose a new behavioral biometric that verifies a user based on his or her widget interaction behavior when using Facebook. Specifically, we monitor activities on the user’s Facebook account using our own logging software and verify the user’s claimed identity by binary classifiers trained with two algorithms (SVM-rbf and the GBM– Gradient Boosting Machines). Our novel dataset consists of eight users over a month of data collection with an average of 2.95k rows of data per user. We convert these activities data into meaningful features such as day-of-week, hour-of-day, and widget types and duration of mouse staying on a widget. The performance shows that our novel widget interaction modality is promising for authentication. The SVM-rbf classifiers achieve a mean Equal Error Rate (EER) and mean Accuracy (ACC) of 3.91% and 97.79%, while the GBM classifiers a mean EER and ACC of 2.76% and 97.88%, respectively. In addition, we perform an ablation study to understand the impact of individual features on authentication performance. The importance of features are ranked in the descending order of hour-of-day, day-of-week, and widget types and duration. 
    more » « less
  2. Account recovery is ubiquitous across web applications but circumvents the username/password-based login step. Therefore, it deserves the same level of security as the user authentication process. A common simplistic procedure for account recovery requires that a user enters the same email used during registration, to which a password recovery link or a new username could be sent. Therefore, an impostor with access to a user’s registration email and other credentials can trigger an account recovery session to take over the user’s account. To prevent such attacks, beyond validating the email and other credentials entered by the user, our proposed recovery method utilizes keystroke dynamics to further secure the account recovery mechanism. Keystroke dynamics is a type of behavioral biometrics that uses the analysis of typing rhythm for user authentication. Using a new dataset with over 500,000 keystrokes collected from 44 students and university staff when they fill out an account recovery web form of multiple fields, we have evaluated the performance of five scoring algorithms on individual fields as well as feature-level fusion and weighted-score fusion. We achieve the best EER of 5.47% when keystroke dynamics from individual fields are used, 0% for a feature-level fusion of five fields, and 0% for a weighted-score fusion of seven fields. Our work represents a new kind of keystroke dynamics that we would like to call it ‘medium fixed-text’ as it sits between the conventional (short) fixed text and (long) free text research. 
    more » « less
  3. Recent studies have shown that several government and business organizations experience huge data breaches. Data breaches increase in a daily basis. The main target for attackers is organization sensitive data which includes personal identifiable information (PII) such as social security number (SSN), date of birth (DOB) and credit card /debit card (CCDC). The other target is encryption/decryption keys or passwords to get access to the sensitive data. The cloud computing is emerging as a solution to store, transfer and process the data in a distributed location over the Internet. Big data and internet of things (IoT) increased the possibility of sensitive data exposure. Most methods used for the attack are hacking, unauthorized access, insider theft and false data injection on the move. Most of the attacks happen during three different states of data life cycle such as data-at-rest, data-in-use, and data-in-transit. Hence, protecting sensitive data at all states particularly when data is moving to cloud computing environment needs special attention. The main purpose of this research is to analyze risks caused by data breaches, personal and organizational weaknesses to protect sensitive data and privacy. The paper discusses methods such as data classification and data encryption at different states to protect personal and organizational sensitive data. The paper also presents mathematical analysis by leveraging the concept of birthday paradox to demonstrate the encryption key attack. The analysis result shows that the use of same keys to encrypt sensitive data at different data states make the sensitive data less secure than using different keys. Our results show that to improve the security of sensitive data and to reduce the data breaches, different keys should be used in different states of the data life cycle. 
    more » « less
  4. Keystroke dynamics are a powerful behavioral biometric capable of determining user identity and for continuous authentication. It is an unobtrusive method that can complement an existing security system such as a password scheme and provides continuous user authentication. Existing methods record all keystrokes and use n-graphs that measure the timing between consecutive keystrokes to distinguish between users. Current state-of-the-art algorithms report EER’s of 7.5% or higher with 1000 characters. With 1000 characters it takes a longer time to detect an imposter and significant damage could be done. In this paper, we investigate how quickly a user is authenticated or how many digraphs are required to accurately detect an imposter in an uncontrolled free-text environment. We present and evaluate the effectiveness of three distance metrics individually and fused with each other. We show that with just 100 digraphs, about the length of a single sentence, we achieve an EER of 35.3%. At 200 digraphs the EER drops to 15.3%. With more digraphs, the performance continues to steadily improve. With 1000 digraphs the EER drops to 3.6% which is an improvement over the state-of-the-art. 
    more » « less
  5. Malicious cyber activities impose substantial costs on the U.S. economy and global markets. Cyber-criminals often use information-sharing social media platforms such as paste sites (e.g., Pastebin) to share vast amounts of plain text content related to Personally Identifiable Information (PII), credit card numbers, exploit code, malware, and other sensitive content. Paste sites can provide targeted Cyber Threat Intelligence (CTI) about potential threats and prior breaches. In this research, we propose a novel Bidirectional Encoder Representation from Transformers (BERT) with Latent Dirichlet Allocation (LDA) model to categorize pastes automatically. Our proposed BERTLDA model leverages a neural network transformer architecture to capture sequential dependencies when representing each sentence in a paste. BERT-LDA replaces the Bag-of-Words (BoW) approach in the conventional LDA with a Bag-of-Labels (BoL) that encompasses class labels at the sequence level. We compared the performance of the proposed BERT-LDA against the conventional LDA and BERT-LDA variants (e.g., GPT2-LDA) on 4,254,453 pastes from three paste sites. Experiment results indicate that the proposed BERT-LDA outperformed the standard LDA and each BERT-LDA variant in terms of perplexity on each paste site. Results of our BERTLDA case study suggest that significant content relating to hacker community activities, malicious code, network and website vulnerabilities, and PII are shared on paste sites. The insights provided by this study could be used by organizations to proactively mitigate potential damage on their infrastructure. 
    more » « less