<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>PILOT: Password and PIN Information Leakage from Obfuscated Typing Videos</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2019</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10112529</idno>
					<idno type="doi"></idno>
					<title level='j'>Journal of computer security</title>
<idno>0926-227X</idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Matteo Cardaioli Kiran Balagani</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[This paper studies leakage of user passwords and PINs based on observations of typing feedback on screens or from projectors in the form of masked characters ( * or •) that indicate keystrokes. To this end, we developed an attack called Password and Pin Information Leakage from Obfuscated Typing Videos (PILOT). Our attack extracts inter-keystroke timing information from videos of password masking characters displayed when users type their password on a computer, or their PIN at an ATM. We conducted several experiments in various attack scenarios. Results indicate that, while in some cases leakage is minor, it is quite substantial in others. By leveraging inter-keystroke timings, PILOT recovers 8-character alphanumeric passwords in as little as 19 attempts. When guessing PINs, PILOT significantly improved on both random guessing and the attack strategy adopted in our prior work [4]. In particular, we were able to guess about 3% of the PINs within 10 attempts. This corresponds to a 26-fold improvement compared to random guessing. Our results strongly indicate that secure password masking GUIs must consider the information leakage identified in this paper.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Passwords and PINs are susceptible to shoulder surfing attacks <ref type="bibr">[26]</ref> of which there are two main types: <ref type="bibr">(1)</ref> input-based and (2) output-based. The former is more common; in it, the adversary observes an input device (keyboard or keypad) as the user enters a secret (password or PIN) and learns the key-presses. The latter involves the adversary observing an output device (screen or projector) while the user enters a secret which is displayed in cleartext. The principal distinction between the two types is the adversary's proximity: observing input devices requires the adversary to be closer to the victim than observing output devices, which tend to have larger form factors, i.e., physical dimensions.</p><p>Completely disabling on-screen feedback during password/PIN entry (as in, e.g., the Unix sudo command) mitigates output-based shoulder-surfing attacks. Unfortunately, it also impacts usability: when deprived of visual feedback, users cannot determine whether a given key-press was registered and are thus more likely to make mistakes. In order to balance security and usability, user interfaces typically implement password masking by displaying a generic symbol (e.g., "&#8226;" or " * ") after each keystroke. This technique is commonly used on desktops, laptops and smartphones as well as on public devices, such as Automated Teller Machines (ATMs) or Point-of-Sale (PoS) terminals at shops or gas stations.</p><p>Despite the popularity of password masking, little has been done to quantify how visual keystroke feedback impacts security. In particular, masking assumes that showing generic symbols does not reveal any information about the corresponding secret. This assumption seems reasonable, since visual representation of a generic symbol is independent of the key-press. However, in this paper we show that this assumption is incorrect. By leveraging precise inter-keystroke timing information leaked by the appearance of each masking symbol, we show that the adversary can significantly narrow down the password/PIN's search space. Put another way, the number of attempts required to brute-force decreases appreciably when the adversary has access to inter-keystroke timing information.</p><p>There are many realistic settings where visual inter-keystroke timing information (leaked via appearance of masking symbols) is readily available while the input information is not, i.e., the input device is not easily observable. For example, in a typical lecture or classroom scenario, the presenter's keyboard is usually out of sight, while the external projector display is wide-open for recording. Similarly, in a multi-person office scenario, an adversarial co-worker can surreptitiously record the victim's screen. The same holds in public scenarios, such as PoS terminals and ATMs, where displays (though smaller) tend to be easier to observe and record than entry keypads.</p><p>In this paper we consider two representative scenarios: (1) a presenter enters a password into a computer connected to an external projector; (2) a user enters a PIN at an ATM in a public location. The adversary is assumed to record keystroke feedback from the projector display or an ATM screen using a dedicated video camera or a smartphone. We note that a human adversary does not need to be present during the attack: recording might be done via an existing camera either pre-installed or pre-compromised by the adversary, possibly remotely, e.g., as in the infamous Mirai botnet <ref type="bibr">[17]</ref>.</p><p>Contributions. The main goal of this paper is to quantify the amount of information leaked through video recordings of on-screen keystroke feedback. To this end, we conducted extensive data collection experiments that involved 84 subjects. <ref type="foot">1</ref> Each subject was asked to type passwords or PINs while the screen or projector was video-recorded using either a commodity video camera and a smartphone camera. Based on this, we determined the key statistical properties of the resulting data, and set up an attack called PILOT: Password and Pin Information Leakage from Obfuscated Typing Videos. It allows us to quantify reduction in brute-force search space due to timing information. PILOT leverages multiple publicly available typing datasets to extract population timings, and applies this information to inter-keystroke timings extracted from videos.</p><p>Our results show that video recordings can be effective in extracting precise inter-keystroke timing information. Experiments show that PILOT substantially reduces the search space for each password, even when the adversary has no access to user-specific keystroke templates. When run on passwords, PILOT performed better than random guessing between 87% and 100% of the time, depending on the password and the machine learning technique used to instantiate the attack. The resulting average speedup is between 25% and 385% (depending on the password), compared to random dictionary-based guessing; some passwords were correctly guessed in as few as 68 attempts. A single password timing disclosure is enough for PILOT to successfully achieve these results. However, when the adversary observes the user entering the password three times, PILOT can crack the password in as few as 19 attempts. Clearly, PILOT 's capabilities depend in part on the strength of a specific password. With very common passwords, benefits of PILOT are limited. Meanwhile, we show that PILOT substantially outperforms random guessing with less common passwords. With PINs, disclosure of timing poses an effective risk. The PIN guessing algorithm can reduce the number of attempts by up to 26 times compared to random guessing.</p><p>Paper Organization. Section 2 reviews the state-of-the-art in password guessing based on timing attacks. Section 3 presents PILOT and the adversary model. Section 4 discusses our data collection and experiments. We then present the results on password guessing using PILOT in Section 5, and on PIN guessing in Section 6. The paper concludes with the summary and future work directions in Section 7.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Work</head><p>There is a large body of prior work on timing attacks in the context of keyboard-based password entry. Song et al. <ref type="bibr">[24]</ref> demonstrated a weakness that allows the adversary to extract information about passwords typed during SSH sessions. The attack relies on the fact that, to minimize latency, SSH transmits each keystroke immediately after entry, in a separate IP packet. By eavesdropping on such packets, the adversary can collect accurate inter-keystroke timing information. Song et al. <ref type="bibr">[24]</ref> showed that this information can be used to restrict the search space of passwords. The impact of this work is significant, because it shows the power of timing attacks on cracking passwords.</p><p>There are several studies of keystroke inference from analysis of video recordings. Balzarotti et al. <ref type="bibr">[5]</ref> addressed the typical shoulder-surfing scenario, where a camera tracks hand and finger movements on the keyboard. Text was automatically reconstructed from resulting videos. Similarly, Xu et al. <ref type="bibr">[33]</ref> recorded user's finger movements on mobile devices to infer keystroke information. Unfortunately, neither attack applies to our sample scenarios, where the keyboard is invisible to the adversary. Shukla et al. <ref type="bibr">[23]</ref> showed that text can be inferred even from videos where the keyboard/keypad is not visible. This attack involved analyzing video recordings of the back of the user's hand holding a smartphone in order to infer which location on the screen is tapped. By observing the motion of the user's hand, the path of the finger across the screen can be reconstructed, which yields the typed text. In a similar attack, Sun et al. <ref type="bibr">[25]</ref> successfully reconstructed text typed on tablets by recording and analyzing the tablet's movements, rather than movements of the user's hands.</p><p>The closest work to the paper is our prior work <ref type="bibr">[4]</ref>, in which we show that passwords can be inferred at a higher probability than random guesses using the timing information from onscreen keystroke feedback. However, in <ref type="bibr">[4]</ref> we concluded that the timing information is not helpful in inferring PINs. In this paper, we revisit our earlier conclusion on inferring PINs and show that it is incorrect. In fact, the attack strategy employed in this paper yielded a 26-fold improvement in inferring PINs over random guesses and significantly outperforms <ref type="bibr">[4]</ref> in terms of number of PINs recovered within a small number of attempts.</p><p>Another line of work aimed to quantify keystroke information inadvertently leaked by motion sensors. Owusu et al. <ref type="bibr">[19]</ref> studied this in the context of a smartphone's inertial sensors while the user types using the on-screen keyboard. The application used to implement this attack does not require special privileges, since modern smartphone operating systems do not require explicit authorization to access inertial sensors data. Similarly, Wang et al. <ref type="bibr">[30]</ref> explored keystroke information leakage from inertial sensors on wearable devices, e.g., smartwatches and fitness trackers. By estimating the motion of a wearable device placed on the wrist of the user, movements of the user's hand over a keyboard can be inferred. This allows learning which keys were pressed during the hand's path. Compared to our work, both <ref type="bibr">[19]</ref> and <ref type="bibr">[30]</ref> require a substantially higher level of access to the user's device. To collect data from inertial sensors the adversary must have previously succeeded in deceiving the user into installing a malicious application, or otherwise compromised the user's device. In contrast, PILOT is a fully passive attack.</p><p>Acoustic emanations represent another effective side-channel for keystroke inference. This class of attacks is based on the observation that different keyboard keys emit subtly different sounds when pressed. This information can be captured (1) locally, using microphones placed near the keyboard <ref type="bibr">[3,</ref><ref type="bibr">35]</ref>, or (2) remotely, via Voice-over-IP <ref type="bibr">[10]</ref>. Also, acoustic emanations captured using multiple microphones can be used to extract the locations of keys on a keyboard. As shown by Zhou et al. <ref type="bibr">[34]</ref>, recordings from multiple microphones can be used to accurately quantify the time difference of arrival (TDoA), and thus triangulate the positions of pressed keys.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">System and Adversary Model</head><p>We now present the system and adversary model used in the rest of the paper. We model a user logging in (authenticating) to a computer system or an ATM using a PIN or a password (secret) entered via a keyboard or keypad (input device). The user receives immediate feedback about each key-press from a screen, a projector, or both (output device) in the form of dots or asterisks (masking symbols). The shape and/or location of each masking symbol does not depend on which key is pressed. The adversary can observe and record the output device(s), though not the input device or the user's hands. An example of this scenario is shown in Figure <ref type="figure">1</ref>. The adversary's goal is to learn the user's secret.</p><p>The envisaged attack setting is representative of many real-world scenarios that involve low-privilege adversaries, including:</p><p>(1) A presenter in a lecture or conference who types a password while the screen is displayed on a projector. The entire audience can see the timing of appearance of masking symbols, and the adversary can be anyone in the audience. (2) An ATM customer typing a PIN. The adversary who stands in line behind the user might have an unobstructed view of the screen, and the timing of appearance of masking symbols (see Figure <ref type="figure">2</ref>). (3) A customer enters her debit card PIN at a self-service gas-station pump. In this case, the adversary can be anyone in the surroundings with a clear view of the pump's screen.</p><p>Although these scenarios seem to imply that adversary is located near the user, proximity is not a requirement for our attack. For instance, the adversary could watch a prior recording of the lecture in scenario (1); or, could be monitoring the ATM machine using a CCTV camera in (2); or, remotely view the screen in (3) through a compromised IoT camera.</p><p>Also, we assume that, in many cases, the attack involves multiple observations. For example, in scenario (1), the adversary could observe the presenter during multiple talks, without the presenter changing passwords between talks. Similarly, in scenario (2), customers often return to the same ATM.</p><p>were males in their 20s, with a technical background and good typing skills. We briefed each subject on the nature of the experiment, and asked them to type four alphanumerical passwords: "jillie02", "william1", "123brian", and "lamondre".We selected these passwords uniformly at random from the RockYou dataset <ref type="bibr">[1]</ref> in order to simulate realistic passwords. The subjects typed each password three times, while our data collection software recorded ground-truth keystroke timings of correctly typed passwords with millisecond accuracy. Timings from passwords that were typed incorrectly were discarded, and subjects were prompted to re-type the password whenever a mistake was made. The typing procedure lasted between 1 and 2 minutes, depending on the subject's typing skills. All subjects typed with the "touch typing" technique, i.e., using fingers from both hands.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">PINs</head><p>We recorded subjects entering 4-digit PINs on a simulated ATM, shown in Figure <ref type="figure">3</ref>. Our dataset was based on experiments with 22 participants; 19 subjects completed three data collection sessions, while 4 subjects completed only one session, resulting in a total of 61 sessions. At the beginning of each session, the subject was given 45 seconds to get accustomed with the keypad of the ATM simulator. During this time, they were free to type as they pleased. Next, a subject was shown a PIN on the screen for ten seconds (Figure <ref type="figure">4a</ref>), and, once it disappeared from the screen, asked to enter it four times (Figure <ref type="figure">4b</ref>). Subjects were advised not to read the PINs out loud. This process was repeated for 15 consecutive PINs. During each session, subjects were presented with the same 15-PIN sequence 3 times. Subjects were given a 30-second break at the end of each sequence. Specific 4-digit PINs were selected to test whether: (1) inter-keypress time is proportional to Euclidean Distance between keys on the keypad; and (2) the direction of movement (up, down, left, or right) between consecutive keys in a keypair impacts the corresponding inter-key time. We show an example of these two situations on the ATM keypad in Figure <ref type="figure">5</ref>. We chose a set of PINs that allowed collection of a significant number of key combinations appropriate for testing both hypotheses. For instance, PIN 3179 tested horizontal and vertical distance two, while 1112 tested distance 0 and horizontal distance 1.</p><p>keystroke timings <ref type="bibr">[7]</ref>. Input to RF is one inter-keystroke timing, and its output is a list of N digraphs ranked based on the probability of corresponding to input timing. NN is a more complex architecture designed to automatically determine and extract complex features from the input distribution. In our experiments, the input to NN is a list of inter-keystroke timings corresponding to a password. This enables NN to extract features, such as arbitrary n-grams, or timings corresponding to non-consecutive characters. NN's output is a guess for the entire password. We instantiated NN using the following parameters:</p><p>&#8226; number of units in the hidden layer -128 (with ReLU activation functions);</p><p>&#8226; inclusion probability of the dropout layer -0.2;</p><p>&#8226; number of input neurons -25;</p><p>&#8226; number of output layers -25 which represents one character in one-hot encoding. Output layers use softmax activation function;</p><p>&#8226; training was performed using batch sizes of 40 and 100 epochs. We used the Adam optimizer with a learning rate of 0.001.</p><p>Classifier Training. We trained PILOT on three public datasets <ref type="bibr">[6,</ref><ref type="bibr">21,</ref><ref type="bibr">29</ref>] that contain keystroke timing information collected from English free-text. Using these datasets for training, we modeled an attack that relies exclusively on population data. Without loss of generality, we filtered the datasets to remove all timings that do not correspond to digraphs composed of alphanumeric lowercase characters. This is motivated by the datasets' limited availability of digraph samples that contain special characters. In practice, the adversary could collect these timings using, for instance, crowdsourcing tools such as Amazon Mechanical Turk. To take care of uneven frequencies of digraphs, we under-represented the most frequent digraphs in the dataset. Data in public datasets was often gathered from free-text typing of volunteers. Therefore, more frequent digraphs in English were represented more than rarer ones. For example, considering lamondre, digraph re appears 43,606 times in the population dataset, while am has only 6,841. Similarly, in 123brian, digraph ri occurs 19,782 times, while 3b has only 138 occurences. We therefore under-sampled each digraph appearing more than 1,000 times to 1,000 randomly selected occurrences. Similarly, we excluded infrequent digraphs that appeared under 100 times in the whole dataset.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.">Attack Process.</head><p>To infer the user's secret from inter-keystroke timings, PILOT leverages a dictionary of passwords (e.g., a list of passwords leaked by online services <ref type="bibr">[1,</ref><ref type="bibr">2,</ref><ref type="bibr">12,</ref><ref type="bibr">27]</ref>), possibly expanded using techniques such as probabilistic context-free grammars <ref type="bibr">[32]</ref> and generative adversarial networks <ref type="bibr">[15]</ref>. When evaluating PILOT, we assume that the user's secret is in the dictionary. In practice, this is often the case, as many users use the same weak passwords (e.g., only 36% of the password of RockYou is unique <ref type="bibr">[18]</ref>), and reuse them across many different services <ref type="bibr">[14,</ref><ref type="bibr">31]</ref>. Given that the size of a reasonable password dictionary is on the order of billions of entries, 2 the goal of PILOT is to narrow down the possible passwords to a small(er) list, e.g., to perform online attacks. This list is then ranked by the probability associated with each entry, computed from inter-keystroke timing data. Specifically:</p><p>(1) Using RF, for each inter-key time extracted from video (corresponding to a digraph), PILOT returns a list of N possible guesses, sorted by the classifier's confidence. Next, PILOT ranks the passwords in the dictionary by resulting probabilities as follows: for each password, PILOT identifies the position in the ranked list of predictions for the first digraph of the password being guessed, and assigns that position as a "penalty" to the password. By performing these steps for each digraph, PILOT obtains a total penalty score for each password, i.e., a score that indicates the probability of the password given the output of the RF. For example, to rank the password jillie02, PILOT first considers the digraph ji, and the list of predictions of RF for the first digraph. It notes that ji appears in such list as the X-th most probable; therefore, it assigns X as the penalty for jillie02. Then, it considers il, which appears in Y-th position in the list of predictions for the second digraph. The penalty for jillie02 is thus updated to X + Y. This operation is repeated for all the 7 digraphs, thus obtaining the final penalty score. (2) Using NN, PILOT computes a list of N possible guesses, sorted by the classifier's confidence of each guess. In this case, the PILOT processes the entire list of flight times at once, rather than refining its guess with each digraph.</p><p>We considered the following attack settings: single-shot, and multiple recordings. With the former, the adversary trains PILOT with inter-keystroke timings from population data, i.e., from users other than the target, e.g., from publicly available datasets, or by recruiting users and asking them to type passwords. In this scenario, the adversary has access to the video recording of a single password entry session. With multiple recordings, the adversary trains PILOT as before, and additionally, has access to videos of multiple login instances by the same user.</p><p>Training PILOT exclusively with population data leads to more realistic attack scenarios than training it with user-specific data, because usually the adversary has limited access to keystrokes samples from the target user. Further, access to user-specific data will likely improve the success rate of PILOT.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.">Results</head><p>In this section, we report on PILOT efficacy in reducing search time on the RockYou <ref type="bibr">[1]</ref> password dataset compared to random choice, weighted by probability. We restricted experiments to the subset of 8-character passwords from RockYou, since the adversary can always determine password length by counting the number of masking symbols shown on the screen. This resulted in 6,514,177 passwords, of which 2,967,116 were unique.</p><p>Attack Baseline. To establish the attack baseline, we consider an adversary that outputs password guesses from a leaked dataset in descending order of frequency. (Ties are broken using random selection from the candidate passwords.) Because password probabilities are far from uniform (e.g., in RockYou, the top 200 8-character passwords account for over 10% of the entire dataset), this is the best adversarial strategy given no additional information on the target user.</p><p>Passwords selected for our evaluation represent a mix of common and rare passwords. Thus, they have widely varying frequencies of occurrence in RockYou and the expected number of attempts needed to guess each password using the baseline attack varies significantly. For example, the expected number of attempts for:</p><p>&#8226; 123brian (appears 6 times) is 93,874; We believe that the discrepancy between the performance of PILOT on various passwords is due to how frequently the digraphs in each password appear in training data. Specifically, even with our under-representation, all digraphs in william1, with the exception of m1, are far more frequent in the training data than 12, 23, 3b, or 02.</p><p>Regarding specific classifiers, RF overtakes NN in most instances. For example, when guessing 123brian (Figure <ref type="figure">8a</ref>), NN performs worse than random guessing for the first 800,000 attempts. Afterwards, NN outperforms both random guessing and RF. Furthermore, while RF can guess a substantial percentage of passwords within 20,000, 50,000 and 100,000 attempts, NN cannot achieve the same result.</p><p>In terms of minimum number of guesses per password, RF recovered william1 in 68, lamondre in 145, 123brian in 5,535, and jillie02 in 28,962 attempts. NN required a consistently higher minimum number of attempts for each password.</p><p>Multiple Recordings. Information from three login instances was used as follows. We averaged classifiers' predictions over three login instances for a given user, and ranked passwords accordingly.</p><p>The results are summarized in Table <ref type="table">2</ref> and Figure <ref type="figure">9</ref>. Although PILOT still consistently outperforms random guessing, using data from multiple authentication recordings leads to mostly identical results overall with both RF and NN. PILOT 's guessing success rate for 123brian and jillie02 is slightly improved compared to the previous setting and minimum number of attempts to recover each password diminished slightly. We recovered william1 in 19, lamondre in 404, 123brian in 13,931, and jillie02 in 67,875 attempts. Overall, the results show that there are no substantial benefits in using timing data from three recordings from the same user.  compared to one in 250 with no timing information. While this result is not as dramatic as the one with passwords, it suggests that keystroke timing information should be carefully concealed by ATMs.</p><p>Clearly, the benefits of PILOT compared to our baseline attack vary depending on how common the user's password is. For very common (and therefore very easy to guess) passwords, our results show</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>Where required, IRB approvals were duly obtained prior to the experiments.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1"><p>See for example the lists maintained by https://haveibeenpwned.com/.</p></note>
		</body>
		</text>
</TEI>
