skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Clid: Identifying TLS Clients With Unsupervised Learning on Domain Names
In this paper, we introduce Clid, a Transport Layer Security (TLS) client identification tool based on unsupervised learning on domain names from the server name indication (SNI) field. Clid aims to provide some information on a wide range of clients, even though it may not be able to identify a definitive characteristic about each one of the clients. This is a different approach from that of many existing rule-based client identification tools that rely on hardcoded databases to identify granular characteristics of a few clients. Often times, these tools can identify only a small number of clients in a real-world network as their databases grow outdated, which motivates an alternative approach like Clid. For this research, we utilize some 345 million anonymized TLS handshakes collected from a large university campus network. From each handshake, we create a TCP fingerprint – comprising IP flags, time-to-live (TTL), TCP window size, initial sequence number, window size, flags, header length, options, max segment size, and window scaling – that identifies each unique client that corresponds to a physical device on the network. Clid uses Bayesian optimization to find the optimal (in a precise sense that we define later) Density-Based Spatial Clustering of Applications with Noise (DBSCAN) clustering of clients and domain names for a set of TLS connections. Clid maps each client cluster to one or more domain clusters that are most strongly associated with it based on the frequency and exclusivity of their TLS connections. While learning highly associated domain names of a client may not immediately tell us specific characteristics of the client like its the operating system, manufacturer, or TLS configuration, it may serve as a strong first step to doing so. There exists prior work [31, 22] that uses the SNI field for client identification. We evaluate Clid’s performance on various subsets of our captured TLS handshakes and on different parameter settings that affect the granularity of identification results. Our experiments show that Clid is able to identify the single most associated domain cluster (a group of similar domain names in a precise sense that we define in §5.3) for at most 90% of clients in 10,000 TLS connections for a real-world traffic. When one or more domain clusters were allowed to be mapped to a single client cluster, Clid identified such domain names for at least 60% of all clients in all our experiments.  more » « less
Award ID(s):
2319080 2124424
PAR ID:
10562783
Author(s) / Creator(s):
;
Publisher / Repository:
ArXiv
Date Published:
Format(s):
Medium: X
Institution:
Stanford University
Sponsoring Org:
National Science Foundation
More Like this
  1. ShadowTLS is a new type of circumvention tool where the relay forwards traffic to a legitimate (unblocked) TLS server until the end of the handshake, and then connects the client to a hidden proxy server (e.g. Shadowsocks). In contrast to previous probe-resistant proxies, this design can evade SNI- based blocking, since to the censor it appears as a legitimate TLS connection to an unblocked domain. In this paper, we describe several attacks against Shad- owTLS which would allow a censor to identify if a suspected IP is hosting a ShadowTLS relay or not (and block it accord- ingly), distinguishing it from the legitimate TLS servers it mimics. Our attacks require only a few TCP connections to the suspected IP, a capability that censors including China have already demonstrated in order to block previous proxies. We evaluate these vulnerabilities by performing Internet- wide scans to discover potential ShadowTLS relays, and find over 15K of them. We also describe mitigations against this attack that ShadowTLS (and proxies like it) can implement, and work with the ShadowTLS developers to deploy these fixes. 
    more » « less
  2. X.509 certificate revocation defends against man-in-the-middle attacks involving a compromised certificate. Certificate revocation strategies face scalability, effectiveness, and deployment challenges as HTTPS adoption rates have soared. We propose Certificate Revocation Table (CRT), a new revocation strategy that is competitive with or exceeds alternative state-of-the-art solutions in effectiveness, efficiency, certificate growth scalability, mass revocation event scalability, revocation timeliness, privacy, and deployment requirements. The CRT design assumes that locality of reference applies to the certificates accessed by an organization. The CRT periodically checks the revocation status of X.509 certificates recently used by the organization. Pre-checking the revocation status of certificates the clients are likely to use avoids the security problems of on-demand certificate revocation checking. To validate both the effectiveness and efficiency of our approach, we simulated a CRT using 60 days of TLS traffic logs from Brigham Young University to measure the effects of actively refreshing revocation status information for various certificate working set window lengths. A working set window size of 45 days resulted in an average of 99.86% of the TLS handshakes having revocation information cached in advance. The CRT storage requirements are small. The initial revocation status information requires downloading a 6.7 MB file, and subsequent updates require only 205.1 KB of bandwidth daily. Updates that include only revoked certificates require just 215 bytes of bandwidth per day. 
    more » « less
  3. We present the first formal-methods analysis of the Session Binding Proxy (SBP) protocol, which protects a vulnerable system by wrapping it and introducing a reverse proxy between the system and its clients. SBP mitigates thefts of authentication cookies by cryptographically binding the authentication cookie---issued by the server to the client---to an underlying Transport Layer Security (TLS) channel using the channel's master secret and a secret key known only by the proxy. An adversary who steals a bound cookie cannot reuse this cookie to create malicious requests on a separate connection because the cookie's channel binding will not match the adversary's channel. SBP seeks to achieve this goal without modifications to the client or the server software, rendering the client and server ``oblivious protocol participants'' that are not aware of the SBP session. Our analysis verifies that the original SBP design mitigates cookie stealing under the client's cryptographic assumptions but fails to authenticate the client to the proxy. Resulting from two issues, the proxy has no assurance that it shares a session context with a legitimate client: SBP assumes an older flawed version of TLS (1.2), and SBP relies on legacy server usernames and passwords to authenticate clients. Due to these issues, there is no guarantee of cookie-stealing resistance from the proxy's cryptographic perspective. Using the Cryptographic Protocol Shapes Analyzer (CPSA), we model and analyze the original SBP and three variations in the Dolev-Yao network intruder model. Our models differ in the version of TLS they use: 1.2 (original SBP), 1.2 with mutual authentication, 1.3, and {\it 1.3 with mutual authentication (mTLS-1.3)}. For comparison, we also analyze a model of the baseline scenario without SBP. We separately analyze each of our SBP models from two perspectives: client and proxy. In each SBP model, the client has assurance that the cookie is valid only for the client's legitimate session. Only in mTLS-1.3 does the proxy have assurance that it communicates with a legitimate client and that the client's cookie is valid. We formalize these results by stating and proving, or disproving, security goals for each model. SBP is useful because it provides a practical solution to the important challenge of protecting flawed legacy systems that cannot be patched. Our analysis of this obscure protocol sheds insight into the properties necessary for wrapper protocols to resist a Dolev-Yao adversary. When engineering wrapper protocols, designers must carefully consider authentication, freshness, and requirements of cryptographic bindings such as channel bindings. Our work exposes strengths and limitations of wrapper protocols and TLS channel bindings. 
    more » « less
  4. DNS latency is a concern for many service operators: CDNs exist to reduce service latency to end-users but must rely on global DNS for reachability and load-balancing. Today, DNS latency is monitored by active probing from distributed platforms like RIPE Atlas, with Verfploeter, or with commercial services. While Atlas coverage is wide, its 10k sites see only a fraction of the Internet. In this paper we show that passive observation of TCP handshakes can measure \emph{live DNS latency, continuously, providing good coverage of current clients of the service}. Estimating RTT from TCP is an old idea, but its application to DNS has not previously been studied carefully. We show that there is sufficient TCP DNS traffic today to provide good operational coverage (particularly of IPv6), and very good temporal coverage (better than existing approaches), enabling near-real time evaluation of DNS latency from \emph{real clients}. We also show that DNS servers can optionally solicit TCP to broaden coverage. We quantify coverage and show that estimates of DNS latency from TCP is consistent with UDP latency. Our approach finds previously unknown, real problems: \emph{DNS polarization} is a new problem where a hypergiant sends global traffic to one anycast site rather than taking advantage of the global anycast deployment. Correcting polarization in Google DNS cut its latency from 100ms to 10ms; and from Microsoft Azure cut latency from 90ms to 20ms. We also show other instances of routing problems that add 100--200ms latency. Finally, \emph{real-time} use of our approach for a European country-level domain has helped detect and correct a BGP routing misconfiguration that detoured European traffic to Australia. We have integrated our approach into several open source tools: Entrada, our open source data warehouse for DNS, a monitoring tool (ANTS), which has been operational for the last 2 years on a country-level top-level domain, and a DNS anonymization tool in use at a root server since March 2021. 
    more » « less
  5. R-tree is a foundational data structure used in spatial databases and scientific databases. With the advancement of networks and computer architectures, in-memory data processing for R-tree in distributed systems has become a common platform. We have observed new performance challenges to process R-tree as the amount of multidimensional datasets become increasingly high. Specifically, an R-tree server can be heavily overloaded while the network and client CPU are lightly loaded, and vice versa. In this article, we present the design and implementation of Catfish, an RDMA-enabled R-tree for low latency and high throughput by adaptively utilizing the available network bandwidth and computing resources to balance the workloads between clients and servers. We design and implement two basic mechanisms of using RDMA for a client-server R-tree data processing system. First, in the fast messaging design, we use RDMA writes to send R-tree requests to the server and let server threads process R-tree requests to achieve low query latency. Second, in the RDMA offloading design, we use RDMA reads to offload tree traversal from the server to the client, which rescues the server as it is overloaded. We further develop an adaptive scheme to effectively switch an R-tree search between fast messaging and RDMA offloading, maximizing the overall performance. Our experiments show that the adaptive solution of Catfish on InfiniBand significantly outperforms R-tree that uses only fast messaging or only RDMA offloading in both latency and throughput. Catfish can also deliver up to one order of magnitude performance over the traditional schemes using TCP/IP on 1 and 40 Gbps Ethernet. We make a strong case to use RDMA to effectively balance workloads in distributed systems for low latency and high throughput. 
    more » « less