This content will become publicly available on June 28, 2025
- PAR ID:
- 10520232
- Publisher / Repository:
- NAACL
- Date Published:
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
Boldyreva, A. ; Kolesnikov, V. (Ed.)In recent work, Backendal, Haller, and Paterson identified several exploitable vulnerabilities in the cloud storage provider MEGA. They demonstrated an RSA key recovery attack in which a malicious server could recover a client’s private RSA key after 512 client login attempts. We show how to exploit additional information revealed by MEGA’s protocol vulnerabilities to give an attack that requires only six client logins to recover the secret key. Our optimized attack combines several cryptanalytic techniques. In particular, we formulate and give a solution to a variant of the hidden number problem with small unknown multipliers, which may be of independent interest. We show that our lattice construction for this problem can be used to give improved results for the implicit factorization problem of May and Ritzenhofen.more » « less
-
Motivated by the ever-increasing concerns on personal data privacy and the rapidly growing data volume at local clients, federated learning (FL) has emerged as a new machine learning setting. An FL system is comprised of a central parameter server and multiple local clients. It keeps data at local clients and learns a centralized model by sharing the model parameters learned locally. No local data needs to be shared, and privacy can be well protected. Nevertheless, since it is the model instead of the raw data that is shared, the system can be exposed to the poisoning model attacks launched by malicious clients. Furthermore, it is challenging to identify malicious clients since no local client data is available on the server. Besides, membership inference attacks can still be performed by using the uploaded model to estimate the client's local data, leading to privacy disclosure. In this work, we first propose a model update based federated averaging algorithm to defend against Byzantine attacks such as additive noise attacks and sign-flipping attacks. The individual client model initialization method is presented to provide further privacy protections from the membership inference attacks by hiding the individual local machine learning model. When combining these two schemes, privacy and security can be both effectively enhanced. The proposed schemes are proved to converge experimentally under non-lID data distribution when there are no attacks. Under Byzantine attacks, the proposed schemes perform much better than the classical model based FedAvg algorithm.more » « less
-
Recent studies show that large language models (LLM) unintendedly memorize part of the training data, which brings serious privacy risks. For example, it has been shown that over 1% of tokens generated unprompted by an LLM are part of sequences in the training data. However, current studies mainly focus on the exact memorization behaviors. In this paper, we propose to evaluate how many generated texts have near-duplicates (e.g., only differ by a couple of tokens out of 100) in the training corpus. A major challenge of conducting this evaluation is the huge computation cost incurred by near-duplicate sequence searches. This is because modern LLMs are trained on larger and larger corpora with up to 1 trillion tokens. What's worse is that the number of sequences in a text is quadratic to the text length. To address this issue, we develop an efficient and scalable near-duplicate sequence search algorithm in this paper. It can find (almost) all the near-duplicate sequences of the query sequence in a large corpus with guarantees. Specifically, the algorithm generates and groups the min-hash values of all the sequences with at least t tokens (as very short near-duplicates are often irrelevant noise) in the corpus in linear time to the corpus size. We formally prove that only 2 n+1/t+1 -1 min-hash values are generated for a text with n tokens in expectation. Thus the index time and size are reasonable. When a query arrives, we find all the sequences sharing enough min-hash values with the query using inverted indexes and prefix filtering. Extensive experiments on a few large real-world LLM training corpora show that our near-duplicate sequence search algorithm is efficient and scalable.
-
ShadowTLS is a new type of circumvention tool where the relay forwards traffic to a legitimate (unblocked) TLS server until the end of the handshake, and then connects the client to a hidden proxy server (e.g. Shadowsocks). In contrast to previous probe-resistant proxies, this design can evade SNI- based blocking, since to the censor it appears as a legitimate TLS connection to an unblocked domain. In this paper, we describe several attacks against Shad- owTLS which would allow a censor to identify if a suspected IP is hosting a ShadowTLS relay or not (and block it accord- ingly), distinguishing it from the legitimate TLS servers it mimics. Our attacks require only a few TCP connections to the suspected IP, a capability that censors including China have already demonstrated in order to block previous proxies. We evaluate these vulnerabilities by performing Internet- wide scans to discover potential ShadowTLS relays, and find over 15K of them. We also describe mitigations against this attack that ShadowTLS (and proxies like it) can implement, and work with the ShadowTLS developers to deploy these fixes.more » « less
-
Despite recent advances in natural language generation, it remains challenging to control attributes of generated text. We propose DExperts: Decoding-time Experts, a decoding-time method for controlled text generation that combines a pretrained language model with “expert” LMs and/or “anti-expert” LMs in a product of experts. Intuitively, under the ensemble, tokens only get high probability if they are considered likely by the experts, and unlikely by the anti-experts. We apply DExperts to language detoxification and sentiment-controlled generation, where we outperform existing controllable generation methods on both automatic and human evaluations. Moreover, because DExperts operates only on the output of the pretrained LM, it is effective with (anti-)experts of smaller size, including when operating on GPT-3. Our work highlights the promise of tuning small LMs on text with (un)desirable attributes for efficient decoding-time steering.more » « less