skip to main content


Title: A Deep Learning-Based Data Minimization Algorithm for Fast and Secure Transfer of Big Genomic Datasets
In the age of Big Genomics Data, institutions such as the National Human Genome Research Institute (NHGRI) are challenged in their efforts to share volumes of data between researchers, a process that has been plagued by unreliable transfers and slow speeds. These occur due to throughput bottlenecks of traditional transfer technologies. Two factors that affect the effciency of data transmission are the channel bandwidth and the amount of data. Increasing the bandwidth is one way to transmit data effciently, but might not always be possible due to resource limitations. Another way to maximize channel utilization is by decreasing the bits needed for transmission of a dataset. Traditionally, transmission of big genomic data between two geographical locations is done using general-purpose protocols, such as hypertext transfer protocol (HTTP) and file transfer protocol (FTP) secure. In this paper, we present a novel deep learning-based data minimization algorithm that 1) minimizes the datasets during transfer over the carrier channels; 2) protects the data from the man-in-the-middle (MITM) and other attacks by changing the binary representation (content-encoding) several times for the same dataset: we assign different codewords to the same character in different parts of the dataset. Our data minimization strategy exploits the alphabet limitation of DNA sequences and modifies the binary representation (codeword) of dataset characters using deep learning-based convolutional neural network (CNN) to ensure a minimum of code word uses to the high frequency characters at different time slots during the transfer time. This algorithm ensures transmission of big genomic DNA datasets with minimal bits and latency and yields an effcient and expedient process. Our tested heuristic model, simulation, and real implementation results indicate that the proposed data minimization algorithm is up to 99 times faster and more secure than the currently used content-encoding scheme used in HTTP of the HTTP content-encoding scheme and 96 times faster than FTP on tested datasets. The developed protocol in C# will be available to the wider genomics community and domain scientists.  more » « less
Award ID(s):
1925960
NSF-PAR ID:
10090851
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
IEEE Transactions on Big Data
ISSN:
2332-7790
Page Range / eLocation ID:
1 to 1
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. DNA sequencing plays an important role in the bioinformatics research community. DNA sequencing is important to all organisms, especially to humans and from multiple perspectives. These include understanding the correlation of specific mutations that plays a significant role in increasing or decreasing the risks of developing a disease or condition, or finding the implications and connections between the genotype and the phenotype. Advancements in the high-throughput sequencing techniques, tools, and equipment, have helped to generate big genomic datasets due to the tremendous decrease in the DNA sequence costs. However, the advancements have posed great challenges to genomic data storage, analysis, and transfer. Accessing, manipulating, and sharing the generated big genomic datasets present major challenges in terms of time and size, as well as privacy. Data size plays an important role in addressing these challenges. Accordingly, data minimization techniques have recently attracted much interest in the bioinformatics research community. Therefore, it is critical to develop new ways to minimize the data size. This paper presents a new real-time data minimization mechanism of big genomic datasets to shorten the transfer time in a more secure manner, despite the potential occurrence of a data breach. Our method involves the application of the random sampling of Fourier transform theory to the real-time generated big genomic datasets of both formats: FASTA and FASTQ and assigns the lowest possible codeword to the most frequent characters of the datasets. Our results indicate that the proposed data minimization algorithm is up to 79% of FASTA datasets' size reduction, with 98-fold faster and more secure than the standard data-encoding method. Also, the results show up to 45% of FASTQ datasets' size reduction with 57-fold faster than the standard data-encoding approach. Based on our results, we conclude that the proposed data minimization algorithm provides the best performance among current data-encoding approaches for big real-time generated genomic datasets. 
    more » « less
  2. Abstract The ability to engineer the spatial wavefunction of photons has enabled a variety of quantum protocols for communication, sensing, and information processing. These protocols exploit the high dimensionality of structured light enabling the encoding of multiple bits of information in a single photon, the measurement of small physical parameters, and the achievement of unprecedented levels of security in schemes for cryptography. Unfortunately, the potential of structured light has been restrained to free-space platforms in which the spatial profile of photons is preserved. Here, we make an important step forward to using structured light for fiber optical communication. We introduce a classical encryption protocol in which the propagation of high-dimensional spatial modes in multimode fibers is used as a natural mechanism for encryption. This provides a secure communication channel for data transmission. The information encoded in spatial modes is retrieved using artificial neural networks, which are trained from the intensity distributions of experimentally detected spatial modes. Our on-fiber communication platform allows us to use single spatial modes for information encoding as well as the high-dimensional superposition modes for bit-by-bit and byte-by-byte encoding respectively. This protocol enables one to recover messages and images with almost perfect accuracy. Our classical smart protocol for high-dimensional encryption in optical fibers provides a platform that can be adapted to address increased per-photon information capacity at the quantum level, while maintaining the fidelity of information transfer. This is key for quantum technologies relying on structured fields of light, particularly those that are challenged by free-space propagation. 
    more » « less
  3. Abstract

    Quantum key distribution (QKD) has established itself as a groundbreaking technology, showcasing inherent security features that are fundamentally proven. Qubit-based QKD protocols that rely on binary encoding encounter an inherent constraint related to the secret key capacity. This limitation restricts the maximum secret key capacity to one bit per photon. On the other hand, qudit-based QKD protocols have their advantages in scenarios where photons are scarce and noise is present, as they enable the transmission of more than one secret bit per photon. While proof-of-principle entangled-based qudit QKD systems have been successfully demonstrated over the years, the current limitation lies in the maximum distribution distance, which remains at 20 km fiber distance. Moreover, in these entangled high-dimensional QKD systems, the witness and distribution of quantum steering have not been shown before. Here we present a high-dimensional time-bin QKD protocol based on energy-time entanglement that generates a secure finite-length key capacity of 2.39 bit/coincidences and secure cryptographic finite-length keys at 0.24 Mbits s−1in a 50 km optical fiber link. Our system is built entirely using readily available commercial off-the-shelf components, and secured by nonlocal dispersion cancellation technique against collective Gaussian attacks. Furthermore, we set new records for witnessing both energy-time entanglement and quantum steering over different fiber distances. When operating with a quantum channel loss of 39 dB, our system retains its inherent characteristic of utilizing large-alphabet. This enables us to achieve a secure key rate of 0.30 kbits s−1and a secure key capacity of 1.10 bit/coincidences, considering finite-key effects. Our experimental results closely match the theoretical upper bound limit of secure cryptographic keys in high-dimensional time-bin QKD protocols (Moweret al2013Phys. Rev.A87062322; Zhanget al2014Phys. Rev. Lett.112120506), and outperform recent state-of-the-art qubit-based QKD protocols in terms of secure key throughput using commercial single-photon detectors (Wengerowskyet al2019Proc. Natl Acad. Sci.1166684; Wengerowskyet al2020npj Quantum Inf.65; Zhanget al2014Phys. Rev. Lett.112120506; Zhanget al2019Nat. Photon.13839; Liuet al2019Phys. Rev. Lett.122160501; Zhanget al2020Phys. Rev. Lett.125010502; Weiet al2020Phys. Rev.X10031030). The simple and robust entanglement-based high-dimensional time-bin protocol presented here provides potential for practical long-distance quantum steering and QKD with multiple secure bits-per-coincidence, and higher secure cryptographic keys compared to mature qubit-based QKD protocols.

     
    more » « less
  4. null (Ed.)
    A Quantum Key Distribution (QKD) protocol describes how two remote parties can establish a secret key by communicating over a quantum and a public classical channel that both can be accessed by an eavesdropper. QKD protocols using energy-time entangled photon pairs are of growing practical interest because of their potential to provide a higher secure key rate over long distances by carrying multiple bits per entangled photon pair. We consider a system where information can be extracted by measuring random times of a sequence of entangled photon arrivals. Our goal is to maximize the utility of each such pair. We propose a discrete-time model for the photon arrival process, and establish a theoretical bound on the number of raw bits that can be generated under this model. We first analyze a well-known simple binning encoding scheme, and show that it generates a significantly lower information rate than what is theoretically possible. We then propose three adaptive schemes that increase the number of raw bits generated per photon, and compute and compare the information rates they offer. Moreover, the effect of public channel communication on the secret key rates of the proposed schemes is investigated. 
    more » « less
  5. Federated learning (FL) is an increasingly popular approach for machine learning (ML) when the training dataset is highly distributed. Clients perform local training on their datasets and the updates are then aggregated into the global model. Existing protocols for aggregation are either inefficient or don’t consider the case of malicious actors in the system. This is a major barrier to making FL an ideal solution for privacy-sensitive ML applications. In this talk, I will present ELSA, a secure aggregation protocol for FL that breaks this barrier - it is efficient and addresses the existence of malicious actors (clients + servers) at the core of its design. Similar to prior work Prio and Prio+, ELSA provides a novel secure aggregation protocol built out of distributed trust across two servers that keeps individual client updates private as long as one server is honest, defends against malicious clients, and is efficient end-to-end. Compared to prior works, the distinguishing theme in ELSA is that instead of the servers generating cryptographic correlations interactively, the clients act as untrusted dealers of these correlations without compromising the protocol’s security. This leads to a much faster protocol while also achieving stronger security at that efficiency compared to prior work. We introduce new techniques that retain privacy even when a server is malicious at a small added cost of 7-25% in runtime with a negligible increase in communication over the case of a semi-honest server. ELSA improves end-to-end runtime over prior work with similar security guarantees by big margins - single-aggregator RoFL by up to 305x (for the models we consider), and distributed-trust Prio by up to 8x (with up to 16x faster server-side protocol). Additionally, ELSA can be run in a bandwidth-saver mode for clients who are geographically bandwidth-constrained - an important property that is missing from prior works. 
    more » « less