DNA sequencing plays an important role in the bioinformatics research community. DNA sequencing is important to all organisms, especially to humans and from multiple perspectives. These include understanding the correlation of specific mutations that plays a significant role in increasing or decreasing the risks of developing a disease or condition, or finding the implications and connections between the genotype and the phenotype. Advancements in the high-throughput sequencing techniques, tools, and equipment, have helped to generate big genomic datasets due to the tremendous decrease in the DNA sequence costs. However, the advancements have posed great challenges to genomic data storage, analysis, and transfer. Accessing, manipulating, and sharing the generated big genomic datasets present major challenges in terms of time and size, as well as privacy. Data size plays an important role in addressing these challenges. Accordingly, data minimization techniques have recently attracted much interest in the bioinformatics research community. Therefore, it is critical to develop new ways to minimize the data size. This paper presents a new real-time data minimization mechanism of big genomic datasets to shorten the transfer time in a more secure manner, despite the potential occurrence of a data breach. Our method involves the application of the random sampling of Fourier transform theory to the real-time generated big genomic datasets of both formats: FASTA and FASTQ and assigns the lowest possible codeword to the most frequent characters of the datasets. Our results indicate that the proposed data minimization algorithm is up to 79% of FASTA datasets' size reduction, with 98-fold faster and more secure than the standard data-encoding method. Also, the results show up to 45% of FASTQ datasets' size reduction with 57-fold faster than the standard data-encoding approach. Based on our results, we conclude that the proposed data minimization algorithm provides the best performance among current data-encoding approaches for big real-time generated genomic datasets.
more »
« less
A Deep Learning-Based Data Minimization Algorithm for Fast and Secure Transfer of Big Genomic Datasets
In the age of Big Genomics Data, institutions such as the National Human Genome Research Institute (NHGRI) are challenged in their efforts to share volumes of data between researchers, a process that has been plagued by unreliable transfers and slow speeds. These occur due to throughput bottlenecks of traditional transfer technologies. Two factors that affect the effciency of data transmission are the channel bandwidth and the amount of data. Increasing the bandwidth is one way to transmit data effciently, but might not always be possible due to resource limitations. Another way to maximize channel utilization is by decreasing the bits needed for transmission of a dataset. Traditionally, transmission of big genomic data between two geographical locations is done using general-purpose protocols, such as hypertext transfer protocol (HTTP) and file transfer protocol (FTP) secure. In this paper, we present a novel deep learning-based data minimization algorithm that 1) minimizes the datasets during transfer over the carrier channels; 2) protects the data from the man-in-the-middle (MITM) and other attacks by changing the binary representation (content-encoding) several times for the same dataset: we assign different codewords to the same character in different parts of the dataset. Our data minimization strategy exploits the alphabet limitation of DNA sequences and modifies the binary representation (codeword) of dataset characters using deep learning-based convolutional neural network (CNN) to ensure a minimum of code word uses to the high frequency characters at different time slots during the transfer time. This algorithm ensures transmission of big genomic DNA datasets with minimal bits and latency and yields an effcient and expedient process. Our tested heuristic model, simulation, and real implementation results indicate that the proposed data minimization algorithm is up to 99 times faster and more secure than the currently used content-encoding scheme used in HTTP of the HTTP content-encoding scheme and 96 times faster than FTP on tested datasets. The developed protocol in C# will be available to the wider genomics community and domain scientists.
more »
« less
- Award ID(s):
- 1925960
- PAR ID:
- 10090851
- Date Published:
- Journal Name:
- IEEE Transactions on Big Data
- ISSN:
- 2332-7790
- Page Range / eLocation ID:
- 1 to 1
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
This paper studies a multi-party private set union (mPSU), a fundamental cryptographic problem that allows multiple parties to compute the union of their respective datasets without revealing any additional information. We propose an efficient mPSU protocol which is secure in the presence of any number of colluding semi-honest participants. Our protocol avoids computationally expensive homomorphic operations or generic multi-party computation, thus providing an efficient solution for mPSU. The crux of our protocol lies in the utilization of new cryptographic tool, namely, Membership Oblivious Transfer (mOT). We believe that the mOT may be of independent interest. We implement our mPSU protocol and evaluate its performance. Our protocol shows an improvement of up to $80.84 times$ in terms of running time and $405.73 times$ bandwidth cost compared to the existing state-of-the-art protocols.more » « less
-
Abstract The ability to engineer the spatial wavefunction of photons has enabled a variety of quantum protocols for communication, sensing, and information processing. These protocols exploit the high dimensionality of structured light enabling the encoding of multiple bits of information in a single photon, the measurement of small physical parameters, and the achievement of unprecedented levels of security in schemes for cryptography. Unfortunately, the potential of structured light has been restrained to free-space platforms in which the spatial profile of photons is preserved. Here, we make an important step forward to using structured light for fiber optical communication. We introduce a classical encryption protocol in which the propagation of high-dimensional spatial modes in multimode fibers is used as a natural mechanism for encryption. This provides a secure communication channel for data transmission. The information encoded in spatial modes is retrieved using artificial neural networks, which are trained from the intensity distributions of experimentally detected spatial modes. Our on-fiber communication platform allows us to use single spatial modes for information encoding as well as the high-dimensional superposition modes for bit-by-bit and byte-by-byte encoding respectively. This protocol enables one to recover messages and images with almost perfect accuracy. Our classical smart protocol for high-dimensional encryption in optical fibers provides a platform that can be adapted to address increased per-photon information capacity at the quantum level, while maintaining the fidelity of information transfer. This is key for quantum technologies relying on structured fields of light, particularly those that are challenged by free-space propagation.more » « less
-
Federated learning (FL) is an increasingly popular approach for machine learning (ML) when the training dataset is highly distributed. Clients perform local training on their datasets and the updates are then aggregated into the global model. Existing protocols for aggregation are either inefficient or don’t consider the case of malicious actors in the system. This is a major barrier to making FL an ideal solution for privacy-sensitive ML applications. In this talk, I will present ELSA, a secure aggregation protocol for FL that breaks this barrier - it is efficient and addresses the existence of malicious actors (clients + servers) at the core of its design. Similar to prior work Prio and Prio+, ELSA provides a novel secure aggregation protocol built out of distributed trust across two servers that keeps individual client updates private as long as one server is honest, defends against malicious clients, and is efficient end-to-end. Compared to prior works, the distinguishing theme in ELSA is that instead of the servers generating cryptographic correlations interactively, the clients act as untrusted dealers of these correlations without compromising the protocol’s security. This leads to a much faster protocol while also achieving stronger security at that efficiency compared to prior work. We introduce new techniques that retain privacy even when a server is malicious at a small added cost of 7-25% in runtime with a negligible increase in communication over the case of a semi-honest server. ELSA improves end-to-end runtime over prior work with similar security guarantees by big margins - single-aggregator RoFL by up to 305x (for the models we consider), and distributed-trust Prio by up to 8x (with up to 16x faster server-side protocol). Additionally, ELSA can be run in a bandwidth-saver mode for clients who are geographically bandwidth-constrained - an important property that is missing from prior works.more » « less
-
null (Ed.)This paper introduces a novel LiDAR point cloud data encoding solution that is compact, flexible, and fully supports distributed data storage within the Hadoop distributed computing environment. The proposed data encoding solution is developed based on Sequence File and Google Protocol Buffers. Sequence File is a generic splittable binary file format built in the Hadoop framework for storage of arbitrary binary data. The key challenge in adopting the Sequence File format for LiDAR data is in the strategy for effectively encoding the LiDAR data as binary sequences in a way that the data can be represented compactly, while allowing necessary mutation. For that purpose, a data encoding solution, based on Google Protocol Buffers (a language-neutral, cross-platform, extensible data serialisation framework) was developed and evaluated. Since neither of the underlying technologies is sufficient to completely and efficiently represent all necessary point formats for distributed computing, an innovative fusion of them was required to provide a viable data storage solution. This paper presents the details of such a data encoding implementation and rigorously evaluates the efficiency of the proposed data encoding solution. Benchmarking was done against a straightforward, naive text encoding implementation using a high-density aerial LiDAR scan of a portion of Dublin, Ireland. The results demonstrated a 6-times reduction in data volume, a 4-times reduction in database ingestion time, and up to a 5 times reduction in querying time.more » « less
An official website of the United States government

