Although software developers of mHealth apps are responsible for protecting patient data and adhering to strict privacy and security requirements, many of them lack awareness of HIPAA regulations and struggle to distinguish between HIPAA rules categories. Therefore, providing guidance of HIPAA rules patterns classification is essential for developing secured applications for Google Play Store. In this work, we identified the limitations of traditional Word2Vec embeddings in processing code patterns. To address this, we adopt multilingual BERT (Bidirectional Encoder Representations from Transformers) which offers contextualized embeddings to the attributes of dataset to overcome the issues. Therefore, we applied this BERT to our dataset for embedding code patterns and then uses these embedded code to various machine learning approaches. Our results demonstrate that the models significantly enhances classification performance, with Logistic Regression achieving a remarkable accuracy of 99.95%. Additionally, we obtained high accuracy from Support Vector Machine (99.79%), Random Forest (99.73%), and Naive Bayes (95.93%), outperforming existing approaches. This work underscores the effectiveness and showcases its potential for secure application development.
more »
« less
Fused Gromov-Wasserstein Variance Decomposition with Linear Optimal Transport
Wasserstein distances form a family of metrics on spaces of probability measures that have recently seen many applications. However, statistical analysis in these spaces is complex due to the nonlinearity of Wasserstein spaces. One potential solution to this problem is Linear Optimal Transport (LOT). This method allows one to find a Euclidean embedding, called {\it LOT embedding}, of measures in some Wasserstein spaces, but some information is lost in this embedding. So, to understand whether statistical analysis relying on LOT embeddings can make valid inferences about original data, it is helpful to quantify how well these embeddings describe that data. To answer this question, we present a decomposition of the {\it Fr\'echet variance} of a set of measures in the 2-Wasserstein space, which allows one to compute the percentage of variance explained by LOT embeddings of those measures. We then extend this decomposition to the Fused Gromov-Wasserstein setting. We also present several experiments that explore the relationship between the dimension of the LOT embedding, the percentage of variance explained by the embedding, and the classification accuracy of machine learning classifiers built on the embedded data. We use the MNIST handwritten digits dataset, IMDB-50000 dataset, and Diffusion Tensor MRI images for these experiments. Our results illustrate the effectiveness of low dimensional LOT embeddings in terms of the percentage of variance explained and the classification accuracy of models built on the embedded data.
more »
« less
- Award ID(s):
- 1953087
- PAR ID:
- 10557057
- Publisher / Repository:
- arXiv.org
- Date Published:
- Format(s):
- Medium: X
- Institution:
- Florida State University
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Many data analysis and design problems involve reasoning about points in high-dimensional space. A common strategy is to embed points from this high-dimensional space into a low-dimensional one. As we will show in this paper, a critical property of good embeddings is that they preserve isometry — i.e., preserving the geodesic distance between points on the original data manifold within their embedded locations in the latent space. However, enforcing isometry is non-trivial for common Neural embedding models, such as autoencoders and generative models. Moreover, while theoretically appealing, it is not clear to what extent enforcing isometry is really necessary for a given design or analysis task. This paper answers these questions by constructing an isometric embedding via an isometric autoencoder, which we employ to analyze an inverse airfoil design problem. Specifically, the paper describes how to train an isometric autoencoder and demonstrates its usefulness compared to non-isometric autoencoders on both simple pedagogical examples and for airfoil embeddings using the UIUC airfoil dataset. Our ablation study illustrates that enforcing isometry is necessary to accurately discover latent space clusters — a common analysis method researchers typically perform on low-dimensional embeddings. We also show how isometric autoencoders can uncover pathologies in typical gradient-based Shape Optimization solvers through an analysis on the SU2-optimized airfoil dataset, wherein we find an over-reliance of the gradient solver on angle of attack. Overall, this paper motivates the use of isometry constraints in Neural embedding models, particularly in cases where researchers or designer intend to use distance-based analysis measures (such as clustering, k-Nearest Neighbors methods, etc.) to analyze designs within the latent space. While this work focuses on airfoil design as an illustrative example, it applies to any domain where analyzing isometric design or data embeddings would be useful.more » « less
-
Abstract Discriminating between distributions is an important problem in a number of scientific fields. This motivated the introduction of Linear Optimal Transportation (LOT), which embeds the space of distributions into an $L^2$-space. The transform is defined by computing the optimal transport of each distribution to a fixed reference distribution and has a number of benefits when it comes to speed of computation and to determining classification boundaries. In this paper, we characterize a number of settings in which LOT embeds families of distributions into a space in which they are linearly separable. This is true in arbitrary dimension, and for families of distributions generated through perturbations of shifts and scalings of a fixed distribution. We also prove conditions under which the $L^2$ distance of the LOT embedding between two distributions in arbitrary dimension is nearly isometric to Wasserstein-2 distance between those distributions. This is of significant computational benefit, as one must only compute $$N$$ optimal transport maps to define the $N^2$ pairwise distances between $$N$$ distributions. We demonstrate the benefits of LOT on a number of distribution classification problems.more » « less
-
In this paper, we propose a comprehensive unsupervised framework that leverages existing and novel multiview learning models, towards obtaining a single node embedding from a collection of node embeddings, combining the best of all worlds. Through extensive experiments, we demonstrate that the proposed multiview node embedding is able to perform on par or better than the best of its constituents and provide reliable performance across downstream tasks including node classification and graph reconstruction. Index Terms—multiview learning, node embedding, hybrid tensor decomposition, unsupervised learningmore » « less
-
Contextualized word embeddings, such as ELMo, provide meaningful representations for words and their contexts. They have been shown to have a great impact on downstream applications. However, we observe that the contextualized embeddings of a word might change drastically when its contexts are paraphrased. As these embeddings are over-sensitive to the context, the downstream model may make different predictions when the input sentence is paraphrased. To address this issue, we propose a post-processing approach to retrofit the embedding with paraphrases. Our method learns an orthogonal transformation on the input space of the contextualized word embedding model, which seeks to minimize the variance of word representations on paraphrased contexts. Experiments show that the proposed method significantly improves ELMo on various sentence classification and inference tasks.more » « less
An official website of the United States government

