A ubiquitous problem in aggregating data across different experimental and observational data sources is a lack of software infrastructure that enables flexible and extensible standardization of data and metadata. To address this challenge, we developed HDMF, a hierarchical data modeling framework for modern science data standards. With HDMF, we separate the process of data standardization into three main components: (1) data modeling and specification, (2) data I/O and storage, and (3) data interaction and data APIs. To enable standards to support the complex requirements and varying use cases throughout the data life cycle, HDMF provides object mapping infrastructure to insulate and integrate these various components. This approach supports the flexible development of data standards and extensions, optimized storage backends, and data APIs, while allowing the other components of the data standards ecosystem to remain stable. To meet the demands of modern, large-scale science data, HDMF provides advanced data I/O functionality for iterative data write, lazy data load, and parallel I/O. It also supports optimization of data storage via support for chunking, compression, linking, and modular data storage. We demonstrate the application of HDMF in practice to design NWB 2.0, a modern data standard for collaborative science across the neurophysiology community.
more »
« less
The topology of data
Topological data analysis, which allows systematic investigations of the “shape” of data, has yielded fascinating insights into many physical systems.
more »
« less
- Award ID(s):
- 1720530
- PAR ID:
- 10414870
- Date Published:
- Journal Name:
- Physics Today
- Volume:
- 76
- Issue:
- 1
- ISSN:
- 0031-9228
- Page Range / eLocation ID:
- 36 to 42
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Recent studies have shown that several government and business organizations experience huge data breaches. Data breaches increase in a daily basis. The main target for attackers is organization sensitive data which includes personal identifiable information (PII) such as social security number (SSN), date of birth (DOB) and credit card /debit card (CCDC). The other target is encryption/decryption keys or passwords to get access to the sensitive data. The cloud computing is emerging as a solution to store, transfer and process the data in a distributed location over the Internet. Big data and internet of things (IoT) increased the possibility of sensitive data exposure. Most methods used for the attack are hacking, unauthorized access, insider theft and false data injection on the move. Most of the attacks happen during three different states of data life cycle such as data-at-rest, data-in-use, and data-in-transit. Hence, protecting sensitive data at all states particularly when data is moving to cloud computing environment needs special attention. The main purpose of this research is to analyze risks caused by data breaches, personal and organizational weaknesses to protect sensitive data and privacy. The paper discusses methods such as data classification and data encryption at different states to protect personal and organizational sensitive data. The paper also presents mathematical analysis by leveraging the concept of birthday paradox to demonstrate the encryption key attack. The analysis result shows that the use of same keys to encrypt sensitive data at different data states make the sensitive data less secure than using different keys. Our results show that to improve the security of sensitive data and to reduce the data breaches, different keys should be used in different states of the data life cycle.more » « less
-
Deep learning is an important technique for extracting value from big data. However, the effectiveness of deep learning requires large volumes of high quality training data. In many cases, the size of training data is not large enough for effectively training a deep learning classifier. Data augmentation is a widely adopted approach for increasing the amount of training data. But the quality of the augmented data may be questionable. Therefore, a systematic evaluation of training data is critical. Furthermore, if the training data is noisy, it is necessary to separate out the noise data automatically. In this paper, we propose a deep learning classifier for automatically separating good training data from noisy data. To effectively train the deep learning classifier, the original training data need to be transformed to suit the input format of the classifier. Moreover, we investigate different data augmentation approaches to generate sufficient volume of training data from limited size original training data. We evaluated the quality of the training data through cross validation of the classification accuracy with different classification algorithms. We also check the pattern of each data item and compare the distributions of datasets. We demonstrate the effectiveness of the proposed approach through an experimental investigation of automated classification of massive biomedical images. Our approach is generic and is easily adaptable to other big data domains.more » « less
-
Deep learning is an important technique for extracting value from big data. However, the effectiveness of deep learning requires large volumes of high quality training data. In many cases, the size of training data is not large enough for effectively training a deep learning classifier. Data augmentation is a widely adopted approach for increasing the amount of training data. But the quality of the augmented data may be questionable. Therefore, a systematic evaluation of training data is critical. Furthermore, if the training data is noisy, it is necessary to separate out the noise data automatically. In this paper, we propose a deep learning classifier for automatically separating good training data from noisy data. To effectively train the deep learning classifier, the original training data need to be transformed to suit the input format of the classifier. Moreover, we investigate different data augmentation approaches to generate sufficient volume of training data from limited size original training data. We evaluated the quality of the training data through cross validation of the classification accuracy with different classification algorithms. We also check the pattern of each data item and compare the distributions of datasets. We demonstrate the effectiveness of the proposed approach through an experimental investigation of automated classification of massive biomedical images. Our approach is generic and is easily adaptable to other big data domains.more » « less
-
With the emergence of self-tracking devices that collect and produce real-time personal data, it is becoming increasingly necessary to innovate data literacy frameworks and student curricula to address new competencies in data handling, visualization, and use. We examine the evolution of data literacy frameworks across the past 7 years, specifically focusing on the inclusion of self-data competencies. We analyzed existing data literacy frameworks to identify common phases of data engagement. A scoping review of published data literacy frameworks was conducted, and 23 studies were included for analysis. Results from this scoping review demonstrate the existence of at least eight sequential phases of data engagement to develop data literacy. Two of these phases address personal or self-data competencies. We then describe a curriculum that addresses these eight phases of data engagement by pairing biometric devices with online tools and educational materials to scaffold self-data knowledge, skills, and attitudes. Based on this, we conclude with the need to propose holistic data literacy education programmes, considering the curriculum as a model to guide similar materials aimed at fostering emerging data competencies.more » « less
An official website of the United States government

