The use of tattoos as a soft biometric is increasing in popularity among law enforcement communities. There is great need for large scale, publicly available tattoo datasets that can be used to standardize efforts to develop tattoo-based biometric systems. In this work, we introduce a large tattoo dataset (WVUMediaTatt) collected from a social-media website. Additionally, we provide the source links to the images so that anyone can re-generate this dataset. Our WVU-MediaTatt database contains tattoo sample images from over 1,000 subjects, with two tattoo image samples per subject. To the best of our knowledge, this dataset is significantly bigger than any current released publicly available tattoo dataset, including the recently released NIST Tatt-C dataset. The use of social media in deep learning, data mining, and biometrics has traditionally been a controversial issue in terms of data security and protection of privacy. In this work, we first conduct a full discussion on the issues associated with data collection from social media sources for the use of biometric system development, and provide a framework for data collection. In this study, within the process of creating a new large scale tattoo dataset, we consider the issues and make attempts protect the subject’s privacy and information, while ensuring that subjects remain in control of their data in this study and the use of the data adheres to the guidelines proposed by the Heath Care Compliance Association (HCCA) and the U.S. Department of Health & Human Services.
more »
« less
Annotated Schema: Mapping Ontologies onto Dataset Schemas
The goal of this proposal, Annotated Schema (AS), is to provide a limited ontology of annotations for dataset metadata that inform a prospective user of the classes, properties, and identifiers contained in the data. The intended audience of this document includes Data Curators (annotators) who want to create meaningful dataset descriptions, and those searching for data sets in our catalog.
more »
« less
- Award ID(s):
- 2131987
- PAR ID:
- 10594562
- Publisher / Repository:
- CAIDA
- Date Published:
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
This paper describes the dataset associated with the paper “Product-Specific Human Appropriation of Net Primary Production (HANPP) in US Counties” (Paudel et al., 2023). This dataset comprises human appropriation of net primary production (HANPP) values for 3101 counties in the conterminous US for the years 1997, 2002, 2007, and 2012. For this dataset, HANPP is the carbon content of specific crop, timber, and livestock grazing products appropriated by humans in a county in a year. To calculate HANPP, raw agricultural data were downloaded from public databases such as USDA-National Agricultural Statistics Service Quick Stats and Cropland Data Layer, US Forest Service Timber Product Output, and NPP data from MODIS. These data were processed in Microsoft Excel using stoichiometry derived from established scientific literature. HANPP was partitioned by year, county, product, used and unused and above- and below-ground. This complete dataset is published in Mendeley Data and the methods used to compile them are included to make our research well documented, reproducible, and useful for future studies.more » « less
-
null (Ed.)This paper introduces a new ROSbag-based multimodal affective dataset for emotional and cognitive states generated using the Robot Operating System (ROS). We utilized images and sounds from the International Affective Pictures System (IAPS) and the International Affective Digitized Sounds (IADS) to stimulate targeted emotions (happiness, sadness, anger, fear, surprise, disgust, and neutral), and a dual N-back game to stimulate different levels of cognitive workload. 30 human subjects participated in the user study; their physiological data were collected using the latest commercial wearable sensors, behavioral data were collected using hardware devices such as cameras, and subjective assessments were carried out through questionnaires. All data were stored in single ROSbag files rather than in conventional Comma-Separated Values (CSV) files. This not only ensures synchronization of signals and videos in a data set, but also allows researchers to easily analyze and verify their algorithms by connecting directly to this dataset through ROS. The generated affective dataset consists of 1,602 ROSbag files, and the size of the dataset is about 787GB. The dataset is made publicly available. We expect that our dataset can be a great resource for many researchers in the fields of affective computing, Human-Computer Interaction (HCI), and Human-Robot Interaction (HRI).more » « less
-
null (Ed.)Network intrusion detection systems (IDS) has efficiently identified the profiles of normal network activities, extracted intrusion patterns, and constructed generalized models to evaluate (un)known attacks using a wide range of machine learning approaches. In spite of the effectiveness of machine learning-based IDS, it has been still challenging to reduce high false alarms due to data misclassification. In this paper, by using multiple decision mechanisms, we propose a new classification method to identify misclassified data and then to classify them into three different classes, called a malicious, benign, and ambiguous dataset. In other words, the ambiguous dataset contains a majority of the misclassified dataset and is thus the most informative for improving the model and anomaly detection because of the lack of confidence for the data classification in the model. We evaluate our approach with the recent real-world network traffic data, Kyoto2006+ datasets, and show that the ambiguous dataset contains 77.2% of the previously misclassified data. Re-evaluating the ambiguous dataset effectively reduces the false prediction rate with minimal overhead and improves accuracy by 15%.more » « less
-
Keystroke data collected from CS1 student participants during spring 2024 semester at Utah State University. See readme.txt for detailed information. This dataset has undergone deidentification, though it is possible, being a complex, temporal, and ephemeral dataset, that identifying keystrokes may have been missed. Ethical use of this dataset includes avoiding attempts at reconstructing identities. That said, if researchers discover anything identifiable in the data, they are encouraged to contact the dataset authors (john.edwards@usu.edu).more » « less
An official website of the United States government

