skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: GeoDE: a Geographically Diverse Evaluation Dataset for Object Recognition
Current dataset collection methods typically scrape large amounts of data from the web. While this technique is extremely scalable, data collected in this way tends to reinforce stereotypical biases, can contain personally identifiable information, and typically originates from Europe and North America. In this work, we rethink the dataset collection paradigm and introduce GeoDE , a geographically diverse dataset with 61,940 images from 40 classes and 6 world regions, with no personally identifiable information, collected by soliciting images from people around the world. We analyse GeoDE to understand differences in images collected in this manner compared to web-scraping. We demonstrate its use as both an evaluation and training dataset, allowing us to highlight and begin to mitigate the shortcomings in current models, despite GeoDE’s relatively small size. We release the full dataset and code at https://geodiverse-data-collection.cs.princeton.edu/  more » « less
Award ID(s):
2145198
PAR ID:
10527153
Author(s) / Creator(s):
; ; ; ; ; ;
Publisher / Repository:
NeurIPS 2023
Date Published:
ISSN:
1049-5258
ISBN:
978-3-319-45173-2
Format(s):
Medium: X
Location:
New Orleans, LA
Sponsoring Org:
National Science Foundation
More Like this
  1. Development of a comprehensive legal privacy framework in the United States should be based on identification of the common deficiencies of privacy policies. We attempt to delineate deficiencies by critically analyzing the privacy policies of mobile apps, application suites, social networks, Internet Service Providers, and Internet-of-Things devices. Whereas many studies have examined readability of privacy policies, few have specifically identified the information that should be provided in privacy policies but is not. Privacy legislation invariably starts a definition of personally identifiable information. We find that privacy policies’ definitions of personally identifiable information are far too restrictive, excluding information that does not itself identify a person but which can be used to reasonably identify a person, and excluding information paired with a device identifier which can be reasonably linked to a person. Legislation should define personally identifiable information to include such information, and should differentiate between information paired with a name versus information paired with a device identifier. Privacy legislation often excludes anonymous and de-identified information from notice and choice requirements. We find that privacy policies’ descriptions of anonymous and de-identified information are far too broad, including information paired with advertising identifiers. Computer science has repeatedly demonstrated that such information is reasonably linkable. Legislation should define these categories of information to align with technological abilities. Legislation should also not exempt de-identified information from notice requirements, to increase transparency. Privacy legislation relies heavily on notice requirements. We find that, because privacy policies’ disclosures of the uses of personal information are disconnected from their disclosures about the types of personal information collected, we are often unable to determine which types of information are used for which purposes. Often, we cannot determine whether location or web browsing history is used solely for functional purposes or also for advertising. Legislation should require the disclosure of the purposes for each type of personal information collected. We also find that, because privacy policies disclosures of sharing of personal information are disconnected from their disclosures about the types of personal information collected, we are often unable to determine which types of information are shared. Legislation should require the disclosure of the types of personal information shared. Finally, privacy legislation relies heavily on user choice. We find that free services often require the collection and sharing of personal information. As a result, users often have no choices. We find that whereas some paid services afford users a wide variety of choices, paid services in less competitive sectors often afford users few choices over use and sharing of personal information for purposes unrelated to the service. As a result, users are often unable to dictate which types of information they wish to allow to be shared, and which types they wish to allow to be used for advertising. Legislation should differentiate between take-it-or-leave it, opt-out, and opt-in approaches based on the type of use and on whether the information is shared. Congress should consider whether user choices should be affected by the presence of market power. 
    more » « less
  2. Automated monitoring of dark web (DW) platforms on a large scale is the first step toward developing proactive Cyber Threat Intelligence (CTI). While there are efficient methods for collecting data from the surface web, large-scale dark web data collection is often hindered by anti-crawling measures. In particular, text-based CAPTCHA serves as the most prevalent and prohibiting type of these measures in the dark web. Text-based CAPTCHA identifies and blocks automated crawlers by forcing the user to enter a combination of hard-to-recognize alphanumeric characters. In the dark web, CAPTCHA images are meticulously designed with additional background noise and variable character length to prevent automated CAPTCHA breaking. Existing automated CAPTCHA breaking methods have difficulties in overcoming these dark web challenges. As such, solving dark web text-based CAPTCHA has been relying heavily on human involvement, which is labor-intensive and time-consuming. In this study, we propose a novel framework for automated breaking of dark web CAPTCHA to facilitate dark web data collection. This framework encompasses a novel generative method to recognize dark web text-based CAPTCHA with noisy background and variable character length. To eliminate the need for human involvement, the proposed framework utilizes Generative Adversarial Network (GAN) to counteract dark web background noise and leverages an enhanced character segmentation algorithm to handle CAPTCHA images with variable character length. Our proposed framework, DW-GAN, was systematically evaluated on multiple dark web CAPTCHA testbeds. DW-GAN significantly outperformed the state-of-the-art benchmark methods on all datasets, achieving over 94.4% success rate on a carefully collected real-world dark web dataset. We further conducted a case study on an emergent Dark Net Marketplace (DNM) to demonstrate that DW-GAN eliminated human involvement by automatically solving CAPTCHA challenges with no more than three attempts. Our research enables the CTI community to develop advanced, large-scale dark web monitoring. We make DW-GAN code available to the community as an open-source tool in GitHub. 
    more » « less
  3. null (Ed.)
    The information privacy of the Internet users has become a major societal concern. The rapid growth of online services increases the risk of unauthorized access to Personally Identifiable Information (PII) of at-risk populations, who are unaware of their PII exposure. To proactively identify online at-risk populations and increase their privacy awareness, it is crucial to conduct a holistic privacy risk assessment across the internet. Current privacy risk assessment studies are limited to a single platform within either the surface web or the dark web. A comprehensive privacy risk assessment requires matching exposed PII on heterogeneous online platforms across the surface web and the dark web. However, due to the incompleteness and inaccuracy of PII records in each platform, linking the exposed PII to users is a non-trivial task. While Entity Resolution (ER) techniques can be used to facilitate this task, they often require ad-hoc, manual rule development and feature engineering. Recently, Deep Learning (DL)-based ER has outperformed manual entity matching rules by automatically extracting prominent features from incomplete or inaccurate records. In this study, we enhance the existing privacy risk assessment with a DL-based ER method, namely Multi-Context Attention (MCA), to comprehensively evaluate individuals’ PII exposure across the different online platforms in the dark web and surface web. Evaluation against benchmark ER models indicates the efficacy of MCA. Using MCA on a random sample of data breach victims in the dark web, we are able to identify 4.3% of the victims on the surface web platforms and calculate their privacy risk scores. 
    more » « less
  4. Facebook has become an important part of our daily life. From knowing the status of our relatives, showing off a new car, to connecting with a high school classmate, abundant personally identifiable information (PII) are made visible to others by posts, images and news. However, this free flow of information has also created significant cyber-security challenges that make us vulnerable to social engineering and cyber crimes. To confront these challenges, we propose a new behavioral biometric that verifies a user based on his or her widget interaction behavior when using Facebook. Specifically, we monitor activities on the user’s Facebook account using our own logging software and verify the user’s claimed identity by binary classifiers trained with two algorithms (SVM-rbf and the GBM– Gradient Boosting Machines). Our novel dataset consists of eight users over a month of data collection with an average of 2.95k rows of data per user. We convert these activities data into meaningful features such as day-of-week, hour-of-day, and widget types and duration of mouse staying on a widget. The performance shows that our novel widget interaction modality is promising for authentication. The SVM-rbf classifiers achieve a mean Equal Error Rate (EER) and mean Accuracy (ACC) of 3.91% and 97.79%, while the GBM classifiers a mean EER and ACC of 2.76% and 97.88%, respectively. In addition, we perform an ablation study to understand the impact of individual features on authentication performance. The importance of features are ranked in the descending order of hour-of-day, day-of-week, and widget types and duration. 
    more » « less
  5. The use of tattoos as a soft biometric is increasing in popularity among law enforcement communities. There is great need for large scale, publicly available tattoo datasets that can be used to standardize efforts to develop tattoo-based biometric systems. In this work, we introduce a large tattoo dataset (WVUMediaTatt) collected from a social-media website. Additionally, we provide the source links to the images so that anyone can re-generate this dataset. Our WVU-MediaTatt database contains tattoo sample images from over 1,000 subjects, with two tattoo image samples per subject. To the best of our knowledge, this dataset is significantly bigger than any current released publicly available tattoo dataset, including the recently released NIST Tatt-C dataset. The use of social media in deep learning, data mining, and biometrics has traditionally been a controversial issue in terms of data security and protection of privacy. In this work, we first conduct a full discussion on the issues associated with data collection from social media sources for the use of biometric system development, and provide a framework for data collection. In this study, within the process of creating a new large scale tattoo dataset, we consider the issues and make attempts protect the subject’s privacy and information, while ensuring that subjects remain in control of their data in this study and the use of the data adheres to the guidelines proposed by the Heath Care Compliance Association (HCCA) and the U.S. Department of Health & Human Services. 
    more » « less