skip to main content


Title: A STATISTICAL OVERVIEW ON DATA PRIVACY
The eruption of big data with the increasing collection and processing of vast volumes and variety of data have led to breakthrough discoveries and innovation in science, engineering, medicine, commerce, criminal justice, and national security that would not have been possible in the past. While there are many benefits to the collection and usage of big data, there are also growing concerns among the general public on what personal information is collected and how it is used. In addition to legal policies and regulations, technological tools and statistical strategies also exist to promote and safeguard individual privacy, while releasing and sharing useful population-level information. In this overview, I introduce some of these approaches, as well as the existing challenges and opportunities in statistical data privacy research and applications to better meet the practical needs of privacy protection and information sharing.  more » « less
Award ID(s):
1717417
NSF-PAR ID:
10187183
Author(s) / Creator(s):
Date Published:
Journal Name:
Notre Dame journal of law ethics public policy
Volume:
34
Issue:
2
ISSN:
0883-3648
Page Range / eLocation ID:
477-500
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Involving the public in scientific discovery offers opportunities for engagement, learning, participation, and action. Since its launch in 2007, the CitSci.org platform has supported hundreds of community-driven citizen science projects involving thousands of participants who have generated close to a million scientific measurements around the world. Members using CitSci.org follow their curiosities and concerns to develop, lead, or simply participate in research projects. While professional scientists are trained to make ethical determinations related to the collection of, access to, and use of information, citizen scientists and practitioners may be less aware of such issues and more likely to become involved in ethical dilemmas. In this era of big and open data, where data sharing is encouraged and open science is promoted, privacy and openness considerations can often be overlooked. Platforms that support the collection, use, and sharing of data and personal information need to consider their responsibility to protect the rights to and ownership of data, the provision of protection options for data and members, and at the same time provide options for openness. This requires critically considering both intended and unintended consequences of the use of platforms, data, and volunteer information. Here, we use our journey developing CitSci.org to argue that incorporating customization into platforms through flexible design options for project managers shifts the decision-making from top-down to bottom-up and allows project design to be more responsive to goals. To protect both people and data, we developed—and continue to improve—options that support various levels of “open” and “closed” access permissions for data and membership participation. These options support diverse governance styles that are responsive to data uses, traditional and indigenous knowledge sensitivities, intellectual property rights, personally identifiable information concerns, volunteer preferences, and sensitive data protections. We present a typology for citizen science openness choices, their ethical considerations, and strategies that we are actively putting into practice to expand privacy options and governance models based on the unique needs of individual projects using our platform. 
    more » « less
  2. In graph machine learning, data collection, sharing, and analysis often involve multiple parties, each of which may require varying levels of data security and privacy. To this end, preserving privacy is of great importance in protecting sensitive information. In the era of big data, the relationships among data entities have become unprecedentedly complex, and more applications utilize advanced data structures (i.e., graphs) that can support network structures and relevant attribute information. To date, many graph-based AI models have been proposed (e.g., graph neural networks) for various domain tasks, like computer vision and natural language processing. In this paper, we focus on reviewing privacypreserving techniques of graph machine learning. We systematically review related works from the data to the computational aspects. We rst review methods for generating privacy-preserving graph data. Then we describe methods for transmitting privacy-preserved information (e.g., graph model parameters) to realize the optimization-based computation when data sharing among multiple parties is risky or impossible. In addition to discussing relevant theoretical methodology and software tools, we also discuss current challenges and highlight several possible future research opportunities for privacy-preserving graph machine learning. Finally, we envision a uni ed and comprehensive secure graph machine learning system. 
    more » « less
  3. Between 2018 and 2021 PIs for National Science Foundation Awards # 1758781 and 1758814 EAGER: Collaborative Research: Developing and Testing an Incubator for Digital Entrepreneurship in Remote Communities, in partnership with the Tanana Chiefs Conference, the traditional tribal consortium of the 42 villages of Interior Alaska, jointly developed and conducted large-scale digital and in-person surveys of multiple Alaskan interior communities. The survey was distributed via a combination of in-person paper surveys, digital surveys, social media links, verbal in-person interviews and telephone-based responses. Analysis of this measure using SAS demonstrated the statistically significant need for enhanced digital infrastructure and reworked digital entrepreneurial and technological education in the Tanana Chiefs Conference region. 1. Two statistical measures were created during this research: Entrepreneurial Readiness (ER) and Digital Technology needs and skills (DT), both of which showed high measures of internal consistency (.89, .81). 2. The measures revealed entrepreneurial readiness challenges and evidence of specific addressable barriers that are currently preventing (serving as hindrances) to regional digital economic activity. The survey data showed statistically significant correlation with the mixed-methodological in-person focus groups and interview research conducted by the PIs and TCC collaborators in Hughes and Huslia, AK, which further corroborated stated barriers to entrepreneurship development in the region. 3. Data generated by the survey and fieldwork is maintained by the Tanana Chiefs Conference under data sovereignty agreements. The survey and focus group data contains aggregated statistical/empirical data as well as qualitative/subjective detail that runs the risk of becoming personally identifiable especially due to (but not limited to) to concerns with exceedingly small Arctic community population sizes. 4. This metadata is being provided in order to serve as a record of the data collection and analysis conducted, and also to share some high-level findings that, while revealing no personal information, may be helpful for policymaking, regional planning and efforts towards educational curricular development and infrastructural investment. The sample demographics consist of 272 women, 79 men, and 4 with gender not indicated as a response. Barriers to Entrepreneurial Readiness were a component of the measure. Lack of education is the #1 barrier, followed closely by lack of access to childcare. Among women who participated in the survey measure, 30% with 2 or more children report lack of childcare to be a significant barrier to entrepreneurial and small business activity. For entrepreneurial readiness and digital economy, the scales perform well from a psychometric standpoint. The summary scores are roughly normally distributed. Cronbach’s alphas are greater than 0.80 for both. They are moderately correlated with each other (r = 0.48, p < .0001). Men and women do not differ significantly on either measure. Education is significantly related to the digital economy measure. The detail provided in the survey related to educational needs enabled optimized development of the Incubator for Digital Entrepreneurship in Remote Communities. Enhanced digital entrepreneurship training with clear cultural linkages to traditions and community needs, along with additional childcare opportunities are two among several specific recommendations provided to the TCC. The project PIs are working closely with the TCC administration and community members related to elements of culturally-aligned curricular development that respects data tribal sovereignty, local data management protocols, data anonymity and adherence to human subjects (IRB) protocols. While the survey data is currently embargoed and unable to be submitted publicly for reasons of anonymity, the project PIs are working with the NSF Arctic Data Center towards determining pathways for sharing personally-protected data with the larger scientific community. These approaches may consist of aggregating and digitally anonymizing sensitive data in ways that cannot be de-aggregated and that meet agency and scientific community needs (while also fully respecting and protecting participants’ rights and personal privacy). At present the data sensitivity protocols are not yet adapted to TCC requirements and the datasets will remain in their care. 
    more » « less
  4. Abstract

    One of the major challenges in ensuring global food security is the ever‐changing biotic risk affecting the productivity and efficiency of the global food supply system. Biotic risks that threaten food security include pests and diseases that affect pre‐ and postharvest terrestrial agriculture and aquaculture. Strategies to minimize this risk depend heavily on plant and animal disease research. As data collected at high spatial and temporal resolutions become increasingly available, epidemiological models used to assess and predict biotic risks have become more accurate and, thus, more useful. However, with the advent of Big Data opportunities, a number of challenges have arisen that limit researchers’ access to complex, multi‐sourced, multi‐scaled data collected on pathogens, and their associated environments and hosts. Among these challenges, one of the most limiting factors is data privacy concerns from data owners and collectors. While solutions, such as the use of de‐identifying and anonymizing tools that protect sensitive information are recognized as effective practices for use by plant and animal disease researchers, there are comparatively few platforms that include data privacy by design that are accessible to researchers. We describe how the general thinking and design used for data sharing and analysis platforms can intrinsically address a number of these data privacy‐related challenges that are a barrier to researchers wanting to access data. We also describe how some of the data privacy concerns confronting plant and animal disease researchers are addressed by way of the GEMS informatics platform.

     
    more » « less
  5. Reddy, S. ; Winter, J.S. ; Padmanabhan, S. (Ed.)
    AI applications are poised to transform health care, revolutionizing benefits for individuals, communities, and health-care systems. As the articles in this special issue aptly illustrate, AI innovations in healthcare are maturing from early success in medical imaging and robotic process automation, promising a broad range of new applications. This is evidenced by the rapid deployment of AI to address critical challenges related to the COVID-19 pandemic, including disease diagnosis and monitoring, drug discovery, and vaccine development. At the heart of these innovations is the health data required for deep learning applications. Rapid accumulation of data, along with improved data quality, data sharing, and standardization, enable development of deep learning algorithms in many healthcare applications. One of the great challenges for healthcare AI is effective governance of these data—ensuring thoughtful aggregation and appropriate access to fuel innovation and improve patient outcomes and healthcare system efficiency while protecting the privacy and security of data subjects. Yet the literature on data governance has rarely looked beyond important pragmatic issues related to privacy and security. Less consideration has been given to unexpected or undesirable outcomes of healthcare in AI, such as clinician deskilling, algorithmic bias, the “regulatory vacuum”, and lack of public engagement. Amidst growing calls for ethical governance of algorithms, Reddy et al. developed a governance model for AI in healthcare delivery, focusing on principles of fairness, accountability, and transparency (FAT), and trustworthiness, and calling for wider discussion. Winter and Davidson emphasize the need to identify underlying values of healthcare data and use, noting the many competing interests and goals for use of health data—such as healthcare system efficiency and reform, patient and community health, intellectual property development, and monetization. Beyond the important considerations of privacy and security, governance must consider who will benefit from healthcare AI, and who will not. Whose values drive health AI innovation and use? How can we ensure that innovations are not limited to the wealthiest individuals or nations? As large technology companies begin to partner with health care systems, and as personally generated health data (PGHD) (e.g., fitness trackers, continuous glucose monitors, health information searches on the Internet) proliferate, who has oversight of these complex technical systems, which are essentially a black box? To tackle these complex and important issues, it is important to acknowledge that we have entered a new technical, organizational, and policy environment due to linked data, big data analytics, and AI. Data governance is no longer the responsibility of a single organization. Rather, multiple networked entities play a role and responsibilities may be blurred. This also raises many concerns related to data localization and jurisdiction—who is responsible for data governance? In this emerging environment, data may no longer be effectively governed through traditional policy models or instruments. 
    more » « less