skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Fund Asset Inference Using Machine Learning Methods: What’s in That Portfolio?
Given only the historic net asset value of a large-cap mutual fund, which members of some universe of stocks are held by the fund? Discovering an exact solution is combinatorially intractable because there are, for example, C(500, 30) or 1.4 × 10^48 possible portfolios of 30 stocks drawn from the S&P 500. The authors extend an existing linear clones approach and introduce a new sequential oscillating selection method to produce a computationally efficient inference. Such techniques could inform efforts to detect fund window dressing of disclosure statements or to adjust market positions in advance of major fund disclosure dates. The authors test the approach by tasking the algorithm with inferring the constituents of exchange-traded funds for which the components can be later examined. Depending on the details of the specific problem, the algorithm runs on consumer hardware in 8 to 15 seconds and identifies target portfolio constituents with an accuracy of 88.2% to 98.6%.  more » « less
Award ID(s):
1741026
PAR ID:
10112224
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
The Journal of Financial Data Science
ISSN:
2640-3943
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Models project that climate change is increasing the frequency of severe storm events such as hurricanes. Hurricanes are an important driver of ecosystem structure and function in tropical coastal and island regions and thus impact tropical forest carbon (C) cycling. We used the DayCent model to explore the effects of increased hurricane frequency on humid tropical forest C stocks and fluxes at decadal and centennial timescales. The model was parameterized with empirical data from the Luquillo Experimental Forest (LEF), Puerto Rico. The DayCent model replicated the well-documented cyclical pattern of forest biomass fluctuations in hurricane-impacted forests such as the LEF. At the historical hurricane frequency (60 years), the dynamic steady state mean forest biomass was 80.9 ± 0.8 Mg C/ha during the 500-year study period. Increasing hurricane frequency to 30 and 10 years did not significantly affect net primary productivity but resulted in a significant decrease in mean forest biomass to 61.1 ± 0.6 and 33.2 ± 0.2 Mg C/ha, respectively (p < 0.001). Hurricane events at all intervals had a positive effect on soil C stocks, although the magnitude and rate of change of soil C varied with hurricane frequency. However, the gain in soil C stocks was insufficient to offset the larger losses from aboveground biomass C over the time period. Heterotrophic respiration increased with hurricane frequency by 1.6 to 4.8%. Overall, we found that an increasing frequency of tropical hurricanes led to a decrease in net ecosystem production by − 0.2 ± 0.08 Mg C/ha/y to − 0.4 ± 0.04 Mg C/ha/y for 30–10-year hurricane intervals, respectively, significantly increasing the C source strength of this forest. These results demonstrate how changes in hurricane frequency can have major implications for the tropical forest C cycle and limit the potential for this ecosystem to serve as a net C sink. 
    more » « less
  2. This work models the costs and benefits of per- sonal information sharing, or self-disclosure, in online social networks as a networked disclosure game. In a networked population where edges rep- resent visibility amongst users, we assume a leader can influence network structure through content promotion, and we seek to optimize social wel- fare through network design. Our approach con- siders user interaction non-homogeneously, where pairwise engagement amongst users can involve or not involve sharing personal information. We prove that this problem is NP-hard. As a solution, we develop a Mixed-integer Linear Programming algorithm, which can achieve an exact solution, and also develop a time-efficient heuristic algo- rithm that can be used at scale. We conduct nu- merical experiments to demonstrate the properties of the algorithms and map theoretical results to a dataset of posts and comments in 2020 and 2021 in a COVID-related Subreddit community where privacy risks and sharing tradeoffs were particularly pronounced. 
    more » « less
  3. null (Ed.)
    Abstract Background Personal privacy is a significant concern in the era of big data. In the field of health geography, personal health data are collected with geographic location information which may increase disclosure risk and threaten personal geoprivacy. Geomasking is used to protect individuals’ geoprivacy by masking the geographic location information, and spatial k-anonymity is widely used to measure the disclosure risk after geomasking is applied. With the emergence of individual GPS trajectory datasets that contains large volumes of confidential geospatial information, disclosure risk can no longer be comprehensively assessed by the spatial k-anonymity method. Methods This study proposes and develops daily activity locations (DAL) k-anonymity as a new method for evaluating the disclosure risk of GPS data. Instead of calculating disclosure risk based on only one geographic location (e.g., home) of an individual, the new DAL k-anonymity is a composite evaluation of disclosure risk based on all activity locations of an individual and the time he/she spends at each location abstracted from GPS datasets. With a simulated individual GPS dataset, we present case studies of applying DAL k-anonymity in various scenarios to investigate its performance. The results of applying DAL k-anonymity are also compared with those obtained with spatial k-anonymity under these scenarios. Results The results of this study indicate that DAL k-anonymity provides a better estimation of the disclosure risk than does spatial k-anonymity. In various case-study scenarios of individual GPS data, DAL k-anonymity provides a more effective method for evaluating the disclosure risk by considering the probability of re-identifying an individual’s home and all the other daily activity locations. Conclusions This new method provides a quantitative means for understanding the disclosure risk of sharing or publishing GPS data. It also helps shed new light on the development of new geomasking methods for GPS datasets. Ultimately, the findings of this study will help to protect individual geoprivacy while benefiting the research community by promoting and facilitating geospatial data sharing. 
    more » « less
  4. Author Name Disambiguation (AND) is the task of clustering unique author names from publication records in scholarly or related databases. Although AND has been extensively studied and has served as an important preprocessing step for several tasks (e.g. calculating bibliometrics and scientometrics for authors), there are few publicly available tools for disambiguation in large-scale scholarly databases. Furthermore, most of the disambiguated data is embedded within the search engines of the scholarly databases, and existing application programming interfaces (APIs) have limited features and are often unavailable for users for various reasons. This makes it difficult for researchers and developers to use the data for various applications (e.g. author search) or research. Here, we design a novel, web-based, RESTful API for searching disambiguated authors, using the PubMed database as a sample application. We offer two type of queries, attribute-based queries and record-based queries which serve different purposes. Attribute-based queries retrieve authors with the attributes available in the database. We study different search engines to find the most appropriate one for processing attribute-based queries. Record-based queries retrieve authors that are most likely to have written a query publication provided by a user. To accelerate record-based queries, we develop a novel algorithm that has a fast record-to-cluster match. We show that our algorithm can accelerate the query by a factor of 4.01 compared to a baseline naive approach. 
    more » « less
  5. Background: The 2020 US Census will use a novel approach to disclosure avoidance to protect respondents’ data, called TopDown. This TopDown algorithm was applied to the 2018 end-to-end (E2E) test of the decennial census. The computer code used for this test as well as accompanying exposition has recently been released publicly by the Census Bureau. Methods: We used the available code and data to better understand the error introduced by the E2E disclosure avoidance system when Census Bureau applied it to 1940 census data and we developed an empirical measure of privacy loss to compare the error and privacy of the new approach to that of a (non-differentially private) simple-random-sampling approach to protecting privacy. Results: We found that the empirical privacy loss of TopDown is substantially smaller than the theoretical guarantee for all privacy loss budgets we examined. When run on the 1940 census data, TopDown with a privacy budget of 1.0 was similar in error and privacy loss to that of a simple random sample of 50% of the US population. When run with a privacy budget of 4.0, it was similar in error and privacy loss of a 90% sample. Conclusions: This work fits into the beginning of a discussion on how to best balance privacy and accuracy in decennial census data collection, and there is a need for continued discussion. 
    more » « less