Clustering large financial time series data enables pattern extraction that facilitates risk management. The knowledge gathered from unsupervised learning is useful for improving portfolio optimization and making stock trading recommendations. Most methods available in the literature for clustering financial time series are based on exploiting linear relationships between time series. However, prices of different assets (stocks) may have non‐linear relationships which may be quantified using information based measures such as mutual information (MI). To estimate the empirical mutual information between time series of stock returns, we employ a novel kernel density estimator (KDE) based jackknife mutual information estimation (JMI), and compare it with the widely‐used binning method. We then propose an average distance gradient change algorithm and an algorithm based on the average silhouette criterion that use pairwise and groupwise MI of high‐frequency financial stock returns. Through numerical studies, we provide insights into the impact of the clustering on asset allocation and risk management based on the nonlinear information structure of the US stock market.
We develop a convex‐optimization clustering algorithm for heterogeneous financial networks, in the presence of arbitrary or even adversarial outliers. In the stochastic block model with heterogeneity parameters, we penalize nodes whose degree exhibit unusual behavior beyond inlier heterogeneity. We prove that under mild conditions, this method achieves exact recovery of the underlying clusters. In absence of any assumption on outliers, they are shown not to hinder the clustering of the inliers. We test the performance of the algorithm on semi‐synthetic heterogenous networks reconstructed to match aggregate data on the Korean financial sector. Our method allows for recovery of sub‐sectors with significantly lower error rates compared to existing algorithms. For overlapping portfolio networks, we uncover a clustering structure supporting diversification effects in investment management.
more » « less- NSF-PAR ID:
- 10442026
- Publisher / Repository:
- Wiley-Blackwell
- Date Published:
- Journal Name:
- Mathematical Finance
- Volume:
- 34
- Issue:
- 2
- ISSN:
- 0960-1627
- Format(s):
- Medium: X Size: p. 425-466
- Size(s):
- p. 425-466
- Sponsoring Org:
- National Science Foundation
More Like this
-
Abstract -
Abstract A post-disaster recovery process necessitates significant financial and time investment. Previous studies have found the importance of post-disaster spatial recovery heterogeneity, but the recovery heterogeneity has not been extended to the directed recovery relationships despite the significance of sequential recovery plans. Identifying a causal structure between county-level time series data can reveal spatial relationships in the post-disaster recovery process. This study uses a causal discovery method to reveal the spatiotemporal relationships between counties before, during, and after Hurricane Irma in 2017. This study proposes node aggregation methods at different time scales to obtain internally validated causal links. This paper utilizes points of interest data with daily location information from mobile phones and county-level daily nighttime light data. We find intra-regional homogeneity, inter-regional heterogeneity, and a hierarchical structure among urban, suburban, and rural counties based on a network motif analysis. Subsequently, this article suggests county-level post-disaster sequential recovery plans using the causal graph methods. These results help policymakers develop recovery scenarios and estimate the corresponding spatial recovery impacts.
-
We consider the problem of clustering data sets in the presence of arbitrary outliers. Traditional clustering algorithms such as k-means and spectral clustering are known to perform poorly for data sets contaminated with even a small number of outliers. In this paper, we develop a provably robust spectral clustering algorithm that applies a simple rounding scheme to denoise a Gaussian kernel matrix built from the data points and uses vanilla spectral clustering to recover the cluster labels of data points. We analyze the performance of our algorithm under the assumption that the “good” data points are generated from a mixture of sub-Gaussians (we term these “inliers”), whereas the outlier points can come from any arbitrary probability distribution. For this general class of models, we show that the misclassification error decays at an exponential rate in the signal-to-noise ratio, provided the number of outliers is a small fraction of the inlier points. Surprisingly, this derived error bound matches with the best-known bound for semidefinite programs (SDPs) under the same setting without outliers. We conduct extensive experiments on a variety of simulated and real-world data sets to demonstrate that our algorithm is less sensitive to outliers compared with other state-of-the-art algorithms proposed in the literature. Funding: G. A. Hanasusanto was supported by the National Science Foundation Grants NSF ECCS-1752125 and NSF CCF-2153606. P. Sarkar gratefully acknowledges support from the National Science Foundation Grants NSF DMS-1713082, NSF HDR-1934932 and NSF 2019844. Supplemental Material: The online appendix is available at https://doi.org/10.1287/opre.2022.2317 .more » « less
-
null (Ed.)Abstract Non-negative matrix factorization and its extensions were applied to various areas (i.e., dimensionality reduction, clustering, etc.). When the original data are corrupted by outliers and noise, most of non-negative matrix factorization methods cannot achieve robust factorization and learn a subspace with binary codes. This paper puts forward a robust semi-supervised non-negative matrix factorization method for binary subspace learning, called RSNMF, for image clustering. For better clustering performance on the dataset contaminated by outliers and noise, we propose a weighted constraint on the noise matrix and impose manifold learning into non-negative matrix factorization. Moreover, we utilize the discrete hashing learning method to constrain the learned subspace, which can achieve a binary subspace from the original data. Experimental results validate the robustness and effectiveness of RSNMF in binary subspace learning and image clustering on the face dataset corrupted by Salt and Pepper noise and Contiguous Occlusion.more » « less
-
We consider the problem of clustering with the longest-leg path distance (LLPD) metric, which is informative for elongated and irregularly shaped clusters. We prove finite-sample guarantees on the performance of clustering with respect to this metric when random samples are drawn from multiple intrinsically low-dimensional clusters in high-dimensional space, in the presence of a large number of high-dimensional outliers. By combining these results with spectral clustering with respect to LLPD, we provide conditions under which the Laplacian eigengap statistic correctly determines the number of clusters for a large class of data sets, and prove guarantees on the labeling accuracy of the proposed algorithm. Our methods are quite general and provide performance guarantees for spectral clustering with any ultrametric. We also introduce an efficient, easy to implement approximation algorithm for the LLPD based on a multiscale analysis of adjacency graphs, which allows for the runtime of LLPD spectral clustering to be quasilinear in the number of data points.more » « less