NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Estimation of the Number of Spiked Eigenvalues in a Covariance Matrix by Bulk Eigenvalue Matching Analysis

https://doi.org/10.1080/01621459.2021.1933497

Ke, Zheng Tracy; Ma, Yucong; Lin, Xihong (July 2021, Journal of the American Statistical Association)
null (Ed.)
The spiked covariance model has gained increasing popularity in high-dimensional data analysis. A fundamental problem is determination of the number of spiked eigenvalues, K. For estimation of K, most attention has focused on the use of top eigenvalues of sample covariance matrix, and there is little investigation into proper ways of using bulk eigenvalues to estimate K. We propose a principled approach to incorporating bulk eigenvalues in the estimation of K. Our method imposes a working model on the residual covariance matrix, which is assumed to be a diagonal matrix whose entries are drawn from a gamma distribution. Under this model, the bulk eigenvalues are asymptotically close to the quantiles of a fixed parametric distribution. This motivates us to propose a two-step method: the first step uses bulk eigenvalues to estimate parameters of this distribution, and the second step leverages these parameters to assist the estimation of K. The resulting estimator Kˆ aggregates information in a large number of bulk eigenvalues. We show the consistency of Kˆ under a standard spiked covariance model. We also propose a confidence interval estimate for K. Our extensive simulation studies show that the proposed method is robust and outperforms the existing methods in a range of scenarios. We apply the proposed method to analysis of a lung cancer microarray dataset and the 1000 Genomes dataset.
more » « less
Full Text Available
Improvements on SCORE, Especially for Weak Signals

https://doi.org/10.1007/s13171-020-00240-1

Jin, Jiashun; Ke, Zheng Tracy; Luo, Shengming (March 2021, Sankhya A)
null (Ed.)
A network may have weak signals and severe degree heterogeneity, and may be very sparse in one occurrence but very dense in another. SCORE (Ann. Statist. 43, 57–89, 2015) is a recent approach to network community detection. It accommodates severe degree heterogeneity and is adaptive to different levels of sparsity, but its performance for networks with weak signals is unclear. In this paper, we show that in a broad class of network settings where we allow for weak signals, severe degree heterogeneity, and a wide range of network sparsity, SCORE achieves prefect clustering and has the so-called “exponential rate” in Hamming clustering errors. The proof uses the most recent advancement on entry-wise bounds for the leading eigenvectors of the network adjacency matrix. The theoretical analysis assures us that SCORE continues to work well in the weak signal settings, but it does not rule out the possibility that SCORE may be further improved to have better performance in real applications, especially for networks with weak signals. As a second contribution of the paper, we propose SCORE+ as an improved version of SCORE. We investigate SCORE+ with 8 network data sets and found that it outperforms several representative approaches. In particular, for the 6 data sets with relatively strong signals, SCORE+ has similar performance as that of SCORE, but for the 2 data sets (Simmons, Caltech) with possibly weak signals, SCORE+ has much lower error rates. SCORE+ proposes several changes to SCORE. We carefully explain the rationale underlying each of these changes, using a mixture of theoretical and numerical study.
more » « less
Full Text Available
Diagonally Dominant Principal Component Analysis

https://doi.org/10.1080/10618600.2020.1713798

Ke, Zheng Tracy; Xue, Lingzhou; Yang, Fan (July 2020, Journal of Computational and Graphical Statistics)
null (Ed.)
Full Text Available
State aggregation learning from Markov transition data

Duan, Yaqi; Ke, Zheng Tracy; Wang, Mengdi (October 2019, Advances in neural information processing systems)
null (Ed.)
State aggregation is a popular model reduction method rooted in optimal control. It reduces the complexity of engineering systems by mapping the system’s states into a small number of meta-states. The choice of aggregation map often depends on the data analysts’ knowledge and is largely ad hoc. In this paper, we propose a tractable algorithm that estimates the probabilistic aggregation map from the system’s trajectory. We adopt a soft-aggregation model, where each meta-state has a signature raw state, called an anchor state. This model includes several common state aggregation models as special cases. Our proposed method is a simple two- step algorithm: The first step is spectral decomposition of empirical transition matrix, and the second step conducts a linear transformation of singular vectors to find their approximate convex hull. It outputs the aggregation distributions and disaggregation distributions for each meta-state in explicit forms, which are not obtainable by classical spectral methods. On the theoretical side, we prove sharp error bounds for estimating the aggregation and disaggregation distributions and for identifying anchor states. The analysis relies on a new entry-wise deviation bound for singular vectors of the empirical transition matrix of a Markov process, which is of independent interest and cannot be deduced from existing literature. The application of our method to Manhattan traffic data successfully generates a data-driven state aggregation map with nice interpretations.
more » « less
Full Text Available
Predicting Returns with Text Data

https://doi.org/10.2139/ssrn.3389884

Ke, Zheng; Kelly, Bryan T.; Xiu, Dacheng (May 2019, SSRN Electronic Journal)
null (Ed.)
We introduce a new text-mining methodology that extracts information from news articles to predict asset returns. Unlike more common sentiment scores used for stock return prediction (e.g., those sold by commercial vendors or built with dictionary-based methods), our supervised learning framework constructs a score that is specifically adapted to the problem of return prediction. Our method proceeds in three steps: 1) isolating a list of terms via predictive screening, 2) assigning prediction weights to these words via topic modeling, and 3) aggregating terms into an article-level predictive score via penalized likelihood. We derive theoretical guarantees on the accuracy of estimates from our model with minimal assumptions. In our empirical analysis, we study one of the most actively monitored streams of news articles in the financial system--the Dow Jones Newswires--and show that our supervised text model excels at extracting return-predictive signals in this context. Information in newswires is assimilated into prices with an ineffcient delay that is broadly consistent with limits-to-arbitrage (i.e., more severe for smaller and more volatile firms) yet can be exploited in a real-time trading strategy with reasonable turnover and net of transaction costs.
more » « less
Full Text Available
Network global testing by counting graphlets

Jin, Jiashun; Ke, Zheng Tracy; Luo, Shengming (October 2018, Proceedings of Machine Learning Research)
null; null (Ed.)
Consider a large social network with possibly severe degree heterogeneity and mixed-memberships. We are interested in testing whether the network has only one community or there are more than one communities. The problem is known to be non-trivial, partially due to the presence of severe degree heterogeneity. We construct a class of test statistics using the numbers of short paths and short cycles, and the key to our approach is a general framework for canceling the effects of degree heterogeneity. The tests compare favorably with existing methods. We support our methods with careful analysis and numerical study with simulated data and a real data example.
more » « less
Full Text Available
Racial disparities in kidney transplant waitlist appearance in Chicago: Is it race or place?

https://doi.org/10.1111/ctr.13195

Peng, Robert B.; Lee, Haena; Ke, Zheng T.; Saunders, Milda R. (May 2018, Clinical Transplantation)
null (Ed.)
Full Text Available
Phase transitions for high dimensional clustering and related problems

https://doi.org/10.1214/16-AOS1522

Jin, Jiashun; Ke, Zheng Tracy; Wang, Wanjie (October 2017, The Annals of Statistics)

Full Text Available

Search for: All records