skip to main content

Search for: All records

Creators/Authors contains: "Yu, Bin"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Abstract

    MicroRNAs (miRNAs) play a key role in regulating gene expression and their biogenesis is precisely controlled through modulating the activity of microprocessor. Here, we report that CWC15, a spliceosome-associated protein, acts as a positive regulator of miRNA biogenesis. CWC15 binds the promoters of genes encoding miRNAs (MIRs), promotes their activity, and increases the occupancy of DNA-dependent RNA polymerases at MIR promoters, suggesting that CWC15 positively regulates the transcription of primary miRNA transcripts (pri-miRNAs). In addition, CWC15 interacts with Serrate (SE) and HYL1, two key components of microprocessor, and is required for efficient pri-miRNA processing and the HYL1-pri-miRNA interaction. Moreover, CWC15 interacts with the 20 S proteasome and PRP4KA, facilitating SE phosphorylation by PRP4KA, and subsequent non-functional SE degradation by the 20 S proteasome. These data reveal that CWC15 ensures optimal miRNA biogenesis by maintaining proper SE levels and by modulating pri-miRNA levels. Taken together, this study uncovers the role of a conserved splicing-related protein in miRNA biogenesis.

    more » « less
  2. Abstract

    MicroRNAs (miRNAs) are important regulators of genes expression. Their levels are precisely controlled through modulating the activity of the microprocesser complex (MC). Here, we report that JANUS, a homology of the conserved U2 snRNP assembly factor in yeast and human, is required for miRNA accumulation. JANUS associates with MC components Dicer-like 1 (DCL1) and SERRATE (SE) and directly binds the stem-loop of pri-miRNAs. In a hypomorphic janus mutant, the activity of DCL1, the numbers of MC, and the interaction of primary miRNA transcript (pri-miRNAs) with MC are reduced. These data suggest that JANUS promotes the assembly and activity of MC through its interaction with MC and/or pri-miRNAs. In addition, JANUS modulates the transcription of some pri-miRNAs as it binds the promoter of pri-miRNAs and facilitates Pol II occupancy of at their promoters. Moreover, global splicing defects are detected in janus. Taken together, our study reveals a novel role of a conserved splicing factor in miRNA biogenesis.

    more » « less
  3. Modern machine learning has achieved impressive prediction performance, but often sacrifices interpretability, a critical consideration in high-stakes domains such as medicine. In such settings, practitioners often use highly interpretable decision tree models, but these suffer from inductive bias against additive structure. To overcome this bias, we propose Fast Interpretable Greedy-Tree Sums (FIGS), which generalizes the CART algorithm to simultaneously grow a flexible number of trees in summation. By combining logical rules with addition, FIGS is able to adapt to additive structure while remaining highly interpretable. Extensive experiments on real-world datasets show that FIGS achieves state-of-the-art prediction performance. To demonstrate the usefulness of FIGS in high-stakes domains, we adapt FIGS to learn clinical decision instruments (CDIs), which are tools for guiding clinical decision-making. Specifically, we introduce a variant of FIGS known as G-FIGS that accounts for the heterogeneity in medical data. G-FIGS derives CDIs that reflect domain knowledge and enjoy improved specificity (by up to 20% over CART) without sacrificing sensitivity or interpretability. To provide further insight into FIGS, we prove that FIGS learns components of additive models, a property we refer to as disentanglement. Further, we show (under oracle conditions) that unconstrained tree-sum models leverage disentanglement to generalize more efficiently than single decision tree models when fitted to additive regression functions. Finally, to avoid overfitting with an unconstrained number of splits, we develop Bagging-FIGS, an ensemble version of FIGS that borrows the variance reduction techniques of random forests. Bagging-FIGS enjoys competitive performance with random forests and XGBoost on real-world datasets. 
    more » « less
    Free, publicly-accessible full text available July 1, 2024
  4. Free, publicly-accessible full text available May 1, 2024
  5. Random Forests (RFs) are at the cutting edge of supervised machine learning in terms of prediction performance, especially in genomics. Iterative RFs (iRFs) use a tree ensemble from iteratively modified RFs to obtain predictive and stable nonlinear or Boolean interactions of features. They have shown great promise for Boolean biological interaction discovery that is central to advancing functional genomics and precision medicine. However, theoretical studies into how tree-based methods discover Boolean feature interactions are missing. Inspired by the thresholding behavior in many biological processes, we first introduce a discontinuous nonlinear regression model, called the “Locally Spiky Sparse” (LSS) model. Specifically, the LSS model assumes that the regression function is a linear combination of piecewise constant Boolean interaction terms. Given an RF tree ensemble, we define a quantity called “Depth-Weighted Prevalence” (DWP) for a set of signed features S ± . Intuitively speaking, DWP( S ± ) measures how frequently features in S ± appear together in an RF tree ensemble. We prove that, with high probability, DWP( S ± ) attains a universal upper bound that does not involve any model coefficients, if and only if S ± corresponds to a union of Boolean interactions under the LSS model. Consequentially, we show that a theoretically tractable version of the iRF procedure, called LSSFind, yields consistent interaction discovery under the LSS model as the sample size goes to infinity. Finally, simulation results show that LSSFind recovers the interactions under the LSS model, even when some assumptions are violated. 
    more » « less
  6. Tree-based models such as decision trees and random forests (RF) are a cornerstone of modern machine-learning practice. To mitigate overfitting, trees are typically regularized by a variety of techniques that modify their structure (e.g. pruning). We introduce Hierarchical Shrinkage (HS), a post-hoc algorithm that does not modify the tree structure, and instead regularizes the tree by shrinking the prediction over each node towards the sample means of its ancestors. The amount of shrinkage is controlled by a single regularization parameter and the number of data points in each ancestor. Since HS is a post-hoc method, it is extremely fast, compatible with any tree growing algorithm, and can be used synergistically with other regularization techniques. Extensive experiments over a wide variety of real world datasets show that HS substantially increases the predictive performance of decision trees, even when used in conjunction with other regularization techniques. Moreover, we find that applying HS to each tree in an RF often improves accuracy, as well as its interpretability by simplifying and stabilizing its decision boundaries and SHAP values. We further explain the success of HS in improving prediction performance by showing its equivalence to ridge regression on a (supervised) basis constructed of decision stumps associated with the internal nodes of a tree. All code and models are released in a full fledged package available on Github. 
    more » « less
  7. Recent deep-learning models have achieved impressive predictive performance by learning complex functions of many variables, often at the cost of interpretability. This chapter covers recent work aiming to interpret models by attributing importance to features and feature groups for a single prediction. Importantly, the proposed attributions assign importance to interactions between features, in addition to features in isolation. These attributions are shown to yield insights across real-world domains, including bio-imaging, cosmology image and natural-language processing. We then show how these attributions can be used to directly improve the generalization of a neural network or to distill it into a simple model. Throughout the chapter, we emphasize the use of reality checks to scrutinize the proposed interpretation techniques. (Code for all methods in this chapter is available at and, implemented in PyTorch [54]). 
    more » « less
  8. Machine learning in high-stakes domains, such as healthcare, faces two critical challenges: (1) generalizing to diverse data distributions given limited training data while (2) maintaining interpretability. To address these challenges, we propose an instance-weighted tree-sum method that effectively pools data across diverse groups to output a concise, rule-based model. Given distinct groups of instances in a dataset (e.g., medical patients grouped by age or treatment site), our method first estimates group membership probabilities for each instance. Then, it uses these estimates as instance weights in FIGS (Tan et al., 2022), to grow a set of decision trees whose values sum to the final prediction. We call this new method Group Probability-Weighted Tree Sums (G-FIGS). G-FIGS achieves state-of-theart prediction performance on important clinical datasets; e.g., holding the level of sensitivity fixed at 92%, G-FIGS increases specificity for identifying cervical spine injury (CSI) by up to 10% over CART and up to 3% over FIGS alone, with larger gains at higher sensitivity levels. By keeping the total number of rules below 16 in FIGS, the final models remain interpretable, and we find that their rules match medical domain expertise. All code, data, and models are released on Github. 
    more » « less