skip to main content

Search for: All records

Creators/Authors contains: "Zhang, Si"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Marshall, Christopher W. (Ed.)
    ABSTRACT Identification of genes encoding β-lactamases (BLs) from short-read sequences remains challenging due to the high frequency of shared amino acid functional domains and motifs in proteins encoded by BL genes and related non-BL gene sequences. Divergent BL homologs can be frequently missed during similarity searches, which has important practical consequences for monitoring antibiotic resistance. To address this limitation, we built ROCker models that targeted broad classes (e.g., class A, B, C, and D) and individual families (e.g., TEM) of BLs and challenged them with mock 150-bp- and 250-bp-read data sets of known composition. ROCker identifies most-discriminant bit score thresholds in sliding windows along the sequence of the target protein sequence and hence can account for nondiscriminative domains shared by unrelated proteins. BL ROCker models showed a 0% false-positive rate (FPR), a 0% to 4% false-negative rate (FNR), and an up-to-50-fold-higher F1 score [2 × precision × recall/(precision + recall)] compared to alternative methods, such as similarity searches using BLASTx with various e-value thresholds and BL hidden Markov models, or tools like DeepARG, ShortBRED, and AMRFinder. The ROCker models and the underlying protein sequence reference data sets and phylogenetic trees for read placement are freely available through . Applicationmore »of these BL ROCker models to metagenomics, metatranscriptomics, and high-throughput PCR gene amplicon data should facilitate the reliable detection and quantification of BL variants encoded by environmental or clinical isolates and microbiomes and more accurate assessment of the associated public health risk, compared to the current practice. IMPORTANCE Resistance genes encoding β-lactamases (BLs) confer resistance to the widely prescribed antibiotic class β-lactams. Therefore, it is important to assess the prevalence of BL genes in clinical or environmental samples for monitoring the spreading of these genes into pathogens and estimating public health risk. However, detecting BLs in short-read sequence data is technically challenging. Our ROCker model-based bioinformatics approach showcases the reliable detection and typing of BLs in complex data sets and thus contributes toward solving an important problem in antibiotic resistance surveillance. The ROCker models developed substantially expand the toolbox for monitoring antibiotic resistance in clinical or environmental settings.« less
    Free, publicly-accessible full text available June 28, 2023
  2. Dense subgraph detection is a fundamental building block for a va- riety of applications. Most of the existing methods aim to discover dense subgraphs within either a single network or a multi-view network while ignoring the informative node dependencies across multiple layers of networks in a complex system. To date, it largely remains a daunting task to detect dense subgraphs on multi-layered networks. In this paper, we formulate the problem of dense sub- graph detection on multi-layered networks based on cross-layer consistency principle. We further propose a novel algorithm Des- tine based on projected gradient descent with the following ad- vantages. First, armed with the cross-layer dependencies, Destine is able to detect significantly more accurate and meaningful dense subgraphs at each layer. Second, it scales linearly w.r.t. the num- ber of links in the multi-layered network. Extensive experiments demonstrate the efficacy of the proposed Destine algorithm in various cases.
  3. Networks (i.e., graphs) are often collected from multiple sources and platforms, such as social networks extracted from multiple online platforms, team-specific collaboration networks within an organization, and inter-dependent infrastructure networks, etc. Such networks from different sources form the multi-networks, which can exhibit the unique patterns that are invisible if we mine the individual network separately. However, compared with single-network mining, multi-network mining is still under-explored due to its unique challenges. First ( multi-network models ), networks under different circumstances can be modeled into a variety of models. How to properly build multi-network models from the complex data? Second ( multi-network mining algorithms ), it is often nontrivial to either extend single-network mining algorithms to multi-networks or design new algorithms. How to develop effective and efficient mining algorithms on multi-networks? The objectives of this tutorial are to: (1) comprehensively review the existing multi-network models, (2) elaborate the techniques in multi-network mining with a special focus on recent advances, and (3) elucidate open challenges and future research directions. We believe this tutorial could be beneficial to various application domains, and attract researchers and practitioners from data mining as well as other interdisciplinary fields.
  4. Cann, Isaac (Ed.)
    ABSTRACT Arsenic (As) metabolism genes are generally present in soils, but their diversity, relative abundance, and transcriptional activity in response to different As concentrations remain unclear, limiting our understanding of the microbial activities that control the fate of an important environmental pollutant. To address this issue, we applied metagenomics and metatranscriptomics to paddy soils showing a gradient of As concentrations to investigate As resistance genes ( ars ) including arsR , acr3 , arsB , arsC , arsM , arsI , arsP , and arsH as well as energy-generating As respiratory oxidation ( aioA ) and reduction ( arrA ) genes. Somewhat unexpectedly, the relative DNA abundances and diversities of ars , aioA , and arrA genes were not significantly different between low and high (∼10 versus ∼100 mg kg −1 ) As soils. Compared to available metagenomes from other soils, geographic distance rather than As levels drove the different compositions of microbial communities. Arsenic significantly increased ars gene abundance only when its concentration was higher than 410 mg kg −1 . In contrast, metatranscriptomics revealed that relative to low-As soils, high-As soils showed a significant increase in transcription of ars and aioA genes, which are induced by arsenite, themore »dominant As species in paddy soils, but not arrA genes, which are induced by arsenate. These patterns appeared to be community wide as opposed to taxon specific. Collectively, our findings advance understanding of how microbes respond to high As levels and the diversity of As metabolism genes in paddy soils and indicated that future studies of As metabolism in soil or other environments should include the function (transcriptome) level. IMPORTANCE Arsenic (As) is a toxic metalloid pervasively present in the environment. Microorganisms have evolved the capacity to metabolize As, and As metabolism genes are ubiquitously present in the environment even in the absence of high concentrations of As. However, these previous studies were carried out at the DNA level; thus, the activity of the As metabolism genes detected remains essentially speculative. Here, we show that the high As levels in paddy soils increased the transcriptional activity rather than the relative DNA abundance and diversity of As metabolism genes. These findings advance our understanding of how microbes respond to and cope with high As levels and have implications for better monitoring and managing an important toxic metalloid in agricultural soils and possibly other ecosystems.« less
  5. Network alignment plays an important role in a variety of applications. Many traditional methods explicitly or implicitly assume the alignment consistency which might suffer from over-smoothness, whereas some recent embedding based methods could somewhat embrace the alignment disparity by sampling negative alignment pairs. However, under different or even competing designs of negative sampling distributions, some methods advocate positive correlation which could result in false negative samples incorrectly violating the alignment consistency, whereas others champion negative correlation or uniform distribution to sample nodes which may contribute little to learning meaningful embeddings. In this paper, we demystify the intrinsic relationships behind various network alignment methods and between these competing design principles of sampling. Specifically, in terms of model design, we theoretically reveal the close connections between a special graph convolutional network model and the traditional consistency based alignment method. For model training, we quantify the risk of embedding learning for network alignment with respect to the sampling distributions. Based on these, we propose NeXtAlign which strikes a balance between alignment consistency and disparity. We conduct extensive experiments that demonstrate the proposed method achieves significant improvements over the state-of-the-arts.
  6. Species retaining ancestral features, such as species called living fossils, are often regarded as less derived than their sister groups, but such discussions are usually based on qualitative enumeration of conserved traits. This approach creates a major barrier, especially when quantifying the degree of phenotypic evolution or degree of derivedness, since it focuses only on commonly shared traits, and newly acquired or lost traits are often overlooked. To provide a potential solution to this problem, especially for inter-species comparison of gene expression profiles, we propose a new method named “derivedness index” to quantify the degree of derivedness. In contrast to the conservation-based approach, which deals with expressions of commonly shared genes among species being compared, the derivedness index also considers those that were potentially lost or duplicated during evolution. By applying our method, we found that the gene expression profiles of penta-radial phases in echinoderm tended to be more highly derived than those of the bilateral phase. However, our results suggest that echinoderms may not have experienced much larger modifications to their developmental systems than chordates, at least at the transcriptomic level. In vertebrates, we found that the mid-embryonic and organogenesis stages were generally less derived than the earlier ormore »later stages, indicating that the conserved phylotypic period is also less derived. We also found genes that potentially explain less derivedness, such as Hox genes. Finally, we highlight technical concerns that may influence the measured transcriptomic derivedness, such as read depth and library preparation protocols, for further improvement of our method through future studies. We anticipate that this index will serve as a quantitative guide in the search for constrained developmental phases or processes.« less
  7. Ranking on networks plays an important role in many high-impact applications, including recommender systems, social network analysis, bioinformatics and many more. In the age of big data, a recent trend is to address the variety aspect of network ranking. Among others, two representative lines of research include (1) heterogeneous information network with different types of nodes and edges, and (2) network of networks with edges at different resolutions. In this paper, we propose a new network model named Network of Heterogeneous Information Networks (NeoHIN for short) that is capable of simultaneously modeling both different types of nodes/edges, and different edge resolutions. We further propose two new ranking algorithms on NeoHIN based on the cross-domain consistency principle. Experiments on synthetic and real-world networks show that our proposed algorithms are (1) effective, which outperform other existing methods, and (2) efficient, without additional time cost per iteration to their counterparts.
  8. Network alignment is a fundamental task in many high-impact applications. Most of the existing approaches either explicitly or implicitly consider the alignment matrix as a linear transformation to map one network to another, and might overlook the complicated alignment relationship across networks. On the other hand, node representation learning based alignment methods are hampered by the incomparability among the node representations of different networks. In this paper, we propose a unified semi-supervised deep model (ORIGIN) that simultaneously finds the non-rigid network alignment and learns node representations in multiple networks in a mutually beneficial way. The key idea is to learn node representations by the effective graph convolutional networks, which subsequently enable us to formulate network alignment as a point set alignment problem. The proposed method offers two distinctive advantages. First (node representations), unlike the existing graph convolutional networks that aggregate the node information within a single network, we can effectively aggregate the auxiliary information from multiple sources, achieving far-reaching node representations. Second (network alignment), guided by the highquality node representations, our proposed non-rigid point set alignment approach overcomes the bottleneck of the linear transformation assumption. We conduct extensive experiments that demonstrate the proposed non-rigid alignment method is (1) effective, outperformingmore »both the state-of-the-art linear transformation-based methods and node representation based methods, and (2) efficient, with a comparable computational time between the proposed multi-network representation learning component and its single-network counterpart.« less