NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Transparent single-cell set classification with kernel mean embeddings

https://doi.org/10.1145/3535508.3545538

Shan, Siyuan; Baskaran, Vishal Athreya; Yi, Haidong; Ranek, Jolene; Stanley, Natalie; Oliva, Junier B. (August 2022, Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics)

Full Text Available
Scalable Algorithms for Convex Clustering

https://doi.org/10.1109/DSLW51110.2021.9523411

Zhou, Weilian; Yi, Haidong; Mishne, Gal; Chi, Eric (June 2021, IEEE Data Science Workshop (DSW))

Full Text Available
COBRAC: a fast implementation of convex biclustering with compression

https://doi.org/10.1093/bioinformatics/btab248

Yi, Haidong; Huang, Le; Mishne, Gal; Chi, Eric C (April 2021, Bioinformatics)
Mathelier, Anthony (Ed.)
Abstract Biclustering is a generalization of clustering used to identify simultaneous grouping patterns in observations (rows) and features (columns) of a data matrix. Recently, the biclustering task has been formulated as a convex optimization problem. While this convex recasting of the problem has attractive properties, existing algorithms do not scale well. To address this problem and make convex biclustering a practical tool for analyzing larger data, we propose an implementation of fast convex biclustering called COBRAC to reduce the computing time by iteratively compressing problem size along the solution path. We apply COBRAC to several gene expression datasets to demonstrate its effectiveness and efficiency. Besides the standalone version for COBRAC, we also developed a related online web server for online calculation and visualization of the downloadable interactive results. Availability The source code and test data are available at https://github.com/haidyi/cvxbiclustr or https://zenodo.org/record/4620218. The web server is available at https://cvxbiclustr.ericchi.com. Supplementary information Supplementary data are available at Bioinformatics online.
more » « less
Full Text Available
dbCAN-PUL: a database of experimentally characterized CAZyme gene clusters and their substrates

https://doi.org/10.1093/nar/gkaa742

Ausland, Catherine; Zheng, Jinfang; Yi, Haidong; Yang, Bowen; Li, Tang; Feng, Xuehuan; Zheng, Bo; Yin, Yanbin (September 2020, Nucleic Acids Research)
null (Ed.)
Abstract PULs (polysaccharide utilization loci) are discrete gene clusters of CAZymes (Carbohydrate Active EnZymes) and other genes that work together to digest and utilize carbohydrate substrates. While PULs have been extensively characterized in Bacteroidetes, there exist PULs from other bacterial phyla, as well as archaea and metagenomes, that remain to be catalogued in a database for efficient retrieval. We have developed an online database dbCAN-PUL (http://bcb.unl.edu/dbCAN_PUL/) to display experimentally verified CAZyme-containing PULs from literature with pertinent metadata, sequences, and annotation. Compared to other online CAZyme and PUL resources, dbCAN-PUL has the following new features: (i) Batch download of PUL data by target substrate, species/genome, genus, or experimental characterization method; (ii) Annotation for each PUL that displays associated metadata such as substrate(s), experimental characterization method(s) and protein sequence information, (iii) Links to external annotation pages for CAZymes (CAZy), transporters (UniProt) and other genes, (iv) Display of homologous gene clusters in GenBank sequences via integrated MultiGeneBlast tool and (v) An integrated BLASTX service available for users to query their sequences against PUL proteins in dbCAN-PUL. With these features, dbCAN-PUL will be an important repository for CAZyme and PUL research, complementing our other web servers and databases (dbCAN2, dbCAN-seq).
more » « less
Full Text Available
AcrFinder: genome mining anti-CRISPR operons in prokaryotes and their viruses

https://doi.org/10.1093/nar/gkaa351

Yi, Haidong; Huang, Le; Yang, Bowen; Gomez, Javi; Zhang, Han; Yin, Yanbin (May 2020, Nucleic Acids Research)

Abstract Anti-CRISPR (Acr) proteins encoded by (pro)phages/(pro)viruses have a great potential to enable a more controllable genome editing. However, genome mining new Acr proteins is challenging due to the lack of a conserved functional domain and the low sequence similarity among experimentally characterized Acr proteins. We introduce here AcrFinder, a web server (http://bcb.unl.edu/AcrFinder) that combines three well-accepted ideas used by previous experimental studies to pre-screen genomic data for Acr candidates. These ideas include homology search, guilt-by-association (GBA), and CRISPR-Cas self-targeting spacers. Compared to existing bioinformatics tools, AcrFinder has the following unique functions: (i) it is the first online server specifically mining genomes for Acr-Aca operons; (ii) it provides a most comprehensive Acr and Aca (Acr-associated regulator) database (populated by GBA-based Acr and Aca datasets); (iii) it combines homology-based, GBA-based, and self-targeting approaches in one software package; and (iv) it provides a user-friendly web interface to take both nucleotide and protein sequence files as inputs, and output a result page with graphic representation of the genomic contexts of Acr-Aca operons. The leave-one-out cross-validation on experimentally characterized Acr-Aca operons showed that AcrFinder had a 100% recall. AcrFinder will be a valuable web resource to help experimental microbiologists discover new Anti-CRISPRs.
more » « less
Full Text Available
AcrDB: a database of anti-CRISPR operons in prokaryotes and viruses

https://doi.org/10.1093/nar/gkaa857

Huang, Le; Yang, Bowen; Yi, Haidong; Asif, Amina; Wang, Jiawei; Lithgow, Trevor; Zhang, Han; Minhas, Fayyaz ul Amir Afsar; Yin, Yanbin (October 2020, Nucleic Acids Research)
null (Ed.)
Abstract CRISPR–Cas is an anti-viral mechanism of prokaryotes that has been widely adopted for genome editing. To make CRISPR–Cas genome editing more controllable and safer to use, anti-CRISPR proteins have been recently exploited to prevent excessive/prolonged Cas nuclease cleavage. Anti-CRISPR (Acr) proteins are encoded by (pro)phages/(pro)viruses, and have the ability to inhibit their host's CRISPR–Cas systems. We have built an online database AcrDB (http://bcb.unl.edu/AcrDB) by scanning ∼19 000 genomes of prokaryotes and viruses with AcrFinder, a recently developed Acr-Aca (Acr-associated regulator) operon prediction program. Proteins in Acr-Aca operons were further processed by two machine learning-based programs (AcRanker and PaCRISPR) to obtain numerical scores/ranks. Compared to other anti-CRISPR databases, AcrDB has the following unique features: (i) It is a genome-scale database with the largest collection of data (39 799 Acr-Aca operons containing Aca or Acr homologs); (ii) It offers a user-friendly web interface with various functions for browsing, graphically viewing, searching, and batch downloading Acr-Aca operons; (iii) It focuses on the genomic context of Acr and Aca candidates instead of individual Acr protein family and (iv) It collects data with three independent programs each having a unique data mining algorithm for cross validation. AcrDB will be a valuable resource to the anti-CRISPR research community.
more » « less
Full Text Available
dbCAN-seq: a database of carbohydrate-active enzyme (CAZyme) sequence and annotation

https://doi.org/10.1093/nar/gkx894

Huang, Le; Zhang, Han; Wu, Peizhi; Entwistle, Sarah; Li, Xueqiong; Yohe, Tanner; Yi, Haidong; Yang, Zhenglu; Yin, Yanbin (October 2017, Nucleic Acids Research)

Full Text Available

Search for: All records