skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: freeCount: A Coding Free Framework for Guided Count Data Visualization and Analysis
The analysis and interpretation of high-dimensional biological data sets is a challenging task. Exploratory data analysis of count data produced by next-generation sequencing technologies presents a common hurdle to researchers. Biologists often find it difficult to get started with the analysis process, which can be time consuming and researchers are guided through the iterative steps of data assessment, processing, and analysis in a visual environment. The freeCount analysis framework takes advantage of the reactive features of R Shiny to deliver a set of modular and interactive tools and tutorials for the structured analysis and visualization of count data.  more » « less
Award ID(s):
2220696
PAR ID:
10531545
Author(s) / Creator(s):
; ;
Publisher / Repository:
ACM
Date Published:
ISBN:
9798400704192
Page Range / eLocation ID:
1 to 4
Format(s):
Medium: X
Location:
Providence RI USA
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    The growth of biodiversity data sets generated by citizen scientists continues to accelerate. The availability of such data has greatly expanded the scale of questions researchers can address. Yet, error, bias, and noise continue to be serious concerns for analysts, particularly when data being contributed to these giant online data sets are difficult to verify. Counts of birds contributed to eBird, the world’s largest biodiversity online database, present a potentially useful resource for tracking trends over time and space in species’ abundances. We quantified counting accuracy in a sample of 1,406 eBird checklists by comparing numbers contributed by birders (N = 246) who visited a popular birding location in Oregon, USA, with numbers generated by a professional ornithologist engaged in a long-term study creating benchmark (reference) measurements of daily bird counts. We focused on waterbirds, which are easily visible at this site. We evaluated potential predictors of count differences, including characteristics of contributed checklists, of each species, and of time of day and year. Count differences were biased toward undercounts, with more than 75% of counts being below the daily benchmark value. Median count discrepancies were −29.1% (range: 0 to −42.8%; N = 20 species). Model sets revealed an important influence of each species’ reference count, which varied seasonally as waterbird numbers fluctuated, and of percent of species known to be present each day that were included on each checklist. That is, checklists indicating a more thorough survey of the species richness at the site also had, on average, smaller count differences. However, even on checklists with the most thorough species lists, counts were biased low and exceptionally variable in their accuracy. To improve utility of such bird count data, we suggest three strategies to pursue in the future. (1) Assess additional options for analytically determining how to select checklists that include less biased count data, as well as exploring options for correcting bias during the analysis stage. (2) Add options for users to provide additional information that helps analysts choose checklists, such as an option for users to tag checklists where they focused on obtaining accurate counts. (3) Explore opportunities to effectively calibrate citizen-science bird count data by establishing a formalized network of marquis sites where dedicated observers regularly contribute carefully collected benchmark data. 
    more » « less
  2. Abstract Background Genome-wide association studies (GWASes) aim to identify single nucleotide polymorphisms (SNPs) associated with a given phenotype. A common approach for the analysis of GWAS is single marker analysis (SMA) based on linear mixed models (LMMs). However, LMM-based SMA usually yields a large number of false discoveries and cannot be directly applied to non-Gaussian phenotypes such as count data. Results We present a novel Bayesian method to find SNPs associated with non-Gaussian phenotypes. To that end, we use generalized linear mixed models (GLMMs) and, thus, call our method Bayesian GLMMs for GWAS (BG2). To deal with the high dimensionality of GWAS analysis, we propose novel nonlocal priors specifically tailored for GLMMs. In addition, we develop related fast approximate Bayesian computations. BG2 uses a two-step procedure: first, BG2 screens for candidate SNPs; second, BG2 performs model selection that considers all screened candidate SNPs as possible regressors. A simulation study shows favorable performance of BG2 when compared to GLMM-based SMA. We illustrate the usefulness and flexibility of BG2 with three case studies on cocaine dependence (binary data), alcohol consumption (count data), and number of root-like structures in a model plant (count data). 
    more » « less
  3. Abstract Modern high-throughput sequencing technologies provide low-cost microbiome survey data across all habitats of life at unprecedented scale. At the most granular level, the primary data consist of sparse counts of amplicon sequence variants or operational taxonomic units that are associated with taxonomic and phylogenetic group information. In this contribution, we leverage the hierarchical structure of amplicon data and propose a data-driven and scalable tree-guided aggregation framework to associate microbial subcompositions with response variables of interest. The excess number of zero or low count measurements at the read level forces traditional microbiome data analysis workflows to remove rare sequencing variants or group them by a fixed taxonomic rank, such as genus or phylum, or by phylogenetic similarity. By contrast, our framework, which we call  (ee-ggregation of ompositional data), learns data-adaptive taxon aggregation levels for predictive modeling, greatly reducing the need for user-defined aggregation in preprocessing while simultaneously integrating seamlessly into the compositional data analysis framework. We illustrate the versatility of our framework in the context of large-scale regression problems in human gut, soil, and marine microbial ecosystems. We posit that the inferred aggregation levels provide highly interpretable taxon groupings that can help microbiome researchers gain insights into the structure and functioning of the underlying ecosystem of interest. 
    more » « less
  4. Spatial transcriptomics (ST) technologies are rapidly becoming the extension of single-cell RNA sequencing (scRNAseq), holding the potential of profiling gene expression at a single-cell resolution while maintaining cellular compositions within a tissue. Having both expression profiles and tissue organization enables researchers to better understand cellular interactions and heterogeneity, providing insight into complex biological processes that would not be possible with traditional sequencing technologies. Data generated by ST technologies are inherently noisy, high-dimensional, sparse, and multi-modal (including histological images, count matrices, etc.), thus requiring specialized computational tools for accurate and robust analysis. However, many ST studies currently utilize traditional scRNAseq tools, which are inadequate for analyzing complex ST datasets. On the other hand, many of the existing ST-specific methods are built upon traditional statistical or machine learning frameworks, which have shown to be sub-optimal in many applications due to the scale, multi-modality, and limitations of spatially resolved data (such as spatial resolution, sensitivity, and gene coverage). Given these intricacies, researchers have developed deep learning (DL)-based models to alleviate ST-specific challenges. These methods include new state-of-the-art models in alignment, spatial reconstruction, and spatial clustering, among others. However, DL models for ST analysis are nascent and remain largely underexplored. In this review, we provide an overview of existing state-of-the-art tools for analyzing spatially resolved transcriptomics while delving deeper into the DL-based approaches. We discuss the new frontiers and the open questions in this field and highlight domains in which we anticipate transformational DL applications. 
    more » « less
  5. We present a new annotated microscopic cellular image dataset to improve the effectiveness of machine learning methods for cellular image analysis. Cell counting is an important step in cell analysis. Typically, domain experts manually count cells in a microscopic image. Automated cell counting can potentially eliminate this tedious, time-consuming process. However, a good, labeled dataset is required for training an accurate machine learning model. Our dataset includes microscopic images of cells, and for each image, the cell count and the location of individual cells. The data were collected as part of an ongoing study investigating the potential of electrical stimulation to modulate stem cell differentiation and possible applications for neural repair. Compared to existing publicly available datasets, our dataset has more images of cells stained with more variety of antibodies (protein components of immune responses against invaders) typically used for cell analysis. The experimental results on this dataset indicate that none of the five existing models under this study are able to achieve sufficiently accurate count to replace the manual methods. The dataset is available at https://figshare.com/articles/dataset/Dataset/21970604. 
    more » « less