Abstract Motivation Despite numerous RNA-seq samples available at large databases, most RNA-seq analysis tools are evaluated on a limited number of RNA-seq samples. This drives a need for methods to select a representative subset from all available RNA-seq samples to facilitate comprehensive, unbiased evaluation of bioinformatics tools. In sequence-based approaches for representative set selection (e.g. a k-mer counting approach that selects a subset based on k-mer similarities between RNA-seq samples), because of the large numbers of available RNA-seq samples and of k-mers/sequences in each sample, computing the full similarity matrix using k-mers/sequences for the entire set of RNA-seq samples in a large database (e.g. the SRA) has memory and runtime challenges; this makes direct representative set selection infeasible with limited computing resources. Results We developed a novel computational method called ‘hierarchical representative set selection’ to handle this challenge. Hierarchical representative set selection is a divide-and-conquer-like algorithm that breaks representative set selection into sub-selections and hierarchically selects representative samples through multiple levels. We demonstrate that hierarchical representative set selection can achieve summarization quality close to that of direct representative set selection, while largely reducing runtime and memory requirements of computing the full similarity matrix (up to 8.4× runtime reduction and 5.35× memory reduction for 10 000 and 12 000 samples respectively that could be practically run with direct subset selection). We show that hierarchical representative set selection substantially outperforms random sampling on the entire SRA set of RNA-seq samples, making it a practical solution to representative set selection on large databases like the SRA. Availability and implementation The code is available at https://github.com/Kingsford-Group/hierrepsetselection and https://github.com/Kingsford-Group/jellyfishsim. Supplementary information Supplementary data are available at Bioinformatics online.
more »
« less
Twitter as research data: Tools, costs, skill sets, and lessons learned
A bstract Scholars increasingly use Twitter data to study the life sciences and politics. However, Twitter data collection tools often pose challenges for scholars who are unfamiliar with their operation. Equally important, although many tools indicate that they offer representative samples of the full Twitter archive, little is known about whether the samples are indeed representative of the targeted population of tweets. This article evaluates such tools in terms of costs, training, and data quality as a means to introduce Twitter data as a research tool. Further, using an analysis of COVID-19 and moral foundations theory as an example, we compared the distributions of moral discussions from two commonly used tools for accessing Twitter data (Twitter’s standard APIs and third-party access) to the ground truth, the Twitter full archive. Our results highlight the importance of assessing the comparability of data sources to improve confidence in findings based on Twitter data. We also review the major new features of Twitter’s API version 2.
more »
« less
- Award ID(s):
- 2027375
- PAR ID:
- 10302782
- Date Published:
- Journal Name:
- Politics and the Life Sciences
- ISSN:
- 0730-9384
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
This mixed-methods observational study analyzes Advanced Placement (AP) Biology teachers’ engagement in microblogging for their professional development (PD). Data from three hashtag-based Twitter communities, #apbiochat, #apbioleaderacad, and #apbioleaderacademy (121 users; 2,253 tweets), are analyzed using methodological approaches including educational data mining, qualitative two-cycle content analysis, social network analysis, linear and logistic regression analyses, and hierarchical linear modeling. Results indicate that Twitter adheres to standards of high-quality PD and has the potential to complement more traditional PD activities. Notably, Twitter’s non-hierarchical leadership affords shared content creation and distribution. Additionally, Twitter allows for different temporal participation patterns and supports the personalization of learning experiences aligned to teachers’ needs and preferences. Furthermore, teachers frame their interactions on Twitter positively, thus, creating a supportive environment for professional learning that might reduce teachers’ perceived isolation. Therefore, policy makers and school leaders should feel empowered to encourage teachers to use microblogging complementary to other PD activities.more » « less
-
Abstract Social media has been transforming political communication dynamics for over a decade. Here using nearly a billion tweets, we analyse the change in Twitter’s news media landscape between the 2016 and 2020 US presidential elections. Using political bias and fact-checking tools, we measure the volume of politically biased content and the number of users propagating such information. We then identify influencers—users with the greatest ability to spread news in the Twitter network. We observe that the fraction of fake and extremely biased content declined between 2016 and 2020. However, results show increasing echo chamber behaviours and latent ideological polarization across the two elections at the user and influencer levels.more » « less
-
Political news is often slanted toward its publisher’s ideology and seeks to influence readers by focusing on selected aspects of contentious social and political issues. We investigate political slants in news and their influence on readers by analyzing election-related news and reader reactions to the news on Twitter. To this end, we collected election-related news from six major US news publishers who covered the 2020 US presidential elections. We computed each publisher’s political slant based on the favorability of its news toward the two major parties’ presidential candidates. We found that the election-related news coverage shows signs of political slant both in news headlines and on Twitter. The difference in news coverage of the two candidates between the left-leaning (LEFT) and right-leaning (RIGHT) news publishers is statistically significant. The effect size is larger for the news on Twitter than for headlines. And, news on Twitter expresses stronger sentiments than the headlines. We identified moral foundations in reader reactions to the news on Twitter based on Moral Foundation Theory. Moral foundations in readers’ reactions to LEFT and RIGHT differ statistically significantly, though the effects are small. Further, these shifts in moral foundations differ across social and political issues. User engagement on Twitter is higher for RIGHT than for LEFT. We posit that an improved understanding of slant and influence can enable better ways to combat online political polarization.more » « less
-
null (Ed.)This paper introduces a spatiotemporal analysis framework for estimating hourly changing population distribution patterns in urban areas using geo-tagged tweets (the messages containing users’ geospatial locations), land use data, and dasymetric maps. We collected geo-tagged social media (tweets) within the County of San Diego during one year (2015) by using Twitter’s Streaming Application Programming Interfaces (APIs). A semi-manual Twitter content verification procedure for data cleaning was applied first to separate tweets created by humans from non-human users (bots). The next step was to calculate the number of unique Twitter users every hour within census blocks. The final step was to estimate the actual population by transforming the numbers of unique Twitter users in each census block into estimated population densities with spatial and temporal factors using dasymetric maps. The temporal factor was estimated based on hourly changes of Twitter messages within San Diego County, CA. The spatial factor was estimated by using the dasymetric method with land use maps and 2010 census data. Comparing to census data, our methods can provide better estimated population in airports, shopping malls, sports stadiums, zoo and parks, and business areas during the day time.more » « less
An official website of the United States government

