skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Perturbation-based Detection and Resolution of Cherry-picking
In settings where an outcome, a decision, or a statement is made based on a single option among alternatives, it is popular to cherry-pick the data to generate an outcome that is supported by the cherry-picked data but not in general. In this paper, we use perturbation as a technique to design a support measure to detect, and resolve, cherry-picking across different contexts. In particular, to demonstrate the general scope of our proposal, we study cherry picking in two very different domains: (a) political statements based on trend-lines and (b) linear rankings. We also discuss sampling-based estimation as an effective and efficient approximation approach for detecting and resolving cherry-picking at scale.  more » « less
Award ID(s):
2107290
PAR ID:
10300669
Author(s) / Creator(s):
;  ; ;
Editor(s):
Wang, Haixun; Li, Chengkai; Yang, Jun
Date Published:
Journal Name:
A Quarterly bulletin of the Computer Society of the IEEE Technical Committee on Data Engineering
Volume:
45
Issue:
3
ISSN:
1053-1238
Page Range / eLocation ID:
39-51
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The growing popularity of interactive time series exploration platforms has made data visualization more accessible to the public. However, the ease of creating polished charts with preloaded data also enables selective information presentation, often resulting in biased or misleading visualizations. Research shows that these tools have been used to spread misinformation, particularly in areas such as public health and economic policies during the COVID-19 pandemic. Post hoc fact-checking may be ineffective because it typically addresses only a portion of misleading posts and comes too late to curb the spread. In this work, we explore using visualization design to counteract cherry-picking, a common tactic in deceptive visualizations. We propose a design space of guardrails—interventions to expose cherry-picking in time-series explorers. Through three crowd-sourced experiments, we demonstrate that guardrails, particularly those superimposing data, can encourage skepticism, though with some limitations. We provide recommendations for developing more effective visualization guardrails. 
    more » « less
  2. Abstract Particle picking in cryo-electron tomograms (cryo-ET) is crucial for in situ structure detection of macro-molecules and protein complexes. The traditional template-matching-based approaches for particle picking suffer from template-specific biases and have low throughput. Given these problems, learning-based solutions are necessary for particle picking. However, the paucity of annotated data for training poses substantial challenges for such learning-based approaches. Moreover, preparing extensively annotated cryo-ET tomograms for particle picking is extremely time-consuming and burdensome. Addressing these challenges, we present TomoPicker, an annotation-efficient particle-picking approach that can effectively pick particles when only a minuscule portion (∼ 0.3 − 0.5%) of the total particles in a cellular cryo-ET dataset is provided for training. TomoPicker regards particle picking as a voxel classification problem and solves it with two different positive-unlabeled learning approaches. We evaluated our method on a benchmark cryo-ET dataset of eukaryotic cells, where we observed about 30% improvement by TomoPicker against the most recent state-of-the-art annotation efficient learning-based picking approaches. 
    more » « less
  3. Abstract Background Differential abundance analysis (DAA) is one central statistical task in microbiome data analysis. A robust and powerful DAA tool can help identify highly confident microbial candidates for further biological validation. Numerous DAA tools have been proposed in the past decade addressing the special characteristics of microbiome data such as zero inflation and compositional effects. Disturbingly, different DAA tools could sometimes produce quite discordant results, opening to the possibility of cherry-picking the tool in favor of one’s own hypothesis. To recommend the best DAA tool or practice to the field, a comprehensive evaluation, which covers as many biologically relevant scenarios as possible, is critically needed. Results We performed by far the most comprehensive evaluation of existing DAA tools using real data-based simulations. We found that DAA methods explicitly addressing compositional effects such as ANCOM-BC, Aldex2, metagenomeSeq (fitFeatureModel), and DACOMP did have improved performance in false-positive control. But they are still not optimal: type 1 error inflation or low statistical power has been observed in many settings. The recent LDM method generally had the best power, but its false-positive control in the presence of strong compositional effects was not satisfactory. Overall, none of the evaluated methods is simultaneously robust, powerful, and flexible, which makes the selection of the best DAA tool difficult. To meet the analysis needs, we designed an optimized procedure, ZicoSeq, drawing on the strength of the existing DAA methods. We show that ZicoSeq generally controlled for false positives across settings, and the power was among the highest. Application of DAA methods to a large collection of real datasets revealed a similar pattern observed in simulation studies. Conclusions Based on the benchmarking study, we conclude that none of the existing DAA methods evaluated can be applied blindly to any real microbiome dataset. The applicability of an existing DAA method depends on specific settings, which are usually unknown a priori. To circumvent the difficulty of selecting the best DAA tool in practice, we design ZicoSeq, which addresses the major challenges in DAA and remedies the drawbacks of existing DAA methods. ZicoSeq can be applied to microbiome datasets from diverse settings and is a useful DAA tool for robust microbiome biomarker discovery. 
    more » « less
  4. Boosting engagement with educational software has been promoted as a means of improving student performance. Various engagement factors have been explored, including choice, personalization, badges, bonuses, and competition. We examine two promising and relatively understudied manipulations from the realm of gambling: the nearwin effect and anticipation. The near-win effect occurs when an individual comes close to achieving a goal, e.g., getting two cherries and a lemon in a slot machine. Anticipation refers to the build-up of suspense as an outcome is revealed, e.g., revealing cherry-cherry-lemon in that order drives expectations of winning more than revealing lemon-cherrycherry. Gambling psychologists have long studied how near-wins affect engagement in pure-chance games but it is difficult to do the same in an educational context where outcomes are based on skill. In this paper, we manipulate the display of outcomes in a manner that allows us to introduce artificial near-wins largely independent of a student’s performance. In a study involving thousands of students using an online math tutor, we examine how this manipulation affects a behavioral measure of engagement—whether or not a student repeats a lesson. We find a near-win effect on engagement when the ‘win’ indicates to the student that they have attained critical competence on a lesson—the competence that allows them to continue to the next lesson. Nonetheless, when we experimentally induce near wins in a randomized controlled trial, we do not obtain a reliable effect of the near win. We discuss this mismatch of results in terms of the role of anticipation on making near wins effective. We conclude by describing manipulations that might increase the effect of near wins on engagement. 
    more » « less
  5. Abstract An important criterion for understanding speciation is the geographic context of population divergence. Three major modes of allopatric, parapatric, and sympatric speciation define the extent of spatial overlap and gene flow between diverging populations. However, mixed modes of speciation are also possible, whereby populations experience periods of allopatry, parapatry, and/or sympatry at different times as they diverge. Here, we report clinal patterns of variation for 21 nuclear‐encoded microsatellites and a wing spot phenotype for cherry‐infestingRhagoletis(Diptera: Tephritidae) across North America consistent with these flies having initially diverged in parapatry followed by a period of allopatric differentiation in the early Holocene. However, mitochondrial DNA (mtDNA) displays a different pattern; cherry flies at the ends of the clines in the eastern USA and Pacific Northwest share identical haplotypes, while centrally located populations in the southwestern USA and Mexico possess a different haplotype. We hypothesize that the mitochondrial difference could be due to lineage sorting but more likely reflects a selective sweep of a favorable mtDNA variant or the spread of an endosymbiont. The estimated divergence time for mtDNA suggests possible past allopatry, secondary contact, and subsequent isolation between USA and Mexican fly populations initiated before the Wisconsin glaciation. Thus, the current genetics of cherry flies may involve different mixed modes of divergence occurring in different portions of the fly's range. We discuss the need for additional DNA sequencing and quantification of prezygotic and postzygotic reproductive isolation to verify the multiple mixed‐mode hypothesis for cherry flies and draw parallels from other systems to assess the generality that speciation may commonly involve complex biogeographies of varying combinations of allopatric, parapatric, and sympatric divergence. 
    more » « less